FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807
Open
immu4989 wants to merge 1 commit into
Open
FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807immu4989 wants to merge 1 commit into
immu4989 wants to merge 1 commit into
Conversation
HarmScorerEvaluator._compute_metrics passes diff = median_model_scores - gold_scores to scipy.stats.ttest_1samp. When the scorer perfectly agrees with gold labels (diff all zeros) or has constant systematic bias (diff = [c, c, ...]), the sample standard error is zero, scipy divides by zero and returns NaN, and three runtime warnings are emitted. The NaN propagates into HarmScorerMetrics.t_statistic / p_value and into serialized eval reports. Guard the call: perfect agreement -> t=0.0, p=1.0 (null hypothesis exactly satisfied); constant non-zero diff -> NaN explicitly (t-test undefined; MAE captures the bias). np.allclose handles float noise from np.median differences. Update both existing tests (perfect_agreement, partial_agreement) to assert on t_statistic / p_value so future regressions are caught. Add test_compute_harm_metrics_partial_agreement_with_variance covering the well-defined path. Document the conventions in HarmScorerMetrics docstring.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1806.
HarmScorerEvaluator._compute_metricscallsscipy.stats.ttest_1samp(diff, 0)wherediff = median_model_scores - gold_scores. Whendiffhas zero within-sample variance (perfect agreement or constant systematic bias), scipy divides by zero, emits three runtime warnings, and returnsNaN. The NaN then flows intoHarmScorerMetrics.t_statistic/p_valueand into serialized eval reports.Both existing tests (
test_compute_harm_metrics_perfect_agreement,test_compute_harm_metrics_partial_agreement) hit this code path; they pass only because neither asserts on the t-test fields.Fix
Guard the
ttest_1sampcall:diff(numerically) all-zero → perfect agreement, null hypothesis is exactly satisfied →t_statistic=0.0, p_value=1.0.diff(numerically) constant but non-zero → t-test undefined, reportNaNexplicitly.mean_absolute_erroralready captures the bias magnitude.Uses
np.allcloseso float noise fromnp.median(...)differences (e.g.0.2 - 0.1 = 0.10000000000000003) doesn't escape the guard.Tests
test_compute_harm_metrics_perfect_agreementnow assertst_statistic == 0.0andp_value == 1.0.test_compute_harm_metrics_partial_agreement(constant +0.1 bias) now assertst_statisticandp_valueareNaN.test_compute_harm_metrics_partial_agreement_with_variancecovers the well-defined path: model has real variance across responses, t-statistic and p-value are finite floats.Docstring
HarmScorerMetrics.t_statistic/p_valueupdated to document the conventions:t_statistic = 0.0,p_value = 1.0on perfect agreement.NaNon the constant-non-zero-bias case; consultmean_absolute_errorinstead.Verification