Skip to content

FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807

Open
immu4989 wants to merge 1 commit into
microsoft:mainfrom
immu4989:fix/ttest-zero-variance-nan
Open

FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807
immu4989 wants to merge 1 commit into
microsoft:mainfrom
immu4989:fix/ttest-zero-variance-nan

Conversation

@immu4989
Copy link
Copy Markdown
Contributor

Fixes #1806.

HarmScorerEvaluator._compute_metrics calls scipy.stats.ttest_1samp(diff, 0) where diff = median_model_scores - gold_scores. When diff has zero within-sample variance (perfect agreement or constant systematic bias), scipy divides by zero, emits three runtime warnings, and returns NaN. The NaN then flows into HarmScorerMetrics.t_statistic / p_value and into serialized eval reports.

Both existing tests (test_compute_harm_metrics_perfect_agreement, test_compute_harm_metrics_partial_agreement) hit this code path; they pass only because neither asserts on the t-test fields.

Fix

Guard the ttest_1samp call:

  • diff (numerically) all-zero → perfect agreement, null hypothesis is exactly satisfied → t_statistic=0.0, p_value=1.0.
  • diff (numerically) constant but non-zero → t-test undefined, report NaN explicitly. mean_absolute_error already captures the bias magnitude.

Uses np.allclose so float noise from np.median(...) differences (e.g. 0.2 - 0.1 = 0.10000000000000003) doesn't escape the guard.

Tests

  • Existing test_compute_harm_metrics_perfect_agreement now asserts t_statistic == 0.0 and p_value == 1.0.
  • Existing test_compute_harm_metrics_partial_agreement (constant +0.1 bias) now asserts t_statistic and p_value are NaN.
  • New test_compute_harm_metrics_partial_agreement_with_variance covers the well-defined path: model has real variance across responses, t-statistic and p-value are finite floats.

Docstring

HarmScorerMetrics.t_statistic / p_value updated to document the conventions:

  • t_statistic = 0.0, p_value = 1.0 on perfect agreement.
  • NaN on the constant-non-zero-bias case; consult mean_absolute_error instead.

Verification

# Pre-fix (RuntimeWarning promoted to error to surface the bug):
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
-> 2 failed (perfect_agreement, partial_agreement)

# Post-fix:
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
-> 28 passed in 1.14s

# Full unit suite, no regressions:
pytest tests/unit -n auto
-> 8216 passed, 4 skipped in 45.03s   (was 8215 — new test added)

# pre-commit (incl. ty type check):
pre-commit run --files pyrit/score/scorer_evaluation/scorer_evaluator.py \
                       pyrit/score/scorer_evaluation/scorer_metrics.py \
                       tests/unit/score/test_scorer_evaluator.py
-> all hooks Passed

HarmScorerEvaluator._compute_metrics passes diff = median_model_scores - gold_scores to scipy.stats.ttest_1samp. When the scorer perfectly agrees with gold labels (diff all zeros) or has constant systematic bias (diff = [c, c, ...]), the sample standard error is zero, scipy divides by zero and returns NaN, and three runtime warnings are emitted. The NaN propagates into HarmScorerMetrics.t_statistic / p_value and into serialized eval reports.

Guard the call: perfect agreement -> t=0.0, p=1.0 (null hypothesis exactly satisfied); constant non-zero diff -> NaN explicitly (t-test undefined; MAE captures the bias). np.allclose handles float noise from np.median differences.

Update both existing tests (perfect_agreement, partial_agreement) to assert on t_statistic / p_value so future regressions are caught. Add test_compute_harm_metrics_partial_agreement_with_variance covering the well-defined path. Document the conventions in HarmScorerMetrics docstring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG HarmScorerEvaluator emits scipy NaN/runtime-warnings on zero-variance diff (perfect agreement or constant bias)

1 participant