FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator by immu4989 · Pull Request #1807 · microsoft/PyRIT

immu4989 · 2026-05-25T21:07:38Z

HarmScorerEvaluator._compute_metrics calls scipy.stats.ttest_1samp(diff, 0) where diff = median_model_scores - gold_scores. When diff has zero within-sample variance (perfect agreement or constant systematic bias), scipy divides by zero, emits three runtime warnings, and returns NaN. The NaN then flows into HarmScorerMetrics.t_statistic / p_value and into serialized eval reports.

Both existing tests (test_compute_harm_metrics_perfect_agreement, test_compute_harm_metrics_partial_agreement) hit this code path; they pass only because neither asserts on the t-test fields.

Fix

Guard the ttest_1samp call:

diff (numerically) all-zero → perfect agreement, null hypothesis is exactly satisfied → t_statistic=0.0, p_value=1.0.
diff (numerically) constant but non-zero → t-test undefined, report NaN explicitly. mean_absolute_error already captures the bias magnitude.

Uses np.allclose so float noise from np.median(...) differences (e.g. 0.2 - 0.1 = 0.10000000000000003) doesn't escape the guard.

Tests

Existing test_compute_harm_metrics_perfect_agreement now asserts t_statistic == 0.0 and p_value == 1.0.
Existing test_compute_harm_metrics_partial_agreement (constant +0.1 bias) now asserts t_statistic and p_value are NaN.
New test_compute_harm_metrics_partial_agreement_with_variance covers the well-defined path: model has real variance across responses, t-statistic and p-value are finite floats.

Docstring

HarmScorerMetrics.t_statistic / p_value updated to document the conventions:

t_statistic = 0.0, p_value = 1.0 on perfect agreement.
NaN on the constant-non-zero-bias case; consult mean_absolute_error instead.

Verification

# Pre-fix (RuntimeWarning promoted to error to surface the bug):
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
-> 2 failed (perfect_agreement, partial_agreement)

# Post-fix:
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
-> 28 passed in 1.14s

# Full unit suite, no regressions:
pytest tests/unit -n auto
-> 8216 passed, 4 skipped in 45.03s   (was 8215 — new test added)

# pre-commit (incl. ty type check):
pre-commit run --files pyrit/score/scorer_evaluation/scorer_evaluator.py \
                       pyrit/score/scorer_evaluation/scorer_metrics.py \
                       tests/unit/score/test_scorer_evaluator.py
-> all hooks Passed

HarmScorerEvaluator._compute_metrics passes diff = median_model_scores - gold_scores to scipy.stats.ttest_1samp. When the scorer perfectly agrees with gold labels (diff all zeros) or has constant systematic bias (diff = [c, c, ...]), the sample standard error is zero, scipy divides by zero and returns NaN, and three runtime warnings are emitted. The NaN propagates into HarmScorerMetrics.t_statistic / p_value and into serialized eval reports. Guard the call: perfect agreement -> t=0.0, p=1.0 (null hypothesis exactly satisfied); constant non-zero diff -> NaN explicitly (t-test undefined; MAE captures the bias). np.allclose handles float noise from np.median differences. Update both existing tests (perfect_agreement, partial_agreement) to assert on t_statistic / p_value so future regressions are caught. Add test_compute_harm_metrics_partial_agreement_with_variance covering the well-defined path. Document the conventions in HarmScorerMetrics docstring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807

FIX guard ttest_1samp against zero-variance diff in HarmScorerEvaluator#1807
immu4989 wants to merge 1 commit into
microsoft:mainfrom
immu4989:fix/ttest-zero-variance-nan

immu4989 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

immu4989 commented May 25, 2026

Fix

Tests

Docstring

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant