What happens
When a harm scorer's median output equals the human gold labels (or differs from them by a constant offset), HarmScorerEvaluator._compute_metrics emits three scipy runtime warnings per call and produces t_statistic=NaN, p_value=NaN in HarmScorerMetrics:
RuntimeWarning: divide by zero encountered in divide
RuntimeWarning: invalid value encountered in scalar multiply
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation.
This occurs when the data are nearly identical. Results may be unreliable.
The NaN propagates into HarmScorerMetrics.t_statistic / HarmScorerMetrics.p_value, gets serialized into the eval report alongside well-defined fields, and downstream consumers (JSON serializers, dashboards, comparison logic) silently ingest meaningless numbers.
Root cause
pyrit/score/scorer_evaluation/scorer_evaluator.py:588:
t_statistic, p_value = cast("tuple[float, float]", ttest_1samp(diff, 0))
ttest_1samp divides by the sample standard error. When diff has zero (or near-zero) within-sample variance, this becomes a 0/0 or c/0 form and scipy returns NaN with three warnings. Two cases hit this:
- Perfect agreement — scorer matches the human gold labels exactly,
diff = [0, 0, ...]. Triggered by the existing test_compute_harm_metrics_perfect_agreement test.
- Constant systematic bias — scorer is off by a fixed offset,
diff = [c, c, ...] with c ≠ 0. Triggered by the existing test_compute_harm_metrics_partial_agreement test (model is +0.1 on every response).
Both existing tests emit the warnings but don't assert on t_statistic / p_value, so the bug is invisible to the suite even though the data path produces it.
Repro
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
# -> 2 failed: test_compute_harm_metrics_perfect_agreement, test_compute_harm_metrics_partial_agreement
Proposed fix
Guard the ttest_1samp call. When diff is (numerically) constant:
- If
diff[0] ≈ 0 (perfect agreement): the null hypothesis (mean diff = 0) is exactly satisfied, so report t_statistic=0.0, p_value=1.0. This is the conventional null-result interpretation.
- Otherwise (constant non-zero bias with no variance): the t-test is undefined — there's a systematic offset but no within-sample variability. Report
NaN explicitly. mean_absolute_error already captures the bias magnitude.
Use np.allclose rather than == so the float noise that creeps in from np.median(...) differences doesn't escape the guard. Update both existing tests to assert on t_statistic / p_value so future regressions are caught. Document the convention in the HarmScorerMetrics docstring.
PR to follow.
What happens
When a harm scorer's median output equals the human gold labels (or differs from them by a constant offset),
HarmScorerEvaluator._compute_metricsemits three scipy runtime warnings per call and producest_statistic=NaN,p_value=NaNinHarmScorerMetrics:The NaN propagates into
HarmScorerMetrics.t_statistic/HarmScorerMetrics.p_value, gets serialized into the eval report alongside well-defined fields, and downstream consumers (JSON serializers, dashboards, comparison logic) silently ingest meaningless numbers.Root cause
pyrit/score/scorer_evaluation/scorer_evaluator.py:588:ttest_1sampdivides by the sample standard error. Whendiffhas zero (or near-zero) within-sample variance, this becomes a0/0orc/0form and scipy returns NaN with three warnings. Two cases hit this:diff = [0, 0, ...]. Triggered by the existingtest_compute_harm_metrics_perfect_agreementtest.diff = [c, c, ...]with c ≠ 0. Triggered by the existingtest_compute_harm_metrics_partial_agreementtest (model is +0.1 on every response).Both existing tests emit the warnings but don't assert on
t_statistic/p_value, so the bug is invisible to the suite even though the data path produces it.Repro
Proposed fix
Guard the
ttest_1sampcall. Whendiffis (numerically) constant:diff[0] ≈ 0(perfect agreement): the null hypothesis (mean diff = 0) is exactly satisfied, so reportt_statistic=0.0, p_value=1.0. This is the conventional null-result interpretation.NaNexplicitly.mean_absolute_erroralready captures the bias magnitude.Use
np.allcloserather than==so the float noise that creeps in fromnp.median(...)differences doesn't escape the guard. Update both existing tests to assert ont_statistic/p_valueso future regressions are caught. Document the convention in theHarmScorerMetricsdocstring.PR to follow.