Skip to content

BUG HarmScorerEvaluator emits scipy NaN/runtime-warnings on zero-variance diff (perfect agreement or constant bias) #1806

@immu4989

Description

@immu4989

What happens

When a harm scorer's median output equals the human gold labels (or differs from them by a constant offset), HarmScorerEvaluator._compute_metrics emits three scipy runtime warnings per call and produces t_statistic=NaN, p_value=NaN in HarmScorerMetrics:

RuntimeWarning: divide by zero encountered in divide
RuntimeWarning: invalid value encountered in scalar multiply
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation.
  This occurs when the data are nearly identical. Results may be unreliable.

The NaN propagates into HarmScorerMetrics.t_statistic / HarmScorerMetrics.p_value, gets serialized into the eval report alongside well-defined fields, and downstream consumers (JSON serializers, dashboards, comparison logic) silently ingest meaningless numbers.

Root cause

pyrit/score/scorer_evaluation/scorer_evaluator.py:588:

t_statistic, p_value = cast("tuple[float, float]", ttest_1samp(diff, 0))

ttest_1samp divides by the sample standard error. When diff has zero (or near-zero) within-sample variance, this becomes a 0/0 or c/0 form and scipy returns NaN with three warnings. Two cases hit this:

  1. Perfect agreement — scorer matches the human gold labels exactly, diff = [0, 0, ...]. Triggered by the existing test_compute_harm_metrics_perfect_agreement test.
  2. Constant systematic bias — scorer is off by a fixed offset, diff = [c, c, ...] with c ≠ 0. Triggered by the existing test_compute_harm_metrics_partial_agreement test (model is +0.1 on every response).

Both existing tests emit the warnings but don't assert on t_statistic / p_value, so the bug is invisible to the suite even though the data path produces it.

Repro

pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
# -> 2 failed: test_compute_harm_metrics_perfect_agreement, test_compute_harm_metrics_partial_agreement

Proposed fix

Guard the ttest_1samp call. When diff is (numerically) constant:

  • If diff[0] ≈ 0 (perfect agreement): the null hypothesis (mean diff = 0) is exactly satisfied, so report t_statistic=0.0, p_value=1.0. This is the conventional null-result interpretation.
  • Otherwise (constant non-zero bias with no variance): the t-test is undefined — there's a systematic offset but no within-sample variability. Report NaN explicitly. mean_absolute_error already captures the bias magnitude.

Use np.allclose rather than == so the float noise that creeps in from np.median(...) differences doesn't escape the guard. Update both existing tests to assert on t_statistic / p_value so future regressions are caught. Document the convention in the HarmScorerMetrics docstring.

PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions