BUG HarmScorerEvaluator emits scipy NaN/runtime-warnings on zero-variance diff (perfect agreement or constant bias)

### What happens

When a harm scorer's median output equals the human gold labels (or differs from them by a constant offset), `HarmScorerEvaluator._compute_metrics` emits three scipy runtime warnings per call and produces `t_statistic=NaN`, `p_value=NaN` in `HarmScorerMetrics`:

```
RuntimeWarning: divide by zero encountered in divide
RuntimeWarning: invalid value encountered in scalar multiply
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation.
  This occurs when the data are nearly identical. Results may be unreliable.
```

The NaN propagates into `HarmScorerMetrics.t_statistic` / `HarmScorerMetrics.p_value`, gets serialized into the eval report alongside well-defined fields, and downstream consumers (JSON serializers, dashboards, comparison logic) silently ingest meaningless numbers.

### Root cause

[`pyrit/score/scorer_evaluation/scorer_evaluator.py:588`](https://github.com/microsoft/PyRIT/blob/main/pyrit/score/scorer_evaluation/scorer_evaluator.py#L588):

```python
t_statistic, p_value = cast("tuple[float, float]", ttest_1samp(diff, 0))
```

`ttest_1samp` divides by the sample standard error. When `diff` has zero (or near-zero) within-sample variance, this becomes a `0/0` or `c/0` form and scipy returns NaN with three warnings. Two cases hit this:

1. **Perfect agreement** — scorer matches the human gold labels exactly, `diff = [0, 0, ...]`. Triggered by the existing `test_compute_harm_metrics_perfect_agreement` test.
2. **Constant systematic bias** — scorer is off by a fixed offset, `diff = [c, c, ...]` with c ≠ 0. Triggered by the existing `test_compute_harm_metrics_partial_agreement` test (model is +0.1 on every response).

Both existing tests emit the warnings but don't assert on `t_statistic` / `p_value`, so the bug is invisible to the suite even though the data path produces it.

### Repro

```
pytest tests/unit/score/test_scorer_evaluator.py -W "error::RuntimeWarning"
# -> 2 failed: test_compute_harm_metrics_perfect_agreement, test_compute_harm_metrics_partial_agreement
```

### Proposed fix

Guard the `ttest_1samp` call. When `diff` is (numerically) constant:

- If `diff[0] ≈ 0` (perfect agreement): the null hypothesis (mean diff = 0) is exactly satisfied, so report `t_statistic=0.0, p_value=1.0`. This is the conventional null-result interpretation.
- Otherwise (constant non-zero bias with no variance): the t-test is undefined — there's a systematic offset but no within-sample variability. Report `NaN` explicitly. `mean_absolute_error` already captures the bias magnitude.

Use `np.allclose` rather than `==` so the float noise that creeps in from `np.median(...)` differences doesn't escape the guard. Update both existing tests to assert on `t_statistic` / `p_value` so future regressions are caught. Document the convention in the `HarmScorerMetrics` docstring.

PR to follow.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG HarmScorerEvaluator emits scipy NaN/runtime-warnings on zero-variance diff (perfect agreement or constant bias) #1806

What happens

Root cause

Repro

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

BUG HarmScorerEvaluator emits scipy NaN/runtime-warnings on zero-variance diff (perfect agreement or constant bias) #1806

Description

What happens

Root cause

Repro

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions