Skip to content

FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805

Open
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/fortress-safety-utility
Open

FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805
romanlutz wants to merge 1 commit into
microsoft:mainfrom
romanlutz:romanlutz/fortress-safety-utility

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz commented May 25, 2026

What

Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922, HF: ScaleAI/fortress_public) to PyRIT as a first-class dataset loader plus a generic per-row binary-rubric scorer.

FORTRESS is paired (each adversarial prompt ships a benign rephrasing on the same topic) and rubric-based (each adversarial prompt ships its own 4-7 binary Y/N criteria). Both properties are novel in the PyRIT catalog and motivated landing the dataset as a first-class loader rather than folding it into an existing one.

Why

SelfAskGeneralTrueFalseScorer is single-criterion; SelfAskScaleScorer returns a single Likert float. Neither handles the per-row rubric pattern. FortressRubricScorer is intentionally generic so any future dataset that ships a per-row binary rubric in metadata can reuse it unchanged.

What's in

Loader - pyrit/datasets/seed_datasets/remote/fortress_dataset.py

  • Three sibling classes: _FortressAdversarialDataset (500), _FortressBenignDataset (500), _FortressPairedDataset (1000).
  • Two filter enums: FortressRiskDomain (3) and FortressRiskSubdomain (10), matching upstream stored values.
  • HF revision pinned to 0c096becbc75bb12065c8059a53960c7f0d4d35c (same as the Inspect Evals reference port).
  • Adversarial seeds carry the per-row rubric + num_dim in SeedPrompt.metadata; both halves carry the partner via paired_prompt so downstream consumers can compute the paired (ARS, ORS) metric.
  • Reproduces the Scale AI use-restriction notice verbatim in docstrings; tags each seed metadata["use_restriction"] = "no_adversarial_training".

Scorer - pyrit/score/float_scale/fortress_rubric_scorer.py

  • New FortressRubricScorer(FloatScaleScorer) reads rubric + num_dim (and optional original_prompt) from MessagePiece.prompt_metadata, asks a configurable judge for an N-character Y/N string, and returns the fraction of Y verdicts in [0, 1].
  • Malformed grades (wrong length, non-Y/N chars) yield 0.0 with score_metadata["invalid"] = "true" so aggregators can filter (matches upstream Inspect Evals behavior).
  • YAML system prompt at pyrit/datasets/score/fortress/rubric_system_prompt.yaml, parameterized with criteria / num_dim / original_prompt.

Docs / registry

  • doc/references.bib: new @misc{knight2025fortress, ...}
  • doc/bibliography.md: hidden-citation block updated
  • doc/code/datasets/1_loading_datasets.{py,ipynb}: FORTRESS added to dataset list and discovery output
  • pyrit/datasets/seed_datasets/remote/__init__.py and pyrit/score/__init__.py: registrations

What's NOT in (intentional follow-up)

FortressScenario (wiring loader + scorer + over-refusal scorer into a single scenario emitting per-(sub)domain ARS/ORS), an aggregate scorecard primitive for paired-population reduction, and a notebook example are explicitly deferred to follow-up PRs.

Validation

  • 39 new unit tests (23 loader + 16 scorer), all green.
  • Full tests/unit/datasets/ + tests/unit/score/ (1545 passed).
  • Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs).
  • Live HF fetch confirmed: 500 / 500 / 1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect the README-documented mismatched-subdomain rows (HF discussion Upgrade transformers to >=4.36.0 to address dependabot alert #4) and are preserved verbatim per documented design.

Use-restriction reminder

Dataset is CC-BY-4.0 with the Scale AI use restriction: "This dataset should be used for safety evaluations only, and it is prohibited to use these data for any adversarial training or research." Loader docstrings reproduce this verbatim and tag each seed; no programmatic enforcement.

Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922) to PyRIT.

**Loader** (pyrit/datasets/seed_datasets/remote/fortress_dataset.py):

- Three sibling classes: `_FortressAdversarialDataset` (500), `_FortressBenignDataset` (500), and `_FortressPairedDataset` (1000).

- Two filter enums: `FortressRiskDomain` (3 values) and `FortressRiskSubdomain` (10 values), matching upstream values.

- Pins HF revision `0c096becbc75bb12065c8059a53960c7f0d4d35c` (same as the Inspect Evals port).

- Adversarial seeds carry the per-row rubric and `num_dim` in `SeedPrompt.metadata` so the scorer can grade against the prompt's own rubric. Both halves carry the partner prompt via `paired_prompt` for downstream (ARS, ORS) trade-off analysis.

- Reproduces the Scale AI use-restriction notice verbatim in the docstrings; tags each seed with `use_restriction='no_adversarial_training'`.

**Scorer** (pyrit/score/float_scale/fortress_rubric_scorer.py):

- New `FortressRubricScorer(FloatScaleScorer)` that reads `rubric` + `num_dim` (and optional `original_prompt`) from `MessagePiece.prompt_metadata` and asks a configurable judge to emit an N-character Y/N string. The score is the fraction of Y verdicts, in `[0, 1]`.

- Malformed grades (wrong length, non-Y/N characters) yield `0.0` with `score_metadata['invalid'] = 'true'` so aggregators can filter, matching upstream Inspect Evals behavior.

- YAML system prompt at `pyrit/datasets/score/fortress/rubric_system_prompt.yaml` is parameterized with `criteria` / `num_dim` / `original_prompt`; the rubric pattern is intentionally generic and works for any dataset that supplies a per-row binary rubric.

**Validation:** 39 new unit tests (23 loader + 16 scorer), all passing. Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs). Live HF fetch confirmed: 500/500/1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect upstream's known mismatched-subdomain rows (HF discussion #4) and are preserved verbatim per documented design.

References:

- `doc/references.bib`: new `@misc{knight2025fortress, ...}`

- `doc/bibliography.md`: hidden-citation block updated

- `doc/code/datasets/1_loading_datasets.{py,ipynb}`: FORTRESS added to dataset list and discovery output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant