FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805
Open
romanlutz wants to merge 1 commit into
Open
FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805romanlutz wants to merge 1 commit into
romanlutz wants to merge 1 commit into
Conversation
Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922) to PyRIT. **Loader** (pyrit/datasets/seed_datasets/remote/fortress_dataset.py): - Three sibling classes: `_FortressAdversarialDataset` (500), `_FortressBenignDataset` (500), and `_FortressPairedDataset` (1000). - Two filter enums: `FortressRiskDomain` (3 values) and `FortressRiskSubdomain` (10 values), matching upstream values. - Pins HF revision `0c096becbc75bb12065c8059a53960c7f0d4d35c` (same as the Inspect Evals port). - Adversarial seeds carry the per-row rubric and `num_dim` in `SeedPrompt.metadata` so the scorer can grade against the prompt's own rubric. Both halves carry the partner prompt via `paired_prompt` for downstream (ARS, ORS) trade-off analysis. - Reproduces the Scale AI use-restriction notice verbatim in the docstrings; tags each seed with `use_restriction='no_adversarial_training'`. **Scorer** (pyrit/score/float_scale/fortress_rubric_scorer.py): - New `FortressRubricScorer(FloatScaleScorer)` that reads `rubric` + `num_dim` (and optional `original_prompt`) from `MessagePiece.prompt_metadata` and asks a configurable judge to emit an N-character Y/N string. The score is the fraction of Y verdicts, in `[0, 1]`. - Malformed grades (wrong length, non-Y/N characters) yield `0.0` with `score_metadata['invalid'] = 'true'` so aggregators can filter, matching upstream Inspect Evals behavior. - YAML system prompt at `pyrit/datasets/score/fortress/rubric_system_prompt.yaml` is parameterized with `criteria` / `num_dim` / `original_prompt`; the rubric pattern is intentionally generic and works for any dataset that supplies a per-row binary rubric. **Validation:** 39 new unit tests (23 loader + 16 scorer), all passing. Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs). Live HF fetch confirmed: 500/500/1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect upstream's known mismatched-subdomain rows (HF discussion #4) and are preserved verbatim per documented design. References: - `doc/references.bib`: new `@misc{knight2025fortress, ...}` - `doc/bibliography.md`: hidden-citation block updated - `doc/code/datasets/1_loading_datasets.{py,ipynb}`: FORTRESS added to dataset list and discovery output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922, HF: ScaleAI/fortress_public) to PyRIT as a first-class dataset loader plus a generic per-row binary-rubric scorer.
FORTRESS is paired (each adversarial prompt ships a benign rephrasing on the same topic) and rubric-based (each adversarial prompt ships its own 4-7 binary Y/N criteria). Both properties are novel in the PyRIT catalog and motivated landing the dataset as a first-class loader rather than folding it into an existing one.
Why
SelfAskGeneralTrueFalseScoreris single-criterion;SelfAskScaleScorerreturns a single Likert float. Neither handles the per-row rubric pattern.FortressRubricScoreris intentionally generic so any future dataset that ships a per-row binary rubric in metadata can reuse it unchanged.What's in
Loader -
pyrit/datasets/seed_datasets/remote/fortress_dataset.py_FortressAdversarialDataset(500),_FortressBenignDataset(500),_FortressPairedDataset(1000).FortressRiskDomain(3) andFortressRiskSubdomain(10), matching upstream stored values.0c096becbc75bb12065c8059a53960c7f0d4d35c(same as the Inspect Evals reference port).num_diminSeedPrompt.metadata; both halves carry the partner viapaired_promptso downstream consumers can compute the paired (ARS, ORS) metric.metadata["use_restriction"] = "no_adversarial_training".Scorer -
pyrit/score/float_scale/fortress_rubric_scorer.pyFortressRubricScorer(FloatScaleScorer)readsrubric+num_dim(and optionaloriginal_prompt) fromMessagePiece.prompt_metadata, asks a configurable judge for an N-character Y/N string, and returns the fraction of Y verdicts in [0, 1].0.0withscore_metadata["invalid"] = "true"so aggregators can filter (matches upstream Inspect Evals behavior).pyrit/datasets/score/fortress/rubric_system_prompt.yaml, parameterized withcriteria/num_dim/original_prompt.Docs / registry
doc/references.bib: new@misc{knight2025fortress, ...}doc/bibliography.md: hidden-citation block updateddoc/code/datasets/1_loading_datasets.{py,ipynb}: FORTRESS added to dataset list and discovery outputpyrit/datasets/seed_datasets/remote/__init__.pyandpyrit/score/__init__.py: registrationsWhat's NOT in (intentional follow-up)
FortressScenario(wiring loader + scorer + over-refusal scorer into a single scenario emitting per-(sub)domain ARS/ORS), an aggregate scorecard primitive for paired-population reduction, and a notebook example are explicitly deferred to follow-up PRs.Validation
tests/unit/datasets/+tests/unit/score/(1545 passed).Use-restriction reminder
Dataset is CC-BY-4.0 with the Scale AI use restriction: "This dataset should be used for safety evaluations only, and it is prohibited to use these data for any adversarial training or research." Loader docstrings reproduce this verbatim and tag each seed; no programmatic enforcement.