FEAT add DecodingTrust Toxicity dataset loader (subtask of #291)

### Subtask of #291

DecodingTrust [1] publishes adversarial toxicity test prompts at
[`/data/toxicity/user_prompts/`](https://github.com/AI-secure/DecodingTrust/tree/main/data/toxicity/user_prompts)
(sampled from RealToxicityPrompts [2]). The Stereotypes perspective was added in PR #385;
the Toxicity perspective is still open.

#### Data
- `toxic.jsonl` — 1196 records (high Perspective `toxicity`)
- `nontoxic.jsonl` — 1200 records
- Each record has a plain-text prompt (`prompt.text`) plus 8 Perspective-API scores
  (`toxicity`, `severe_toxicity`, `profanity`, `sexually_explicit`, `identity_attack`,
  `threat`, `insult`, `flirtation`) and a `challenging: bool` flag.

#### Proposed loader
- `_DecodingTrustToxicityDataset(_RemoteDatasetLoader)` following the
  `_DangerousQADataset` pattern (single URL, `_fetch_from_url`, plain-string extraction).
- Parameters: `subset: Literal["toxic", "nontoxic", "all"] = "all"`,
  `challenging_only: bool = False`, `toxicity_threshold: float = 0.5`.
- Per-prompt `harm_categories` derived from the Perspective scores (e.g. include
  `"profanity"` when `prompt.profanity >= toxicity_threshold`).
- Source URL pinned to a specific commit SHA
  (current main HEAD: `161ae8321ced62f45fcd9ceb412e05b47c603cd4`, 2024-09-16).
- Unit tests mock `_fetch_from_url`, mirroring
  `tests/unit/datasets/test_dangerous_qa_dataset.py`.

#### One question before I start
DecodingTrust's root LICENSE is **CC BY-SA 4.0**, while PyRIT is MIT. The existing
Stereotypes assets (`pyrit/datasets/jailbreak/templates/dt_stereotypes_*.yaml`) ship the
system prompts in-tree with attribution. For Toxicity I'd plan to fetch at runtime from
`raw.githubusercontent.com` (no vendoring) and add full attribution in the class
docstring. Is that the approach you'd like, or should I handle CC BY-SA sources
differently?

#### References
1. Wang et al., 2023. *DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.* https://arxiv.org/abs/2306.11698
2. Gehman et al., 2020. *RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models.* https://arxiv.org/abs/2009.11462

> ⚠️ **Content warning:** the prompts include profanity, sexual content, and identity attacks (standard for red-team toxicity datasets).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798

Subtask of #291

Data

Proposed loader

One question before I start

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798

Description

Subtask of #291

Data

Proposed loader

One question before I start

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions