Skip to content

FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798

@v0ropaev

Description

@v0ropaev

Subtask of #291

DecodingTrust [1] publishes adversarial toxicity test prompts at
/data/toxicity/user_prompts/
(sampled from RealToxicityPrompts [2]). The Stereotypes perspective was added in PR #385;
the Toxicity perspective is still open.

Data

  • toxic.jsonl — 1196 records (high Perspective toxicity)
  • nontoxic.jsonl — 1200 records
  • Each record has a plain-text prompt (prompt.text) plus 8 Perspective-API scores
    (toxicity, severe_toxicity, profanity, sexually_explicit, identity_attack,
    threat, insult, flirtation) and a challenging: bool flag.

Proposed loader

  • _DecodingTrustToxicityDataset(_RemoteDatasetLoader) following the
    _DangerousQADataset pattern (single URL, _fetch_from_url, plain-string extraction).
  • Parameters: subset: Literal["toxic", "nontoxic", "all"] = "all",
    challenging_only: bool = False, toxicity_threshold: float = 0.5.
  • Per-prompt harm_categories derived from the Perspective scores (e.g. include
    "profanity" when prompt.profanity >= toxicity_threshold).
  • Source URL pinned to a specific commit SHA
    (current main HEAD: 161ae8321ced62f45fcd9ceb412e05b47c603cd4, 2024-09-16).
  • Unit tests mock _fetch_from_url, mirroring
    tests/unit/datasets/test_dangerous_qa_dataset.py.

One question before I start

DecodingTrust's root LICENSE is CC BY-SA 4.0, while PyRIT is MIT. The existing
Stereotypes assets (pyrit/datasets/jailbreak/templates/dt_stereotypes_*.yaml) ship the
system prompts in-tree with attribution. For Toxicity I'd plan to fetch at runtime from
raw.githubusercontent.com (no vendoring) and add full attribution in the class
docstring. Is that the approach you'd like, or should I handle CC BY-SA sources
differently?

References

  1. Wang et al., 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. https://arxiv.org/abs/2306.11698
  2. Gehman et al., 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://arxiv.org/abs/2009.11462

⚠️ Content warning: the prompts include profanity, sexual content, and identity attacks (standard for red-team toxicity datasets).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions