PFW at SemEval-2026 Task 6: Multi-Seed DeBERTa Ensembles for Political Response Clarity and Evasion Classification
Official code and paper source for the PFW submission to SemEval-2026 Task 6 (CLARITY). Our system placed 18/41 on Subtask 1 (Clarity) and 12/33 on Subtask 2 (Evasion). The paper has been accepted for presentation at SemEval-2026.
Tamsal, T. and Rusert, J. (2026). PFW at SemEval-2026 Task 6: Multi-Seed DeBERTa Ensembles for Political Response Clarity and Evasion Classification. In Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026), ACL.
| Subtask 1 (3-way Clarity) | Subtask 2 (9-way Evasion) | |
|---|---|---|
| Architecture | DeBERTa-xlarge (900M) | DeBERTa-v3-large (304M) |
| Ensemble | 5 folds × 10 seeds = 50 models | 5 folds × 10 seeds = 50 models |
| Aggregation | Simple logit averaging | Simple logit averaging |
| Macro F1 (official eval) | 0.76 (18/41) | 0.50 (12/33) |
No LLM prompting or API access is required; the largest model is under 1B parameters and runs on a single A100 GPU.
| System | T1 | T2 |
|---|---|---|
| Majority class | 0.248 | 0.052 |
| TF-IDF + Logistic Regression | 0.546 | 0.319 |
| ChatGPT zero-shot (Thomas et al., 2024) | 0.413 | 0.244 |
| DeBERTa-base fine-tuned (Thomas et al., 2024) | 0.441 | – |
| RoBERTa-base fine-tuned (Thomas et al., 2024) | 0.530 | – |
| XLNet-base fine-tuned (Thomas et al., 2024) | 0.518 | – |
| DeBERTa-v3-large (single seed, ours) | 0.643 ± 0.024 | 0.327 ± 0.040 |
| DeBERTa-xlarge (single seed, ours) | 0.663 ± 0.021 | – |
| Multi-seed ensemble (ours) — official eval | 0.76 | 0.50 |
See latex/acl_latex.pdf for full results and analysis.
Three independent post-hoc optimization strategies (learned ensemble weights, per-class thresholds, and hierarchical masking) each improved out-of-fold (OOF) macro F1 but degraded official evaluation scores by 0.02–0.10. We document this as an optimization paradox: with limited evaluation data (237 samples), model-level interventions (seed diversity) transfer robustly, whereas prediction-level interventions (post-hoc calibration) overfit OOF artifacts. See Section 5.3 of the paper.
.
├── latex/ # Camera-ready paper source and PDF
│ ├── acl_latex.tex # ACL-format paper
│ ├── acl_latex.pdf # Compiled PDF
│ ├── custom.bib # Bibliography
│ ├── acl.sty # ACL template style
│ └── acl_natbib.bst # ACL bibliography style
├── docs/
│ ├── methodology_report.md # Extended methodology notes
│ └── paper_figures/ # Paper figures (PDF+PNG) and result tables (CSV/JSON)
├── src/
│ ├── data/ # Dataset loading, preprocessing, CV splits
│ ├── models/ # DeBERTa encoder-classifier architecture
│ ├── training/ # Training loops (multi-seed, per-architecture)
│ ├── metrics/ # Macro-F1 scorer and local evaluation harness
│ └── submission/ # Prediction-file and submission-zip helpers
├── scripts/
│ ├── collect_task1_oof.py # OOF-logit collection (Task 1, xlarge)
│ ├── collect_v3large_oof.py # OOF-logit collection (both tasks, v3-large)
│ ├── build_oof_logits.py # OOF-matrix utilities
│ ├── generate_final_ensemble.py # End-to-end inference → submission
│ ├── generate_simple_ensemble.py # Reference logit-averaging ensemble
│ ├── eval_task1_predictions.py # Local T1 scorer
│ ├── eval_task2_predictions.py # Local T2 scorer
│ ├── local_eval.py # Combined local evaluator
│ ├── paper_baselines.py # Reproduce majority / TF-IDF baselines
│ ├── paper_analysis.py # Regenerate figures and result tables
│ └── slurm/ # SLURM job scripts for an HPC workflow
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md
Everything outside this tree (raw data, checkpoints, OOF logits, training logs, submission archives) is produced by the training pipeline and gitignored.
- Python 3.9+ with PyTorch 2.0+ (CUDA 12.1)
- 1× NVIDIA A100-80GB recommended for DeBERTa-xlarge; a 24 GB card is sufficient for DeBERTa-v3-large
- Estimated compute for the full 100-model campaign: ~50 GPU-hours
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtThe QEvasion dataset (Thomas et al., 2024) is loaded from the HuggingFace Hub:
from datasets import load_dataset
dataset = load_dataset("ailsntua/QEvasion")The official SemEval-2026 evaluation set is distributed via the task organizers and is not included in this repository.
1. Generate stratified 5-fold splits (writes to artifacts/splits/):
python src/data/splits.py2. Train one fold × one seed — primary entry point:
python src/training/train_10seed.py \
--task 1 --fold 0 --seed 42 \
--model_name microsoft/deberta-xlarge \
--epochs 3 --lr 2e-5For the full 50-model ensemble per architecture, submit the provided SLURM arrays:
sbatch scripts/slurm/task1_10seed.sbatch # Task 1, all (fold × seed) combos
sbatch scripts/slurm/task2_10seed.sbatch # Task 2, all (fold × seed) combos3. Collect OOF logits:
python scripts/collect_task1_oof.py # xlarge → Task 1
python scripts/collect_v3large_oof.py # v3-large → Tasks 1 & 24. Generate ensemble predictions:
python scripts/generate_simple_ensemble.py # logit averaging (no calibration)5. Evaluate locally:
python scripts/local_eval.py --task 1 --prediction_file submissions/task1_prediction
python scripts/local_eval.py --task 2 --prediction_file submissions/task2_predictionpython scripts/paper_baselines.py # majority-class and TF-IDF baselines
python scripts/paper_analysis.py # figures and result tables in docs/paper_figures/cd latex
pdflatex acl_latex && bibtex acl_latex && pdflatex acl_latex && pdflatex acl_latexThe camera-ready PDF compiles cleanly under the ACL 2026 style, passes aclpubcheck --paper_type long, and fits the 6-page main-body limit.
SemEval-2026 Task 6 (CLARITY) addresses political question evasion detection over the QEvasion dataset:
- Subtask 1: 3-way clarity classification — Clear Reply / Ambivalent / Clear Non-Reply
- Subtask 2: 9-way evasion-type classification — Explicit, Dodging, Deflection, Implicit, General, Partial/half-answer, Claims ignorance, Declining to answer, Clarification
Both subtasks are scored by macro F1 over a 237-sample held-out evaluation set.
@inproceedings{tamsal2026pfw,
title = {{PFW} at {SemEval}-2026 Task 6: Multi-Seed {DeBERTa} Ensembles for Political Response Clarity and Evasion Classification},
author = {Tamsal, Taleef and Rusert, Jonathan},
booktitle = {Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026)},
year = {2026},
publisher = {Association for Computational Linguistics},
note = {To appear}
}The code in this repository is released under the MIT License. The QEvasion dataset is governed by its own license terms and is not redistributed here.
We thank the CLARITY task organizers for the QEvasion dataset and the shared task. Computational resources were provided by Purdue University Fort Wayne through the Gilbreth HPC cluster.