Review: Evaluation/benchmarking scripts (David)#47
Conversation
Paper-to-Code Mapping UpdateThis branch contains evaluation scripts for paper tasks RxN, RxR, RxI, RxTF. The Priority for reviewer reproducibility: MEDIUMThe evaluation methodology is useful but needs to be integrated into a portable harness. |
Final ClassificationPaper tasks: Evaluation scripts for RxN, RxR, RxI, RxTF VerdictThese 8 benchmarking scripts cover 4 paper tasks and demonstrate the evaluation methodology (prompt construction, MCQ shuffling, answer extraction, accuracy computation). However, they are entirely hardcoded to David's cluster and have a crashing bug. The methodology should be extracted into a single parameterized evaluation harness rather than merged as-is. Work needed for peer review
RecommendationDo not merge as-is. Use as reference when building a clean evaluation harness for Table 4 reproduction. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated classificationRelabeled from The 8 scripts cover evaluation methodology for all 4 reaction tasks:
These should be used as reference when building |
Reproducibility Assessment --
david_evalAuthor: David Segura | Commits: 1 | Files changed: 8 | +1,086 lines
What this contributes
Eight standalone vLLM-based evaluation scripts in
evaluation/for four reaction tasks, each with a base and retry variant:benchmarking_inversion-2.py/-retry.pybenchmarking_naming-2.py/-retry.pybenchmarking_replacement.py/-2-retry.pybenchmarking_truefalse-2.py/-retry.pyEach script loads a CSV dataset, constructs MCQ/binary prompts, runs vLLM inference, extracts answers via regex, and computes accuracy metrics.
Breaks / Blockers
benchmarking_inversion-2-retry.py:needs_retryreferenced before definition -- script crashes with NameError on first runReproducibility Gaps
/data/david/...,/data/share/sft_hf_3/-- zero CLI args, no env vars, no config filesnp.random.permutation()without fixed seed -- option ordering varies between runsRelationship to other branches
david_active_branchis the full dev branch -- contains these scripts plus result CSVs, reaction tasks, and CSCS infrastructure. Also contains committed WandB API key (security issue) and ~1.4M lines of result CSVs.david_kuma_sampling_param(PR David kuma sampling param #36) contains the 3 reaction task classes that these scripts evaluateevaluate.rxnpred.pyinVu_active_branchWhat is needed for reviewer reproducibility
benchmarking_inversion-2-retry.pyNameErrornp.random.seed()for deterministic MCQ option shuffling${MIST_DATA_DIR}src/open_r1/evaluation infrastructure rather than standalone scripts🤖 Generated with Claude Code