Skip to content

Review: Evaluation/benchmarking scripts (David)#47

Open
doncamilom wants to merge 3 commits intomainfrom
david_eval
Open

Review: Evaluation/benchmarking scripts (David)#47
doncamilom wants to merge 3 commits intomainfrom
david_eval

Conversation

@doncamilom
Copy link
Copy Markdown
Contributor

Reproducibility Assessment -- david_eval

Author: David Segura | Commits: 1 | Files changed: 8 | +1,086 lines

Status: NOT READY -- useful evaluation methodology, completely non-portable.


What this contributes

Eight standalone vLLM-based evaluation scripts in evaluation/ for four reaction tasks, each with a base and retry variant:

Script Task
benchmarking_inversion-2.py / -retry.py MCQ: identify correct reaction (swapped reactants)
benchmarking_naming-2.py / -retry.py Reaction class prediction (10 classes)
benchmarking_replacement.py / -2-retry.py MCQ: identify correct reaction (modified reactions)
benchmarking_truefalse-2.py / -retry.py True/False: is this reaction correct?

Each script loads a CSV dataset, constructs MCQ/binary prompts, runs vLLM inference, extracts answers via regex, and computes accuracy metrics.


Breaks / Blockers

Issue Severity
benchmarking_inversion-2-retry.py: needs_retry referenced before definition -- script crashes with NameError on first run Bug

Reproducibility Gaps

Gap Details
All 8 scripts use hardcoded absolute paths /data/david/..., /data/share/sft_hf_3/ -- zero CLI args, no env vars, no config files
No deterministic option shuffling MCQ scripts use np.random.permutation() without fixed seed -- option ordering varies between runs
Massive code duplication ~60% identical boilerplate across 8 scripts (data loading, splitting, extraction, metrics)
No tests Zero unit or integration tests
No documentation No docstrings, no README explaining what models are targeted or how to interpret results
Hardcoded model paths Specific to David's cluster environment
Hardcoded tokenizer Qwen-specific stop tokens in some scripts

Relationship to other branches

  • david_active_branch is the full dev branch -- contains these scripts plus result CSVs, reaction tasks, and CSCS infrastructure. Also contains committed WandB API key (security issue) and ~1.4M lines of result CSVs.
  • david_kuma_sampling_param (PR David kuma sampling param #36) contains the 3 reaction task classes that these scripts evaluate
  • The evaluation methodology overlaps with evaluate.rxnpred.py in Vu_active_branch

What is needed for reviewer reproducibility

  • Fix benchmarking_inversion-2-retry.py NameError
  • Refactor 8 scripts into a single parameterized evaluation harness with CLI args for model path, data path, task type
  • Add np.random.seed() for deterministic MCQ option shuffling
  • Replace all hardcoded paths with CLI arguments or ${MIST_DATA_DIR}
  • Add a README documenting usage, expected inputs/outputs, and interpretation
  • Add at least smoke tests for prompt construction and answer extraction logic
  • Consider integrating with the existing src/open_r1/ evaluation infrastructure rather than standalone scripts

🤖 Generated with Claude Code

@doncamilom
Copy link
Copy Markdown
Contributor Author

Paper-to-Code Mapping Update

This branch contains evaluation scripts for paper tasks RxN, RxR, RxI, RxTF.

The benchmarking_truefalse-2.py script is one of the few places where RxTF evaluation logic exists. However, the scripts are completely non-portable (hardcoded paths) and would need significant rework to serve as reproducibility artifacts.

Priority for reviewer reproducibility: MEDIUM

The evaluation methodology is useful but needs to be integrated into a portable harness.

@doncamilom
Copy link
Copy Markdown
Contributor Author

Final Classification

Paper tasks: Evaluation scripts for RxN, RxR, RxI, RxTF
Priority: MEDIUM
Paper relevance: Contains evaluation methodology needed for Table 4, but completely non-portable

Verdict

These 8 benchmarking scripts cover 4 paper tasks and demonstrate the evaluation methodology (prompt construction, MCQ shuffling, answer extraction, accuracy computation). However, they are entirely hardcoded to David's cluster and have a crashing bug. The methodology should be extracted into a single parameterized evaluation harness rather than merged as-is.

Work needed for peer review

Item Effort
Extract evaluation logic into a portable, parameterized harness Medium-Large
Fix benchmarking_inversion-2-retry.py NameError Tiny
Add np.random.seed() for deterministic MCQ shuffling Tiny
Replace all hardcoded paths with CLI args Medium
Add documentation for usage and output interpretation Small

Recommendation

Do not merge as-is. Use as reference when building a clean evaluation harness for Table 4 reproduction.

@doncamilom doncamilom added priority: medium Useful but not blocking effort: medium A day or two of work has-bugs Contains known runtime bugs labels Apr 6, 2026
Davidmingsegura and others added 2 commits April 6, 2026 14:58
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@doncamilom doncamilom added documentation Improvements or additions to documentation and removed has-bugs Contains known runtime bugs labels Apr 6, 2026
@doncamilom
Copy link
Copy Markdown
Contributor Author

Updated classification

Relabeled from has-bugs to documentation. This PR should NOT be merged as-is (hardcoded paths, duplicated boilerplate, missing deps), but it serves as the reference implementation for the evaluation harness that needs to be built.

The 8 scripts cover evaluation methodology for all 4 reaction tasks:

  • RxN: prompt construction with 10-class options, per-class accuracy
  • RxR: MCQ with Enamine50k-corrupted options, letter extraction
  • RxI: MCQ with inverted-reagent options
  • RxTF: binary true/false classification

These should be used as reference when building scripts/evaluate.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation effort: medium A day or two of work priority: medium Useful but not blocking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants