Review: Evaluation/benchmarking scripts (David) by doncamilom · Pull Request #47 · schwallergroup/mist

doncamilom · 2026-04-06T11:11:46Z

Reproducibility Assessment -- `david_eval`

Author: David Segura | Commits: 1 | Files changed: 8 | +1,086 lines

Status: NOT READY -- useful evaluation methodology, completely non-portable.

What this contributes

Eight standalone vLLM-based evaluation scripts in evaluation/ for four reaction tasks, each with a base and retry variant:

Script	Task
`benchmarking_inversion-2.py` / `-retry.py`	MCQ: identify correct reaction (swapped reactants)
`benchmarking_naming-2.py` / `-retry.py`	Reaction class prediction (10 classes)
`benchmarking_replacement.py` / `-2-retry.py`	MCQ: identify correct reaction (modified reactions)
`benchmarking_truefalse-2.py` / `-retry.py`	True/False: is this reaction correct?

Each script loads a CSV dataset, constructs MCQ/binary prompts, runs vLLM inference, extracts answers via regex, and computes accuracy metrics.

Breaks / Blockers

Issue	Severity
`benchmarking_inversion-2-retry.py`: `needs_retry` referenced before definition -- script crashes with NameError on first run	Bug

Reproducibility Gaps

Gap	Details
All 8 scripts use hardcoded absolute paths	`/data/david/...`, `/data/share/sft_hf_3/` -- zero CLI args, no env vars, no config files
No deterministic option shuffling	MCQ scripts use `np.random.permutation()` without fixed seed -- option ordering varies between runs
Massive code duplication	~60% identical boilerplate across 8 scripts (data loading, splitting, extraction, metrics)
No tests	Zero unit or integration tests
No documentation	No docstrings, no README explaining what models are targeted or how to interpret results
Hardcoded model paths	Specific to David's cluster environment
Hardcoded tokenizer	Qwen-specific stop tokens in some scripts

Relationship to other branches

david_active_branch is the full dev branch -- contains these scripts plus result CSVs, reaction tasks, and CSCS infrastructure. Also contains committed WandB API key (security issue) and ~1.4M lines of result CSVs.
david_kuma_sampling_param (PR David kuma sampling param #36) contains the 3 reaction task classes that these scripts evaluate
The evaluation methodology overlaps with evaluate.rxnpred.py in Vu_active_branch

What is needed for reviewer reproducibility

Fix benchmarking_inversion-2-retry.py NameError
Refactor 8 scripts into a single parameterized evaluation harness with CLI args for model path, data path, task type
Add np.random.seed() for deterministic MCQ option shuffling
Replace all hardcoded paths with CLI arguments or ${MIST_DATA_DIR}
Add a README documenting usage, expected inputs/outputs, and interpretation
Add at least smoke tests for prompt construction and answer extraction logic
Consider integrating with the existing src/open_r1/ evaluation infrastructure rather than standalone scripts

🤖 Generated with Claude Code

doncamilom · 2026-04-06T11:39:14Z

Paper-to-Code Mapping Update

This branch contains evaluation scripts for paper tasks RxN, RxR, RxI, RxTF.

The benchmarking_truefalse-2.py script is one of the few places where RxTF evaluation logic exists. However, the scripts are completely non-portable (hardcoded paths) and would need significant rework to serve as reproducibility artifacts.

Priority for reviewer reproducibility: MEDIUM

The evaluation methodology is useful but needs to be integrated into a portable harness.

doncamilom · 2026-04-06T11:44:30Z

Final Classification

Paper tasks: Evaluation scripts for RxN, RxR, RxI, RxTF
Priority: MEDIUM
Paper relevance: Contains evaluation methodology needed for Table 4, but completely non-portable

Verdict

These 8 benchmarking scripts cover 4 paper tasks and demonstrate the evaluation methodology (prompt construction, MCQ shuffling, answer extraction, accuracy computation). However, they are entirely hardcoded to David's cluster and have a crashing bug. The methodology should be extracted into a single parameterized evaluation harness rather than merged as-is.

Work needed for peer review

Item	Effort
Extract evaluation logic into a portable, parameterized harness	Medium-Large
Fix `benchmarking_inversion-2-retry.py` NameError	Tiny
Add `np.random.seed()` for deterministic MCQ shuffling	Tiny
Replace all hardcoded paths with CLI args	Medium
Add documentation for usage and output interpretation	Small

Recommendation

Do not merge as-is. Use as reference when building a clean evaluation harness for Table 4 reproduction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

doncamilom · 2026-04-06T20:12:56Z

Updated classification

Relabeled from has-bugs to documentation. This PR should NOT be merged as-is (hardcoded paths, duplicated boilerplate, missing deps), but it serves as the reference implementation for the evaluation harness that needs to be built.

The 8 scripts cover evaluation methodology for all 4 reaction tasks:

RxN: prompt construction with 10-class options, per-class accuracy
RxR: MCQ with Enamine50k-corrupted options, letter extraction
RxI: MCQ with inverted-reagent options
RxTF: binary true/false classification

These should be used as reference when building scripts/evaluate.py.

doncamilom added priority: medium Useful but not blocking effort: medium A day or two of work has-bugs Contains known runtime bugs labels Apr 6, 2026

Davidmingsegura and others added 2 commits April 6, 2026 14:58

evaluation scripts

55c283e

Run black and isort for CI compliance

d8e7bf6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

doncamilom force-pushed the david_eval branch from 0cc038f to d8e7bf6 Compare April 6, 2026 12:59

Use main's portable kinetic test file

283f47c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

doncamilom added documentation Improvements or additions to documentation and removed has-bugs Contains known runtime bugs labels Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review: Evaluation/benchmarking scripts (David)#47

Review: Evaluation/benchmarking scripts (David)#47
doncamilom wants to merge 3 commits intomainfrom
david_eval

doncamilom commented Apr 6, 2026

Uh oh!

doncamilom commented Apr 6, 2026

Uh oh!

doncamilom commented Apr 6, 2026

Uh oh!

doncamilom commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

doncamilom commented Apr 6, 2026

Reproducibility Assessment -- david_eval

What this contributes

Breaks / Blockers

Reproducibility Gaps

Relationship to other branches

What is needed for reviewer reproducibility

Uh oh!

doncamilom commented Apr 6, 2026

Paper-to-Code Mapping Update

Priority for reviewer reproducibility: MEDIUM

Uh oh!

doncamilom commented Apr 6, 2026

Final Classification

Verdict

Work needed for peer review

Recommendation

Uh oh!

doncamilom commented Apr 6, 2026

Updated classification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reproducibility Assessment -- `david_eval`