[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism by mzr1996 · Pull Request #1499 · open-compass/VLMEvalKit

mzr1996 · 2026-03-27T04:03:58Z

Unify run_api.py into run.py via --api-mode flag, removing the separate entry point
Redesign reuse system: structured run status (status.json), format-aware prediction file reuse, and three-level --reuse-aux control (all/infer/none) replacing the old boolean flag
Standardize checkpoint keys to string type across inference/inference_mt/inference_video for consistent pickle serialization
Replace --ignore with --keep-failed (old flag kept as deprecated alias), and Re-running failed samples becomes the default behavior now.
Add --base-url support to local mode for direct API model inference without config.py
Add judge model configuration args (--judge-base-url, --judge-key, --judge-api-nproc, etc.)
Improve executor shutdown with proper SIGTERM/SIGKILL worker termination
Use relative symlinks for output files to support directory relocation
Clean up hardcoded local model paths in config.py

… mechanism - Unify run_api.py into run.py via `--api-mode` flag, removing the separate entry point - Redesign reuse system: structured run status (status.json), format-aware prediction file reuse, and three-level `--reuse-aux` control (all/infer/none) replacing the old boolean flag - Standardize checkpoint keys to string type across inference/inference_mt/inference_video for consistent pickle serialization - Replace `--ignore` with `--keep-failed` (old flag kept as deprecated alias) - Add `--base-url` support to local mode for direct API model inference without config.py - Add judge model configuration args (--judge-base-url, --judge-key, --judge-api-nproc, etc.) - Improve executor shutdown with proper SIGTERM/SIGKILL worker termination - Use relative symlinks for output files to support directory relocation - Clean up hardcoded local model paths in config.py

Copilot

Pull request overview

This PR consolidates the separate API runner into run.py (--api-mode) and refactors the reuse/checkpoint workflow to support structured run metadata (status.json), format-aware prediction reuse, and more granular auxiliary-file reuse controls.

Changes:

Merge run_api.py functionality into run.py behind --api-mode, adding unified judge/inference CLI options and improved executor shutdown/symlink behavior.
Redesign reuse logic with status.json, run discovery, prediction completeness checks, and format-converting prediction reuse.
Standardize checkpoint/result handling around string keys and add “retry failed by default” controls (--keep-failed, deprecated --ignore).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`vlmeval/smp/file.py`	Adds run-id helpers, `status.json` read/write, run discovery, prediction reuse/conversion, and completeness checks.
`vlmeval/inference.py`	Switches to `_checkpoint.pkl` and string-key checkpoints; adds retry-failed behavior and cleanup.
`vlmeval/inference_mt.py`	Same checkpoint/retry refactor for multi-turn inference; adds cleanup and PREV handling.
`vlmeval/inference_video.py`	Same checkpoint/retry refactor for video inference; uses standardized pred path and cleanup.
`vlmeval/inference_api.py`	Adds `retry_failed`, improves executor shutdown/worker termination, and creates relocation-safe relative symlinks.
`run.py`	Introduces `--api-mode`, new reuse/aux semantics, `status.json` writing, base-url model support, and judge configuration args.
`run_api.py`	Removed; functionality migrated into `run.py --api-mode`.
`vlmeval/__init__.py`	Filters a specific `pkg_resources` deprecation warning.
`vlmeval/config.py`	Removes a hardcoded local path for Intern-S1-mini in favor of a repo-style model path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T08:37:10Z

run.py

+                judge_kwargs = get_judge_kwargs(dataset_name, dataset.TYPE, args)
+                judge_signature = judge_kwargs.get('model', '')
+
+                if RANK == 0:
+                    reuse_ctx = prepare_reuse_files(
+                        pred_root_meta=pred_root_meta,
+                        eval_id=eval_id,
+                        model_name=model_name,
+                        dataset=dataset,
+                        result_file=result_file,
+                        reuse=args.reuse,
+                        reuse_aux=args.reuse_aux,
+                        retry_failed=not args.keep_failed,
+                        judge_signature=judge_signature if args.mode != 'infer' else None,
+                        world_size=WORLD_SIZE,


judge_signature is currently derived only from judge_kwargs['model'], but prepare_reuse_files(... reuse_aux='all') uses this signature to decide whether to reuse evaluation auxiliary outputs. If judge configuration changes (e.g., base URL, verifier/vllm flags, judge_args affecting scoring), reusing eval aux based on model name alone can mix incompatible evaluation artifacts and yield incorrect results. Consider computing a stable signature from the relevant, non-secret judge settings (e.g., model + api_base + normalized judge_args + verifier/vllm flags) and store/compare that instead.

Copilot · 2026-03-27T08:37:10Z

vlmeval/inference_api.py

        # Try to load from the result file.
        result_path = Path(cfg.result_file)
        if result_path.exists():
            try:
                data = load(str(result_path))
                if isinstance(data, pd.DataFrame):
-                    existing_results = {
-                        str(idx): pred
-                        for idx, pred in zip(data['index'], data['prediction'])
-                        if FAIL_MSG not in str(pred)
-                    }
+                    if self.retry_failed:
+                        existing_results = {
+                            str(idx): pred
+                            for idx, pred in zip(data['index'], data['prediction'])
+                            if FAIL_MSG not in str(pred)
+                        }
+                    else:
+                        existing_results = {
+                            str(idx): pred
+                            for idx, pred in zip(data['index'], data['prediction'])
+                        }
                    results.update(existing_results)
                    logger.info(f"   [{dataset_name}] Loaded {len(existing_results)} "
                                "results from result file")


_load_checkpoint() only reuses results from cfg.result_file when load() returns a pandas DataFrame. If PRED_FORMAT=json, dump() writes a list of records and load() returns a list, so this reuse path is skipped (or hits the exception handler) and the pipeline will re-infer already-finished samples. Consider supporting list-of-records / dict formats here (similar to the prediction-table normalization used in vlmeval.smp.file).

Copilot · 2026-03-27T08:37:11Z

vlmeval/smp/file.py

+
+def write_run_status(run_dir, status):
+    run_dir = Path(run_dir)
+    run_dir.mkdir(exist_ok=True)


write_run_status() calls run_dir.mkdir(exist_ok=True) without parents=True. If a caller passes a nested run directory whose parents haven't been created yet, this will raise and prevent status.json from being written. Using mkdir(parents=True, exist_ok=True) would make this helper robust and consistent with other directory creation in the repo.

Suggested change

run_dir.mkdir(exist_ok=True)

run_dir.mkdir(parents=True, exist_ok=True)

Copilot · 2026-03-27T08:37:11Z

vlmeval/smp/file.py

+    frame = _prediction_table(src_file)
+    if frame is None:
+        logger.warning(
+            f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'


Spelling/grammar: "Can not convert ..." should be "Cannot convert ..." in this warning message.

Suggested change

f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'

f'Cannot convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'

Copilot · 2026-03-27T08:37:11Z

run.py

+    if args.ignore:
+        logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '
+                       'the default behavior, use `--keep-failed` to disable it.')


--ignore is described as a deprecated alias, but it currently only logs a warning and does not change behavior. As a result, users passing --ignore will still retry failed samples (since retry_failed=not args.keep_failed). If the intent is to keep backward compatibility, map args.ignore to args.keep_failed=True (and adjust the help text accordingly).

Suggested change

if args.ignore:

logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '

'the default behavior, use `--keep-failed` to disable it.')

if getattr(args, "ignore", False):

logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '

'the default behavior, use `--keep-failed` to disable it.')

# Maintain backward compatibility: `--ignore` behaves like `--keep-failed`.

# This ensures users passing `--ignore` will not retry failed samples.

if hasattr(args, "keep_failed"):

args.keep_failed = True

Copilot · 2026-03-27T08:37:12Z

vlmeval/inference.py

        if rank == 0:
            data = load(result_file)
-            # breakpoint()
            results = {k: v for k, v in zip(data['index'], data['prediction'])}


load(result_file) can return a list of records when PRED_FORMAT=json (because dump() serializes DataFrames to to_dict('records') for JSON). This code assumes data is a DataFrame/dict with data['index'] and data['prediction'], which will raise at runtime when resuming from a JSON result file. Consider normalizing data to a DataFrame (or reusing the shared prediction-table parsing logic in vlmeval.smp.file._prediction_table) before zipping index/prediction.

Suggested change

results = {k: v for k, v in zip(data['index'], data['prediction'])}

# Normalize loaded prediction table to a dict[index] -> prediction

if isinstance(data, list):

# JSON format: list of record dicts, e.g. [{'index': ..., 'prediction': ...}, ...]

results = {

rec['index']: rec['prediction']

for rec in data

if isinstance(rec, dict) and 'index' in rec and 'prediction' in rec

}

else:

# DataFrame-like or dict-like with 'index' and 'prediction' columns/keys

try:

indices = data['index']

predictions = data['prediction']

results = {k: v for k, v in zip(indices, predictions)}

except (TypeError, KeyError):

# Unexpected structure; fall back to empty results to avoid crashing

results = {}

Copilot · 2026-03-27T08:37:12Z

vlmeval/inference_mt.py

+    if osp.exists(result_file):
+        if rank == 0:
+            data = load(result_file)
+            results = {k: v for k, v in zip(data['index'], data['prediction'])}


load(result_file) can return a list of records when PRED_FORMAT=json (since dump() serializes DataFrames as records for JSON). This block assumes data supports data['index']/data['prediction'] like a DataFrame/dict, which will fail on resume with JSON output. Normalize data to a DataFrame (or reuse the shared prediction parsing helper) before building results.

Suggested change

results = {k: v for k, v in zip(data['index'], data['prediction'])}

# `load` may return a dict/DataFrame-like object or a list of records (for JSON "records" format).

if isinstance(data, list):

# Expect a list of {'index': ..., 'prediction': ...} records.

results = {rec['index']: rec['prediction'] for rec in data}

else:

results = {k: v for k, v in zip(data['index'], data['prediction'])}

Copilot · 2026-03-27T08:37:12Z

vlmeval/inference_video.py

+                results = {k: v for k, v in zip(data['index'], data['prediction'])}
+                results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
+                if len(results) == len(data):


When PRED_FORMAT=json, dump() writes DataFrames as a list of record dicts, so load(result_file) returns a list. This resume/retry block assumes data['index'] and data['prediction'] exist, which will raise for JSON result files. Consider normalizing data to a DataFrame (or using the same prediction-table parsing logic used by reuse checks) before extracting index/prediction.

Suggested change

results = {k: v for k, v in zip(data['index'], data['prediction'])}

results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}

if len(results) == len(data):

# Normalize loaded prediction table to support both dict/DataFrame-like

# objects (with 'index' and 'prediction' keys/columns) and JSON

# record-style lists (list of dicts with those keys).

if isinstance(data, list):

indices = [item.get('index') for item in data]

preds = [item.get('prediction') for item in data]

else:

indices = data['index']

preds = data['prediction']

results = {k: v for k, v in zip(indices, preds)}

results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}

if len(results) == len(indices):

Copilot · 2026-03-27T08:37:12Z

run.py

+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        repetition_penalty=args.repetition_penalty,


build_model_from_base_url() (and the corresponding CLI args) supports repetition_penalty but does not expose presence_penalty, which was previously supported by the removed run_api.py runner and is used by several API/model implementations in this repo. If the intent is full feature parity when using --base-url, consider adding a --presence-penalty arg and including it in model_args (similar to repetition_penalty).

Suggested change

repetition_penalty=args.repetition_penalty,

repetition_penalty=args.repetition_penalty,

presence_penalty=args.presence_penalty,

mzr1996 added 2 commits March 27, 2026 12:01

fix minor error

a8865b4

mzr1996 requested review from Copilot and kennymckormick March 27, 2026 07:47

Copilot started reviewing on behalf of mzr1996 March 27, 2026 08:30 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499

[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499
mzr1996 wants to merge 2 commits intoopen-compass:mainfrom
mzr1996:update-run

mzr1996 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	run_dir.mkdir(exist_ok=True)
	run_dir.mkdir(parents=True, exist_ok=True)

	f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'
	f'Cannot convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'

-            results = {k: v for k, v in zip(data['index'], data['prediction'])}
+            # Normalize loaded prediction table to a dict[index] -> prediction
+            if isinstance(data, list):
+                # JSON format: list of record dicts, e.g. [{'index': ..., 'prediction': ...}, ...]
+                results = {
+                    rec['index']: rec['prediction']
+                    for rec in data
+                    if isinstance(rec, dict) and 'index' in rec and 'prediction' in rec
+                }
+            else:
+                # DataFrame-like or dict-like with 'index' and 'prediction' columns/keys
+                try:
+                    indices = data['index']
+                    predictions = data['prediction']
+                    results = {k: v for k, v in zip(indices, predictions)}
+                except (TypeError, KeyError):
+                    # Unexpected structure; fall back to empty results to avoid crashing
+                    results = {}

-                results = {k: v for k, v in zip(data['index'], data['prediction'])}
-                results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
-                if len(results) == len(data):
+                # Normalize loaded prediction table to support both dict/DataFrame-like
+                # objects (with 'index' and 'prediction' keys/columns) and JSON
+                # record-style lists (list of dicts with those keys).
+                if isinstance(data, list):
+                    indices = [item.get('index') for item in data]
+                    preds = [item.get('prediction') for item in data]
+                else:
+                    indices = data['index']
+                    preds = data['prediction']
+                results = {k: v for k, v in zip(indices, preds)}
+                results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
+                if len(results) == len(indices):

	repetition_penalty=args.repetition_penalty,
	repetition_penalty=args.repetition_penalty,
	presence_penalty=args.presence_penalty,

Conversation

mzr1996 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mzr1996 commented Mar 27, 2026 •

edited

Loading