[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499
[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499mzr1996 wants to merge 2 commits intoopen-compass:mainfrom
Conversation
… mechanism - Unify run_api.py into run.py via `--api-mode` flag, removing the separate entry point - Redesign reuse system: structured run status (status.json), format-aware prediction file reuse, and three-level `--reuse-aux` control (all/infer/none) replacing the old boolean flag - Standardize checkpoint keys to string type across inference/inference_mt/inference_video for consistent pickle serialization - Replace `--ignore` with `--keep-failed` (old flag kept as deprecated alias) - Add `--base-url` support to local mode for direct API model inference without config.py - Add judge model configuration args (--judge-base-url, --judge-key, --judge-api-nproc, etc.) - Improve executor shutdown with proper SIGTERM/SIGKILL worker termination - Use relative symlinks for output files to support directory relocation - Clean up hardcoded local model paths in config.py
There was a problem hiding this comment.
Pull request overview
This PR consolidates the separate API runner into run.py (--api-mode) and refactors the reuse/checkpoint workflow to support structured run metadata (status.json), format-aware prediction reuse, and more granular auxiliary-file reuse controls.
Changes:
- Merge
run_api.pyfunctionality intorun.pybehind--api-mode, adding unified judge/inference CLI options and improved executor shutdown/symlink behavior. - Redesign reuse logic with
status.json, run discovery, prediction completeness checks, and format-converting prediction reuse. - Standardize checkpoint/result handling around string keys and add “retry failed by default” controls (
--keep-failed, deprecated--ignore).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
vlmeval/smp/file.py |
Adds run-id helpers, status.json read/write, run discovery, prediction reuse/conversion, and completeness checks. |
vlmeval/inference.py |
Switches to _checkpoint.pkl and string-key checkpoints; adds retry-failed behavior and cleanup. |
vlmeval/inference_mt.py |
Same checkpoint/retry refactor for multi-turn inference; adds cleanup and PREV handling. |
vlmeval/inference_video.py |
Same checkpoint/retry refactor for video inference; uses standardized pred path and cleanup. |
vlmeval/inference_api.py |
Adds retry_failed, improves executor shutdown/worker termination, and creates relocation-safe relative symlinks. |
run.py |
Introduces --api-mode, new reuse/aux semantics, status.json writing, base-url model support, and judge configuration args. |
run_api.py |
Removed; functionality migrated into run.py --api-mode. |
vlmeval/__init__.py |
Filters a specific pkg_resources deprecation warning. |
vlmeval/config.py |
Removes a hardcoded local path for Intern-S1-mini in favor of a repo-style model path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| judge_kwargs = get_judge_kwargs(dataset_name, dataset.TYPE, args) | ||
| judge_signature = judge_kwargs.get('model', '') | ||
|
|
||
| if RANK == 0: | ||
| reuse_ctx = prepare_reuse_files( | ||
| pred_root_meta=pred_root_meta, | ||
| eval_id=eval_id, | ||
| model_name=model_name, | ||
| dataset=dataset, | ||
| result_file=result_file, | ||
| reuse=args.reuse, | ||
| reuse_aux=args.reuse_aux, | ||
| retry_failed=not args.keep_failed, | ||
| judge_signature=judge_signature if args.mode != 'infer' else None, | ||
| world_size=WORLD_SIZE, |
There was a problem hiding this comment.
judge_signature is currently derived only from judge_kwargs['model'], but prepare_reuse_files(... reuse_aux='all') uses this signature to decide whether to reuse evaluation auxiliary outputs. If judge configuration changes (e.g., base URL, verifier/vllm flags, judge_args affecting scoring), reusing eval aux based on model name alone can mix incompatible evaluation artifacts and yield incorrect results. Consider computing a stable signature from the relevant, non-secret judge settings (e.g., model + api_base + normalized judge_args + verifier/vllm flags) and store/compare that instead.
| # Try to load from the result file. | ||
| result_path = Path(cfg.result_file) | ||
| if result_path.exists(): | ||
| try: | ||
| data = load(str(result_path)) | ||
| if isinstance(data, pd.DataFrame): | ||
| existing_results = { | ||
| str(idx): pred | ||
| for idx, pred in zip(data['index'], data['prediction']) | ||
| if FAIL_MSG not in str(pred) | ||
| } | ||
| if self.retry_failed: | ||
| existing_results = { | ||
| str(idx): pred | ||
| for idx, pred in zip(data['index'], data['prediction']) | ||
| if FAIL_MSG not in str(pred) | ||
| } | ||
| else: | ||
| existing_results = { | ||
| str(idx): pred | ||
| for idx, pred in zip(data['index'], data['prediction']) | ||
| } | ||
| results.update(existing_results) | ||
| logger.info(f" [{dataset_name}] Loaded {len(existing_results)} " | ||
| "results from result file") |
There was a problem hiding this comment.
_load_checkpoint() only reuses results from cfg.result_file when load() returns a pandas DataFrame. If PRED_FORMAT=json, dump() writes a list of records and load() returns a list, so this reuse path is skipped (or hits the exception handler) and the pipeline will re-infer already-finished samples. Consider supporting list-of-records / dict formats here (similar to the prediction-table normalization used in vlmeval.smp.file).
|
|
||
| def write_run_status(run_dir, status): | ||
| run_dir = Path(run_dir) | ||
| run_dir.mkdir(exist_ok=True) |
There was a problem hiding this comment.
write_run_status() calls run_dir.mkdir(exist_ok=True) without parents=True. If a caller passes a nested run directory whose parents haven't been created yet, this will raise and prevent status.json from being written. Using mkdir(parents=True, exist_ok=True) would make this helper robust and consistent with other directory creation in the repo.
| run_dir.mkdir(exist_ok=True) | |
| run_dir.mkdir(parents=True, exist_ok=True) |
| frame = _prediction_table(src_file) | ||
| if frame is None: | ||
| logger.warning( | ||
| f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.' |
There was a problem hiding this comment.
Spelling/grammar: "Can not convert ..." should be "Cannot convert ..." in this warning message.
| f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.' | |
| f'Cannot convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.' |
| if args.ignore: | ||
| logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is ' | ||
| 'the default behavior, use `--keep-failed` to disable it.') |
There was a problem hiding this comment.
--ignore is described as a deprecated alias, but it currently only logs a warning and does not change behavior. As a result, users passing --ignore will still retry failed samples (since retry_failed=not args.keep_failed). If the intent is to keep backward compatibility, map args.ignore to args.keep_failed=True (and adjust the help text accordingly).
| if args.ignore: | |
| logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is ' | |
| 'the default behavior, use `--keep-failed` to disable it.') | |
| if getattr(args, "ignore", False): | |
| logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is ' | |
| 'the default behavior, use `--keep-failed` to disable it.') | |
| # Maintain backward compatibility: `--ignore` behaves like `--keep-failed`. | |
| # This ensures users passing `--ignore` will not retry failed samples. | |
| if hasattr(args, "keep_failed"): | |
| args.keep_failed = True |
| if rank == 0: | ||
| data = load(result_file) | ||
| # breakpoint() | ||
| results = {k: v for k, v in zip(data['index'], data['prediction'])} |
There was a problem hiding this comment.
load(result_file) can return a list of records when PRED_FORMAT=json (because dump() serializes DataFrames to to_dict('records') for JSON). This code assumes data is a DataFrame/dict with data['index'] and data['prediction'], which will raise at runtime when resuming from a JSON result file. Consider normalizing data to a DataFrame (or reusing the shared prediction-table parsing logic in vlmeval.smp.file._prediction_table) before zipping index/prediction.
| results = {k: v for k, v in zip(data['index'], data['prediction'])} | |
| # Normalize loaded prediction table to a dict[index] -> prediction | |
| if isinstance(data, list): | |
| # JSON format: list of record dicts, e.g. [{'index': ..., 'prediction': ...}, ...] | |
| results = { | |
| rec['index']: rec['prediction'] | |
| for rec in data | |
| if isinstance(rec, dict) and 'index' in rec and 'prediction' in rec | |
| } | |
| else: | |
| # DataFrame-like or dict-like with 'index' and 'prediction' columns/keys | |
| try: | |
| indices = data['index'] | |
| predictions = data['prediction'] | |
| results = {k: v for k, v in zip(indices, predictions)} | |
| except (TypeError, KeyError): | |
| # Unexpected structure; fall back to empty results to avoid crashing | |
| results = {} |
| if osp.exists(result_file): | ||
| if rank == 0: | ||
| data = load(result_file) | ||
| results = {k: v for k, v in zip(data['index'], data['prediction'])} |
There was a problem hiding this comment.
load(result_file) can return a list of records when PRED_FORMAT=json (since dump() serializes DataFrames as records for JSON). This block assumes data supports data['index']/data['prediction'] like a DataFrame/dict, which will fail on resume with JSON output. Normalize data to a DataFrame (or reuse the shared prediction parsing helper) before building results.
| results = {k: v for k, v in zip(data['index'], data['prediction'])} | |
| # `load` may return a dict/DataFrame-like object or a list of records (for JSON "records" format). | |
| if isinstance(data, list): | |
| # Expect a list of {'index': ..., 'prediction': ...} records. | |
| results = {rec['index']: rec['prediction'] for rec in data} | |
| else: | |
| results = {k: v for k, v in zip(data['index'], data['prediction'])} |
| results = {k: v for k, v in zip(data['index'], data['prediction'])} | ||
| results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)} | ||
| if len(results) == len(data): |
There was a problem hiding this comment.
When PRED_FORMAT=json, dump() writes DataFrames as a list of record dicts, so load(result_file) returns a list. This resume/retry block assumes data['index'] and data['prediction'] exist, which will raise for JSON result files. Consider normalizing data to a DataFrame (or using the same prediction-table parsing logic used by reuse checks) before extracting index/prediction.
| results = {k: v for k, v in zip(data['index'], data['prediction'])} | |
| results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)} | |
| if len(results) == len(data): | |
| # Normalize loaded prediction table to support both dict/DataFrame-like | |
| # objects (with 'index' and 'prediction' keys/columns) and JSON | |
| # record-style lists (list of dicts with those keys). | |
| if isinstance(data, list): | |
| indices = [item.get('index') for item in data] | |
| preds = [item.get('prediction') for item in data] | |
| else: | |
| indices = data['index'] | |
| preds = data['prediction'] | |
| results = {k: v for k, v in zip(indices, preds)} | |
| results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)} | |
| if len(results) == len(indices): |
| temperature=args.temperature, | ||
| top_k=args.top_k, | ||
| top_p=args.top_p, | ||
| repetition_penalty=args.repetition_penalty, |
There was a problem hiding this comment.
build_model_from_base_url() (and the corresponding CLI args) supports repetition_penalty but does not expose presence_penalty, which was previously supported by the removed run_api.py runner and is used by several API/model implementations in this repo. If the intent is full feature parity when using --base-url, consider adding a --presence-penalty arg and including it in model_args (similar to repetition_penalty).
| repetition_penalty=args.repetition_penalty, | |
| repetition_penalty=args.repetition_penalty, | |
| presence_penalty=args.presence_penalty, |
--api-modeflag, removing the separate entry point--reuse-auxcontrol (all/infer/none) replacing the old boolean flag--ignorewith--keep-failed(old flag kept as deprecated alias), and Re-running failed samples becomes the default behavior now.--base-urlsupport to local mode for direct API model inference without config.py