Skip to content

fix:gaia dataset file attach issue and evaluation script format support#25

Merged
junjzhang merged 2 commits into
cmriat:mainfrom
kyh035:fix_eval_format_support
Jul 1, 2025
Merged

fix:gaia dataset file attach issue and evaluation script format support#25
junjzhang merged 2 commits into
cmriat:mainfrom
kyh035:fix_eval_format_support

Conversation

@kyh035

@kyh035 kyh035 commented Jul 1, 2025

Copy link
Copy Markdown
Contributor

What did you do

fix gaia dataset file attach issue and fix metric computing script format issue

New test cases

None

Test results

image

Other comments

None

@junjzhang junjzhang requested review from Copilot and junjzhang July 1, 2025 08:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses file attachment formatting for the GAIA dataset and enhances the evaluation scripts’ configuration and environment handling.

  • Added visual_qa_tool_factory to the tool maps.
  • Enabled .env loading, string-label wrapping, and increased concurrency in the LLM evaluation script.
  • Updated evaluation runner script for the GAIA dataset.
  • Fixed prompt formatting in GAIA batch builder.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

File Description
src/l0/traj_sampler/nb_agent_sampler/tool_specs.py Added visual_qa_tool_factory to TOOL_FACTORY_MAP and enabled in TOOL_SPECS_MAP
evaluation/nb_agent_eval/simpleqa_metrics.py Loaded environment from .env, wrapped labels when a string, and bumped --workers default to 64
evaluation/nb_agent_eval/run_eval.sh Switched datasets to GAIA and updated config path
evaluation/nb_agent_eval/eval_datasets/gaia.py Applied .format(file_path=…) to the file-attach prompt
Comments suppressed due to low confidence (1)

evaluation/nb_agent_eval/simpleqa_metrics.py:257

  • os.getenv is used here but os is not imported. Add import os at the top to avoid a NameError.
        client = openai.OpenAI(base_url=os.getenv("OPENAI_API_BASE"), api_key=os.getenv("OPENAI_API_KEY"))

Comment on lines +18 to +20
from typing import Any

TOOL_FACTORY_MAP: dict[str, str] = {"qa": {"web_search_tool_factory", "jina_reader_tool_factory"}, "math": {}}
TOOL_FACTORY_MAP: dict[str, str] = {"qa": {"web_search_tool_factory", "jina_reader_tool_factory", "visual_qa_tool_factory"}, "math": {}}

Copilot AI Jul 1, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation dict[str, str] is incorrect since the values are sets of strings. Consider using dict[str, set[str]] or Mapping[str, Set[str]] for accuracy.

Copilot uses AI. Check for mistakes.
Comment thread evaluation/nb_agent_eval/simpleqa_metrics.py
Comment thread evaluation/nb_agent_eval/run_eval.sh Outdated
--datasets bamboogle musique simpleqa hotpotqa \
--config_path /root/AgentRL/evaluation/nb_agent_eval/config/sampler_config_direct.yaml No newline at end of file
--datasets gaia \
--config_path /root/l0/evaluation/nb_agent_eval/config/sampler_config_claude.yaml No newline at end of file

Copilot AI Jul 1, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using hardcoded absolute paths reduces portability—consider switching to relative paths or environment variables for the project root.

Suggested change
--config_path /root/l0/evaluation/nb_agent_eval/config/sampler_config_claude.yaml
--config_path "${PROJECT_ROOT}/evaluation/nb_agent_eval/config/sampler_config_claude.yaml"

Copilot uses AI. Check for mistakes.

@junjzhang junjzhang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@junjzhang junjzhang merged commit 2454c54 into cmriat:main Jul 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants