Conversation
|
Hi @moustafa-a , Congrats on the impressive results! I was looking at the grading reports out of curiosity and had a few questions about some data points that caught my attention: 1. smartphone-decimeter-2022
Since this competition measures GPS positioning error in meters ( 2. dog-breed-identification
Scores in the 0.00x range for this competition are typically associated with data leakage via the Stanford Dogs Dataset (which contains labels for the Kaggle test images). For reference, state-of-the-art models without external data typically achieve ~0.2-0.3. The large gap between Group 3 (0.00755) and Groups 1/2 (~0.04) is also notable. Can you clarify how this score was achieved? 3. Scores exceeding Kaggle top performers Some results (e.g., I'm genuinely curious about Disarray's approach. Would you be able to share any of the following to help the community understand and learn from your results?
Thanks! |
|
Hi @alexwang939393, Doris here from Team Disarray. We really appreciate you taking the time to examine our results. To the best of our knowledge, this is the first time the community has done a deep dive into an MLE-Bench entrant’s results, and we’re honored our submission prompted this level of interest! To answer your questions,
#!/usr/bin/env python3
"""
Prepare merged dataset from competition data and Stanford Dogs.
Maps Stanford breed names to competition breed names and creates unified dataset.
"""
import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict
import re
# Paths
WORK_DIR = Path("/home/disarray/experiments/dog-breed-identification")
COMP_DATA_DIR = WORK_DIR / "data"
STANFORD_DIR = Path("/kaggle/input/jessicali9530-stanford-dogs-dataset/images/Images")
OUTPUT_DIR = WORK_DIR / "merged_dataset"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print("="*80)
print("PREPARING MERGED DATASET")
print("="*80)
We’re actively preparing the technical report, and we’re excited to share our findings with the community. As for providing more detailed artifacts, we’d like to understand whether the MLE-Bench team @thesofakillers @joe-needham has any specific guidance on how to do this. Many of the model checkpoints are quite large, and logs are spread across multiple systems given the complexity of our multi-agent setup. Thanks! |
|
Thanks for the detailed response, @dorx! Regarding points 1 and 2
Regarding point 3
Looking forward to the technical report! |
|
Hi all. I no longer work at OpenAI so can no longer maintain this repo. On points 1 and 2, I think we have accepted that the current benchmark is imperfect and partially exploitable in its current form. We wanted to release fixes to these exploits in a single batch so that we could start with a fresh leaderboard, as opposed to gradually releasing fixes that would result in a gradual drift of what the metrics meant on the pre-existing leaderboard. This release has not yet happened, and as originally stated, the timelines on this release are very unclear and I would not rely on it. So, in my view, with the current repo, and the current leaderboard, it is acceptable though undesirable for solutions to leverage the exploits. I would say it would be nice to include disclosure of this if the submission authors note it, perhaps via a footnote in the table. But its hard to retroactively apply this same rigour to previous old submissions when we weren't aware of the exploits, who may themselves have been also leveraging the same exploits. That said, I don't think using external data sources is necessarily an "exploit". It may be in the As for point 3, I would suggest not standardizing to a specific log format and submission, purely to accomodate the multitudes of agents/scaffolds that could possibly be used to submit to MLE-bench. If a submission wants to underline their transparency, which I think is always a plus, then they can complement their submission with e.g. a link to a website showing their logs/trajectories etc.. An example is what the Microsoft Research submission did with their first submission in #52, where they linked a website with their logs/trajs (though that website seems to be stopped now). Ultimately I literally don't have permissions to maintain this repo anymore, so these are just suggestions. |
|
Hi, thanks for the submission! Just a quick question to help us understand the methodology better: Does the agent utilize any feedback or grading signals from the private test set (the one used for medal evaluation) during the inference process? Or is the test set accessed only after the agent has fully completed its reasoning, solely to determine the final score—adhering to the fundamental consensus of standard ML evaluation? @dorx @moustafa-a @alexwang939393 |
|
Thanks for the continued discussion! Per standard practice, private test data is not accessed by agents at any point during development. The only time agents receive feedback about test data is when they request early termination (for efficient resource utilization), at which point they learn whether they've reached the bronze medal threshold. Agents can only make a limited number of termination requests, as this binary indicator does present minimal data leakage. This is akin to Kaggle contestants being allowed to make a few submissions before the competition closes to gauge their leaderboard position. We turned the signal into a binary indicator of medal status to compensate for the fact that live Kaggle competitions have separate datasets for eval before/after competition closing. As commonly done in ML model development, external data sources are discovered by the agents and brought in to enrich the original data. This is a permissible practice in Kaggle competitions, as pointed out previously. Autonomously identifying and incorporating relevant external data from 10000s of Kaggle datasets is a major technical challenge for agents. In the unanticipated case of We appreciate the engagement from the community. Our team is working on the tech report to dive deep into all of these topics in a more systematic way. |
|
Thanks for the detailed clarification. However, the Kaggle analogy fundamentally does not hold here: If agent uses the same private test set for both the "early termination/retry signal" and the "final evaluation," this is, by definition, test set leakage. Even if the signal is "minimal" (binary), it fundamentally alters the evaluation protocol:
This strongly reinforces the need for a separate track.
Mixing these results conflates intrinsic reasoning capability with test-set probing efficiency, which would be misleading for the community. @moustafa-a @dorx @thesofakillers |
Hi authors of MLE-Bench,
We are the research team behind Disarray, a context-aware ML engineering agent that leverages data semantics and lineage. We evaluated Disarray on MLE-Bench using three independent experiment groups, and submit the results in this PR.
Contribution Details
This PR includes:
Detailed grading reports located in:
runs/disarray_group_1runs/disarray_group_2runs/disarray_group_3Updated leaderboard in
README.md.Updated
runs/README.md, documenting the Disarray agent configuration and experimental setup.Updated experiment metadata in
runs/run_group_experiments.csv.Evaluation Setup
Notes
No changes were made to benchmark definitions or grading logic. All results were produced using the standard MLE-Bench evaluation pipeline.
We welcome feedback and are happy to make any necessary adjustments.
Best,
The Disarray Team