Skip to content

Add Disarray MLE-Bench Results#118

Open
moustafa-a wants to merge 1 commit intoopenai:mainfrom
LineaLabs:main
Open

Add Disarray MLE-Bench Results#118
moustafa-a wants to merge 1 commit intoopenai:mainfrom
LineaLabs:main

Conversation

@moustafa-a
Copy link

Hi authors of MLE-Bench,

We are the research team behind Disarray, a context-aware ML engineering agent that leverages data semantics and lineage. We evaluated Disarray on MLE-Bench using three independent experiment groups, and submit the results in this PR.

Contribution Details

This PR includes:

  • Detailed grading reports located in:

    • runs/disarray_group_1
    • runs/disarray_group_2
    • runs/disarray_group_3
  • Updated leaderboard in README.md.

  • Updated runs/README.md, documenting the Disarray agent configuration and experimental setup.

  • Updated experiment metadata in runs/run_group_experiments.csv.

Evaluation Setup

  • Model ensemble: Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2-Codex, Gemini-3-Pro-Preview
  • Runtime: 24 hours
  • Hardware: 24 vCPUs, 220 GB RAM, 1× A100 GPU

Notes

No changes were made to benchmark definitions or grading logic. All results were produced using the standard MLE-Bench evaluation pipeline.

We welcome feedback and are happy to make any necessary adjustments.

Best,
The Disarray Team

@alexwang939393
Copy link

Hi @moustafa-a ,

Congrats on the impressive results! I was looking at the grading reports out of curiosity and had a few questions about some data points that caught my attention:

1. smartphone-decimeter-2022

  • Group 3: score = 0.0
  • Group 2: score = 0.054

Since this competition measures GPS positioning error in meters (is_lower_better: true), a score of 0.0 would mean perfect prediction with zero error. This seems physically implausible for a GPS task. Could you share how this was achieved?

2. dog-breed-identification

  • Group 3: score = 0.00755
  • Group 1: score = 0.04147
  • Group 2: score = 0.04093

Scores in the 0.00x range for this competition are typically associated with data leakage via the Stanford Dogs Dataset (which contains labels for the Kaggle test images). For reference, state-of-the-art models without external data typically achieve ~0.2-0.3. The large gap between Group 3 (0.00755) and Groups 1/2 (~0.04) is also notable. Can you clarify how this score was achieved?

3. Scores exceeding Kaggle top performers

Some results (e.g., google-research-identify-contrails, osic-pulmonary-fibrosis-progression) appear to exceed the original Kaggle competition winners. While MLE-bench uses different test splits, achieving scores better than human top performers across multiple competitions is noteworthy.

I'm genuinely curious about Disarray's approach. Would you be able to share any of the following to help the community understand and learn from your results?

  • Agent execution logs
  • Generated training code
  • Model checkpoints or training artifacts

Thanks!

@dorx
Copy link

dorx commented Feb 4, 2026

Hi @alexwang939393,

Doris here from Team Disarray. We really appreciate you taking the time to examine our results. To the best of our knowledge, this is the first time the community has done a deep dive into an MLE-Bench entrant’s results, and we’re honored our submission prompted this level of interest!

To answer your questions,

  1. RE: smartphone-decimeter-2022. The issue you flagged is a known MLE-Bench issue ([Issue] Google Smartphone Decimeter Challenge 2022 - Hackable #93), and our agent was indeed able to exploit this quirk to achieve a score of 0.0. We look forward to re-evaluating our solution on https://github.com/openai/frontier-evals when it’s ready for the new version of MLE-Bench.
  2. RE: dog-breed-identification. Our agent discovered that this Kaggle competition was derived from the Stanford Dogs Dataset on its own, via its ability to make connections across competitions and datasets on Kaggle. Below is a snippet of the code the agent used to generate the Group 3 submission. The ability to transfer knowledge across tasks is a major strength of the Disarray system.
#!/usr/bin/env python3
"""
Prepare merged dataset from competition data and Stanford Dogs.
Maps Stanford breed names to competition breed names and creates unified dataset.
"""

import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict
import re

# Paths
WORK_DIR = Path("/home/disarray/experiments/dog-breed-identification")
COMP_DATA_DIR = WORK_DIR / "data"
STANFORD_DIR = Path("/kaggle/input/jessicali9530-stanford-dogs-dataset/images/Images")
OUTPUT_DIR = WORK_DIR / "merged_dataset"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("="*80)
print("PREPARING MERGED DATASET")
print("="*80)
  1. RE: Scores exceeding Kaggle top performers. Results like these are what truly excite our team about working in this space! This phenomenon isn't unique to Disarray. Nearly all existing agents on the leaderboard have achieved similar results to varying degrees. Major shoutout to the PiEvolve team for doing an even better job than us on this front!
Top Performing Agents vs top Kaggle leaderboard score (ranked by absolute wins):

1. PiEvolve_24hrs: 19 wins
2. PiEvolve_12hrs: 16 wins
3. Disarray: 15 wins ⭐
4. Thesis: 15 wins
5. AIRA-dojo: 14 wins
6. Famou-Agent-2.0: 12 wins
7. Leeroo: 12 wins
8. multi-agent-Neo: 10 wins
9. MLE-STAR-Pro-1.0: 10 wins
10. deepseek-v3.2-speciale-ML-Master-2.0: 10 wins

We’re actively preparing the technical report, and we’re excited to share our findings with the community. As for providing more detailed artifacts, we’d like to understand whether the MLE-Bench team @thesofakillers @joe-needham has any specific guidance on how to do this. Many of the model checkpoints are quite large, and logs are spread across multiple systems given the complexity of our multi-agent setup.

Thanks!

@alexwang939393
Copy link

Thanks for the detailed response, @dorx!

Regarding points 1 and 2

  • I appreciate the transparency. My remaining question is whether exploiting benchmark quirks (smartphone-decimeter) or leveraging external datasets to directly map test labels (dog-breed) aligns with what MLE-Bench intends to measure as ML engineering capability. That's ultimately for the maintainers to judge.

Regarding point 3

  • Other teams beating Kaggle top scores on older competitions makes sense, given advances in pretrained models and architectures since those competitions ran.
    However, I noticed that several of Disarray's wins over Kaggle top performers are specifically on competitions where MLE-Bench uses its own custom test splits (e.g., contrails, osic).
    Beating the original Kaggle winners on MLE-Bench-specific test data is a different claim than doing so on well-studied benchmarks. Combined with the dog-breed pattern, this raises the question of whether external data sourcing may have played a broader role.

Looking forward to the technical report!

@thesofakillers
Copy link
Contributor

thesofakillers commented Feb 5, 2026

Hi all. I no longer work at OpenAI so can no longer maintain this repo.
As author of the original benchmark, I'll chime in at a personal level, not representing any company.

On points 1 and 2, I think we have accepted that the current benchmark is imperfect and partially exploitable in its current form.

We wanted to release fixes to these exploits in a single batch so that we could start with a fresh leaderboard, as opposed to gradually releasing fixes that would result in a gradual drift of what the metrics meant on the pre-existing leaderboard. This release has not yet happened, and as originally stated, the timelines on this release are very unclear and I would not rely on it.

So, in my view, with the current repo, and the current leaderboard, it is acceptable though undesirable for solutions to leverage the exploits. I would say it would be nice to include disclosure of this if the submission authors note it, perhaps via a footnote in the table. But its hard to retroactively apply this same rigour to previous old submissions when we weren't aware of the exploits, who may themselves have been also leveraging the same exploits.

That said, I don't think using external data sources is necessarily an "exploit". It may be in the dog-breeds case, but for many other kaggle competitions, finding external data sources is part of the meta.


As for point 3, I would suggest not standardizing to a specific log format and submission, purely to accomodate the multitudes of agents/scaffolds that could possibly be used to submit to MLE-bench. If a submission wants to underline their transparency, which I think is always a plus, then they can complement their submission with e.g. a link to a website showing their logs/trajectories etc.. An example is what the Microsoft Research submission did with their first submission in #52, where they linked a website with their logs/trajs (though that website seems to be stopped now).


Ultimately I literally don't have permissions to maintain this repo anymore, so these are just suggestions.

@AtrixTang
Copy link

Hi, thanks for the submission! Just a quick question to help us understand the methodology better:

Does the agent utilize any feedback or grading signals from the private test set (the one used for medal evaluation) during the inference process?

Or is the test set accessed only after the agent has fully completed its reasoning, solely to determine the final score—adhering to the fundamental consensus of standard ML evaluation? @dorx @moustafa-a @alexwang939393

@dorx
Copy link

dorx commented Feb 13, 2026

Thanks for the continued discussion!

Per standard practice, private test data is not accessed by agents at any point during development. The only time agents receive feedback about test data is when they request early termination (for efficient resource utilization), at which point they learn whether they've reached the bronze medal threshold. Agents can only make a limited number of termination requests, as this binary indicator does present minimal data leakage. This is akin to Kaggle contestants being allowed to make a few submissions before the competition closes to gauge their leaderboard position. We turned the signal into a binary indicator of medal status to compensate for the fact that live Kaggle competitions have separate datasets for eval before/after competition closing.

As commonly done in ML model development, external data sources are discovered by the agents and brought in to enrich the original data. This is a permissible practice in Kaggle competitions, as pointed out previously. Autonomously identifying and incorporating relevant external data from 10000s of Kaggle datasets is a major technical challenge for agents. In the unanticipated case of dog-breed, there’s an open dataset that contains the labels.

We appreciate the engagement from the community. Our team is working on the tech report to dive deep into all of these topics in a more systematic way.

@AtrixTang
Copy link

AtrixTang commented Feb 14, 2026

Thanks for the detailed clarification.

However, the Kaggle analogy fundamentally does not hold here:
In Kaggle competitions, the feedback (Public Leaderboard) comes from a separate data split, while the final ranking is determined by a completely hidden Private Leaderboard. Participants never receive feedback on the Private LB during the competition.

If agent uses the same private test set for both the "early termination/retry signal" and the "final evaluation," this is, by definition, test set leakage.

Even if the signal is "minimal" (binary), it fundamentally alters the evaluation protocol:

  1. Optimization vs. Generalization: It allows the agent to treat the test set as an objective function. If an agent knows it hasn't reached the Bronze threshold, it can retry or pivot strategies. This is no longer measuring generalization (pass@1); it is measuring search efficiency with an Oracle.
  2. Benchmark Integrity: Standard ML benchmarks strictly prohibit accessing the test set during the inference process. Allowing "retry until Bronze" creates a non-stationary evaluation target that cannot be compared to standard baselines.

This strongly reinforces the need for a separate track.

  • Standard Track: Strict pass@1, no feedback from the test set.
  • Oracle/Interactive Track: Allows binary feedback/retries against the test set.

Mixing these results conflates intrinsic reasoning capability with test-set probing efficiency, which would be misleading for the community. @moustafa-a @dorx @thesofakillers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments