A well-structured wrapper and test suite for running SWE-bench evaluations and inference.
SWE-bench-testsuite/
├── src/ # Source code
│ ├── __init__.py
│ ├── evaluate.py # Evaluation wrapper
│ ├── inference.py # Inference wrapper
│ └── README.md
├── tests/ # Test files
│ ├── __init__.py
│ ├── test_setup.py # Setup validation test
│ ├── test_eval.py # Evaluation tests
│ └── README.md
├── SWE-bench/ # Git submodule
├── logs/ # Evaluation logs (generated)
├── outputs/ # Model outputs (generated)
├── .venv/ # Virtual environment
├── pyproject.toml # Project configuration
├── setup.sh # Setup script
└── readme.md # This file
- Python 3.9+
- Docker Desktop (running)
- Git
# Clone with submodules
git clone --recursive https://github.com/VAR-META-Tech/SWE-bench-testsuite.git
cd SWE-bench-testsuite
# Run setup script
sh setup.sh
# Activate virtual environment
source .venv/bin/activate# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Initialize SWE-bench submodule
git submodule update --init --recursive
# Install dependencies
pip install -e .
pip install -e ./SWE-bench# Using the module
python -m src.evaluate
# Or import in your code
python -c "from src.evaluate import run_evaluation; run_evaluation()"# Using the module
python -m src.inference
# Or import in your code
python -c "from src.inference import run_inference; run_inference()"# Run all tests
pytest
# Run specific test with output
pytest tests/test_setup.py -v -s
# Run with coverage
pytest --cov=src tests/# Quick setup validation test
pytest tests/test_setup.py -vThis will verify:
- ✅ Virtual environment is configured
- ✅ Dependencies are installed
- ✅ SWE-bench can load datasets
- ✅ Docker connection works
- ✅ Evaluation harness executes
from src.evaluate import run_evaluation
run_evaluation(
dataset_name="princeton-nlp/SWE-bench_Lite",
predictions_path="outputs/predictions.jsonl",
instance_ids=["sympy__sympy-20590", "django__django-11001"],
max_workers=2,
run_id="my-custom-eval",
namespace="", # Required on macOS Apple Silicon
cache_level="env",
)from src.inference import run_inference
run_inference(
model_name_or_path="princeton-nlp/SWE-Llama-13b",
dataset_name="princeton-nlp/SWE-bench_Lite",
max_instances=10,
output_dir="outputs",
)On macOS with Apple Silicon, you must use namespace="":
run_evaluation(..., namespace="")cache_level="env": Cache at environment level (recommended)cache_level="instance": Cache at instance level (faster rebuilds)
- First Run: Docker images will be built on-demand, which can take time
- Test Behavior: Setup tests may show instance "errors" due to missing images - this is expected
- Docker: Ensure Docker Desktop is running before evaluation
source .venv/bin/activate
pip install pytest# Ensure Docker Desktop is running
docker ps# Build images manually
python -m swebench.harness.docker_build \
--instances sympy__sympy-20590 \
--namespace ""See the SWE-bench repository for license information.