World-class benchmarking suite for comparing NVIDIA OpenFold2 NIM microservice against open-source OpenFold with publication-quality statistical rigor.
🚀 Deploying on Colossus? See COLOSSUS_DEPLOY.md for push-button setup (5 minutes from clone to running benchmarks)
- Out-of-the-box execution: Clone → bootstrap → run benchmarks → view HTML report
- Comprehensive metrics: Latency, GPU utilization, memory, energy, accuracy
- Pareto frontier analysis: Find optimal accuracy/performance trade-offs
- Multi-GPU support: Run on A100, H100, or other GPUs
- Colossus-optimized: Scripts and docs for NVIDIA Colossus infrastructure
- CASP-standard accuracy metrics: TM-score, GDT_TS in addition to RMSD/lDDT
- Statistical rigor: Proper warmup/measurement separation, p90/p95/p99 percentiles
- Apples-to-apples comparison: Precomputed MSAs ensure identical inputs for both systems
- Cold-start measurements: Quantify container startup and initialization overhead
- Version tracking: Automatic warnings for
:latesttags, digest capture for reproducibility - Tail latency analysis: Identify performance variance with percentile tracking
- NVIDIA GPU with drivers
- Docker with nvidia-container-toolkit
- NGC API key (for NIM): https://catalog.ngc.nvidia.com/
- Python 3.10+
- 100 GB free disk space
git clone https://github.com/your-org/openfold2-bench.git
cd openfold2-bench
# Install dependencies with pip
pip install -e .
# Optional: Install PNG export support (recommended)
pip install -e .[plots]
# Set NGC API key
export NGC_API_KEY="your_key_here"Note: PNG export requires kaleido, which can be difficult to install on some HPC systems. If you skip [plots], the benchmark will still work but will only generate HTML plots (not PNG).
The benchmark suite includes production-ready Colossus integration with:
- One-click execution - Single command runs entire workflow
- Hardware auto-detection - Automatic GPU/storage detection and optimization
- Resumable benchmarks - Checkpoint and resume long-running jobs
- Campaign mode - Aggregate results across GPU types
# One command to bootstrap, configure, and run
./scripts/colossus/run.sh configs/default.yamlThis automatically:
- Bootstraps environment (first time only)
- Detects hardware (GPU type, VRAM, fast storage)
- Generates optimized config
- Validates prerequisites
- Runs benchmark with checkpointing
# 1. Bootstrap (first time only)
./scripts/colossus/bootstrap.sh
source .env.colossus
# 2. Detect hardware and generate config
bench colossus detect
bench colossus auto-config --base-config configs/default.yaml
# 3. Validate environment
bench preflight
# 4. Run benchmark
bench run --config configs/generated/auto_*.yaml
# Results will be at: $PERSIST_DIR/results/run_*/# Create campaign for GPU comparison
bench colossus campaign-create campaigns/gpu_comparison --name "H100 vs A100"
# Run on H100
bench run --config configs/auto_h100.yaml
bench colossus campaign-add --campaign-dir campaigns/gpu_comparison \
--run-dir results/run_h100
# Run on A100
bench run --config configs/auto_a100.yaml
bench colossus campaign-add --campaign-dir campaigns/gpu_comparison \
--run-dir results/run_a100
# Aggregate and compare
bench colossus campaign-aggregate campaigns/gpu_comparison
bench colossus campaign-report campaigns/gpu_comparison
# View report
open campaigns/gpu_comparison/campaign_report.htmlSee Colossus Runbook for detailed operations guide.
# Preflight checks
bench preflight
# Run default benchmark (~1-2 hours)
bench run --config configs/default.yaml
# View results
open results/run_*/analysis/report.htmlThe benchmark suite supports multiple protein datasets:
- 20 well-studied proteins covering diverse fold types
- Size range: 20-260 residues
- All fold types: all-α, all-β, α+β
- Use for: Quick validation and development
- 18 CASP15 (2022) targets from the Critical Assessment of Structure Prediction
- Categories: Free Modeling (FM), Template-Based Modeling Easy/Hard (TBM-easy/TBM-hard)
- Size range: 112-324 residues
- Use for: Research-grade benchmarking and publication-quality results
To run CASP15 benchmark:
bench run --config configs/casp15.yamlStandard Benchmarks:
default.yaml: Fast benchmark with accuracy evaluation (~1-2 hours)casp15.yaml: CASP15 benchmark with deeper MSA (~3-4 hours)full_matrix.yaml: Comprehensive with scaling studies (~8-12 hours)
World-Class Benchmarks (NEW):
inference_only_benchmark.yaml: Apples-to-apples with precomputed MSAsstatistical_rigor.yaml: 20 measurement passes for reliable percentilesnim_backend_comparison.yaml: TensorRT vs Torch comparisoncold_start.yaml: Container initialization overhead measurementsworld_class_benchmark.yaml: Comprehensive publication-quality suite
gpu:
device_ids: [0]
sampling_interval_ms: 50
nim:
enabled: true
container_image: "nvcr.io/nim/openfold/openfold2:latest"
cache_dir: "${FAST_CACHE:-$HOME}/nim_cache"
backend: "tensorrt"
model_sets:
- [3] # Single model
- [1, 2, 3, 4, 5] # Ensemble
suites:
- name: accuracy_small
targets_file: bench/dataset/targets.yaml
msa_depth: 1
repeats: 3For fair comparison, use precomputed MSAs to ensure identical inputs:
# Step 1: Generate precomputed MSAs
python scripts/precompute_msas.py \
--targets bench/dataset/casp15_targets.yaml \
--output data/precomputed/casp15_msa128 \
--msa-depth 128
# Step 2: Run inference-only benchmark
bench run --config configs/inference_only_benchmark.yamlThis eliminates MSA generation variance and measures pure inference performance.
For publication-quality results with proper statistics:
bench run --config configs/statistical_rigor.yamlFeatures:
- 10 warmup passes (excluded from results)
- 20 measurement passes (for reliable p95/p99 calculation)
- Target shuffling between passes
- Distribution statistics (median, p90, p95, p99)
Measure container startup and initialization overhead:
bench run --config configs/cold_start.yamlMeasures:
- Container startup time (stop → ready)
- First request latency (cold)
- Second request latency (warm)
Compare TensorRT vs Torch backends:
# Run TensorRT
bench run --config configs/nim_backend_comparison.yaml
# Edit config: set backend to "torch"
# Run again
bench run --config configs/nim_backend_comparison.yamlDon't use :latest tags in production!
# ❌ NOT RECOMMENDED
nim:
container_image: "nvcr.io/nim/openfold/openfold2:latest"
# ✅ RECOMMENDED - Pin to specific version
nim:
container_image: "nvcr.io/nim/openfold/openfold2:1.0"
# ✅ BEST - Pin to specific digest
nim:
container_image: "nvcr.io/nim/openfold/openfold2@sha256:abc123..."The benchmark automatically:
- Warns when using
:latesttags - Captures container digests in results
- Tracks exact versions for reproducibility
For publication-quality benchmarks:
-
Use warmup passes to eliminate cold-start effects
warmup_passes: 5-10 # Recommended
-
Run sufficient measurements for tail latency analysis
measurement_passes: 20 # Enables reliable p95/p99
-
Enable target shuffling to reduce order effects
shuffle_targets_each_pass: true
-
Use precomputed MSAs for inference-only comparison
precomputed_msa_dir: "data/precomputed/casp15_msa128" inference_only_mode: true
Ensure fair comparison by using AlphaFold official weights:
openfold:
weights_source: "alphafold_official" # Match NIM's weights# Validate environment
bench preflight
# Prepare datasets
bench prepare-data --config configs/default.yaml
# Run benchmark
bench run --config configs/default.yaml --output-dir results/run_001
# Generate analysis
bench analyze results/run_001
# Generate analysis with README update
bench analyze results/run_001 --update-readme
# Use latest symlink for most recent run
bench analyze results/latest --update-readme
# Aggregate multi-machine results
bench aggregate results/*/manifest.json --output-dir results/aggregateEach benchmark run produces:
manifest.json: Run metadata and environment inforecords.parquet: Per-prediction metrics (fast analysis)records.jsonl: Human-readable metricstimeseries/*.parquet: GPU/CPU monitoring datastructures/: Predicted protein structuresanalysis/report.html: Interactive HTML report with plotsanalysis/data/*.csv: Exported CSV data files for external analysis
The analyze command automatically exports CSV files for all plots and analyses:
raw_records.csv- All prediction recordssummary_statistics.csv- Aggregated performance metricspareto_data.csv- Pareto frontier analysis datascaling_data.csv- Sequence length scaling dataaccuracy_per_target.csv- Per-target accuracy breakdown
These CSV files enable:
- External data analysis and visualization
- Publication-quality figure generation
- Integration with other tools and workflows
- Long-term data archival and reproducibility
- Wall time: End-to-end latency per prediction
- GPU hours: Normalized compute cost
- Time-to-first-GPU: Startup/overhead latency
- GPU utilization: SM and memory utilization
- Energy: Power consumption (Wh)
- Cα RMSD: Coordinate accuracy after Kabsch alignment (Ångströms)
- Cα lDDT: Local distance difference test (0-100)
- TM-score: Length-normalized structural similarity (0-1, >0.5 = same fold)
- GDT_TS: CASP standard metric (0-100, measures % residues within distance thresholds)
- Mean pLDDT: Model confidence (0-100)
- Colossus Runbook: Operations guide for Colossus
- Methodology: Benchmarking methodology and fairness
- Metrics Schema: Complete metrics documentation
- API Integration: NIM and OpenFold integration details
The benchmark generates comprehensive HTML reports with interactive Plotly visualizations:
Performance Analysis:
- Latency distribution analysis: Violin plots, percentile comparison (NEW)
- Tail latency analysis: p99/median ratios for stability assessment (NEW)
- Pareto frontier plot: Accuracy vs performance trade-offs
- Scaling analysis: Sequence length and MSA depth scaling
- GPU utilization: Distribution and time-series plots
Accuracy Analysis: 6. Per-target accuracy: RMSD, lDDT, TM-score, GDT_TS breakdown (NEW) 7. Energy efficiency: Power consumption vs accuracy trade-offs
Data Exports:
raw_records.csv: All prediction resultslatency_distributions.csv: p50/p90/p95/p99 statistics (NEW)accuracy_distributions.csv: Statistical distribution of accuracy metrics (NEW)summary_statistics.csv: Aggregated performance summarycold_start_records.parquet: Container initialization metrics (NEW)
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black bench/
ruff check bench/
# Type checking
mypy bench/- Check NGC_API_KEY is set:
echo $NGC_API_KEY - Check port 8000 is available:
lsof -i :8000 - Check GPU access:
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
- Clean old results:
rm -rf results/run_*(after backing up) - Clean Docker:
docker system prune -a
- Ensure using fast storage (primary drive on Colossus, not volume)
- Check GPU not throttling:
nvidia-smi dmon -s puct
MIT License - see LICENSE file
If you use this benchmark in your research, please cite:
@software{openfold2_nim_bench,
title={OpenFold2-NIM Performance Benchmark},
author={NVIDIA Performance Team},
year={2025},
url={https://github.com/your-org/openfold2-bench}
}Contributions welcome! Please open an issue or pull request.
For questions or issues:
- GitHub Issues: https://github.com/your-org/openfold2-bench/issues
- NVIDIA Developer Forums: https://forums.developer.nvidia.com/