GitHub - YLab-Open/BRIDGE

BRIDGE (Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text)

📜 Background

Large Language Models (LLMs) have demonstrated transformative potential in healthcare, yet concerns remain around their reliability and clinical validity across diverse clinical tasks, specialties, and languages. To support timely and trustworthy evaluation, building upon our systematic review of global clinical text resources, we introduce BRIDGE, a multilingual benchmark that comprises 87 real-world clinical text tasks spanning nine languages and more than one million samples. The first leaderboard release systematically evaluated 52 state-of-the-art LLMs; the leaderboard has since been expanded through regular updates.

Key Features: Real-world Clinical Text, 9 Languages, 9 Task types, 14 Clinical specialties, 7 Clinical document types, 20 Clinical applications covering 6 clinical stages of patient care.

More Details can be found in our BRIDGE paper and systematic review, and the comprehensive leaderboard is available at BRIDGE Leaderboard.

This project is led and maintained by the team of Prof. Jie Yang and Prof. Kueiyu Joshua Lin at Harvard Medical School and Brigham and Women's Hospital.

📢 Updates

🗓️ 2026/04/08: We continue to evaluate the latest models. BRIDGE Leaderboard now includes 107 evaluated models.
🗓️ 2025/06/03: Updated the leaderboard with 21 additional models, for 75 evaluated models in total.
🗓️ 2025/04/28: BRIDGE Leaderboard V1.0.0 is now live with 54 evaluated models.
🗓️ 2025/04/28: Our paper BRIDGE is available on arXiv.

🛠️ How to Use?

1. Download the BRIDGE Dataset

All fully open-access datasets in BRIDGE are available in BRIDGE-Open. To ensure leaderboard fairness, we publicly release five completed few-shot examples for each task and all testing samples with instruction/input fields. Regulated-access clinical datasets cannot be directly published, but task descriptions and original data sources are listed in the BRIDGE paper.

Put task files under dataset_raw/:

dataset_raw/<task_name>.SFT.json
dataset_raw/example/<task_name>.example.json
dataset_raw/example_cot/<task_name>.example.json

2. Prepare Local Configs

Real API keys and local checkpoint paths are intentionally not committed. Create local files from the templates:

cp configs/API.key.example.yaml configs/API.key.yaml
cp configs/dict_model_path.example.json configs/dict_model_path.json

configs/dict_model_path.json maps model names to local checkpoint paths. configs/API.key.yaml stores provider credentials, including the AZURE-OPENAI block used by the Azure example notebook.

3. Run Local Inference

The recommended local runtime is vLLM:

mamba activate vllm
python main.py \
  --model_name gpt-oss-20b \
  --gpus 0,1 \
  --engine vllm \
  --config configs/BRIDGE.yaml \
  --model_file configs/dict_model_path.json

Configure tasks, data paths, result paths, prompt modes, decoding modes, and token budgets in configs/BRIDGE.yaml. Supported prompt modes are direct, cot, direct-N-shot, and cot-N-shot.

Results are saved as:

result/<task_name>/<model_name>/<task_name>-<prompt_mode>-<decoding>-<seed>.result.json

4. Evaluate Results

Evaluation is configured separately from inference through evaluation/evaluate_BRIDGE.yaml:

python -m evaluation.bridge
python -m evaluation.bridge --config evaluation/evaluate_BRIDGE.yaml

Set models, prompt modes, task types, and task subsets in the YAML file. The evaluation code warns when a result file does not align with the expected task rows, which supports partial sampling runs while keeping data-integrity issues visible.

5. More Documentation

USAGE.md: detailed runtime, data, inference, and evaluation guide.
configs/README.md: runtime config templates and local-only config paths.
scripts/README.md: single-model, multi-model, and delayed launch scripts.
notebooks/README.md: interactive notebook examples.
dataset/README.md: task data loading and prompt setup.
model/README.md: local inference and provider batch helpers.
evaluation/README.md: maintained evaluation entry point and YAML examples.
tests/README.md: lightweight regression tests.

6. Update the Leaderboard

To submit model results to BRIDGE Leaderboard, send the generated result folder to the maintainers. We will update the leaderboard regularly and notify you when results are added.

🤝 Contributing

We welcome and greatly value contributions and collaborations from the community! If you have clinical text datasets that you would like to share for broader exploration, please contact us! We are committed to expanding BRIDGE while strictly adhering to appropriate data use agreements and ethical guidelines. Let's work together to advance the responsible application of LLMs in medicine!

🚀 Donation

BRIDGE is a non-profit, researcher-led benchmark that requires substantial resources (e.g., high-performance GPUs, a dedicated team) to sustain. To support open and impactful academic research that advances clinical care, we welcome your contributions. Please contact Prof. Jie Yang at jyang66@bwh.harvard.edu to discuss donation opportunities.

📬 Contact Information

If you have any questions about BRIDGE or the leaderboard, feel free to reach out!

Leaderboard Managers: Jiageng Wu (jiwu7@bwh.harvard.edu), Kevin Xie (kevinxie@mit.edu), Bowen Gu (bogu@bwh.harvard.edu)
Benchmark Managers: Jiageng Wu, Bowen Gu
Project Lead: Jie Yang (jyang66@bwh.harvard.edu)

📚 Citation

If you find this leaderboard useful for your research and applications, please cite the following papers:

@article{BRIDGE-benchmark,
    title={BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text},
    author={Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie},
    year={2025},
    journal={arXiv preprint arXiv: 2504.19467},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.19467},
}
@article{clinical-text-review,
    title={Clinical text datasets for medical artificial intelligence and large language models—a systematic review},
    author={Wu, Jiageng and Liu, Xiaocong and Li, Minghui and Li, Wanxin and Su, Zichang and Lin, Shixu and Garay, Lucas and Zhang, Zhiyun and Zhang, Yujie and Zeng, Qingcheng and Shen, Jie and Yuan, Changzheng and Yang, Jie},
    journal={NEJM AI},
    volume={1},
    number={6},
    pages={AIra2400012},
    year={2024},
    publisher={Massachusetts Medical Society}
}

If you use the datasets in BRIDGE, please also cite the original paper of datasets, which can be found in our BRIDGE paper.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
data_contamination		data_contamination
dataset		dataset
evaluation		evaluation
fig		fig
metric		metric
model		model
notebooks		notebooks
scripts		scripts
tests		tests
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.sh		run.sh
run_multiple_model.sh		run_multiple_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BRIDGE (Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text)

📜 Background

📢 Updates

🛠️ How to Use?

1. Download the BRIDGE Dataset

2. Prepare Local Configs

3. Run Local Inference

4. Evaluate Results

5. More Documentation

6. Update the Leaderboard

🤝 Contributing

🚀 Donation

📬 Contact Information

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BRIDGE (Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text)

📜 Background

📢 Updates

🛠️ How to Use?

1. Download the BRIDGE Dataset

2. Prepare Local Configs

3. Run Local Inference

4. Evaluate Results

5. More Documentation

6. Update the Leaderboard

🤝 Contributing

🚀 Donation

📬 Contact Information

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages