This is the official implementation of DeBias-CLIP.
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsène Fansi Tchango, Steven L. Waslander
| Checkpoint | Size | Urban1k T2I |
DCI T2I |
DOCCI T2I |
COCO T2I |
FLickr T2I |
|---|---|---|---|---|---|---|
| debias_vitb_3e.pt | ViT-B-16 | 93.0 | 67.6 | 80.0 | 43.0 | 36.6 |
| debias_vitl_3e.pt | ViT-L-14 | 95.2 | 73.5 | 85.6 | 48.1 | 43.9 |
Install the environment packages with pip install -r requirements.txt
Please refer to data_parsers/data_folder.md for dataset download instructions and required directory organization for training and evaluation.
We provide a training script train_script.sh with default parameters. You will need to set some arguments for your local environment.
We provide an example testing script in test_script.sh.
Our code is based on the OpenCLIP codebase and builds upon Long-CLIP. Evaluation implementation is derived from COSMOS.
If you find DeBias-CLIP useful for your work please cite:
@article{lavoie2026clip,
title={CLIP Is Shortsighted: Paying Attention Beyond the First Sentence},
author={Lavoie, Marc-Antoine and Mahmoud, Anas and Zaimi, Aldo and Tchango, Arsene Fansi and Waslander, Steven L},
journal={arXiv preprint arXiv:2602.22419},
year={2026}
}