Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
FOCUS is a benchmark and framework for testing the robustness of Evaluator VLMs (Vision-Language Model) across diverse tasks and evaluation strategies. The framework covers two task families:
- I2T (Image-to-Text): Evaluating VLM answers to visual questions.
- T2I (Text-to-Image): Evaluating images generated from text prompts.
The core idea: generate high-quality gold responses/images, then introduce carefully crafted adversarial perturbations across a taxonomy of error types. LLM-as-a-judge evaluators are then tested on whether they can detect these perturbations.
Benchmark Instances
│
▼
┌───────────────────┐
│ 1. IPTonate App │ ← Human annotation & instance selection
└───────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ 2. Perturbation Generation │
│ │
│ I2T: gold answers ──► perturbed text answers │
│ T2I: gold images ──► perturbed images │
└──────────────────────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ 3. PerturbVal App │ ← Human validation of perturbations
└───────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ 4. Evaluator Benchmarking │
│ │
│ Single-answer │ Comparison │ Reference-based │
│ Vanilla CoT │ Vanilla CoT │ Score vs. ref │
│ Rubrics │ Rules-based │ │
│ Multi-Axes │ Multi-Axes │ │
└──────────────────────────────────────────────────────┘
focus/
├── app/ # Streamlit annotation tools
│ ├── iptonate_benchmark_selection_app.py # Instance selection & annotation
│ └── perturbval_pertubation_validation_app.py # Perturbation validation
│
├── i2t/ # Image-to-Text pipeline
│ ├── perturbations/ # Generate adversarial text perturbations
│ └── evaluators/ # LLM-as-a-judge evaluation harness
│
└── t2i/ # Text-to-Image pipeline
├── perturbations/ # Generate adversarial image perturbations
└── evaluators/ # LLM-as-a-judge evaluation harness
pip install google-genai openai anthropic Pillow streamlit markdown aiohttp requestsSet API keys in your environment (or in a .env file):
export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
# For Vertex AI (optional)
export GOOGLE_CLOUD_PROJECT="..."
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service_account.json"
export GEMINI_BUCKET_NAME="..."Annotation Apps (app/)
Two Streamlit UIs for human-in-the-loop annotation:
| App | Script | Purpose |
|---|---|---|
| IPTonate | iptonate_benchmark_selection_app.py |
Select and annotate benchmark instances; assign feasibility labels (Yes/No/Maybe) and difficulty |
| PerturbVal | perturbval_pertubation_validation_app.py |
Validate generated perturbations; label as Valid / Score-Invariant / Incorrect / Not Sure / Not Relevant |
See app/README.md for setup and usage.
I2T Perturbation Generation (i2t/perturbations/)
Generates adversarial text perturbations of VLM answers across four evaluation categories and 21 subcategories:
| Category | Example Perturbation Types |
|---|---|
| General Perception | Entity mislabelling, attribute substitution, spatial relation perturbation |
| Semantic Understanding | Contextual nuance ignoring, cultural context substitution |
| Reasoning | Numerical errors, sequence misordering, misattributed relations |
| Creative Generation | Incoherent details, thematic drift, tone mismatch |
See i2t/perturbations/README.md for the full pipeline.
T2I Perturbation Generation (t2i/perturbations/)
Generates adversarial image perturbations using a two-model architecture (Gemini image generation + edit instruction models) across four categories and 21 subcategories:
| Category | Example Perturbation Types |
|---|---|
| Basic Skill | Object substitution, element omission, attribute manipulation |
| Scene Context & Style | Style inconsistency, environmental conflict, overcrowding |
| Reasoning | Physics manipulation, logical contradiction, functional absurdity |
| Text Rendering | Typographical substitution, incomplete rendering, mislabelled symbols |
See t2i/perturbations/README.md for the full pipeline.
I2T Evaluators (i2t/evaluators/)
LLM-as-a-judge harness for image-to-text evaluation. Supports batch API and parallel execution across OpenAI, Google Gemini, Vertex AI, and Anthropic Claude.
Evaluator types: Vanilla CoT · Rubrics · Multi-Axes · Comparison · Reference-Based
See i2t/evaluators/README.md for full documentation.
T2I Evaluators (t2i/evaluators/)
LLM-as-a-judge harness for text-to-image evaluation — same evaluator architecture as I2T, adapted for image quality assessment.
See t2i/evaluators/README.md for full documentation.
If you use this work, please cite:
@article{khan2026seeing,
title = {Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models},
author = {Mohammed Safi Ur Rahman Khan and Sanjay Suryanarayanan and Tushar Anand and Mitesh M. Khapra},
year = {2026},
journal = {arXiv preprint arXiv: 2604.21523}
}