Skip to content

AI4Bharat/focus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FOCUS

Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

📜 Paper | 🤗 Data

main image


Overview

FOCUS is a benchmark and framework for testing the robustness of Evaluator VLMs (Vision-Language Model) across diverse tasks and evaluation strategies. The framework covers two task families:

  • I2T (Image-to-Text): Evaluating VLM answers to visual questions.
  • T2I (Text-to-Image): Evaluating images generated from text prompts.

The core idea: generate high-quality gold responses/images, then introduce carefully crafted adversarial perturbations across a taxonomy of error types. LLM-as-a-judge evaluators are then tested on whether they can detect these perturbations.

Framework Pipeline

Benchmark Instances
        │
        ▼
┌───────────────────┐
│  1. IPTonate App  │  ← Human annotation & instance selection
└───────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│  2. Perturbation Generation                          │
│                                                      │
│   I2T: gold answers  ──►  perturbed text answers     │
│   T2I: gold images   ──►  perturbed images           │
└──────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────┐
│  3. PerturbVal App    │  ← Human validation of perturbations
└───────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│  4. Evaluator Benchmarking                           │
│                                                      │
│   Single-answer  │  Comparison  │  Reference-based  │
│   Vanilla CoT    │  Vanilla CoT │  Score vs. ref    │
│   Rubrics        │  Rules-based │                   │
│   Multi-Axes     │  Multi-Axes  │                   │
└──────────────────────────────────────────────────────┘

Repository Structure

focus/
├── app/                        # Streamlit annotation tools
│   ├── iptonate_benchmark_selection_app.py   # Instance selection & annotation
│   └── perturbval_pertubation_validation_app.py  # Perturbation validation
│
├── i2t/                        # Image-to-Text pipeline
│   ├── perturbations/          # Generate adversarial text perturbations
│   └── evaluators/             # LLM-as-a-judge evaluation harness
│
└── t2i/                        # Text-to-Image pipeline
    ├── perturbations/          # Generate adversarial image perturbations
    └── evaluators/             # LLM-as-a-judge evaluation harness

Getting Started

Requirements

pip install google-genai openai anthropic Pillow streamlit markdown aiohttp requests

Set API keys in your environment (or in a .env file):

export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."

# For Vertex AI (optional)
export GOOGLE_CLOUD_PROJECT="..."
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service_account.json"
export GEMINI_BUCKET_NAME="..."

Components

Annotation Apps (app/)

Two Streamlit UIs for human-in-the-loop annotation:

App Script Purpose
IPTonate iptonate_benchmark_selection_app.py Select and annotate benchmark instances; assign feasibility labels (Yes/No/Maybe) and difficulty
PerturbVal perturbval_pertubation_validation_app.py Validate generated perturbations; label as Valid / Score-Invariant / Incorrect / Not Sure / Not Relevant

See app/README.md for setup and usage.


I2T Perturbation Generation (i2t/perturbations/)

Generates adversarial text perturbations of VLM answers across four evaluation categories and 21 subcategories:

Category Example Perturbation Types
General Perception Entity mislabelling, attribute substitution, spatial relation perturbation
Semantic Understanding Contextual nuance ignoring, cultural context substitution
Reasoning Numerical errors, sequence misordering, misattributed relations
Creative Generation Incoherent details, thematic drift, tone mismatch

See i2t/perturbations/README.md for the full pipeline.


T2I Perturbation Generation (t2i/perturbations/)

Generates adversarial image perturbations using a two-model architecture (Gemini image generation + edit instruction models) across four categories and 21 subcategories:

Category Example Perturbation Types
Basic Skill Object substitution, element omission, attribute manipulation
Scene Context & Style Style inconsistency, environmental conflict, overcrowding
Reasoning Physics manipulation, logical contradiction, functional absurdity
Text Rendering Typographical substitution, incomplete rendering, mislabelled symbols

See t2i/perturbations/README.md for the full pipeline.


I2T Evaluators (i2t/evaluators/)

LLM-as-a-judge harness for image-to-text evaluation. Supports batch API and parallel execution across OpenAI, Google Gemini, Vertex AI, and Anthropic Claude.

Evaluator types: Vanilla CoT · Rubrics · Multi-Axes · Comparison · Reference-Based

See i2t/evaluators/README.md for full documentation.


T2I Evaluators (t2i/evaluators/)

LLM-as-a-judge harness for text-to-image evaluation — same evaluator architecture as I2T, adapted for image quality assessment.

See t2i/evaluators/README.md for full documentation.


Citation

If you use this work, please cite:

@article{khan2026seeing,
  title   = {Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models},
  author  = {Mohammed Safi Ur Rahman Khan and Sanjay Suryanarayanan and Tushar Anand and Mitesh M. Khapra},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2604.21523}
}

About

Multimodal FBI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors