GitHub - IDEA-Research/V-Reflection: Related code, checkpoints and project page for V-Reflection

This repository contains the official implementation of V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism.

Method

Stage 1 (BCM): Box-Guided Compression establishes stable pixel-to-latent targets through explicit spatial grounding, with Stochastic Decoupled Alignment — bidirectional symmetric loss to jointly train resampler and LLM.
Stage 2 (DAC): Dynamic Autoregressive Compression maps the model's hidden states into dynamic probes that interrogate the global visual feature map. Student uses LLM hidden states as Queries, full-image features as K/V, MSE distillation from frozen BCM Teacher.
Inference: Both BCM and DAC remain entirely inactive. Purely end-to-end autoregressive decoding in the latent space — last_position_hidden_state as next-step input embedding for 8-step latent reasoning with optimal efficiency.

Installation

conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Note: Install flash-attn after the other packages (requires CUDA build). For wandb logging, run wandb login or set WANDB_API_KEY.

Data Preparation

1. Download Annotations

We provide pre-formatted LVR training data. Download from Huggingface and place the JSON files in ./data/.

The directory structure after downloading should be:

data/
├── meta_data_lvr_sft_stage1.json       # Meta config for default (SROIE + DUDE)
├── viscot_sroie_dude_lvr_formatted.json # SROIE + DUDE subset
└── viscot_363k_lvr_formatted.json      # Full 363K Visual CoT dataset

2. Download Images

Download images for the Visual CoT dataset. Some sources may require registration or form completion.

Dataset	Source
COCO	train2017 / train2014
GQA	images
Flickr30k	homepage
TextVQA	train_val_images
DocVQA	homepage
InfographicsVQA	homepage
Open Images	download script (splits 0–5)
VSR	images
DUDE	images
SROIE	homepage
CUB	images
Visual7W	repo

3. Organize Directory Structure

Place images under {image_folder} (e.g. ./data/images or your custom path). The structure should match the paths in the annotation JSON:

{image_folder}/
├── coco/
│   ├── train2017/
│   └── train2014/
├── gqa/
│   └── images/
├── flickr30k/
│   └── flickr30k-images/
├── textvqa/
│   └── train_images/
├── docvqa/
├── infographicsvqa/
├── openimages/          # only need train_0 ... train_5
├── vsr/
│   └── images/
├── dude/
│   └── viscot/dude/      # or DUDE_train-val-test_binaries/images/train/
├── sroie/
│   └── viscot/sroie/
├── cub/
│   └── CUB_200_2011/images/
└── visual7w/
    └── images/

4. Dataset Configuration

Set --data_path to a meta JSON that lists your formatted datasets. Example data/meta_data_lvr_sft_stage1.json:

[
  {"ds_name": "viscot_1", "data_path": "data/viscot_sroie_dude_lvr_formatted.json", "image_folder": "/path/to/images", "ds_type": "Q_A"},
  {"ds_name": "viscot_2", "data_path": "data/viscot_363k_lvr_formatted.json", "image_folder": "/path/to/images", "ds_type": "Q_A"}
]

5. Data Format

Each entry follows LLaVA specification with image, conversations, and bboxes. The <image> and <lvr> are placeholders for data collation.

Example dataset entry

{
  "dataset": "flickr30k",
  "image": ["viscot/flickr30k/2618322793.jpg"],
  "conversations": [
    {"from": "human", "value": "<image>\nCan you describe the lower apparel of the child on the swing?\nProvide a short and direct response."},
    {"from": "gpt", "value": "<lvr>\n<answer> The child on the swing is wearing dark blue denim shorts. </answer>"}
  ],
  "bboxes": [[0.382, 0.456, 0.718, 0.656]]
}

Evaluation

First, download our provided model weights, and set EVAL_CHECKPOINT_PATH to your checkpoint directory.

For full benchmark evaluation (BLINK, MMVP, VSTAR, HRBench4K, HRBench8K, MME-RealWorld-Lite):

bash scripts_release/evaluation/evaluation_7b_stage2.sh

Training

Prerequisites: Download the base model Qwen2.5-VL-7B-Instruct from HuggingFace before training. The training scripts load it by default via Qwen/Qwen2.5-VL-7B-Instruct (or set HF_HOME / TRANSFORMERS_CACHE for custom cache paths).

Stage 1 (Box-Guided Compression): Compresses variable-length bbox visual features into fixed 8 latent tokens via cross-attention. Set --data_path and --image_folder in the script.

bash scripts_release/train/sft_7b_stage1_box_resampler.sh

Stage 2 (Dynamic Autoregressive Compression): Teacher-Student distillation. Requires a Stage 1 checkpoint.

export CHECKPOINT_PATH="path/to/stage1_checkpoint"
bash scripts_release/train/sft_7b_stage2_distillation.sh

Note: We use data packing (InternVL-style). Enable with --enable_data_packing True.

Models

We provide checkpoints for V-Reflection. Results on visual perception and high-resolution benchmarks:

Benchmark	V-Reflection (ours) Download	Qwen2.5-VL-7B
MMVP	72.3	66.7
BLINK	56.4	54.5
V*	81.7	78.5
HRBench-4K	72.6	68.0
HRBench-8K	66.3	63.8
MME-Real-Lite	53.9	45.8

Qualitative Results

Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.

Training Attention (Stage 2): Teacher vs Student attention maps during distillation.

Inference Attention: Latent reasoning visualization — dynamic probes interrogate the visual feature space during inference.

Acknowledgement

We would like to thank the authors of the following projects for their excellent work:

Qwen2.5-VL - MLLM series from Qwen family
LVR - Latent Visual Reasoning model by Vincent Lee
Visual-CoT - Visual CoT dataset
InternVL - Open-source MLLM family by Shanghai AI Lab

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
evaluation		evaluation
images		images
public		public
scripts_release		scripts_release
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Method

Installation

Data Preparation

1. Download Annotations

2. Download Images

3. Organize Directory Structure

4. Dataset Configuration

5. Data Format

Evaluation

Training

Models

Qualitative Results

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Method

Installation

Data Preparation

1. Download Annotations

2. Download Images

3. Organize Directory Structure

4. Dataset Configuration

5. Data Format

Evaluation

Training

Models

Qualitative Results

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages