This repository contains the official implementation of V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism.
- Stage 1 (BCM): Box-Guided Compression establishes stable pixel-to-latent targets through explicit spatial grounding, with Stochastic Decoupled Alignment — bidirectional symmetric loss to jointly train resampler and LLM.
- Stage 2 (DAC): Dynamic Autoregressive Compression maps the model's hidden states into dynamic probes that interrogate the global visual feature map. Student uses LLM hidden states as Queries, full-image features as K/V, MSE distillation from frozen BCM Teacher.
- Inference: Both BCM and DAC remain entirely inactive. Purely end-to-end autoregressive decoding in the latent space —
last_position_hidden_stateas next-step input embedding for 8-step latent reasoning with optimal efficiency.
conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolationNote: Install flash-attn after the other packages (requires CUDA build). For wandb logging, run wandb login or set WANDB_API_KEY.
We provide pre-formatted LVR training data. Download from Huggingface and place the JSON files in ./data/.
The directory structure after downloading should be:
data/
├── meta_data_lvr_sft_stage1.json # Meta config for default (SROIE + DUDE)
├── viscot_sroie_dude_lvr_formatted.json # SROIE + DUDE subset
└── viscot_363k_lvr_formatted.json # Full 363K Visual CoT datasetDownload images for the Visual CoT dataset. Some sources may require registration or form completion.
| Dataset | Source |
|---|---|
| COCO | train2017 / train2014 |
| GQA | images |
| Flickr30k | homepage |
| TextVQA | train_val_images |
| DocVQA | homepage |
| InfographicsVQA | homepage |
| Open Images | download script (splits 0–5) |
| VSR | images |
| DUDE | images |
| SROIE | homepage |
| CUB | images |
| Visual7W | repo |
Place images under {image_folder} (e.g. ./data/images or your custom path). The structure should match the paths in the annotation JSON:
{image_folder}/
├── coco/
│ ├── train2017/
│ └── train2014/
├── gqa/
│ └── images/
├── flickr30k/
│ └── flickr30k-images/
├── textvqa/
│ └── train_images/
├── docvqa/
├── infographicsvqa/
├── openimages/ # only need train_0 ... train_5
├── vsr/
│ └── images/
├── dude/
│ └── viscot/dude/ # or DUDE_train-val-test_binaries/images/train/
├── sroie/
│ └── viscot/sroie/
├── cub/
│ └── CUB_200_2011/images/
└── visual7w/
└── images/Set --data_path to a meta JSON that lists your formatted datasets. Example data/meta_data_lvr_sft_stage1.json:
[
{"ds_name": "viscot_1", "data_path": "data/viscot_sroie_dude_lvr_formatted.json", "image_folder": "/path/to/images", "ds_type": "Q_A"},
{"ds_name": "viscot_2", "data_path": "data/viscot_363k_lvr_formatted.json", "image_folder": "/path/to/images", "ds_type": "Q_A"}
]Each entry follows LLaVA specification with image, conversations, and bboxes. The <image> and <lvr> are placeholders for data collation.
Example dataset entry
{
"dataset": "flickr30k",
"image": ["viscot/flickr30k/2618322793.jpg"],
"conversations": [
{"from": "human", "value": "<image>\nCan you describe the lower apparel of the child on the swing?\nProvide a short and direct response."},
{"from": "gpt", "value": "<lvr>\n<answer> The child on the swing is wearing dark blue denim shorts. </answer>"}
],
"bboxes": [[0.382, 0.456, 0.718, 0.656]]
}First, download our provided model weights, and set EVAL_CHECKPOINT_PATH to your checkpoint directory.
For full benchmark evaluation (BLINK, MMVP, VSTAR, HRBench4K, HRBench8K, MME-RealWorld-Lite):
bash scripts_release/evaluation/evaluation_7b_stage2.shPrerequisites: Download the base model Qwen2.5-VL-7B-Instruct from HuggingFace before training. The training scripts load it by default via Qwen/Qwen2.5-VL-7B-Instruct (or set HF_HOME / TRANSFORMERS_CACHE for custom cache paths).
Stage 1 (Box-Guided Compression): Compresses variable-length bbox visual features into fixed 8 latent tokens via cross-attention. Set --data_path and --image_folder in the script.
bash scripts_release/train/sft_7b_stage1_box_resampler.shStage 2 (Dynamic Autoregressive Compression): Teacher-Student distillation. Requires a Stage 1 checkpoint.
export CHECKPOINT_PATH="path/to/stage1_checkpoint"
bash scripts_release/train/sft_7b_stage2_distillation.shNote: We use data packing (InternVL-style). Enable with --enable_data_packing True.
We provide checkpoints for V-Reflection. Results on visual perception and high-resolution benchmarks:
| Benchmark | V-Reflection (ours) Download | Qwen2.5-VL-7B |
|---|---|---|
| MMVP | 72.3 | 66.7 |
| BLINK | 56.4 | 54.5 |
| V* | 81.7 | 78.5 |
| HRBench-4K | 72.6 | 68.0 |
| HRBench-8K | 66.3 | 63.8 |
| MME-Real-Lite | 53.9 | 45.8 |
Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.
Training Attention (Stage 2): Teacher vs Student attention maps during distillation.
Inference Attention: Latent reasoning visualization — dynamic probes interrogate the visual feature space during inference.
We would like to thank the authors of the following projects for their excellent work:
- Qwen2.5-VL - MLLM series from Qwen family
- LVR - Latent Visual Reasoning model by Vincent Lee
- Visual-CoT - Visual CoT dataset
- InternVL - Open-source MLLM family by Shanghai AI Lab
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.



