Skip to content

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Notifications You must be signed in to change notification settings

wuhang03/CamReasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

teaser
Hang Wu1      Yujun Cai†2,3      Zehao Li4      Haonan Ge1      Bowen Sun1
Junsong Yuan5      Yiwei Wang1

1University of California, Merced        2The University of Queensland        3Ant Group
4Institute of Computing Technology, Chinese Academy of Sciences       
5University at Buffalo, State University of New York

Indicates Corresponding Author

🔥 Update

  • [2026-01-28]: 🚀 CamReasoner-7B released on Huggingface.
  • [2026-01-28]: 🚀 Codes and training dataset released.

🎯 Overview

teaser

Abstract: Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.

🕹️ Usage

Supervised Fine-tuning

Supervised Fine-Tuning establishes a foundational reasoning baseline by injecting structured templates and domain-specific knowledge, enabling the model to follow instructions and generate coherent initial responses.

git clone https://github.com/wuhang03/CamReasoner
cd CamReasoner

# build SFT environment
conda create -n sft python=3.11 
conda activate sft
cd LLaMA-Factory
bash setup.sh

# download data
bash download.sh

# run sft (modify parameters according to your need)
bash local_scripts/run_sft.sh

Our proposed SFT dataset CamReasoning-SFT-18k is in camerabench_sft.json

Reinforcement Learning

Reinforcement Learning drives the model to self-evolve through trial and error, refining the internal logic chain and optimizing decision-making performance beyond the limitations of static training data.

git clone https://github.com/wuhang03/CamReasoner
cd CamReasoner

# build RL environment
conda create -n rl python=3.11 
conda activate rl
cd EasyR1
bash setup.sh

# download data
bash download.sh

# run rl (modify parameters according to your need)
bash local_scripts/run_rl.sh

Our proposed RL dataset CamReasoning-RL-38k is in camerabench_rl.json

For more details for the SFT and RL environment installation, please refer to LLaMA-Factory, EasyR1

Evaluation

You can use CamReasoner-7B to inference and reproduce experimental results following this part.

git clone https://github.com/wuhang03/CamReasoner
cd CamReasoner

# build evaluation environment
conda create -n eval python=3.11 
conda activate eval
cd Evaluation
bash setup.sh

# download data
python data_download.py

# run evaluation (modify parameters according to your need)
bash eval/eval.sh

🏅 Experiments

teaser
teaser
  • Please refer to our paper for detailed experimental results.

📌 Examples

results

Qualitative results across four typical camera movements. For each case, we visualize the temporal frame sequence alongside the CamReasoner-7B response. The model demonstrates robust spatial reasoning by generating detailed of visual cues and a logical process to accurately identify the movement and provide the final .

📝 Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows:

About

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published