Skip to content

🔄 Automatically Update VLA / VLN/ VLM Papers Daily with GitHub Actions.

License

Notifications You must be signed in to change notification settings

20bytes/vlm-arxiv-daily

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,362 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 VLM-Arxiv-Daily

Robotics Navigation VLM

🚀 每日自动追踪 Vision-Language-Action (VLA), Vision-Language Navigation (VLN)Vision-Language Models (VLM) 的最新 Arxiv 论文。

📅 Updated on 2026.03.02

点击查看目录 (Table of Contents)
  1. VLA
  2. VLN
  3. VLM

📌 VLA

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY
2026-02-26 EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
EmbodMocap:面向具身智能体的野外四维人-场景重建
摘要
Taku Komura Team 2602.23205 HJFY
2026-02-26 Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
基于残差库普曼谱分析预测与防止Transformer训练不稳定性
摘要
Yutaka Matsuo Team 2602.22988 HJFY
2026-02-26 Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy
经皮扩张气管切开术的自动化机器人针穿刺系统
摘要
Andrew Weightman Team 2602.22952 HJFY
2026-02-26 DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
DySL-VLA:通过动态-静态层跳跃实现机器人操作中高效视觉-语言-动作模型推理
摘要
Meng Li Team 2602.22896 HJFY
2026-02-26 GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
GraspLDP:基于潜在扩散的通用化抓取策略研究
摘要
Di Huang Team 2602.22862 HJFY
2026-02-26 ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
ArtPro:基于自适应运动提议集成的自监督关节物体重建
摘要
Changhe Tu Team 2602.22666 HJFY
2026-02-26 Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline
重新审视视觉-语言-动作模型的实用性:一个综合性基准与改进基线
摘要
Haoang Li Team 2602.22663 HJFY
2026-02-26 Metamorphic Testing of Vision-Language Action-Enabled Robots
视觉-语言-动作赋能机器人的蜕变测试
摘要
Aitor Arrieta Team 2602.22579 HJFY
2026-02-26 SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
SignVLA:一种无需注释的视觉-语言-动作框架,用于实时手语引导的机器人操作
摘要
Zezhi Tang Team 2602.22514 HJFY
2026-02-25 When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
何时执行、询问或学习:不确定性感知的策略引导
摘要
Andrea Bajcsy Team 2602.22474 HJFY
2026-02-24 NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
NoRD:一种无需推理、数据高效驱动的视觉-语言-动作模型
摘要
Wei Zhan Team 2602.21172 HJFY
2026-02-24 ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
行动推理:基于大语言模型的机器人三维空间动作推理与砖块堆叠应用
摘要
Brian Sheil Team 2602.21161 HJFY
2026-02-24 HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
HALO:面向具身多模态思维链推理的统一视觉-语言-动作模型
摘要
Song Guo Team 2602.21157 HJFY
2026-02-24 From Perception to Action: An Interactive Benchmark for Vision Reasoning
从感知到行动:视觉推理的交互式基准测试
摘要
Roy Ka-Wei Lee Team 2602.21015 HJFY
2026-02-24 Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks
自我笔记:用于依赖记忆操作任务的便签增强型视觉语言动作模型
摘要
Roland Memisevic Team 2602.21013 HJFY
2026-02-24 Toward an Agentic Infused Software Ecosystem
迈向赋能代理的软件生态系统
摘要
Mark Marron Team 2602.20979 HJFY
2026-02-24 IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation
IG-RFT:面向长时程机器人操作的交互引导强化学习框架,用于视觉-语言-动作模型
摘要
Huixu Dong Team 2602.20715 HJFY
2026-02-24 How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
基础技能如何影响基于视觉语言模型的具身智能体:一个原生视角
摘要
Tong Xu Team 2602.20687 HJFY
2026-02-24 Recursive Belief Vision Language Model
递归信念视觉语言模型
摘要
Nirav Patel Team 2602.20659 HJFY
2026-02-24 Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
基于掩码视觉-语言-动作扩散的高效可解释端到端自动驾驶
摘要
Ziran Wang Team 2602.20577 HJFY
2026-02-19 When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
当视觉凌驾于语言之上:评估与缓解视觉语言动作模型中的反事实失败
摘要
Mingyu Ding Team 2602.17659 HJFY
2026-02-19 What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else?
什么在破坏具身人工智能安全:大语言模型漏洞、信息物理系统缺陷,还是其他因素?
摘要
Yue Zhang Team 2602.17345 HJFY
2026-02-19 FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
FRAPPE:通过多未来表示对齐将世界建模融入通用策略
摘要
Donglin Wang Team 2602.17259 HJFY
2026-02-19 Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web
网络动词:面向智能网络可靠任务组合的类型化抽象
摘要
Suman Nath Team 2602.17245 HJFY
2026-02-19 Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success
评估物体姿态估计与重建对机器人抓取成功率影响的基准研究
摘要
Torsten Sattler Team 2602.17101 HJFY
2026-02-18 MALLVI: a multi agent framework for integrated generalized robotics manipulation
MALLVI:一种面向集成通用机器人操作的多智能体框架
摘要
Babak Khalaj Team 2602.16898 HJFY
2026-02-18 EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
EgoScale:利用多样化的自我中心人类数据扩展灵巧操作能力
摘要
Linxi Fan Team 2602.16710 HJFY
2026-02-19 RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation
RoboGene:通过多样性驱动的智能体框架提升视觉语言动作预训练,实现真实世界任务生成
摘要
Jian Tang Team 2602.16444 HJFY
2026-02-17 Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
学习检索可导航候选对象以实现高效的视觉与语言导航
摘要
Lina Yao Team 2602.15724 HJFY
2026-02-17 The Next Paradigm Is User-Centric Agent, Not Platform-Centric Service
下一代范式是用户中心智能体,而非平台中心服务
摘要
Enhong Chen Team 2602.15682 HJFY
2026-02-12 Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效
摘要
Marco Pavone Team 2602.12281 HJFY
2026-02-12 Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work
面向共址蓝领工作团队协作的具身人工智能体
摘要
Thomas Olsson Team 2602.12136 HJFY
2026-02-12 GigaBrain-0.5M: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain-0.5M
:一种基于世界模型强化学习训练的视觉-语言-动作模型
摘要**
Zheng Zhu Team 2602.12099 HJFY
2026-02-12 VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model
VLAW:视觉-语言-动作策略与世界模型的迭代协同改进
摘要
Chelsea Finn Team 2602.12063 HJFY
2026-02-12 HoloBrain-0 Technical Report
HoloBrain-0技术报告
摘要
Zhizhong Su Team 2602.12062 HJFY
2026-02-12 When would Vision-Proprioception Policies Fail in Robotic Manipulation?
视觉-本体感知策略在机器人操作中何时会失效?
摘要
Di Hu Team 2602.12032 HJFY
2026-02-12 Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control
Robot-DIFT:提取扩散特征以实现几何一致的视觉运动控制
摘要
Georgia Chalvatzaki Team 2602.11934 HJFY
2026-02-12 JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
JEPA-VLA:视觉语言动作模型需要视频预测性嵌入
摘要
Mingsheng Long Team 2602.11832 HJFY
2026-02-12 Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Clutt3R-Seg:面向杂乱场景中语言驱动抓取的稀疏视角三维实例分割
摘要
Ayoung Kim Team 2602.11660 HJFY
2026-02-12 ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
ViTaS:面向视觉运动学习的视觉触觉软融合对比学习
摘要
Huazhe Xu Team 2602.11643 HJFY
2026-02-10 MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
MVISTA-4D:具有测试时动作推理能力的视图一致4D世界模型,用于机器人操作
摘要
Xiangyu Yue Team 2602.09878 HJFY
2026-02-10 Code2World: A GUI World Model via Renderable Code Generation
Code2World:通过可渲染代码生成的GUI世界模型
摘要
Kevin Qinghong Lin Team 2602.09856 HJFY
2026-02-10 BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
BagelVLA:通过交错视觉-语言-动作生成增强长时程操作能力
摘要
Jianyu Chen Team 2602.09849 HJFY
2026-02-10 NavDreamer: Video Models as Zero-Shot 3D Navigators
NavDreamer:视频模型作为零样本三维导航器
摘要
Fei Gao Team 2602.09765 HJFY
2026-02-10 Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
重新审视视觉-语言-动作模型的规模化:对齐、混合与正则化
摘要
Qin Jin Team 2602.09722 HJFY
2026-02-10 AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
AutoFly:面向野外无人机自主导航的视觉-语言-动作模型
摘要
Hui Xiong Team 2602.09657 HJFY
2026-02-10 VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地
摘要
Hui Xiong Team 2602.09638 HJFY
2026-02-10 Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
Hand2World:基于自由空间手势的自回归第一人称交互生成
摘要
Xingang Pan Team 2602.09600 HJFY
2026-02-10 Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation
面向可变形物体操作的偏好对齐视觉运动扩散策略
摘要
Danica Kragic Team 2602.09583 HJFY
2026-02-10 AUHead: Realistic Emotional Talking Head Generation via Action Units Control
AUHead:基于动作单元控制的逼真情感说话头部生成
摘要
Tat-Seng Chua Team 2602.09534 HJFY
2026-02-04 Capturing Visual Environment Structure Correlates with Control Performance
捕捉视觉环境结构与控制性能的相关性
摘要
Yu-Xiong Wang Team 2602.04880 HJFY
2026-02-04 CoWTracker: Tracking by Warping instead of Correlation
CoWTracker:通过变形而非相关性进行跟踪
摘要
Andrea Vedaldi Team 2602.04877 HJFY
2026-02-04 Relational Scene Graphs for Object Grounding of Natural Language Commands
面向自然语言指令中物体定位的关系场景图
摘要
Ville Kyrki Team 2602.04635 HJFY
2026-02-04 Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data
行动、感知、再行动:从大规模第一人称人类数据中学习非马尔可夫主动感知策略
摘要
Wenzhao Lian Team 2602.04600 HJFY
2026-02-04 A Unified Complementarity-based Approach for Rigid-Body Manipulation and Motion Prediction
基于互补性的统一方法在刚体操作与运动预测中的应用
摘要
Riddhiman Laha Team 2602.04522 HJFY
2026-02-04 EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
EgoActor:通过视觉语言模型将任务规划落地为具身机器人的空间感知自我中心动作
摘要
Börje F. Karlsson Team 2602.04515 HJFY
2026-02-04 Self-evolving Embodied AI
自演化的具身人工智能
摘要
Wenwu Zhu Team 2602.04411 HJFY
2026-02-04 GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
GeneralVLA:具备知识引导轨迹规划的通用视觉-语言-动作模型
摘要
Hao Tang Team 2602.04315 HJFY
2026-02-04 Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation
视角至关重要:利用掩码自编码器动态优化视觉操控的视角
摘要
Wenzhao Lian Team 2602.04243 HJFY
2026-02-04 GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning
GeoLanG:基于统一RGB-D多模态学习的几何感知语言引导抓取
摘要
Hongliang Ren Team 2602.04231 HJFY
2026-02-02 TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments
TIC-VLA:一种用于动态环境中机器人导航的思控一体化视觉-语言-动作模型
摘要
Jiaqi Ma Team 2602.02459 HJFY
2026-02-02 World-Gymnast: Training Robots with Reinforcement Learning in a World Model
世界体操家:在世界模型中通过强化学习训练机器人
摘要
Sherry Yang Team 2602.02454 HJFY
2026-02-02 SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation
SoMA:面向机器人软体操作的真实到仿真神经模拟器
摘要
Jiangmiao Pang Team 2602.02402 HJFY
2026-02-02 MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models
MAIN-VLA:为视觉-语言-动作模型建模意图与环境的抽象
摘要
Lemiao Qiu Team 2602.02212 HJFY
2026-02-02 FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
FD-VLA:用于接触丰富操作的力蒸馏视觉-语言-动作模型
摘要
Haiyue Zhu Team 2602.02142 HJFY
2026-02-02 See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力
摘要
Takeo Igarashi Team 2602.02063 HJFY
2026-02-02 Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models
面向视觉语言动作模型推理时安全性的概念词典学习方法
摘要
Di Wang Team 2602.01834 HJFY
2026-02-02 From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models
从精确认知到精准执行:面向视觉语言动作模型的通用自校正与终止框架
摘要
Jianzong Wang Team 2602.01811 HJFY
2026-02-02 AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act
AgenticLab:一个能够观察、思考与行动的真实世界机器人智能体平台
摘要
Yu She Team 2602.01662 HJFY
2026-02-02 From Perception to Action: Spatial AI Agents and World Models
从感知到行动:空间人工智能代理与世界模型
摘要
Esteban Rojas Team 2602.01644 HJFY
2026-01-30 Temporally Coherent Imitation Learning via Latent Action Flow Matching for Robotic Manipulation Wu Songwei et.al. 2601.23087
2026-01-30 EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing Xijie Yang et.al. 2601.23065
2026-01-30 Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation Di Zhang et.al. 2601.22988
2026-01-30 Alignment among Language, Vision and Action Representations Nicola Milano et.al. 2601.22948
2026-01-30 When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection Shashank Mishra et.al. 2601.22868
2026-01-30 Vision-Language Models Unlock Task-Centric Latent Actions Alexander Nikulin et.al. 2601.22714
2026-01-30 Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference Emilien Biré et.al. 2601.22701
2026-01-30 CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control Jiaqi Shi et.al. 2601.22467
2026-01-29 PoSafeNet: Safe Learning with Poset-Structured Neural Nets Kiwan Wong et.al. 2601.22356
2026-01-29 DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation Haozhe Xie et.al. 2601.22153
2026-01-29 PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction Changjian Jiang et.al. 2601.22046
2026-01-29 PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy Jinhao Zhang et.al. 2601.22018
2026-01-29 Causal World Modeling for Robot Control Lin Li et.al. 2601.21998
2026-01-29 MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts Lorenzo Mazza et.al. 2601.21971
2026-01-29 Information Filtering via Variational Regularization for Robot Manipulation Jinhao Zhang et.al. 2601.21926
2026-01-29 Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation Jiankun Peng et.al. 2601.21751
2026-01-29 CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation Xuanran Zhai et.al. 2601.21712
2026-01-29 AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation Jianli Sun et.al. 2601.21602
2026-01-29 EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots Zixing Lei et.al. 2601.21570

(back to top)

📌 VLN

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY
2026-02-20 CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
CapNav:基于能力条件室内导航的视觉语言模型基准测试
摘要
Jon Froehlich Team 2602.18424 HJFY
2026-02-17 Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
学习检索可导航候选对象以实现高效的视觉与语言导航
摘要
Lina Yao Team 2602.15724 HJFY
2026-02-17 One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation
一智体引领全局:通过显式世界表征赋能多模态大语言模型实现视觉与语言导航
摘要
Qi Wu Team 2602.15400 HJFY
2026-02-16 pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI
pFedNavi:面向具身AI的结构感知个性化联邦视觉语言导航
摘要
Haibing Guan Team 2602.14401 HJFY
2026-02-12 ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
ABot-N0:面向通用具身导航的视觉-语言-动作基础模型技术报告
摘要
Mu Xu Team 2602.11598 HJFY
2026-02-10 Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning
Hydra-Nav:基于自适应双过程推理的目标导航
摘要
Yiming Gan Team 2602.09972 HJFY
2026-02-10 AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
AutoFly:面向野外无人机自主导航的视觉-语言-动作模型
摘要
Hui Xiong Team 2602.09657 HJFY
2026-02-09 When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
何时想象与想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理
摘要
Mohit Bansal Team 2602.08236 HJFY
2026-02-10 LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation
LCLA:面向视觉语言导航的语言条件化潜在对齐框架
摘要
Soumik Sarkar Team 2602.07629 HJFY
2026-02-06 Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters
弥合室内外鸿沟:面向最后几米的视觉中心化指令引导具身导航
摘要
Mu Xu Team 2602.06427 HJFY
2026-02-06 Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation
防微杜渐:基于回溯修正的鲁棒视觉语言导航
摘要
Weiying Xie Team 2602.06356 HJFY
2026-02-05 Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
稀疏视频生成推动现实世界超视距视觉语言导航
摘要
Hongyang Li Team 2602.05827 HJFY
2026-02-05 Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
他者中心感知器:通过框架实例化从他者视觉先验中解耦他者中心推理
摘要
Weiming Zhang Team 2602.05789 HJFY
2026-02-05 MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
MerNav:一种高度可泛化的记忆-执行-回顾框架,用于零样本目标导航
摘要
Mu Xu Team 2602.05467 HJFY
2026-02-02 LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation
LangMap:面向开放词汇目标导航的分层基准
摘要
Anton van den Hengel Team 2602.02220 HJFY
2026-01-31 APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
APEX:一种用于异步空中目标导航的解耦记忆型探索器
摘要
Shuo Yang Team 2602.00551 HJFY
2026-02-03 MapDream: Task-Driven Map Learning for Vision-Language Navigation
MapDream:面向视觉语言导航的任务驱动地图学习
摘要
Zhaoxin Fan Team 2602.00222 HJFY
2026-01-29 Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation
动态拓扑感知:打破视觉语言导航中的粒度僵化
摘要
Xiaoming Wang Team 2601.21751 HJFY
2026-01-26 DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation
DV-VLN:基于大语言模型的视觉与语言导航双重验证可靠框架
摘要
Shoujun Zhou Team 2601.18492 HJFY
2026-01-26 \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation
NaVIDA:基于逆动力学增强的视觉语言导航
摘要
Feng Zheng Team 2601.18188 HJFY
2026-01-22 AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning
AION:基于双策略强化学习的空中室内目标导航系统
摘要
Lin Zhao Team 2601.15614 HJFY
2026-01-23 FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架
摘要
Yonggang Qi Team 2601.13976 HJFY
2026-01-19 Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration
Spatial-VLN:具备显式空间感知与探索能力的零样本视觉语言导航
摘要
Feitian Zhang Team 2601.12766 HJFY
2026-01-14 Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
迈向开放环境与指令:基于快慢交互推理的通用视觉语言导航
摘要
Yahong Han Team 2601.09111 HJFY
2026-01-11 Residual Cross-Modal Fusion Networks for Audio-Visual Navigation Yi Wang et.al. 2601.08868
2026-01-13 VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory Shaoan Wang et.al. 2601.08665
2026-01-12 GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap Farzad Shami et.al. 2601.07375

(back to top)

📌 VLM

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY
2026-02-26 MediX-R1: Open Ended Medical Reinforcement Learning
MediX-R1:开放式医学强化学习框架
摘要
Hisham Cholakkal Team 2602.23363 HJFY
2026-02-26 Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
规模无法克服语用学:报告偏差对视觉-语言推理的影响
摘要
Ranjay Krishna Team 2602.23351 HJFY
2026-02-26 Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
检索与分割:少量示例足以弥合开放词汇分割中的监督鸿沟吗?
摘要
Giorgos Tolias Team 2602.23339 HJFY
2026-02-26 CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
CXReasonAgent:基于证据的胸部X光诊断推理智能体
摘要
Edward Choi Team 2602.23276 HJFY
2026-02-26 Large Multimodal Models as General In-Context Classifiers
大型多模态模型作为通用上下文内分类器
摘要
Elisa Ricci Team 2602.23229 HJFY
2026-02-26 MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
MovieTeller:基于工具增强的电影剧情摘要与ID一致渐进式抽象
摘要
Gaoang Wang Team 2602.23228 HJFY
2026-02-26 Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
高效无编码器的基于傅里叶变换的3D大型多模态模型
摘要
Fabio Poiesi Team 2602.23153 HJFY
2026-02-26 Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
以言构形:弱监督视觉-语言建模用于人脑显微成像
摘要
Christian Schiffer Team 2602.23088 HJFY
2026-02-26 SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
SubspaceAD:基于子空间建模的无训练少样本异常检测方法
摘要
Egor Bondarev Team 2602.23013 HJFY
2026-02-26 FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning
FactGuard:基于强化学习的智能体视频虚假信息检测
摘要
Zhaoqi Wang Team 2602.22963 HJFY
2026-02-24 Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Spa3R:面向三维视觉推理的预测性空间场建模
摘要
Xinggang Wang Team 2602.21186 HJFY
2026-02-24 Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
透过文字看见:利用语言模型控制视觉检索质量
摘要
Yun Fu Team 2602.21175 HJFY
2026-02-24 LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
LUMEN:用于预后与诊断的纵向多模态放射学模型
摘要
Marius George Linguraru Team 2602.21142 HJFY
2026-02-24 VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
VAUQ:面向LVLM自评估的视觉感知不确定性量化
摘要
Sharon Li Team 2602.21054 HJFY
2026-02-24 OCR-Agent: Agentic OCR with Capability and Memory Reflection
OCR-Agent:具备能力与记忆反思的智能OCR代理
摘要
Ying Cai Team 2602.21053 HJFY
2026-02-24 Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
不止于所见:无需微调,让CLIP理解否定的视觉描述
摘要
Zejiang He Team 2602.21035 HJFY
2026-02-24 From Perception to Action: An Interactive Benchmark for Vision Reasoning
从感知到行动:视觉推理的交互式基准测试
摘要
Roy Ka-Wei Lee Team 2602.21015 HJFY
2026-02-24 CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
CrystaL:多模态大语言模型中视觉潜在特征的自发涌现
摘要
Xiang Li Team 2602.20980 HJFY
2026-02-24 Are Multimodal Large Language Models Good Annotators for Image Tagging?
多模态大语言模型是图像标注的优秀注释者吗?
摘要
Masashi Sugiyama Team 2602.20972 HJFY
2026-02-24 LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1:面向低成本长视频理解的智能导航方法
摘要
Qixiang Ye Team 2602.20913 HJFY
2026-02-19 Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
通过细粒度细节定位推动黑盒大视觉语言模型攻击前沿
摘要
Zhiqiang Shen Team 2602.17645 HJFY
2026-02-19 Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
抗灾难性遗忘的单次增量联邦学习
摘要
Monowar Bhuyan Team 2602.17625 HJFY
2026-02-19 AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
AI游戏商店:通过人类游戏实现机器通用智能的可扩展、开放式评估
摘要
Joshua B. Tenenbaum Team 2602.17594 HJFY
2026-02-19 RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
RetouchIQ:基于指令的图像修饰多模态大语言模型智能体与通用奖励机制
摘要
Handong Zhao Team 2602.17558 HJFY
2026-02-19 GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
GraphThinker:通过事件图思维强化视频推理
摘要
Shaogang Gong Team 2602.17555 HJFY
2026-02-19 LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
LATA:面向医学视觉语言模型置信度校准的拉普拉斯辅助转导自适应方法
摘要
Zongyuan Ge Team 2602.17535 HJFY
2026-02-19 QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery
QuPAINT:面向量子材料发现的物理感知指令调优方法
摘要
Khoa Luu Team 2602.17478 HJFY
2026-02-19 EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
EAGLE:面向多模态大语言模型免调优工业异常检测的专家增强注意力引导方法
摘要
Seon Han Choi Team 2602.17419 HJFY
2026-02-19 EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models
EntropyPrune:基于矩阵熵引导的多模态大语言模型视觉令牌剪枝
摘要
Lianghua He Team 2602.17196 HJFY
2026-02-19 Selective Training for Large Vision Language Models via Visual Information Gain
基于视觉信息增益的大型视觉语言模型选择性训练
摘要
Sangheum Hwang Team 2602.17186 HJFY
2026-02-12 Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效
摘要
Marco Pavone Team 2602.12281 HJFY
2026-02-12 ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
ExStrucTiny:面向文档图像中模式可变结构化信息提取的基准数据集
摘要
Manuela Veloso Team 2602.12203 HJFY
2026-02-12 Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
视觉推理基准:评估多模态大语言模型在基础教育课堂真实视觉问题上的表现
摘要
Oliver G. B. Garrod Team 2602.12196 HJFY
2026-02-12 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
3DGSNav:通过主动3D高斯泼溅增强视觉语言模型在物体导航中的推理能力
摘要
Xinyi Yu Team 2602.12159 HJFY
2026-02-12 DeepSight: An All-in-One LM Safety Toolkit
DeepSight:一体化大型模型安全工具箱
摘要
Xia Hu Team 2602.12092 HJFY
2026-02-12 Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning
可供性图化任务世界:面向可扩展具身学习的自演化任务生成
摘要
Changshui Zhang Team 2602.12065 HJFY
2026-02-12 Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
本地视觉语言模型能否超越视觉Transformer提升活动识别能力?——以新生儿复苏为例的研究
摘要
Øyvind Meinich-Bache Team 2602.12002 HJFY
2026-02-12 Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
空间思维链:连接理解与生成模型以实现空间推理生成
摘要
Long Chen Team 2602.11980 HJFY
2026-02-12 Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
评估视觉语言模型在法语PDF转Markdown任务中的性能基准
摘要
Nicolas Mery Team 2602.11960 HJFY
2026-02-12 Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization
双LLM是否优于单一模型?一种用于医药内容优化的师生双头LLM架构
摘要
Anubhav Girdhar Team 2602.11957 HJFY
2026-02-10 Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
Reason-IAD:面向可解释工业异常检测的知识引导动态潜在推理框架
摘要
Xiaochun Cao Team 2602.09850 HJFY
2026-02-10 Kelix Technique Report
Kelix技术报告
摘要
Ziqi Wang Team 2602.09843 HJFY
2026-02-10 SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
SAKED:通过稳定性感知的知识增强解码缓解大型视觉语言模型中的幻觉问题
摘要
Xudong Jiang Team 2602.09825 HJFY
2026-02-10 GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
GenSeg-R1:基于强化学习的视觉语言细粒度指代分割
摘要
Uma Mahesh Team 2602.09701 HJFY
2026-02-10 VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地
摘要
Hui Xiong Team 2602.09638 HJFY
2026-02-10 AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
AGMark:面向大型视觉语言模型的注意力引导动态水印技术
摘要
Linlin Wang Team 2602.09611 HJFY
2026-02-10 Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing
Tele-Omni:面向视频生成与编辑的统一多模态框架
摘要
Xuelong Li Team 2602.09609 HJFY
2026-02-10 Delving into Spectral Clustering with Vision-Language Representations
探索基于视觉-语言表征的光谱聚类方法
摘要
Zhen Fang Team 2602.09586 HJFY
2026-02-10 Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
手术刀:通过混合高斯桥精细对齐注意力激活流形以缓解多模态幻觉
摘要
Koichi Shirahata Team 2602.09541 HJFY
2026-02-10 DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
DR.Experts:面向盲图像质量评估的失真感知专家差分细化方法
摘要
Runze Hu Team 2602.09531 HJFY
2026-02-04 When LLaVA Meets Objects: Token Composition for Vision-Language-Models
当LLaVA遇见物体:视觉语言模型的令牌组合
摘要
Hilde Kuehne Team 2602.04864 HJFY
2026-02-04 El Agente Estructural: An Artificially Intelligent Molecular Editor
结构智能体:一种人工智能分子编辑器
摘要
Varinia Bernales Team 2602.04849 HJFY
2026-02-04 VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
VISTA-Bench:视觉语言模型真的能像理解纯文本一样理解图像中的文本吗?
摘要
Huchuan Lu Team 2602.04802 HJFY
2026-02-04 Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases
多模态大语言模型中的对齐漂移:对八个模型版本有害性的两阶段纵向评估
摘要
Emily Dix Team 2602.04739 HJFY
2026-02-04 SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation
SAR-RAG:通过语义搜索、检索与多模态大语言模型生成的自动目标识别视觉问答
摘要
Andreas Spanias Team 2602.04712 HJFY
2026-02-04 Annotation Free Spacecraft Detection and Segmentation using Vision Language Models
基于视觉语言模型的无标注航天器检测与分割
摘要
Djamila Aouada Team 2602.04699 HJFY
2026-02-04 AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
AGILE:基于智能体生成从视频重建手-物交互
摘要
Chunhua Shen Team 2602.04672 HJFY
2026-02-04 PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
PIO-FVLM:从推理目标视角重新审视用于VLM加速的无训练视觉令牌缩减
摘要
Chunhua Shen Team 2602.04657 HJFY
2026-02-04 Relational Scene Graphs for Object Grounding of Natural Language Commands
面向自然语言指令中物体定位的关系场景图
摘要
Ville Kyrki Team 2602.04635 HJFY
2026-02-04 LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
LEAD:面向忠实放射学报告生成的层级专家对齐解码
摘要
Yan Song Team 2602.04617 HJFY
2026-02-02 Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
Avenir-Web:基于混合定位专家的人类经验模仿式多模态网络代理
摘要
Mengdi Wang Team 2602.02468 HJFY
2026-02-02 MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
MentisOculi:揭示心智意象推理的局限性
摘要
Wieland Brendel Team 2602.02465 HJFY
2026-02-02 Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning
面向任务推理的关系感知分层三维场景图
摘要
Kostas Alexis Team 2602.02456 HJFY
2026-02-02 World-Gymnast: Training Robots with Reinforcement Learning in a World Model
世界体操家:在世界模型中通过强化学习训练机器人
摘要
Sherry Yang Team 2602.02454 HJFY
2026-02-02 ReasonEdit: Editing Vision-Language Models using Human Reasoning
ReasonEdit:基于人类推理的视觉语言模型编辑
摘要
Thomas Hartvigsen Team 2602.02408 HJFY
2026-02-02 LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
LongVPO:从锚定线索到自我推理的长视频偏好优化
摘要
Limin Wang Team 2602.02341 HJFY
2026-02-02 Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Vision-DeepResearch基准:重新思考多模态大语言模型的视觉与文本搜索能力
摘要
Shaosheng Cao Team 2602.02185 HJFY
2026-02-02 See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力
摘要
Takeo Igarashi Team 2602.02063 HJFY
2026-02-02 Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
Auto-Comp:面向对比式视觉语言模型可扩展组合性探测的自动化流程
摘要
Toshihiko Yamasaki Team 2602.02043 HJFY
2026-02-02 One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
一图多配:在大规模广告图像生成中协调多样化的群体点击偏好
摘要
Jian Liang Team 2602.02033 HJFY
2026-01-30 User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments Junfeng Lin et.al. 2601.23281
2026-01-30 Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models Yi Zhang et.al. 2601.23253
2026-01-30 Structured Over Scale: Learning Spatial Reasoning from Educational Video Bishoy Galoaa et.al. 2601.23251
2026-01-30 Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning Xiangyu Zeng et.al. 2601.23224
2026-01-30 Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training Anglin Liu et.al. 2601.23220
2026-01-30 Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization Hui Lu et.al. 2601.23179
2026-01-30 Hearing is Believing? Evaluating and Analyzing Audio Language Model Sycophancy with SYAUDIO Junchi Yao et.al. 2601.23149
2026-01-30 One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs Youxu Shi et.al. 2601.23041
2026-01-30 Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models Anmin Wang et.al. 2601.22959
2026-01-30 Alignment among Language, Vision and Action Representations Nicola Milano et.al. 2601.22948
2026-01-29 UEval: A Benchmark for Unified Multimodal Generation Bo Li et.al. 2601.22155
2026-01-29 Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions Xiaoxiao Sun et.al. 2601.22150
2026-01-29 SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence Saoud Aldowaish et.al. 2601.22114
2026-01-29 VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning Yibo Wang et.al. 2601.22069
2026-01-29 Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Wenxuan Huang et.al. 2601.22060
2026-01-29 MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources Baorui Ma et.al. 2601.22054
2026-01-29 Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning Chengyi Cai et.al. 2601.22020
2026-01-29 Causal World Modeling for Robot Control Lin Li et.al. 2601.21998
2026-01-29 Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models Konstantinos P. Panousis et.al. 2601.21944
2026-01-29 VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models Yunhao Li et.al. 2601.21915

(back to top)

About

🔄 Automatically Update VLA / VLN/ VLM Papers Daily with GitHub Actions.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%