Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision–Language Models with advanced 3D understanding capabilities from monocular video input.
Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model.
Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding.
Our framework takes a monocular video as input and augments the vision token sequence with per-frame latent camera pose priors extracted from the 3D foundation model CUT3R. The model is jointly trained using two spatial objectives: (1) layout reconstruction, which grounds vision patch tokens into a bird's-eye-view (BEV) space to capture global scene structure, and (2) situation modeling, which utilizes dedicated localization query tokens to localize an agent from a situation description. During answer generation, the model leverages the inferred layout and location to perform viewpoint-aware 3D reasoning.
The spatial supervision framework introduces complementary training signals to equip the model with enhanced 3D understanding. For the layout reconstruction objective, the model learns to ground each vision patch token onto its corresponding BEV coordinate in a cognitive map to capture global scene structure. For localization, dedicated localization tokens explicitly model the agent's position and orientation. The framework is trained end-to-end using a joint objective of layout, localization, and language losses.
We show example predictions of Loc3R-VLM for language-based localization and situated 3D question answering on SQA3D. Our model accurately grounds the described situations (blue: prediction, green: ground truth) and provides the correct viewpoint-dependent answer. Meshes are shown for visualization only and are not used by the model.
We benchmark language-based localization on SQA3D, where the model must predict an agent's position and orientation from a natural-language situation description. Loc3R-VLM achieves state-of-the-art accuracy.
We evaluate Loc3R-VLM on established 3D situated question-answering and general 3D question-answering benchmarks spanning diverse scene types and question categories. Our model consistently outperforms prior video-based approaches.
@misc{qu2026loc3rvlm,
title={Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models},
author={Kevin Qu and Haozhe Qi and Mihai Dusmanu and Mahdi Rad and Rui Wang and Marc Pollefeys},
year={2026},
eprint={2603.18002},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.18002}
}