Planning with the Views

This repository contains a model checkpoint presented in the paper Planning with the Views.

Project Page | GitHub | Paper

Overview

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? This capability, called view planning, requires (1) understanding how a single action transforms the view, and (2) composing many such transformations across multi-turn plans to identify a target view.

ViewSuite is a 3D point-cloud environment and benchmark suite for view planning, built on real ScanNet indoor scenes. It probes view planning through three diagnostic tasks:

  • Path-to-View (P2V): Predict the resulting view from an action sequence.
  • View-to-Path (V2P): Infer the action sequence between two views.
  • Interactive View Planning (IVP): Plan view changes over multiple turns to identify a target view.

This model is an optimized version of Qwen2.5-VL-7B, trained using an iterative framework that alternates self-exploration with view graph distillation. This approach significantly closes the planning gap found in frontier VLMs, improving performance on interactive view planning tasks.

Citation

If you find ViewAgent or these checkpoints useful in your research, please consider citing:

@article{wang2026viewagent,
  title   = {Planning with the Views},
  author  = {Wang, Kangrui and Li, Linjie and Yang, Zhengyuan and Chen, Shiqi and
             Wang, Zihan and Fei-Fei, Li and Wu, Jiajun and Guibas, Leonidas and
             Wang, Lijuan and Li, Manling},
  year    = {2026}
}
Downloads last month
30
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MLL-Lab/viewagent-all-qwen25vl7b

Finetuned
(1126)
this model

Dataset used to train MLL-Lab/viewagent-all-qwen25vl7b

Collection including MLL-Lab/viewagent-all-qwen25vl7b

Paper for MLL-Lab/viewagent-all-qwen25vl7b