license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Model Description
This CoVT checkpoint is aligned with 4 Depth tokens, based on LLaVA-v1.5-13B.
These task-specific tokens are integrated into the model’s embedding space to enhance 3D-awareness.
Overview
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions.
Chain-of-Visual-Thought (CoVT) is a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance.
These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
For more details on evaluation, Gradio demo, and training CoVT, please refer to the GitHub repository.
Citation
If you use this work in your research, please cite:
@article{qin2025chainofvisualthought,
title={Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens},
author={Qin, Yiming and Wei, Bomin and Ge, Jiaxin and Kallidromitis, Konstantinos and Fu, Stephanie and Darrell, Trevor and Wang, Xudong},
journal={arXiv preprint arXiv:2511.19418},
year={2025},
eprint={2511.19418},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.19418},
}