license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
This repository hosts a CoVT checkpoint, as presented in the paper Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens.
Project Page: https://wakalsprojectpage.github.io/comt-website Code: https://github.com/Wakals/CoMT
Overview of CoVT
Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
CoVT Checkpoint (Depth Aligned)
This CoVT checkpoint is aligned with 4 Depth tokens. These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.