CoVT-7B-depth / README.md
nielsr's picture
nielsr HF Staff
Improve model card: add metadata and links
86fddb0 verified
|
raw
history blame
1.34 kB
---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
# Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
This repository hosts a CoVT checkpoint, as presented in the paper [Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens](https://huggingface.co/papers/2511.19418).
**Project Page**: https://wakalsprojectpage.github.io/comt-website
**Code**: https://github.com/Wakals/CoMT
## Overview of CoVT
Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, **CoVT** forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing *continuous visual tokens* that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes *chains of textual and visual thoughts* that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
## CoVT Checkpoint (Depth Aligned)
This CoVT checkpoint is aligned with **4 Depth tokens**.
These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.