CoVT-7B-depth / README.md
nielsr's picture
nielsr HF Staff
Improve model card: add metadata and links
86fddb0 verified
|
raw
history blame
1.34 kB
metadata
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

This repository hosts a CoVT checkpoint, as presented in the paper Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens.

Project Page: https://wakalsprojectpage.github.io/comt-website Code: https://github.com/Wakals/CoMT

Overview of CoVT

Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.

CoVT Checkpoint (Depth Aligned)

This CoVT checkpoint is aligned with 4 Depth tokens. These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.