Improve model card: add metadata and links
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,8 +1,21 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
# CoVT Checkpoint (Depth Aligned)
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
|
|
|
| 6 |
|
| 7 |
+
# Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
|
| 8 |
+
|
| 9 |
+
This repository hosts a CoVT checkpoint, as presented in the paper [Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens](https://huggingface.co/papers/2511.19418).
|
| 10 |
+
|
| 11 |
+
**Project Page**: https://wakalsprojectpage.github.io/comt-website
|
| 12 |
+
**Code**: https://github.com/Wakals/CoMT
|
| 13 |
+
|
| 14 |
+
## Overview of CoVT
|
| 15 |
+
|
| 16 |
+
Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, **CoVT** forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing *continuous visual tokens* that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes *chains of textual and visual thoughts* that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
|
| 17 |
+
|
| 18 |
+
## CoVT Checkpoint (Depth Aligned)
|
| 19 |
+
|
| 20 |
+
This CoVT checkpoint is aligned with **4 Depth tokens**.
|
| 21 |
These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.
|