Wakals
/

CoVT-7B-depth

Model card Files Files and versions

CoVT-7B-depth / README.md

nielsr's picture

nielsr HF Staff

Improve model card: add metadata and links

86fddb0 verified 3 months ago

|

1.34 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

	This repository hosts a CoVT checkpoint, as presented in the paper [Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens](https://huggingface.co/papers/2511.19418).

	Project Page: https://wakalsprojectpage.github.io/comt-website
	Code: https://github.com/Wakals/CoMT

	## Overview of CoVT

	Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.

	## CoVT Checkpoint (Depth Aligned)

	This CoVT checkpoint is aligned with 4 Depth tokens.
	These task-specific tokens are integrated into the model’s embedding space to enhance depth-awareness.