nielsr HF Staff commited on
Commit
862faeb
·
verified ·
1 Parent(s): 4dd407d

Improve model card: Add metadata, links, overview, and citation

Browse files

This PR enhances the model card by adding key metadata and comprehensive information:

- Adds `pipeline_tag: image-text-to-text` to correctly categorize the model for multimodal tasks.
- Adds `library_name: transformers` as the model architecture (`llava_llama` and `AnchorLlava`) and `transformers_version` in `config.json` indicate compatibility with the `transformers` library, enabling the "How to use" widget.
- Includes a direct link to the paper: [Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens](https://huggingface.co/papers/2511.19418).
- Provides links to the official project page (https://wakalsprojectpage.github.io/comt-website) and the GitHub repository (https://github.com/Wakals/CoMT) for easy access to more resources.
- Expands the "Model Description" with a detailed overview of CoVT's methodology and benefits, derived from the paper's abstract and the GitHub README.
- Embeds relevant demo images from the GitHub repository to visually illustrate the model's capabilities.
- Adds a BibTeX citation for the paper.

Please review and merge if these improvements align with your expectations.

Files changed (1) hide show
  1. README.md +38 -1
README.md CHANGED
@@ -1,8 +1,45 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
- # CoVT Checkpoint (Segmentation, Depth, and DINO Aligned)
 
 
 
 
 
5
 
6
  ## Model Description
7
  This CoVT checkpoint is aligned with **4 Depth tokens**, based on LLaVA-v1.5-13B.
8
  These task-specific tokens are integrated into the model’s embedding space to enhance 3D-awareness.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
+ # Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
8
+
9
+ [![Arixv](https://img.shields.io/badge/arxiv-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://huggingface.co/papers/2511.19418)
10
+ [![Project Page](https://img.shields.io/badge/Project_Page-00CED1?style=for-the-badge&logoColor=white&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0id2hpdGUiPjxwYXRoIGQ9Ik0xMCAyMHYtNmg0djZoNXYtOGgzTDEyIDMgMiAxMmgzdjh6Ii8+PC9zdmc+)](https://wakalsprojectpage.github.io/comt-website)
11
+ [![Github Code](https://img.shields.io/badge/Github-Code-keygen.svg?logo=github&style=flat-square)](https://github.com/Wakals/CoMT)
12
 
13
  ## Model Description
14
  This CoVT checkpoint is aligned with **4 Depth tokens**, based on LLaVA-v1.5-13B.
15
  These task-specific tokens are integrated into the model’s embedding space to enhance 3D-awareness.
16
+
17
+ ## Overview
18
+ Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, *e.g.*, spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions.
19
+
20
+ **Chain-of-Visual-Thought (CoVT)** is a framework that enables VLMs to reason not only in words but also through **continuous visual tokens** — compact latent representations that encode rich perceptual cues. Within a small budget of roughly **20 tokens**, CoVT distills knowledge from lightweight vision experts, capturing complementary properties such as **2D appearance, 3D geometry, spatial layout, and edge structure**. During training, the VLM with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (*e.g.*, depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than **ten diverse perception benchmarks**, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance.
21
+
22
+ These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
23
+
24
+ <div align="center">
25
+ <img src="https://github.com/Wakals/CoMT/raw/main/assets/DEMO.jpg" alt="CoVT Demo" style="width: 100%; margin: 10px 0;">
26
+ <img src="https://github.com/Wakals/CoMT/raw/main/assets/edit_demo.jpg" alt="CoVT Edit Demo" style="width: 100%; margin: 10px 0;">
27
+ </div>
28
+
29
+ For more details on evaluation, Gradio demo, and training CoVT, please refer to the [GitHub repository](https://github.com/Wakals/CoMT).
30
+
31
+ ## Citation
32
+ If you use this work in your research, please cite:
33
+
34
+ ```bibtex
35
+ @article{qin2025chainofvisualthought,
36
+ title={Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens},
37
+ author={Qin, Yiming and Wei, Bomin and Ge, Jiaxin and Kallidromitis, Konstantinos and Fu, Stephanie and Darrell, Trevor and Wang, Xudong},
38
+ journal={arXiv preprint arXiv:2511.19418},
39
+ year={2025},
40
+ eprint={2511.19418},
41
+ archivePrefix={arXiv},
42
+ primaryClass={cs.CV},
43
+ url={https://arxiv.org/abs/2511.19418},
44
+ }
45
+ ```