ankanmbz commited on
Commit
cbe6b2b
·
verified ·
1 Parent(s): 5be1187

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -3
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+
4
+ language:
5
+ - en
6
+ base_model:
7
+ - Qwen/Qwen2-7B
8
+ - google/siglip-so400m-patch14-384
9
+ - facebook/dinov3-vitl16-pretrain-lvd1689m
10
+ pipeline_tag: image-text-to-text
11
+ library_name: transformers
12
+ tags:
13
+ - multimodal
14
+ - charts
15
+ - diagrams
16
+ - pointing
17
+ - localization
18
+ - CoME-VL
19
+ ---
20
+
21
+
22
+
23
+ <div align="center">
24
+ <h1>CoME-VL: Scaling Complementary Multi-Encoder Vision-Language</h1>
25
+ </div>
26
+ <p align="center">
27
+ <a href="https://github.com/mbzuai-oryx/CoME-VL">
28
+ <img alt="GitHub" src="https://img.shields.io/badge/GitHub-CoME--VL-black?logo=github">
29
+ </a>
30
+ <a href="https://arxiv.org/abs/XXXX.XXXXX">
31
+ <img alt="Paper" src="https://img.shields.io/badge/arxiv-XXXX.XXXXX-blue">
32
+ </a>
33
+ <a href="https://mbzuai-oryx.github.io/CoME-VL/">
34
+ <img alt="Project Page" src="https://img.shields.io/badge/Project-Page-green">
35
+ </a>
36
+ <a href="https://huggingface.co/MBZUAI/CoME-VL">
37
+ <img alt="HuggingFace" src="https://img.shields.io/badge/🤗%20HuggingFace-CoME--VL-yellow">
38
+ </a>
39
+ </p>
40
+ <div align="center">
41
+ <img src="assets/teaser_fig.png" alt="CoME-VL Teaser" width="800"/>
42
+ </div>
43
+
44
+
45
+ ## Overview
46
+
47
+ **CoME-VL** is a complementary multi-encoder vision-language framework that fuses contrastively trained and self-supervised visual representations to improve both visual understanding and grounding. Built on top of [Molmo](https://github.com/allenai/molmo) (Ai2), CoME-VL introduces three key architectural innovations:
48
+
49
+ - **Entropy-guided layer selection** to identify and select complementary layer ranges from SigLIP2 and DINOv3
50
+ - **Orthogonality-regularized multi-layer mixing (OL)** to reduce redundancy and promote complementary feature fusion
51
+ - **RoPE-enhanced cross-attention (RGCA)** to spatially align heterogeneous token grids across encoders
52
+
53
+ <div align="center">
54
+ <img src="assets/main_arct.png" alt="CoME-VL Architecture" width="800"/>
55
+ <p>Overview of CoME-VL: dual encoders (SigLIP2 + DINOv3) fused via orthogonality-regularized mixing and RoPE-based cross-attention, injected into a decoder-only LLM.</p>
56
+ </div>
57
+
58
+ ---
59
+
60
+ ## Installation
61
+
62
+ Python 3.10 is recommended. First install [PyTorch](https://pytorch.org) for your platform, then:
63
+
64
+ ```bash
65
+ git clone https://github.com/ankan8145/COME-VL.git
66
+ cd COME-VL
67
+ pip install -e .[all]
68
+ ```
69
+
70
+ ---
71
+
72
+ ## Environment Setup
73
+
74
+ ```bash
75
+ export MOLMO_DATA_DIR=/path/to/data
76
+ export HF_HOME=/path/to/huggingface/cache
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Training / Fine-tuning
82
+
83
+ Fine-tune starting from a pretrained checkpoint:
84
+
85
+ ```bash
86
+ HF_HUB_OFFLINE=1 \
87
+ TRANSFORMERS_OFFLINE=1 \
88
+ WANDB_MODE=offline \
89
+ WANDB_API_KEY="<your_wandb_key>" \
90
+ WANDB_PROJECT="come-vl" \
91
+ WANDB_ENTITY="<your_entity>" \
92
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
93
+ torchrun --standalone --nnodes=1 --nproc_per_node=8 \
94
+ launch_scripts/train_multitask_model.py \
95
+ 3.2-synthetic \
96
+ checkpoint_folder \
97
+ --save_folder=output_folder \
98
+ --save_overwrite
99
+ ```
100
+
101
+ **Notes:**
102
+ - `checkpoint_folder` should point to your starting model checkpoint directory.
103
+ - `--save_folder` should use a short, descriptive name — avoid long paths with special characters.
104
+ - `3.2-synthetic` specifies the training data mixture.
105
+ - `--save_overwrite` allows overwriting an existing save folder.
106
+
107
+ ---
108
+
109
+ ## Evaluation
110
+
111
+ ```bash
112
+ torchrun --nproc-per-node 1 --master_port 29504 \
113
+ launch_scripts/eval_downstream.py \
114
+ checkpoint_folder \
115
+ "test-low-res" \
116
+ --save_to_checkpoint_dir
117
+ ```
118
+
119
+ **Notes:**
120
+ - `test-low-res` evaluates at standard resolution on the test split.
121
+ - Use `test-high-res` for high-resolution evaluation (add `--fsdp --high_res` flags).
122
+ - Results and predictions are saved into the checkpoint directory.
123
+ - Add `--overwrite` to re-run and replace cached metrics.
124
+
125
+ ---
126
+
127
+ ## Model Architecture
128
+
129
+ CoME-VL uses:
130
+
131
+ - **Language backbone:** Qwen2-7B
132
+ - **Contrastive encoder:** SigLIP2-SO400M — semantic alignment
133
+ - **Self-supervised encoder:** DINOv3-Large — spatial grounding
134
+ - **Selected layers:** SigLIP2 layers 0–27 (all) + DINOv3 layers 10–23 (entropy-guided)
135
+
136
+ ---
137
+
138
+ ## Data
139
+
140
+ Most data is managed via HuggingFace Datasets. Training uses the [PixMo dataset](https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b) and RefCOCO.
141
+
142
+ Download all datasets:
143
+
144
+ ```bash
145
+ python3 scripts/download.py all --n_proc 12
146
+ ```
147
+
148
+ Download a specific dataset:
149
+
150
+ ```bash
151
+ python3 scripts/download_data.py ChartQa --n_proc 12
152
+ ```
153
+
154
+ ---
155
+
156
+ ## Pretrained Model Initialization
157
+
158
+ Convert HuggingFace weights before training from scratch:
159
+
160
+ ```bash
161
+ python3 scripts/convert_hf_to_molmo.py qwen2_7b
162
+ python3 scripts/convert_hf_to_molmo.py openai
163
+ ```
164
+
165
+ ---
166
+ ---
167
+
168
+ ## Citation
169
+
170
+ If you find CoME-VL useful in your research, please consider citing:
171
+ ```bibtex
172
+ @article{comevl2026,
173
+ title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language},
174
+ author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
175
+ journal={arXiv preprint},
176
+ year={2026}
177
+ }
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Acknowledgements
183
+
184
+ This codebase is built on top of **[Molmo](https://github.com/allenai/molmo)** by the Allen Institute for AI (Ai2). We thank the Ai2 team for open-sourcing their work.