CoME-VL / README.md

Update README.md

b61b123 verified 2 days ago

5.15 kB

	---
	license: apache-2.0

	language:
	- en
	base_model:
	- allenai/Molmo-7B-D-0924
	- Qwen/Qwen2-7B
	- google/siglip-so400m-patch14-384
	- facebook/dinov3-vitl16-pretrain-lvd1689m
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- multimodal
	- charts
	- diagrams
	- pointing
	- localization
	- CoME-VL
	---



	<div align="center">
	<h1>CoME-VL: Scaling Complementary Multi-Encoder Vision-Language</h1>
	</div>
	<p align="center">
	<a href="https://github.com/mbzuai-oryx/CoME-VL">
	<img alt="GitHub" src="https://img.shields.io/badge/GitHub-CoME--VL-black?logo=github">
	</a>
	<a href="https://arxiv.org/abs/2604.03231">
	<img alt="Paper" src="https://img.shields.io/badge/arxiv-2604.03231-blue">
	</a>
	<a href="https://mbzuai-oryx.github.io/CoME-VL/">
	<img alt="Project Page" src="https://img.shields.io/badge/Project-Page-green">
	</a>
	<a href="https://huggingface.co/MBZUAI/CoME-VL">
	<img alt="HuggingFace" src="https://img.shields.io/badge/🤗%20HuggingFace-CoME--VL-yellow">
	</a>
	</p>
	<div align="center">
	<img src="assets/teaser_fig.png" alt="CoME-VL Teaser" width="800"/>
	</div>


	## Overview

	CoME-VL is a complementary multi-encoder vision-language framework that fuses contrastively trained and self-supervised visual representations to improve both visual understanding and grounding. Built on top of [Molmo](https://github.com/allenai/molmo) (Ai2), CoME-VL introduces three key architectural innovations:

	- Entropy-guided layer selection to identify and select complementary layer ranges from SigLIP2 and DINOv3
	- Orthogonality-regularized multi-layer mixing (OL) to reduce redundancy and promote complementary feature fusion
	- RoPE-enhanced cross-attention (RGCA) to spatially align heterogeneous token grids across encoders

	<div align="center">
	<img src="assets/main_arct.png" alt="CoME-VL Architecture" width="800"/>
	<p>Overview of CoME-VL: dual encoders (SigLIP2 + DINOv3) fused via orthogonality-regularized mixing and RoPE-based cross-attention, injected into a decoder-only LLM.</p>
	</div>

	---

	## Installation

	Python 3.10 is recommended. First install [PyTorch](https://pytorch.org) for your platform, then:

	```bash
	git clone https://github.com/ankan8145/COME-VL.git
	cd COME-VL
	pip install -e .[all]
	```

	---

	## Environment Setup

	```bash
	export MOLMO_DATA_DIR=/path/to/data
	export HF_HOME=/path/to/huggingface/cache
	```

	---

	## Training / Fine-tuning

	Fine-tune starting from a pretrained checkpoint:

	```bash
	HF_HUB_OFFLINE=1 \
	TRANSFORMERS_OFFLINE=1 \
	WANDB_MODE=offline \
	WANDB_API_KEY="<your_wandb_key>" \
	WANDB_PROJECT="come-vl" \
	WANDB_ENTITY="<your_entity>" \
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
	torchrun --standalone --nnodes=1 --nproc_per_node=8 \
	launch_scripts/train_multitask_model.py \
	3.2-synthetic \
	checkpoint_folder \
	--save_folder=output_folder \
	--save_overwrite
	```

	Notes:
	- `checkpoint_folder` should point to your starting model checkpoint directory.
	- `--save_folder` should use a short, descriptive name — avoid long paths with special characters.
	- `3.2-synthetic` specifies the training data mixture.
	- `--save_overwrite` allows overwriting an existing save folder.

	---

	## Evaluation

	```bash
	torchrun --nproc-per-node 1 --master_port 29504 \
	launch_scripts/eval_downstream.py \
	checkpoint_folder \
	"test-low-res" \
	--save_to_checkpoint_dir
	```

	Notes:
	- `test-low-res` evaluates at standard resolution on the test split.
	- Use `test-high-res` for high-resolution evaluation (add `--fsdp --high_res` flags).
	- Results and predictions are saved into the checkpoint directory.
	- Add `--overwrite` to re-run and replace cached metrics.

	---

	## Model Architecture

	CoME-VL uses:

	- Language backbone: Qwen2-7B
	- Contrastive encoder: SigLIP2-SO400M — semantic alignment
	- Self-supervised encoder: DINOv3-Large — spatial grounding
	- Selected layers: SigLIP2 layers 0–27 (all) + DINOv3 layers 10–23 (entropy-guided)

	---

	## Data

	Most data is managed via HuggingFace Datasets. Training uses the [PixMo dataset](https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b) and RefCOCO.

	Download all datasets:

	```bash
	python3 scripts/download.py all --n_proc 12
	```

	Download a specific dataset:

	```bash
	python3 scripts/download_data.py pixmo_count_counting --n_proc 12
	```

	---

	## Pretrained Model Initialization

	Convert HuggingFace weights before training from scratch:

	```bash
	python3 scripts/convert_hf_to_molmo.py qwen2_7b
	python3 scripts/convert_hf_to_molmo.py openai
	```

	---
	---

	## Citation

	If you find CoME-VL useful in your research, please consider citing:
	```bibtex
	@article{comevl2026,
	title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language},
	author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
	journal={arXiv preprint},
	year={2026}
	}
	```

	---

	## Acknowledgements

	This codebase is built on top of [Molmo](https://github.com/allenai/molmo) by the Allen Institute for AI (Ai2). We thank the Ai2 team for open-sourcing their work.