| --- |
| base_model: |
| - facebook/dinov2-large |
| license: apache-2.0 |
| pipeline_tag: image-feature-extraction |
| library_name: slimm |
| --- |
| |
| # Model Card for CoMP-MM-1B |
|
|
| <!-- Provide a quick summary of what the model is/does. --> |
| This is an VFM that supports <b>native image resolution inputs</b>, continually pre-trained from [DINOv2](https://huggingface.co/facebook/dinov2-large). |
|
|
| ## Model Sources |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** https://github.com/SliMM-X/CoMP-MM |
| - **Paper:** https://arxiv.org/abs/2503.18931 |
| - **Project Page:** https://slimm-x.github.io/comp |
|
|
| ## How to Get Started with the Model |
|
|
| Install the github repo, and use the code below to get started with the model. |
|
|
| ```python |
| import torch |
| from slimm.model.processor import SliMMQwen2VLProcessor |
| from slimm.model.utils_vl import process_vision_info |
| from slimm.model.vision_encoder import CoMPDinov2Model |
| from PIL import Image |
| |
| model_path = "SliMM-X/CoMP-DINOv2-Large" |
| |
| model = CoMPDinov2Model.from_pretrained( |
| model_path, torch_dtype="auto", device_map="cuda", w_merger=False |
| ).to(torch.bfloat16) |
| |
| processor = SliMMQwen2VLProcessor.from_pretrained(model_path) |
| |
| image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png") |
| inputs = processor( |
| images=image_input, |
| return_tensors="pt", |
| ) |
| |
| inputs = inputs.to("cuda") |
| output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw) |
| print(output_feat) |
| ``` |
|
|
| ## Citation |
|
|
| **BibTeX:** |
|
|
| ```bibtex |
| @article{comp2025, |
| title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, |
| author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang}, |
| year={2025}, |
| journal={arXiv preprint arXiv:2503.18931}, |
| } |
| ``` |