| --- |
| base_model: |
| - Qwen/Qwen2-VL-7B-Instruct |
| language: |
| - en |
| - zh |
| license: other |
| license_name: license-term-of-versavit |
| metrics: |
| - accuracy |
| library_name: transformers |
| --- |
| |
| <p align="center"> |
| <img src="assets/versavit_logo.png" width="480"/> |
| <p> |
| |
| <p align="center"> |
| <a href="https://huggingface.co/tencent/VersaViT"> |
| <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"> |
| </a> |
| <a href="https://arxiv.org/pdf/2602.09934"> |
| <img src="https://img.shields.io/badge/Paper-VersaViT-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper"> |
| </a> |
| </p> |
| |
| ## 🌟 Model Overview |
|
|
| **VersaViT** is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a **multi-task collaborative post-training** recipe. VersaViT is **well suited both to language-mediated reasoning** (e.g., vision–language understanding when paired with an LLM) **and to pixel-level understanding** (e.g., segmentation and depth probing). |
|
|
| ## Quick Start |
|
|
| ``` |
| import torch |
| from PIL import Image |
| from transformers import AutoImageProcessor |
| from models.versavit import VersaViTPretrainedModel |
| |
| |
| model_path = 'tencent/VersaViT' |
| processor = AutoImageProcessor.from_pretrained(model_path) |
| model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda') |
| |
| image = Image.open("./assets/versavit_logo.png") |
| inputs = processor(images=image, return_tensors="pt").to('cuda') |
| outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw']) |
| ``` |
|
|
|
|
| ## Citation |
|
|
| If you use this model for your research or project, please cite: |
| ```latex |
| @article{liu2026versavit, |
| title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization}, |
| author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others}, |
| journal={arXiv preprint arXiv:2602.09934}, |
| year={2026} |
| } |
| ``` |