VersaViT / README.md
code-kunkun's picture
Upload folder using huggingface_hub
e795dc7 verified
---
base_model:
- Qwen/Qwen2-VL-7B-Instruct
language:
- en
- zh
license: other
license_name: license-term-of-versavit
metrics:
- accuracy
library_name: transformers
---
<p align="center">
<img src="assets/versavit_logo.png" width="480"/>
<p>
<p align="center">
<a href="https://huggingface.co/tencent/VersaViT">
<img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
</a>
<a href="https://arxiv.org/pdf/2602.09934">
<img src="https://img.shields.io/badge/Paper-VersaViT-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
</a>
</p>
## 🌟 Model Overview
**VersaViT** is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a **multi-task collaborative post-training** recipe. VersaViT is **well suited both to language-mediated reasoning** (e.g., vision–language understanding when paired with an LLM) **and to pixel-level understanding** (e.g., segmentation and depth probing).
## Quick Start
```
import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel
model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
```
## Citation
If you use this model for your research or project, please cite:
```latex
@article{liu2026versavit,
title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
journal={arXiv preprint arXiv:2602.09934},
year={2026}
}
```