--- base_model: - Qwen/Qwen2-VL-7B-Instruct language: - en - zh license: other license_name: license-term-of-versavit metrics: - accuracy library_name: transformers ---

## 🌟 Model Overview **VersaViT** is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a **multi-task collaborative post-training** recipe. VersaViT is **well suited both to language-mediated reasoning** (e.g., vision–language understanding when paired with an LLM) **and to pixel-level understanding** (e.g., segmentation and depth probing). ## Quick Start ``` import torch from PIL import Image from transformers import AutoImageProcessor from models.versavit import VersaViTPretrainedModel model_path = 'tencent/VersaViT' processor = AutoImageProcessor.from_pretrained(model_path) model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda') image = Image.open("./assets/versavit_logo.png") inputs = processor(images=image, return_tensors="pt").to('cuda') outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw']) ``` ## Citation If you use this model for your research or project, please cite: ```latex @article{liu2026versavit, title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization}, author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others}, journal={arXiv preprint arXiv:2602.09934}, year={2026} } ```