tencent
/

VersaViT

text-generation-inference

Model card Files Files and versions

VersaViT / README.md

code-kunkun's picture

Upload folder using huggingface_hub

e795dc7 verified 8 days ago

|

history blame contribute delete

2.1 kB

	---
	base_model:
	- Qwen/Qwen2-VL-7B-Instruct
	language:
	- en
	- zh
	license: other
	license_name: license-term-of-versavit
	metrics:
	- accuracy
	library_name: transformers
	---

	<p align="center">
	<img src="assets/versavit_logo.png" width="480"/>
	<p>

	<p align="center">
	<a href="https://huggingface.co/tencent/VersaViT">
	<img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
	</a>
	<a href="https://arxiv.org/pdf/2602.09934">
	<img src="https://img.shields.io/badge/Paper-VersaViT-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
	</a>
	</p>

	## 🌟 Model Overview

	VersaViT is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a multi-task collaborative post-training recipe. VersaViT is well suited both to language-mediated reasoning (e.g., vision–language understanding when paired with an LLM) and to pixel-level understanding (e.g., segmentation and depth probing).

	## Quick Start

	```
	import torch
	from PIL import Image
	from transformers import AutoImageProcessor
	from models.versavit import VersaViTPretrainedModel


	model_path = 'tencent/VersaViT'
	processor = AutoImageProcessor.from_pretrained(model_path)
	model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

	image = Image.open("./assets/versavit_logo.png")
	inputs = processor(images=image, return_tensors="pt").to('cuda')
	outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
	```


	## Citation

	If you use this model for your research or project, please cite:
	```latex
	@article{liu2026versavit,
	title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
	author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
	journal={arXiv preprint arXiv:2602.09934},
	year={2026}
	}
	```