code-kunkun commited on
Commit
e795dc7
·
verified ·
1 Parent(s): e5fe3f7

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/versavit_logo.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ - zh
7
+ license: other
8
+ license_name: license-term-of-versavit
9
+ metrics:
10
+ - accuracy
11
+ library_name: transformers
12
+ ---
13
+
14
+ <p align="center">
15
+ <img src="assets/versavit_logo.png" width="480"/>
16
+ <p>
17
+
18
+ <p align="center">
19
+ <a href="https://huggingface.co/tencent/VersaViT">
20
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
21
+ </a>
22
+ <a href="https://arxiv.org/pdf/2602.09934">
23
+ <img src="https://img.shields.io/badge/Paper-VersaViT-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
24
+ </a>
25
+ </p>
26
+
27
+ ## 🌟 Model Overview
28
+
29
+ **VersaViT** is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a **multi-task collaborative post-training** recipe. VersaViT is **well suited both to language-mediated reasoning** (e.g., vision–language understanding when paired with an LLM) **and to pixel-level understanding** (e.g., segmentation and depth probing).
30
+
31
+ ## Quick Start
32
+
33
+ ```
34
+ import torch
35
+ from PIL import Image
36
+ from transformers import AutoImageProcessor
37
+ from models.versavit import VersaViTPretrainedModel
38
+
39
+
40
+ model_path = 'tencent/VersaViT'
41
+ processor = AutoImageProcessor.from_pretrained(model_path)
42
+ model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
43
+
44
+ image = Image.open("./assets/versavit_logo.png")
45
+ inputs = processor(images=image, return_tensors="pt").to('cuda')
46
+ outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
47
+ ```
48
+
49
+
50
+ ## Citation
51
+
52
+ If you use this model for your research or project, please cite:
53
+ ```latex
54
+ @article{liu2026versavit,
55
+ title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
56
+ author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
57
+ journal={arXiv preprint arXiv:2602.09934},
58
+ year={2026}
59
+ }
60
+ ```
assets/versavit_logo.png ADDED

Git LFS Details

  • SHA256: 1d85b8b72286239f93e2d7c21cb3e7f844547ecbbcfb9e0cfcd3a8f0b4f0252d
  • Pointer size: 131 Bytes
  • Size of remote file: 662 kB
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2VisionTransformerPretrainedModel"
4
+ ],
5
+ "depth": 32,
6
+ "embed_dim": 1280,
7
+ "hidden_act": "quick_gelu",
8
+ "hidden_size": 3584,
9
+ "in_channels": 3,
10
+ "in_chans": 3,
11
+ "initializer_range": 0.02,
12
+ "mlp_ratio": 4,
13
+ "model_type": "qwen2_vl",
14
+ "num_heads": 16,
15
+ "patch_size": 14,
16
+ "spatial_merge_size": 2,
17
+ "spatial_patch_size": 14,
18
+ "temporal_patch_size": 2,
19
+ "torch_dtype": "bfloat16",
20
+ "transformers_version": "4.52.1"
21
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3315fc72a1bc1441bdab10d054785cd1b8e5b9cebeb5a8286d6c08698765a7a
3
+ size 1351555904
preprocessor_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "Qwen2VLImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "max_pixels": 12845056,
18
+ "merge_size": 2,
19
+ "min_pixels": 3136,
20
+ "patch_size": 14,
21
+ "processor_class": "Qwen2VLProcessor",
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 12845056,
26
+ "shortest_edge": 3136
27
+ },
28
+ "temporal_patch_size": 2
29
+ }