Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +60 -0
assets/versavit_logo.png +3 -0
config.json +21 -0
model.safetensors +3 -0
preprocessor_config.json +29 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/versavit_logo.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+---
+base_model:
+- Qwen/Qwen2-VL-7B-Instruct
+language:
+- en
+- zh
+license: other
+license_name: license-term-of-versavit
+metrics:
+- accuracy
+library_name: transformers
+---
+<p align="center">
+    <img src="assets/versavit_logo.png" width="480"/>
+<p>
+<p align="center">
+  <a href="https://huggingface.co/tencent/VersaViT">
+    <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
+  </a>
+  <a href="https://arxiv.org/pdf/2602.09934">
+    <img src="https://img.shields.io/badge/Paper-VersaViT-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
+  </a>
+</p>
+## 🌟 Model Overview
+**VersaViT** is a vision transformer tuned to serve as a capable, general-purpose visual encoder for multimodal systems. It is refined with a **multi-task collaborative post-training** recipe. VersaViT is **well suited both to language-mediated reasoning** (e.g., vision–language understanding when paired with an LLM) **and to pixel-level understanding** (e.g., segmentation and depth probing).
+## Quick Start
+```
+import torch
+from PIL import Image
+from transformers import AutoImageProcessor
+from models.versavit import VersaViTPretrainedModel
+model_path = 'tencent/VersaViT'
+processor = AutoImageProcessor.from_pretrained(model_path)
+model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
+image = Image.open("./assets/versavit_logo.png")
+inputs = processor(images=image, return_tensors="pt").to('cuda')
+outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
+```
+## Citation
+If you use this model for your research or project, please cite:
+```latex
+@article{liu2026versavit,
+  title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
+  author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
+  journal={arXiv preprint arXiv:2602.09934},
+  year={2026}
+}
+```

assets/versavit_logo.png ADDED Viewed

Git LFS Details

SHA256: 1d85b8b72286239f93e2d7c21cb3e7f844547ecbbcfb9e0cfcd3a8f0b4f0252d
Pointer size: 131 Bytes
Size of remote file: 662 kB

config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": [
+    "Qwen2VisionTransformerPretrainedModel"
+  ],
+  "depth": 32,
+  "embed_dim": 1280,
+  "hidden_act": "quick_gelu",
+  "hidden_size": 3584,
+  "in_channels": 3,
+  "in_chans": 3,
+  "initializer_range": 0.02,
+  "mlp_ratio": 4,
+  "model_type": "qwen2_vl",
+  "num_heads": 16,
+  "patch_size": 14,
+  "spatial_merge_size": 2,
+  "spatial_patch_size": 14,
+  "temporal_patch_size": 2,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.52.1"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3315fc72a1bc1441bdab10d054785cd1b8e5b9cebeb5a8286d6c08698765a7a
+size 1351555904

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "max_pixels": 12845056,
+  "merge_size": 2,
+  "min_pixels": 3136,
+  "patch_size": 14,
+  "processor_class": "Qwen2VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "longest_edge": 12845056,
+    "shortest_edge": 3136
+  },
+  "temporal_patch_size": 2
+}