Image-Text-to-Text
Transformers
Safetensors
English
qwen3_vl
conversational
langdaohlb commited on
Commit
7acd034
·
verified ·
1 Parent(s): 98a57b7

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -11,9 +11,14 @@
11
  **ZwZ-4B** is a fine-grained multimodal perception model built upon [Qwen3-VL-4B](https://huggingface.co/Qwen/Qwen3-VL-4B). It is trained using **Region-to-Image Distillation (R2I)** combined with reinforcement learning, enabling superior fine-grained visual understanding in a single forward pass — no inference-time zooming or tool calling required. ZwZ-4B achieves state-of-the-art performance on fine-grained perception benchmarks among open-source models of comparable size.
12
 
13
  <div align=center>
14
- <img src="gp_avg_comparison.png" width="90%" alt="avg_comparison"/>
15
  </div>
16
 
 
 
 
 
 
17
 
18
  ## How It Works
19
 
@@ -96,4 +101,4 @@ ZwZ-4B is trained on [inclusionAI/ZwZ-RL-VQA](https://huggingface.co/datasets/in
96
 
97
  ## License
98
 
99
- This model follows the license of Apache License 2.0.
 
11
  **ZwZ-4B** is a fine-grained multimodal perception model built upon [Qwen3-VL-4B](https://huggingface.co/Qwen/Qwen3-VL-4B). It is trained using **Region-to-Image Distillation (R2I)** combined with reinforcement learning, enabling superior fine-grained visual understanding in a single forward pass — no inference-time zooming or tool calling required. ZwZ-4B achieves state-of-the-art performance on fine-grained perception benchmarks among open-source models of comparable size.
12
 
13
  <div align=center>
14
+ <img src="assets/gp_avg_comparison.png" width="90%" alt="avg_comparison"/>
15
  </div>
16
 
17
+ ## Key Features
18
+
19
+ - **⚡ Single-Pass Efficiency**: Achieves fine-grained perception in one forward pass, eliminating inference-time tool-calling overhead
20
+ - **🎯 Superior Accuracy**: State-of-the-art on perception benchmarks among open-source models
21
+ - **📈 Broad Improvements**: Enhances not only perception benchmarks but also out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection
22
 
23
  ## How It Works
24
 
 
101
 
102
  ## License
103
 
104
+ This model follows the license of [Qwen3-VL-4B](https://huggingface.co/Qwen/Qwen3-VL-4B). Please refer to the base model's license for usage terms.