hellogroup-opensource commited on
Commit
88cd91f
·
verified ·
1 Parent(s): 359bc5c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
+ tags:
7
+ - multimodal
8
+ - vision-language-model
9
+ - image-text-to-text
10
+ - ocr
11
+ - content-moderation
12
+ - safety
13
+ - reasoning
14
+ pipeline_tag: image-text-to-text
15
+ ---
16
+
17
+ # Xuanwu
18
+
19
+ ### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
20
+
21
+ [![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
22
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
23
+ [![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)
24
+
25
+ Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.
26
+
27
+ The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.
28
+
29
+ ## Highlights
30
+
31
+ - Compact ~2B architecture for deployment-sensitive moderation settings.
32
+ - Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
33
+ - Progressive three-stage training: pre-training, mid-training, and post-training.
34
+ - Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
35
+ - Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.
36
+
37
+ ## Model Details
38
+
39
+ | Item | Value |
40
+ |---|---|
41
+ | Model type | Autoregressive vision-language model |
42
+ | Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
43
+ | Parameters | Approximately 2B |
44
+ | Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
45
+ | Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
46
+ | Context length | Up to 16,384 packed tokens |
47
+ | Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
48
+ | Hardware | 64 x NVIDIA A100 80GB GPUs |
49
+ | Training cost | ~3,500 GPU hours |
50
+ | Language coverage | Primarily English and Chinese |
51
+
52
+ ## Training
53
+
54
+ | Stage | Effective scale | Purpose |
55
+ |---|---:|---|
56
+ | Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
57
+ | Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
58
+ | SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
59
+ | RL | 810k | GRPO alignment for classification, format, and OCR character alignment |
60
+
61
+ The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.
62
+
63
+ ## Evaluation
64
+
65
+ ### Headline Results
66
+
67
+ | Area | Metric | Xuanwu VL-2B | Reference |
68
+ |---|---|---:|---:|
69
+ | General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
70
+ | Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
71
+ | Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
72
+ | Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |
73
+
74
+ ### General Multimodal Benchmarks
75
+
76
+ | Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
77
+ |---|---:|---:|
78
+ | HallusionBench | 46.78 | **47.32** |
79
+ | AI2D | 77.95 | **82.19** |
80
+ | MMStar | 56.20 | **60.47** |
81
+ | OCRBench | 83.10 | **89.80** |
82
+ | MMBench v1.1 | 75.08 | **79.02** |
83
+ | MMMU (val) | **50.51** | 48.11 |
84
+ | MathVista | 60.30 | **68.40** |
85
+ | average-7 | 64.27 | **67.90** |
86
+
87
+ ### Business Moderation and Adversarial OCR
88
+
89
+ Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.
90
+
91
+ Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.
92
+
93
+ ## Prompt Format
94
+
95
+ For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:
96
+
97
+ ```text
98
+ [Observation] describe the main subjects and background
99
+ [Extraction] recover visible or concealed text and symbols
100
+ [Reasoning] compare the extracted evidence against moderation rules
101
+ [Conclusion] output the final decision (Safe / Violating-Category)
102
+ ```
103
+
104
+ Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.
105
+
106
+ ## Intended Uses
107
+
108
+ - Research on industrial multimodal systems, especially content moderation and adversarial OCR.
109
+ - Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
110
+ - Assistive moderation workflows with structured explanations and OCR-aware reasoning.
111
+
112
+ ## Limitations
113
+
114
+ - The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
115
+ - In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
116
+ - Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
117
+ - Outputs should be verified before use in high-stakes review decisions.
118
+
119
+ ## Citation
120
+
121
+ ```bibtex
122
+ @article{zhang2026xuanwu,
123
+ title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
124
+ author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
125
+ journal={arXiv preprint arXiv:2603.29211},
126
+ year={2026}
127
+ }
128
+ ```
129
+
130
+ ## Links
131
+
132
+ - Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
133
+ - GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)