File size: 6,288 Bytes
88cd91f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
language:
- en
- zh
license: apache-2.0
tags:
- multimodal
- vision-language-model
- image-text-to-text
- ocr
- content-moderation
- safety
- reasoning
pipeline_tag: image-text-to-text
---

# Xuanwu

### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

[![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)

Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.

The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.

## Highlights

- Compact ~2B architecture for deployment-sensitive moderation settings.
- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
- Progressive three-stage training: pre-training, mid-training, and post-training.
- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.

## Model Details

| Item | Value |
|---|---|
| Model type | Autoregressive vision-language model |
| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
| Parameters | Approximately 2B |
| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
| Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
| Context length | Up to 16,384 packed tokens |
| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
| Hardware | 64 x NVIDIA A100 80GB GPUs |
| Training cost | ~3,500 GPU hours |
| Language coverage | Primarily English and Chinese |

## Training

| Stage | Effective scale | Purpose |
|---|---:|---|
| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |

The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.

## Evaluation

### Headline Results

| Area | Metric | Xuanwu VL-2B | Reference |
|---|---|---:|---:|
| General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
| Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
| Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
| Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |

### General Multimodal Benchmarks

| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
|---|---:|---:|
| HallusionBench | 46.78 | **47.32** |
| AI2D | 77.95 | **82.19** |
| MMStar | 56.20 | **60.47** |
| OCRBench | 83.10 | **89.80** |
| MMBench v1.1 | 75.08 | **79.02** |
| MMMU (val) | **50.51** | 48.11 |
| MathVista | 60.30 | **68.40** |
| average-7 | 64.27 | **67.90** |

### Business Moderation and Adversarial OCR

Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.

Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.

## Prompt Format

For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:

```text
[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)
```

Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.

## Intended Uses

- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
- Assistive moderation workflows with structured explanations and OCR-aware reasoning.

## Limitations

- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
- Outputs should be verified before use in high-stakes review decisions.

## Citation

```bibtex
@article{zhang2026xuanwu,
  title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
  author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
  journal={arXiv preprint arXiv:2603.29211},
  year={2026}
}
```

## Links

- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)