Xuanwu / README.md
hellogroup-opensource's picture
Update README.md
88cd91f verified
---
language:
- en
- zh
license: apache-2.0
tags:
- multimodal
- vision-language-model
- image-text-to-text
- ocr
- content-moderation
- safety
- reasoning
pipeline_tag: image-text-to-text
---
# Xuanwu
### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
[![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)
Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.
The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.
## Highlights
- Compact ~2B architecture for deployment-sensitive moderation settings.
- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
- Progressive three-stage training: pre-training, mid-training, and post-training.
- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.
## Model Details
| Item | Value |
|---|---|
| Model type | Autoregressive vision-language model |
| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
| Parameters | Approximately 2B |
| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
| Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
| Context length | Up to 16,384 packed tokens |
| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
| Hardware | 64 x NVIDIA A100 80GB GPUs |
| Training cost | ~3,500 GPU hours |
| Language coverage | Primarily English and Chinese |
## Training
| Stage | Effective scale | Purpose |
|---|---:|---|
| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |
The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.
## Evaluation
### Headline Results
| Area | Metric | Xuanwu VL-2B | Reference |
|---|---|---:|---:|
| General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
| Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
| Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
| Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |
### General Multimodal Benchmarks
| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
|---|---:|---:|
| HallusionBench | 46.78 | **47.32** |
| AI2D | 77.95 | **82.19** |
| MMStar | 56.20 | **60.47** |
| OCRBench | 83.10 | **89.80** |
| MMBench v1.1 | 75.08 | **79.02** |
| MMMU (val) | **50.51** | 48.11 |
| MathVista | 60.30 | **68.40** |
| average-7 | 64.27 | **67.90** |
### Business Moderation and Adversarial OCR
Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.
Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.
## Prompt Format
For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:
```text
[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)
```
Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.
## Intended Uses
- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
- Assistive moderation workflows with structured explanations and OCR-aware reasoning.
## Limitations
- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
- Outputs should be verified before use in high-stakes review decisions.
## Citation
```bibtex
@article{zhang2026xuanwu,
title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
journal={arXiv preprint arXiv:2603.29211},
year={2026}
}
```
## Links
- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)