File size: 6,288 Bytes

88cd91f

---
language:
- en
- zh
license: apache-2.0
tags:
- multimodal
- vision-language-model
- image-text-to-text
- ocr
- content-moderation
- safety
- reasoning
pipeline_tag: image-text-to-text
---

# Xuanwu

### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

[![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)

Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.

The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.

## Highlights

- Compact ~2B architecture for deployment-sensitive moderation settings.
- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
- Progressive three-stage training: pre-training, mid-training, and post-training.
- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.

## Model Details

| Item | Value |
|---|---|
| Model type | Autoregressive vision-language model |
| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
| Parameters | Approximately 2B |
| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
| Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
| Context length | Up to 16,384 packed tokens |
| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
| Hardware | 64 x NVIDIA A100 80GB GPUs |
| Training cost | ~3,500 GPU hours |
| Language coverage | Primarily English and Chinese |

## Training

| Stage | Effective scale | Purpose |
|---|---:|---|
| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |

The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.

## Evaluation

### Headline Results

| Area | Metric | Xuanwu VL-2B | Reference |
|---|---|---:|---:|
| General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
| Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
| Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
| Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |

### General Multimodal Benchmarks

| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
|---|---:|---:|
| HallusionBench | 46.78 | **47.32** |
| AI2D | 77.95 | **82.19** |
| MMStar | 56.20 | **60.47** |
| OCRBench | 83.10 | **89.80** |
| MMBench v1.1 | 75.08 | **79.02** |
| MMMU (val) | **50.51** | 48.11 |
| MathVista | 60.30 | **68.40** |
| average-7 | 64.27 | **67.90** |

### Business Moderation and Adversarial OCR

Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.

Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.

## Prompt Format

For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:

```text
[Observation] describe the main subjects and background
[Extraction] recover visible or concealed text and symbols
[Reasoning] compare the extracted evidence against moderation rules
[Conclusion] output the final decision (Safe / Violating-Category)
```

Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.

## Intended Uses

- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
- Assistive moderation workflows with structured explanations and OCR-aware reasoning.

## Limitations

- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
- Outputs should be verified before use in high-stakes review decisions.

## Citation

```bibtex
@article{zhang2026xuanwu,
  title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
  author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
  journal={arXiv preprint arXiv:2603.29211},
  year={2026}
}
```

## Links

- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)