hellogroup-opensource
/

Xuanwu

+---
+language:
+- en
+- zh
+license: apache-2.0
+tags:
+- multimodal
+- vision-language-model
+- image-text-to-text
+- ocr
+- content-moderation
+- safety
+- reasoning
+pipeline_tag: image-text-to-text
+---
+# Xuanwu
+### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
+[![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
+[![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)
+Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.
+The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.
+## Highlights
+- Compact ~2B architecture for deployment-sensitive moderation settings.
+- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
+- Progressive three-stage training: pre-training, mid-training, and post-training.
+- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
+- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.
+## Model Details
+| Item | Value |
+|---|---|
+| Model type | Autoregressive vision-language model |
+| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
+| Parameters | Approximately 2B |
+| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
+| Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
+| Context length | Up to 16,384 packed tokens |
+| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
+| Hardware | 64 x NVIDIA A100 80GB GPUs |
+| Training cost | ~3,500 GPU hours |
+| Language coverage | Primarily English and Chinese |
+## Training
+| Stage | Effective scale | Purpose |
+|---|---:|---|
+| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
+| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
+| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
+| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |
+The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.
+## Evaluation
+### Headline Results
+| Area | Metric | Xuanwu VL-2B | Reference |
+|---|---|---:|---:|
+| General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
+| Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
+| Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
+| Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |
+### General Multimodal Benchmarks
+| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
+|---|---:|---:|
+| HallusionBench | 46.78 | **47.32** |
+| AI2D | 77.95 | **82.19** |
+| MMStar | 56.20 | **60.47** |
+| OCRBench | 83.10 | **89.80** |
+| MMBench v1.1 | 75.08 | **79.02** |
+| MMMU (val) | **50.51** | 48.11 |
+| MathVista | 60.30 | **68.40** |
+| average-7 | 64.27 | **67.90** |
+### Business Moderation and Adversarial OCR
+Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.
+Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.
+## Prompt Format
+For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:
+```text
+[Observation] describe the main subjects and background
+[Extraction] recover visible or concealed text and symbols
+[Reasoning] compare the extracted evidence against moderation rules
+[Conclusion] output the final decision (Safe / Violating-Category)
+```
+Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.
+## Intended Uses
+- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
+- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
+- Assistive moderation workflows with structured explanations and OCR-aware reasoning.
+## Limitations
+- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
+- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
+- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
+- Outputs should be verified before use in high-stakes review decisions.
+## Citation
+```bibtex
+@article{zhang2026xuanwu,
+  title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
+  author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
+  journal={arXiv preprint arXiv:2603.29211},
+  year={2026}
+}
+```
+## Links
+- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
+- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)