| --- |
| language: |
| - en |
| - zh |
| license: apache-2.0 |
| tags: |
| - multimodal |
| - vision-language-model |
| - image-text-to-text |
| - ocr |
| - content-moderation |
| - safety |
| - reasoning |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Xuanwu |
|
|
| ### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems |
|
|
| [](https://arxiv.org/abs/2603.29211) |
| [](https://www.apache.org/licenses/LICENSE-2.0) |
| [](https://github.com/hellogroup-opensource/Xuanwu) |
|
|
| Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget. |
|
|
| The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment. |
|
|
| ## Highlights |
|
|
| - Compact ~2B architecture for deployment-sensitive moderation settings. |
| - Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail. |
| - Progressive three-stage training: pre-training, mid-training, and post-training. |
| - Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`. |
| - Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability. |
|
|
| ## Model Details |
|
|
| | Item | Value | |
| |---|---| |
| | Model type | Autoregressive vision-language model | |
| | Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B | |
| | Parameters | Approximately 2B | |
| | Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail | |
| | Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens | |
| | Context length | Up to 16,384 packed tokens | |
| | Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 | |
| | Hardware | 64 x NVIDIA A100 80GB GPUs | |
| | Training cost | ~3,500 GPU hours | |
| | Language coverage | Primarily English and Chinese | |
|
|
| ## Training |
|
|
| | Stage | Effective scale | Purpose | |
| |---|---:|---| |
| | Pre-training | 18.63M | Cross-modal alignment and general image-text learning | |
| | Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection | |
| | SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning | |
| | RL | 810k | GRPO alignment for classification, format, and OCR character alignment | |
|
|
| The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations. |
|
|
| ## Evaluation |
|
|
| ### Headline Results |
|
|
| | Area | Metric | Xuanwu VL-2B | Reference | |
| |---|---|---:|---:| |
| | General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 | |
| | Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 | |
| | Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 | |
| | Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 | |
|
|
| ### General Multimodal Benchmarks |
|
|
| | Benchmark | InternVL 3.5 2B | Xuanwu VL-2B | |
| |---|---:|---:| |
| | HallusionBench | 46.78 | **47.32** | |
| | AI2D | 77.95 | **82.19** | |
| | MMStar | 56.20 | **60.47** | |
| | OCRBench | 83.10 | **89.80** | |
| | MMBench v1.1 | 75.08 | **79.02** | |
| | MMMU (val) | **50.51** | 48.11 | |
| | MathVista | 60.30 | **68.40** | |
| | average-7 | 64.27 | **67.90** | |
|
|
| ### Business Moderation and Adversarial OCR |
|
|
| Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets. |
|
|
| Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions. |
|
|
| ## Prompt Format |
|
|
| For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern: |
|
|
| ```text |
| [Observation] describe the main subjects and background |
| [Extraction] recover visible or concealed text and symbols |
| [Reasoning] compare the extracted evidence against moderation rules |
| [Conclusion] output the final decision (Safe / Violating-Category) |
| ``` |
|
|
| Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`. |
|
|
| ## Intended Uses |
|
|
| - Research on industrial multimodal systems, especially content moderation and adversarial OCR. |
| - Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred. |
| - Assistive moderation workflows with structured explanations and OCR-aware reasoning. |
|
|
| ## Limitations |
|
|
| - The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text. |
| - In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues. |
| - Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed. |
| - Outputs should be verified before use in high-stakes review decisions. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhang2026xuanwu, |
| title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems}, |
| author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun}, |
| journal={arXiv preprint arXiv:2603.29211}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Links |
|
|
| - Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211) |
| - GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu) |
|
|