--- language: - en - zh license: apache-2.0 tags: - multimodal - vision-language-model - image-text-to-text - ocr - content-moderation - safety - reasoning pipeline_tag: image-text-to-text --- # Xuanwu ### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems [![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211) [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu) Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget. The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment. ## Highlights - Compact ~2B architecture for deployment-sensitive moderation settings. - Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail. - Progressive three-stage training: pre-training, mid-training, and post-training. - Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`. - Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability. ## Model Details | Item | Value | |---|---| | Model type | Autoregressive vision-language model | | Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B | | Parameters | Approximately 2B | | Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail | | Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens | | Context length | Up to 16,384 packed tokens | | Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 | | Hardware | 64 x NVIDIA A100 80GB GPUs | | Training cost | ~3,500 GPU hours | | Language coverage | Primarily English and Chinese | ## Training | Stage | Effective scale | Purpose | |---|---:|---| | Pre-training | 18.63M | Cross-modal alignment and general image-text learning | | Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection | | SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning | | RL | 810k | GRPO alignment for classification, format, and OCR character alignment | The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations. ## Evaluation ### Headline Results | Area | Metric | Xuanwu VL-2B | Reference | |---|---|---:|---:| | General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 | | Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 | | Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 | | Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 | ### General Multimodal Benchmarks | Benchmark | InternVL 3.5 2B | Xuanwu VL-2B | |---|---:|---:| | HallusionBench | 46.78 | **47.32** | | AI2D | 77.95 | **82.19** | | MMStar | 56.20 | **60.47** | | OCRBench | 83.10 | **89.80** | | MMBench v1.1 | 75.08 | **79.02** | | MMMU (val) | **50.51** | 48.11 | | MathVista | 60.30 | **68.40** | | average-7 | 64.27 | **67.90** | ### Business Moderation and Adversarial OCR Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets. Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions. ## Prompt Format For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern: ```text [Observation] describe the main subjects and background [Extraction] recover visible or concealed text and symbols [Reasoning] compare the extracted evidence against moderation rules [Conclusion] output the final decision (Safe / Violating-Category) ``` Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`. ## Intended Uses - Research on industrial multimodal systems, especially content moderation and adversarial OCR. - Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred. - Assistive moderation workflows with structured explanations and OCR-aware reasoning. ## Limitations - The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text. - In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues. - Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed. - Outputs should be verified before use in high-stakes review decisions. ## Citation ```bibtex @article{zhang2026xuanwu, title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems}, author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun}, journal={arXiv preprint arXiv:2603.29211}, year={2026} } ``` ## Links - Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211) - GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)