Xuanwu / README.md

Update README.md

88cd91f verified 6 days ago

6.29 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	tags:
	- multimodal
	- vision-language-model
	- image-text-to-text
	- ocr
	- content-moderation
	- safety
	- reasoning
	pipeline_tag: image-text-to-text
	---

	# Xuanwu

	### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

	[![Paper](https://img.shields.io/badge/arXiv-2603.29211-b31b1b.svg)](https://arxiv.org/abs/2603.29211)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
	[![GitHub](https://img.shields.io/badge/GitHub-Xuanwu-181717?logo=github)](https://github.com/hellogroup-opensource/Xuanwu)

	Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.

	The model combines InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.

	## Highlights

	- Compact ~2B architecture for deployment-sensitive moderation settings.
	- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
	- Progressive three-stage training: pre-training, mid-training, and post-training.
	- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
	- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.

	## Model Details

	\| Item \| Value \|
	\|---\|---\|
	\| Model type \| Autoregressive vision-language model \|
	\| Architecture \| InternViT-300M + 2-layer MLP projector + Qwen3-1.7B \|
	\| Parameters \| Approximately 2B \|
	\| Visual front end \| Dynamic tiling with up to 12 local crops plus 1 global thumbnail \|
	\| Token control \| Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens \|
	\| Context length \| Up to 16,384 packed tokens \|
	\| Training stack \| DeepSpeed, bf16 / AMP, Flash Attention-2 \|
	\| Hardware \| 64 x NVIDIA A100 80GB GPUs \|
	\| Training cost \| ~3,500 GPU hours \|
	\| Language coverage \| Primarily English and Chinese \|

	## Training

	\| Stage \| Effective scale \| Purpose \|
	\|---\|---:\|---\|
	\| Pre-training \| 18.63M \| Cross-modal alignment and general image-text learning \|
	\| Mid-training \| 2.801M \| General-capability retention plus moderation and adversarial OCR injection \|
	\| SFT \| 8.408M \| High-fidelity supervised tuning for rules, format, and reasoning \|
	\| RL \| 810k \| GRPO alignment for classification, format, and OCR character alignment \|

	The raw pretraining inventory contains 20,078,399 source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.

	## Evaluation

	### Headline Results

	\| Area \| Metric \| Xuanwu VL-2B \| Reference \|
	\|---\|---\|---:\|---:\|
	\| General multimodal \| OpenCompass average-7 \| 67.90 \| InternVL 3.5 2B: 64.27 \|
	\| Text-only retention \| average-9 \| 58.38 \| InternVL 3.5 2B: 59.02 \|
	\| Business moderation \| average recall over 7 categories \| 94.38 \| InternVL 3.5 2B: 47.98 \|
	\| Adversarial OCR \| weighted overall recall \| 82.82 \| Gemini-2.5-Pro: 76.72 \|

	### General Multimodal Benchmarks

	\| Benchmark \| InternVL 3.5 2B \| Xuanwu VL-2B \|
	\|---\|---:\|---:\|
	\| HallusionBench \| 46.78 \| 47.32 \|
	\| AI2D \| 77.95 \| 82.19 \|
	\| MMStar \| 56.20 \| 60.47 \|
	\| OCRBench \| 83.10 \| 89.80 \|
	\| MMBench v1.1 \| 75.08 \| 79.02 \|
	\| MMMU (val) \| 50.51 \| 48.11 \|
	\| MathVista \| 60.30 \| 68.40 \|
	\| average-7 \| 64.27 \| 67.90 \|

	### Business Moderation and Adversarial OCR

	Xuanwu VL-2B reaches 94.38 average recall over seven business moderation categories and 82.82 weighted overall recall on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.

	Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.

	## Prompt Format

	For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:

	```text
	[Observation] describe the main subjects and background
	[Extraction] recover visible or concealed text and symbols
	[Reasoning] compare the extracted evidence against moderation rules
	[Conclusion] output the final decision (Safe / Violating-Category)
	```

	Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.

	## Intended Uses

	- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
	- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
	- Assistive moderation workflows with structured explanations and OCR-aware reasoning.

	## Limitations

	- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
	- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
	- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
	- Outputs should be verified before use in high-stakes review decisions.

	## Citation

	```bibtex
	@article{zhang2026xuanwu,
	title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
	author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
	journal={arXiv preprint arXiv:2603.29211},
	year={2026}
	}
	```

	## Links

	- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
	- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)