Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,133 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- zh
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
tags:
|
| 7 |
+
- multimodal
|
| 8 |
+
- vision-language-model
|
| 9 |
+
- image-text-to-text
|
| 10 |
+
- ocr
|
| 11 |
+
- content-moderation
|
| 12 |
+
- safety
|
| 13 |
+
- reasoning
|
| 14 |
+
pipeline_tag: image-text-to-text
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Xuanwu
|
| 18 |
+
|
| 19 |
+
### Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
|
| 20 |
+
|
| 21 |
+
[](https://arxiv.org/abs/2603.29211)
|
| 22 |
+
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 23 |
+
[](https://github.com/hellogroup-opensource/Xuanwu)
|
| 24 |
+
|
| 25 |
+
Xuanwu VL-2B is a compact vision-language foundation model for industrial content ecosystems. It is designed to balance general multimodal understanding, fine-grained content moderation, and adversarial OCR robustness within an approximately 2B-parameter budget.
|
| 26 |
+
|
| 27 |
+
The model combines **InternViT-300M + a lightweight 2-layer MLP projector + Qwen3 1.7B**, together with dynamic high-resolution perception, business-oriented data curation, structured chain-of-thought supervision, and GRPO-based post-training alignment.
|
| 28 |
+
|
| 29 |
+
## Highlights
|
| 30 |
+
|
| 31 |
+
- Compact ~2B architecture for deployment-sensitive moderation settings.
|
| 32 |
+
- Dynamic high-resolution perception with up to 12 local `448 x 448` tiles plus a global thumbnail.
|
| 33 |
+
- Progressive three-stage training: pre-training, mid-training, and post-training.
|
| 34 |
+
- Structured moderation reasoning in the form of `Observation -> Extraction -> Reasoning -> Conclusion`.
|
| 35 |
+
- Strong business moderation and adversarial OCR performance while retaining competitive general multimodal capability.
|
| 36 |
+
|
| 37 |
+
## Model Details
|
| 38 |
+
|
| 39 |
+
| Item | Value |
|
| 40 |
+
|---|---|
|
| 41 |
+
| Model type | Autoregressive vision-language model |
|
| 42 |
+
| Architecture | InternViT-300M + 2-layer MLP projector + Qwen3-1.7B |
|
| 43 |
+
| Parameters | Approximately 2B |
|
| 44 |
+
| Visual front end | Dynamic tiling with up to 12 local crops plus 1 global thumbnail |
|
| 45 |
+
| Token control | Pixel Unshuffle; each `448 x 448` tile contributes 256 visual tokens |
|
| 46 |
+
| Context length | Up to 16,384 packed tokens |
|
| 47 |
+
| Training stack | DeepSpeed, bf16 / AMP, Flash Attention-2 |
|
| 48 |
+
| Hardware | 64 x NVIDIA A100 80GB GPUs |
|
| 49 |
+
| Training cost | ~3,500 GPU hours |
|
| 50 |
+
| Language coverage | Primarily English and Chinese |
|
| 51 |
+
|
| 52 |
+
## Training
|
| 53 |
+
|
| 54 |
+
| Stage | Effective scale | Purpose |
|
| 55 |
+
|---|---:|---|
|
| 56 |
+
| Pre-training | 18.63M | Cross-modal alignment and general image-text learning |
|
| 57 |
+
| Mid-training | 2.801M | General-capability retention plus moderation and adversarial OCR injection |
|
| 58 |
+
| SFT | 8.408M | High-fidelity supervised tuning for rules, format, and reasoning |
|
| 59 |
+
| RL | 810k | GRPO alignment for classification, format, and OCR character alignment |
|
| 60 |
+
|
| 61 |
+
The raw pretraining inventory contains **20,078,399** source samples across nine top-level categories, including captioning, chart and table understanding, VQA, OCR, document understanding, science, mathematics, and text-only data. Mid-training and post-training further add business moderation data, adversarial OCR data, and manually reviewed LLM-assisted annotations.
|
| 62 |
+
|
| 63 |
+
## Evaluation
|
| 64 |
+
|
| 65 |
+
### Headline Results
|
| 66 |
+
|
| 67 |
+
| Area | Metric | Xuanwu VL-2B | Reference |
|
| 68 |
+
|---|---|---:|---:|
|
| 69 |
+
| General multimodal | OpenCompass average-7 | **67.90** | InternVL 3.5 2B: 64.27 |
|
| 70 |
+
| Text-only retention | average-9 | **58.38** | InternVL 3.5 2B: 59.02 |
|
| 71 |
+
| Business moderation | average recall over 7 categories | **94.38** | InternVL 3.5 2B: 47.98 |
|
| 72 |
+
| Adversarial OCR | weighted overall recall | **82.82** | Gemini-2.5-Pro: 76.72 |
|
| 73 |
+
|
| 74 |
+
### General Multimodal Benchmarks
|
| 75 |
+
|
| 76 |
+
| Benchmark | InternVL 3.5 2B | Xuanwu VL-2B |
|
| 77 |
+
|---|---:|---:|
|
| 78 |
+
| HallusionBench | 46.78 | **47.32** |
|
| 79 |
+
| AI2D | 77.95 | **82.19** |
|
| 80 |
+
| MMStar | 56.20 | **60.47** |
|
| 81 |
+
| OCRBench | 83.10 | **89.80** |
|
| 82 |
+
| MMBench v1.1 | 75.08 | **79.02** |
|
| 83 |
+
| MMMU (val) | **50.51** | 48.11 |
|
| 84 |
+
| MathVista | 60.30 | **68.40** |
|
| 85 |
+
| average-7 | 64.27 | **67.90** |
|
| 86 |
+
|
| 87 |
+
### Business Moderation and Adversarial OCR
|
| 88 |
+
|
| 89 |
+
Xuanwu VL-2B reaches **94.38 average recall** over seven business moderation categories and **82.82 weighted overall recall** on the adversarial OCR benchmark. On adversarial OCR, it performs particularly strongly on `aigc`, `noise`, `warp`, and `watermark` subsets.
|
| 90 |
+
|
| 91 |
+
Gemini-2.5-Pro is reported in the paper as a zero-shot commercial control model without domain adaptation. The comparison is intended as a task-specialized reference rather than a claim of superiority under identical training conditions.
|
| 92 |
+
|
| 93 |
+
## Prompt Format
|
| 94 |
+
|
| 95 |
+
For difficult moderation and adversarial OCR cases, the paper uses the following structured response pattern:
|
| 96 |
+
|
| 97 |
+
```text
|
| 98 |
+
[Observation] describe the main subjects and background
|
| 99 |
+
[Extraction] recover visible or concealed text and symbols
|
| 100 |
+
[Reasoning] compare the extracted evidence against moderation rules
|
| 101 |
+
[Conclusion] output the final decision (Safe / Violating-Category)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
Reported evaluation results use greedy decoding with `temperature = 0` and `max_tokens = 8192`.
|
| 105 |
+
|
| 106 |
+
## Intended Uses
|
| 107 |
+
|
| 108 |
+
- Research on industrial multimodal systems, especially content moderation and adversarial OCR.
|
| 109 |
+
- Deployment-sensitive multimodal understanding where a compact 2B-scale model is preferred.
|
| 110 |
+
- Assistive moderation workflows with structured explanations and OCR-aware reasoning.
|
| 111 |
+
|
| 112 |
+
## Limitations
|
| 113 |
+
|
| 114 |
+
- The model can still miss violations in extreme cases such as ultra-dense overlapping watermarks or nearly invisible hidden text.
|
| 115 |
+
- In long reasoning chains, the model may occasionally hallucinate or over-attribute a violation because of local high-risk cues.
|
| 116 |
+
- Moderation behavior is shaped partly by business-specific policies and data, so transfer to other platforms or jurisdictions is not guaranteed.
|
| 117 |
+
- Outputs should be verified before use in high-stakes review decisions.
|
| 118 |
+
|
| 119 |
+
## Citation
|
| 120 |
+
|
| 121 |
+
```bibtex
|
| 122 |
+
@article{zhang2026xuanwu,
|
| 123 |
+
title={Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems},
|
| 124 |
+
author={Zhang, Zhiqian and Zhao, Xu and Xu, Xiaoqing and Liang, Guangdong and Wang, Weijia and Lv, Xiaolei and Li, Bo and Gao, Jun},
|
| 125 |
+
journal={arXiv preprint arXiv:2603.29211},
|
| 126 |
+
year={2026}
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## Links
|
| 131 |
+
|
| 132 |
+
- Paper: [https://arxiv.org/abs/2603.29211](https://arxiv.org/abs/2603.29211)
|
| 133 |
+
- GitHub: [https://github.com/hellogroup-opensource/Xuanwu](https://github.com/hellogroup-opensource/Xuanwu)
|