| --- |
| license: apache-2.0 |
| base_model: infly/Infinity-Parser2-Pro |
| pipeline_tag: image-text-to-text |
| tags: [ocr, document-parsing, vlm, awq, int4, w4a16, quantized, vllm] |
| --- |
| |
| # Infinity-Parser2-Pro — AWQ W4A16 (int4) |
|
|
| An int4 (W4A16) AWQ quantization of [`infly/Infinity-Parser2-Pro`](https://huggingface.co/infly/Infinity-Parser2-Pro), made to run on **NVIDIA A100 / sm80**, where the original FP8 path is unsupported. ~70 GB bf16 → **~21 GB**. |
|
|
| ## Method |
| AWQ via [llm-compressor](https://github.com/vllm-project/llm-compressor), **routed experts only** (W4A16). Attention, GDN/linear-attention, shared expert, vision tower, and `lm_head` are kept in bf16. Calibrated on a few hundred diverse document + general-vision samples. |
|
|
| ## Serving note (important) |
| vLLM fuses some bf16 layers before consulting the ignore list, which can otherwise yield all-`!` output. Fix: the saved `config.json` `quantization_config.ignore` uses broad regexes matching the *fused* names. Already applied here. |
|
|
| ## Quality (VLMEvalKit, AI-judged reproduction) |
| | Benchmark | Published bf16 | This int4 | |
| |---|---|---| |
| | MMStar | 69.7 | 66.9 | |
| | OCRBench | 86.2 | ~89 | |
| | DocVQA (val) | 96.4 | 96.4 | |
|
|
| Near-lossless on the cleanly-comparable axes. (MMBench omitted to avoid a circular-vs-vanilla scoring mismatch.) |
|
|
| ## Usage (vLLM) |
| ``` |
| vllm serve spectator2026/Infinity-Parser2-Pro-AWQ-W4A16 --dtype bfloat16 --trust-remote-code --reasoning-parser qwen3 |
| ``` |
| Pass `chat_template_kwargs={"enable_thinking": false}` in requests, or answers land in the reasoning channel. |
|
|
| --- |
| Quantized by [@spectator2026](https://huggingface.co/spectator2026). Original model © infly, Apache-2.0. |
|
|