Add int4 AWQ W4A16 checkpoint

37c6a3d verified 7 days ago

1.67 kB

license: apache-2.0
base_model: infly/Infinity-Parser2-Flash
pipeline_tag: image-text-to-text
tags:
  - ocr
  - document-parsing
  - vlm
  - awq
  - int4
  - w4a16
  - quantized
  - vllm

Infinity-Parser2-Flash — AWQ W4A16 (int4)

An int4 (W4A16) AWQ quantization of infly/Infinity-Parser2-Flash, made to run on NVIDIA A100 / sm80, where the original FP8 path is unsupported. 4.2 GB bf16 → **2.9 GB**.

Method

AWQ via llm-compressor, routed experts only (W4A16). Attention, GDN/linear-attention, shared expert, vision tower, and lm_head are kept in bf16. Calibrated on a few hundred diverse document + general-vision samples.

Serving note (important)

vLLM fuses some bf16 layers before consulting the ignore list, which can otherwise yield all-! output. Fix: the saved config.json quantization_config.ignore uses broad regexes matching the fused names. Already applied here.

Quality (VLMEvalKit, AI-judged reproduction)

Benchmark	Published bf16	This int4
MMStar	57.1	54.8
OCRBench	81.6	85.0
DocVQA (val)	93.2	93.5

Near-lossless on the cleanly-comparable axes. (MMBench omitted to avoid a circular-vs-vanilla scoring mismatch.)

Usage (vLLM)

vllm serve spectator2026/Infinity-Parser2-Flash-AWQ-W4A16 --dtype bfloat16 --trust-remote-code --reasoning-parser qwen3

Pass chat_template_kwargs={"enable_thinking": false} in requests, or answers land in the reasoning channel.