spectator2026
/

Infinity-Parser2-Flash-AWQ-W4A16

Image-Text-to-Text

document-parsing

compressed-tensors

Model card Files Files and versions

Infinity-Parser2-Flash-AWQ-W4A16 / README.md

spectator2026's picture

Add int4 AWQ W4A16 checkpoint

37c6a3d verified 8 days ago

|

history blame contribute delete

1.67 kB

	---
	license: apache-2.0
	base_model: infly/Infinity-Parser2-Flash
	pipeline_tag: image-text-to-text
	tags: [ocr, document-parsing, vlm, awq, int4, w4a16, quantized, vllm]
	---

	# Infinity-Parser2-Flash — AWQ W4A16 (int4)

	An int4 (W4A16) AWQ quantization of [`infly/Infinity-Parser2-Flash`](https://huggingface.co/infly/Infinity-Parser2-Flash), made to run on NVIDIA A100 / sm80, where the original FP8 path is unsupported. ~4.2 GB bf16 → ~2.9 GB.

	## Method
	AWQ via [llm-compressor](https://github.com/vllm-project/llm-compressor), routed experts only (W4A16). Attention, GDN/linear-attention, shared expert, vision tower, and `lm_head` are kept in bf16. Calibrated on a few hundred diverse document + general-vision samples.

	## Serving note (important)
	vLLM fuses some bf16 layers before consulting the ignore list, which can otherwise yield all-`!` output. Fix: the saved `config.json` `quantization_config.ignore` uses broad regexes matching the fused names. Already applied here.

	## Quality (VLMEvalKit, AI-judged reproduction)
	\| Benchmark \| Published bf16 \| This int4 \|
	\|---\|---\|---\|
	\| MMStar \| 57.1 \| 54.8 \|
	\| OCRBench \| 81.6 \| 85.0 \|
	\| DocVQA (val) \| 93.2 \| 93.5 \|

	Near-lossless on the cleanly-comparable axes. (MMBench omitted to avoid a circular-vs-vanilla scoring mismatch.)

	## Usage (vLLM)
	```
	vllm serve spectator2026/Infinity-Parser2-Flash-AWQ-W4A16 --dtype bfloat16 --trust-remote-code --reasoning-parser qwen3
	```
	Pass `chat_template_kwargs={"enable_thinking": false}` in requests, or answers land in the reasoning channel.

	---
	Quantized by [@spectator2026](https://huggingface.co/spectator2026). Original model © infly, Apache-2.0.