File size: 2,538 Bytes
93486da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: apache-2.0
base_model: infly/Infinity-Parser2-Flash
pipeline_tag: image-text-to-text
library_name: gguf
language:
  - en
  - zh
tags:
  - ocr
  - document-parsing
  - document-understanding
  - vlm
  - vision-language
  - gguf
  - llama.cpp
  - q6_k
  - imatrix
  - quantized
---

# Infinity-Parser2-Flash — Q6_K GGUF (+ vision mmproj)

A **Q6_K GGUF** quantization of [`infly/Infinity-Parser2-Flash`](https://huggingface.co/infly/Infinity-Parser2-Flash) for **llama.cpp / `llama-server`**, so the model runs on a **single consumer GPU** (validated on an RTX 3080 Ti, 12 GB) without vLLM. ~4.2 GB bf16 → **~1.5 GB** Q6_K weights (+ 0.67 GB f16 vision projector).

The base is a Qwen3.5-architecture vision-language model for document understanding: OCR, layout analysis, tables→HTML, charts→JSON, formulas→LaTeX, and Markdown conversion (EN/ZH).

## Files
| File | What |
|---|---|
| `Infinity-Parser2-Flash-Q6_K.gguf` | Q6_K-quantized weights (imatrix) |
| `Infinity-Parser2-Flash-mmproj-f16.gguf` | f16 multimodal projector — **required for image input** |

## Method
`convert_hf_to_gguf` → f16 GGUF → `llama-quantize Q6_K` with an **importance matrix** computed from a clean native-PDF document corpus (~519 k tokens). (`llama-imatrix` is text-only; the mmproj carries the vision tower at serve time.)

## Quality (VLMEvalKit, vs published bf16)
| Benchmark | bf16 | Q6_K GGUF |
|---|---|---|
| DocVQA (val) | 93.80 | 93.63 |
| OCRBench | 84.3 | 82.8 |
| MMStar / MMBench | ref | ≥ bf16 |

Effectively **lossless** for the 6-bit quant. The small OCRBench dip is **not** the quantization — an f16 GGUF on the same stack scores ≈ 83.0 ≈ Q6_K's 82.8, so the residual gap is the llama.cpp vision preprocessing (candle CLIP), not the 6-bit weights.

## Serving (llama.cpp)
```bash
llama-server \
  --model Infinity-Parser2-Flash-Q6_K.gguf \
  --mmproj Infinity-Parser2-Flash-mmproj-f16.gguf \
  --ctx-size 32768 --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8105
```
OpenAI-compatible `/v1/chat/completions` with `image_url` content. Notes:
- **Reasoning-capable model:** output may arrive in the `reasoning_content` channel (llama.cpp routes the think block there) — read it accordingly, or disable thinking.
- A 16 MP page ≈ 15.6 K vision tokens, so `--ctx-size 32768` comfortably fits one page + output.

---
Quantized by [@spectator2026](https://huggingface.co/spectator2026). Original model © infly, Apache-2.0 — see the [base model card](https://huggingface.co/infly/Infinity-Parser2-Flash).