spectator2026 commited on
Commit
93486da
·
verified ·
1 Parent(s): 5f40ea0

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Infinity-Parser2-Flash-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ Infinity-Parser2-Flash-mmproj-f16.gguf filter=lfs diff=lfs merge=lfs -text
Infinity-Parser2-Flash-Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43ebc6070502467e96dfacc7423d75e380b11be05a3edced071a9faf707a9f62
3
+ size 1556391232
Infinity-Parser2-Flash-mmproj-f16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a04d63111ac206c7db99f17cf6f1f7a8bd19c07baa1e17522f9ceef732f3d7a
3
+ size 671373056
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: infly/Infinity-Parser2-Flash
4
+ pipeline_tag: image-text-to-text
5
+ library_name: gguf
6
+ language:
7
+ - en
8
+ - zh
9
+ tags:
10
+ - ocr
11
+ - document-parsing
12
+ - document-understanding
13
+ - vlm
14
+ - vision-language
15
+ - gguf
16
+ - llama.cpp
17
+ - q6_k
18
+ - imatrix
19
+ - quantized
20
+ ---
21
+
22
+ # Infinity-Parser2-Flash — Q6_K GGUF (+ vision mmproj)
23
+
24
+ A **Q6_K GGUF** quantization of [`infly/Infinity-Parser2-Flash`](https://huggingface.co/infly/Infinity-Parser2-Flash) for **llama.cpp / `llama-server`**, so the model runs on a **single consumer GPU** (validated on an RTX 3080 Ti, 12 GB) without vLLM. ~4.2 GB bf16 → **~1.5 GB** Q6_K weights (+ 0.67 GB f16 vision projector).
25
+
26
+ The base is a Qwen3.5-architecture vision-language model for document understanding: OCR, layout analysis, tables→HTML, charts→JSON, formulas→LaTeX, and Markdown conversion (EN/ZH).
27
+
28
+ ## Files
29
+ | File | What |
30
+ |---|---|
31
+ | `Infinity-Parser2-Flash-Q6_K.gguf` | Q6_K-quantized weights (imatrix) |
32
+ | `Infinity-Parser2-Flash-mmproj-f16.gguf` | f16 multimodal projector — **required for image input** |
33
+
34
+ ## Method
35
+ `convert_hf_to_gguf` → f16 GGUF → `llama-quantize Q6_K` with an **importance matrix** computed from a clean native-PDF document corpus (~519 k tokens). (`llama-imatrix` is text-only; the mmproj carries the vision tower at serve time.)
36
+
37
+ ## Quality (VLMEvalKit, vs published bf16)
38
+ | Benchmark | bf16 | Q6_K GGUF |
39
+ |---|---|---|
40
+ | DocVQA (val) | 93.80 | 93.63 |
41
+ | OCRBench | 84.3 | 82.8 |
42
+ | MMStar / MMBench | ref | ≥ bf16 |
43
+
44
+ Effectively **lossless** for the 6-bit quant. The small OCRBench dip is **not** the quantization — an f16 GGUF on the same stack scores ≈ 83.0 ≈ Q6_K's 82.8, so the residual gap is the llama.cpp vision preprocessing (candle CLIP), not the 6-bit weights.
45
+
46
+ ## Serving (llama.cpp)
47
+ ```bash
48
+ llama-server \
49
+ --model Infinity-Parser2-Flash-Q6_K.gguf \
50
+ --mmproj Infinity-Parser2-Flash-mmproj-f16.gguf \
51
+ --ctx-size 32768 --n-gpu-layers 99 \
52
+ --host 0.0.0.0 --port 8105
53
+ ```
54
+ OpenAI-compatible `/v1/chat/completions` with `image_url` content. Notes:
55
+ - **Reasoning-capable model:** output may arrive in the `reasoning_content` channel (llama.cpp routes the think block there) — read it accordingly, or disable thinking.
56
+ - A 16 MP page ≈ 15.6 K vision tokens, so `--ctx-size 32768` comfortably fits one page + output.
57
+
58
+ ---
59
+ Quantized by [@spectator2026](https://huggingface.co/spectator2026). Original model © infly, Apache-2.0 — see the [base model card](https://huggingface.co/infly/Infinity-Parser2-Flash).