CompressedGemma
/

HPC-Quantize

Model card Files Files and versions

CompressedGemma commited on May 8

Commit

099fd3c

·

verified ·

1 Parent(s): 5a67f67

Update README.md

Files changed (1) hide show

README.md +3 -2

README.md CHANGED Viewed

@@ -204,14 +204,15 @@ When these globally-informed tokens are fed through the HPC forward pass for imp
 Standard `llama.cpp` imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with **one chunk** because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
 ```bash
 # Generate HPC importance matrix
 python3 LLM/generate_imatrix.py \
     model.gguf calibration_data.txt \
-    -o imatrix.dat --chunks 1 --verbose
 ```
 **Step C: Quantize with HPC**

 Standard `llama.cpp` imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with **one chunk** because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
 ```bash
 # Generate HPC importance matrix
 python3 LLM/generate_imatrix.py \
     model.gguf calibration_data.txt \
+    -o imatrix.dat --chunks 5 --verbose
 ```
+5 Chunks is the 'sweet spot' for retaining most model intelligence I've found.
 **Step C: Quantize with HPC**