CompressedGemma commited on
Commit
099fd3c
·
verified ·
1 Parent(s): 5a67f67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -204,14 +204,15 @@ When these globally-informed tokens are fed through the HPC forward pass for imp
204
 
205
  Standard `llama.cpp` imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with **one chunk** because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
206
 
 
207
  ```bash
208
  # Generate HPC importance matrix
209
  python3 LLM/generate_imatrix.py \
210
  model.gguf calibration_data.txt \
211
- -o imatrix.dat --chunks 1 --verbose
212
  ```
213
 
214
-
215
 
216
 
217
  **Step C: Quantize with HPC**
 
204
 
205
  Standard `llama.cpp` imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with **one chunk** because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
206
 
207
+
208
  ```bash
209
  # Generate HPC importance matrix
210
  python3 LLM/generate_imatrix.py \
211
  model.gguf calibration_data.txt \
212
+ -o imatrix.dat --chunks 5 --verbose
213
  ```
214
 
215
+ 5 Chunks is the 'sweet spot' for retaining most model intelligence I've found.
216
 
217
 
218
  **Step C: Quantize with HPC**