CompressedGemma
/

HPC-Quantize

Model card Files Files and versions

xet

Community

CompressedGemma commited on May 6

Commit

96fce02

verified ·

1 Parent(s): 07b428c

Update README.md

Browse files

Files changed (1) hide show

README.md +195 -3

README.md CHANGED Viewed

@@ -1,3 +1,195 @@
----
-license: mit
----

+---
+license: mit
+---
+# HPC Quantizer — Shor-Optimized
+**GGUF quantization powered by Shor's algorithm.**
+HPC is probably a breakthrough in model compression, utilizing a quantization pipeline derived from Shor's factoring algorithm. It compresses large language models (like Gemma 4) by mapping quantization candidates to a quantum-inspired state space.
+Instead of independently rounding each weight block or running iterative belief propagation, HPC encodes scale candidates as Z₆ complex amplitudes on a constraint graph and applies the **Griffiths-Niu sequential measurement protocol**. This uses the same IDFT + feed-forward + collapse/back-action loop that extracts periods in Shor's algorithm — finding globally optimal scale configurations where quantization noise is rotated away from the transformer's reasoning dimensions.
+## 1. Why Shor's Algorithm?
+The previous HPC engine used iterative **belief propagation (BP)** to find optimal scale configurations. BP converges to the element-wise MSE minimum — the scale configuration that minimizes total `Σ (w_original - w_quantized)²`. This produces the lowest possible RMSE, but at 2 bits per weight, the noise floor still slightly bleeds into reasoning-critical dimensions.
+Shor's Griffiths-Niu measurement protocol replaces BP with a fundamentally different optimization strategy:
+| Feature | Belief Propagation (v2) | Shor's Measurement (v3) |
+|---|---|---|
+| **Mechanism** | Iterative message-passing (200+ rounds) | Single-pass sequential measurement |
+| **Convergence** | May oscillate or get stuck | Exact marginals, no iteration |
+| **Inter-block coordination** | Local messages only | Global conditioning via collapse back-action |
+| **Error metric** | Element-wise MSE (isotropic) | D₆ vesica gate (anisotropic) |
+| **RMSE** | Lower | Slightly higher |
+| **Reasoning fidelity** | Good | **Significantly better** |
+The key insight: **RMSE measures the wrong thing.** Standard RMSE treats every weight dimension equally. But during matrix multiplication, some error dimensions propagate through the computation graph and destroy reasoning, while others cancel out and are invisible. Shor's measurement finds configurations where block-to-block errors are anti-correlated along the computation path — they cancel during matmul even though each individual block has slightly higher error.
+## 2. Performance & Benchmarks
+### Gemma 4 26B-A4B-it MoE (25.8B params)
+| Quantization | Size | Fits 12 GB? | Method |
+|-------------|------|:-----------:|--------|
+| BF16 | 48.5 GB | ❌ | — |
+| Q8_0 | ~27 GB | ❌ | Round-to-nearest |
+| Q4_K_M | 16.8 GB | ❌ | Round-to-nearest |
+| IQ3_K_XXS | ~12 GB | ⚠️ | Unsloth |
+| **HPC·Shor** | **10.2 GB** | **✅** | **Griffiths-Niu measurement** |
+### Gemma 4 E2B-it (4.65B params)
+| Model | Size | BPW | PPL | Speed |
+|-------|------|-----|-----|-------|
+| BF16 (original) | 8.67 GB | 16.00 | 154.0 | 4.2 t/s |
+| ggml Q2_K + iMatrix | 2.77 GB | 5.12 | 89.1 | 14.0 t/s |
+| **HPC Q2_K + Q4_0·Shor** | **1.44 GB** | **~3.0** | **129.6** | **18.1 t/s** |
+### Reasoning Benchmarks (Gemma 4 31B, Q2_K·Shor, 12.5 GB)
+- **25 Horses combinatorial proof:** ✅ (7 races, complete elimination)
+- **Hindley-Milner type inference:** ✅ (correct let-polymorphism)
+- **Arto Inkala "World's Hardest Sudoku":** ✅ (AC-3 + backtracking)
+- **Diagnose 3 non-obvious bugs in C:** ✅ (first attempt)
+- **Tarjan's bridge-finding algorithm:** ✅ (correct `>` vs `>=` distinction)
+## 3. Quantum-Inspired Mechanics (Shor Pipeline)
+Standard quantization picks scales independently per block. Shor-powered quantization treats scale selection as a **global optimization problem** where the measurement of each block conditions all remaining blocks through quantum-inspired back-action.
+The domain mapping from Shor's integer factoring to HPC Quantization:
+| Shor's Factoring | HPC Quantization |
+|---|---|
+| Oracle phase `2π × d × cₖ / N` | Boltzmann amplitude from candidate error |
+| Period `r` | Optimal scale configuration |
+| QFT interference peaks at `r` | IDFT6 interference peaks at optimal RMSE |
+| Semi-classical feed-forward | Phase correction from measured blocks |
+| Born measurement → period bits | Born measurement → scale candidate selection |
+| Collapse + entanglement | Collapse + back-action into neighbor amplitudes |
+By utilizing the **IDFT6** inside the coherent sum, the algorithm creates constructive interference at the optimal RMSE configuration, similarly to how Shor's QFT creates interference at the correct period.
+## 4. Q2_K and Q4_0 Promotion Strategies
+The quantizer automatically assigns precision tiers:
+- **Q4_0·Shor** — Attention projections (Q/K/V/O) — 16 candidate scales, 24-beam search.
+- **Q2_K·Shor** — FFN, MLP, expert weights — 36 candidate (d, dmin) pairs, dual-quhit graph, 24-beam search.
+- **Preserved** — Embeddings, norms, router/gate weights (kept as-is in high precision).
+### Tied Embeddings
+If no separate output weight tensor exists, `token_embd.weight` doubles as the LM head. HPC automatically detects tied embeddings and promotes them to Q4_0 to preserve generation logic, ensuring that output tokens remain accurate despite the extreme compression of the feed-forward layers.
+## 5. Sub-Block Refinement and Beam Search
+The HPC process executes in multiple phases to guarantee globally coherent scaling parameters:
+1. **Phase 1: Greedy Seed & WLS Refinement**: Computes reference scale and min per 256-weight superblock.
+2. **Phase 2: Candidate Generation via D₆ Vesica Scoring**: Instead of MSE, weights are scored with the D₆ Vesica gate (penalizing DC/summed errors 4x more than AC/wave errors).
+3. **Phase 3: Sequential Measurement**: Builds an HPCGraph encoding candidates as Boltzmann amplitudes. Connects blocks via CZ phase gates and runs Griffiths-Niu measurement (IDFT6, Born Rule, Collapse).
+4. **Phase 4: 24-Beam Hensel Search**: Maintains 24 parallel configuration beams across the tensor, branching candidates evaluated via triality-weighted scoring.
+5. **Phase 5: Sub-Block Shor Refinement**: A second, smaller Shor sequential measurement over a 16-node graph corresponding to the 16 sub-blocks within each 256-weight superblock.
+## 6. Prerequisites and Build Instructions
+Before you can quantize models, you must build the Shor-optimized HPC C engine.
+### Dependencies (Ubuntu/Debian)
+```bash
+sudo apt install gcc libgmp-dev libmpfr-dev python3 python3-numpy
+```
+You will also need `llama.cpp` built from source for iMatrix generation:
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp && cmake -B build && cmake --build build --target llama-imatrix -j$(nproc)
+```
+### Build the HPC Engine
+Navigate to the `LLM-distributed` directory and compile:
+```bash
+make -f makefile.quantize
+```
+Verify the build generated `libhexstate_q2k.so`. The Python requantizer (`hexstate_requantize.py`) auto-detects this library to enable Shor optimization.
+## 7. The Quantization Pipeline (End-to-End)
+**Step A: Convert Model to BF16 GGUF**
+Use `llama.cpp` to convert your source HuggingFace model to BF16.
+```bash
+python3 llama.cpp/convert_hf_to_gguf.py /path/to/model/     --outfile Model-BF16.gguf     --outtype bf16
+```
+**Step B: Generate Importance Matrix (iMatrix)**
+Download a calibration dataset (like wikitext-2) and generate the iMatrix:
+```bash
+llama-imatrix     -m Model-BF16.gguf     -f wikitext-2-raw/wiki.train.raw     -o imatrix.gguf     --chunks 300     -ngl 0
+```
+*Tip: Set `-ngl 99` to use GPU acceleration, which speeds up this step significantly.*
+**Step C: Quantize with HPC**
+Execute the re-quantizer with your newly generated BF16 GGUF and iMatrix.
+```bash
+python3 hexstate_requantize.py     Model-BF16.gguf     Model-Q2_K-HexState.gguf     --keep-metadata     --imatrix imatrix.gguf
+```
+This automatically routes the attention layers to Q4_0 and FFN/MLP layers to Q2_K using the Shor measurement graph.
+## 8. Inference & Runtime Configuration
+For correct operation, it is highly recommended to use appropriate chat templates and specific configuration flags in llama.cpp to prevent context length bugs.
+**Download Correct Chat Template:**
+```bash
+curl -L -o chat_template.jinja "https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"
+```
+**Run llama-server:**
+```bash
+llama-server     -m Model-Q2_K-HexState.gguf     -ngl 0     -c 4096     --host 0.0.0.0 --port 8989     --jinja --chat-template-file chat_template.jinja     --cache-ram 0 -ctxcp 1
+```
+**Recommended Sampling Settings:**
+- **Temperature:** 0.3–0.4 (Lower reduces sampling noise at low BPW)
+- **Top_k:** 20–30 (Narrow sampling for coherence)
+- **Top_p:** 0.8–0.85 (Cuts noisy long tail)
+- **Repeat_penalty:** 1.15–1.2 (Prevents self-correction loops)
+## 9. Fidelity Classification & Troubleshooting
+The quantizer reports a fidelity rating based on total RMSE across all quantized tensors:
+| Rating | RMSE Threshold | Icon |
+|--------|:-:|:-:|
+| ULTRA | ≤ 1e-04 | ★★★★ |
+| HIGH | ≤ 3e-04 | ★★★☆ |
+| GOOD | ≤ 1e-03 | ★★☆☆ |
+| STANDARD | > 1e-03 | ★☆☆☆ |
+*Note: Due to anisotropic error shaping, a "GOOD" Shor-quantized model will typically outperform a "HIGH" BP-quantized model on reasoning tasks.*
+### Common Issues
+- **Arabic/Korean characters in output:** The embedded chat template is broken. Use `--chat-template-file` with the correct Jinja template.
+- **RAM usage keeps growing:** Use `--cache-ram 0 -ctxcp 1` with llama.cpp to manage sliding window attention.
+- **RMSE is higher than standard Q2_K:** This is intentional. The D₆ vesica gate trades total RMSE for computation-aligned error minimization.
+- **libhexstate_q2k.so not found:** Make sure to compile the C engine using `make -f makefile.quantize`.
+## 10. License
+The quantizer code is part of the HPC project (MIT). Quantized models inherit the license of the base model (e.g., Gemma Terms of Use).