File size: 22,247 Bytes
96fce02 5a67f67 099fd3c 96fce02 5a67f67 099fd3c 96fce02 5a67f67 099fd3c 2a6ab91 96fce02 dd6d6ba 96fce02 dd6d6ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 | ---
license: mit
---
# HPC Quantizer β Shor-Optimized
**GGUF quantization powered by Shor's algorithm.**
HPC is probably a breakthrough in model compression, utilizing a quantization pipeline derived from Shor's factoring algorithm. It compresses large language models (like Gemma 4) by mapping quantization candidates to a quantum-inspired state space.
Instead of independently rounding each weight block or running iterative belief propagation, HPC encodes scale candidates as Zβ complex amplitudes on a constraint graph and applies the **Griffiths-Niu sequential measurement protocol**. This uses the same IDFT + feed-forward + collapse/back-action loop that extracts periods in Shor's algorithm β finding globally optimal scale configurations where quantization noise is rotated away from the transformer's reasoning dimensions.
## 1. Why Shor's Algorithm?
The previous HPC engine used iterative **belief propagation (BP)** to find optimal scale configurations. BP converges to the element-wise MSE minimum β the scale configuration that minimizes total `Ξ£ (w_original - w_quantized)Β²`. This produces the lowest possible RMSE, but at 2 bits per weight, the noise floor still slightly bleeds into reasoning-critical dimensions.
Shor's Griffiths-Niu measurement protocol replaces BP with a fundamentally different optimization strategy:
| Feature | Belief Propagation (v2) | Shor's Measurement (v3) |
|---|---|---|
| **Mechanism** | Iterative message-passing (200+ rounds) | Single-pass sequential measurement |
| **Convergence** | May oscillate or get stuck | Exact marginals, no iteration |
| **Inter-block coordination** | Local messages only | Global conditioning via collapse back-action |
| **Error metric** | Element-wise MSE (isotropic) | Dβ vesica gate (anisotropic) |
| **RMSE** | Lower | Slightly higher |
| **Reasoning fidelity** | Good | **Significantly better** |
The key insight: **RMSE measures the wrong thing.** Standard RMSE treats every weight dimension equally. But during matrix multiplication, some error dimensions propagate through the computation graph and destroy reasoning, while others cancel out and are invisible. Shor's measurement finds configurations where block-to-block errors are anti-correlated along the computation path β they cancel during matmul even though each individual block has slightly higher error.
## 2. Performance & Benchmarks
### Gemma 4 26B-A4B-it MoE (25.8B params)
| Quantization | Size | Fits 12 GB? | Method |
|-------------|------|:-----------:|--------|
| BF16 | 48.5 GB | β | β |
| Q8_0 | ~27 GB | β | Round-to-nearest |
| Q4_K_M | 16.8 GB | β | Round-to-nearest |
| IQ3_K_XXS | ~12 GB | β οΈ | Unsloth |
| **HPCΒ·Shor** | **10.2 GB** | **β
** | **Griffiths-Niu measurement** |
### Gemma 4 E2B-it (4.65B params)
| Model | Size | BPW | PPL | Speed |
|-------|------|-----|-----|-------|
| BF16 (original) | 8.67 GB | 16.00 | 154.0 | 4.2 t/s |
| ggml Q2_K + iMatrix | 2.77 GB | 5.12 | 89.1 | 14.0 t/s |
| **HPC Q2_K + Q4_0Β·Shor** | **1.44 GB** | **~3.0** | **129.6** | **18.1 t/s** |
### Reasoning Benchmarks (Gemma 4 31B, Q2_KΒ·Shor, 12.5 GB)
- **25 Horses combinatorial proof:** β
(7 races, complete elimination)
- **Hindley-Milner type inference:** β
(correct let-polymorphism)
- **Arto Inkala "World's Hardest Sudoku":** β
(AC-3 + backtracking)
- **Diagnose 3 non-obvious bugs in C:** β
(first attempt)
- **Tarjan's bridge-finding algorithm:** β
(correct `>` vs `>=` distinction)
## 3. Quantum-Inspired Mechanics (Shor Pipeline)
Standard quantization picks scales independently per block. Shor-powered quantization treats scale selection as a **global optimization problem** where the measurement of each block conditions all remaining blocks through quantum-inspired back-action.
The domain mapping from Shor's integer factoring to HPC Quantization:
| Shor's Factoring | HPC Quantization |
|---|---|
| Oracle phase `2Ο Γ d Γ cβ / N` | Boltzmann amplitude from candidate error |
| Period `r` | Optimal scale configuration |
| QFT interference peaks at `r` | IDFT6 interference peaks at optimal RMSE |
| Semi-classical feed-forward | Phase correction from measured blocks |
| Born measurement β period bits | Born measurement β scale candidate selection |
| Collapse + entanglement | Collapse + back-action into neighbor amplitudes |
By utilizing the **IDFT6** inside the coherent sum, the algorithm creates constructive interference at the optimal RMSE configuration, similarly to how Shor's QFT creates interference at the correct period.
## 4. Q2_K and Q4_0 Promotion Strategies
The quantizer automatically assigns precision tiers:
- **Q4_0Β·Shor** β Attention projections (Q/K/V/O) β 16 candidate scales, 24-beam search.
- **Q2_KΒ·Shor** β FFN, MLP, expert weights β 36 candidate (d, dmin) pairs, dual-quhit graph, 24-beam search.
- **Preserved** β Embeddings, norms, router/gate weights (kept as-is in high precision).
### Tied Embeddings
If no separate output weight tensor exists, `token_embd.weight` doubles as the LM head. HPC automatically detects tied embeddings and promotes them to Q4_0 to preserve generation logic, ensuring that output tokens remain accurate despite the extreme compression of the feed-forward layers.
## 5. Sub-Block Refinement and Beam Search
The HPC process executes in multiple phases to guarantee globally coherent scaling parameters:
1. **Phase 1: Greedy Seed & WLS Refinement**: Computes reference scale and min per 256-weight superblock.
2. **Phase 2: Candidate Generation via Dβ Vesica Scoring**: Instead of MSE, weights are scored with the Dβ Vesica gate (penalizing DC/summed errors 4x more than AC/wave errors).
3. **Phase 3: Sequential Measurement**: Builds an HPCGraph encoding candidates as Boltzmann amplitudes. Connects blocks via CZ phase gates and runs Griffiths-Niu measurement (IDFT6, Born Rule, Collapse).
4. **Phase 4: 24-Beam Hensel Search**: Maintains 24 parallel configuration beams across the tensor, branching candidates evaluated via triality-weighted scoring.
5. **Phase 5: Sub-Block Shor Refinement**: A second, smaller Shor sequential measurement over a 16-node graph corresponding to the 16 sub-blocks within each 256-weight superblock.
## 6. Prerequisites and Build Instructions
Before you can quantize models, you must build the Shor-optimized HPC C engine.
### Dependencies (Ubuntu/Debian)
```bash
sudo apt install gcc libgmp-dev libmpfr-dev python3 python3-numpy
```
You will also need `llama.cpp` built from source for iMatrix generation:
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --target llama-imatrix -j$(nproc)
```
### Build the HPC Engine
Navigate to the `LLM-distributed` directory and compile:
```bash
make -f makefile.quantize
```
Verify the build generated `libhexstate_q2k.so`. The Python requantizer (`hexstate_requantize.py`) auto-detects this library to enable Shor optimization.
## 7. The Quantization Pipeline (End-to-End)
**Step A: Convert Model to BF16 GGUF**
Use `llama.cpp` to convert your source HuggingFace model to BF16.
```bash
python3 llama.cpp/convert_hf_to_gguf.py /path/to/model/ --outfile Model-BF16.gguf --outtype bf16
```
**Step B: Generate Importance Matrix (iMatrix)**
HPC includes a native C engine for generating importance matrices (imatrix) used in aggressive LLM weight quantization (Q2_K and below). The engine replaces the standard `llama.cpp` calibration pipeline with an HPC-graph-accelerated tokenizer and forward pass that produces structurally superior importance data.
**Component files:**
| File | Description |
|:---|:---|
| `LLM/hexstate_quantize.c` | C engine: HPC BPE tokenizer, graph-based forward pass, quantization kernels |
| `LLM/generate_imatrix.py` | Orchestrator: GGUF loading, weight dequantization, C-bridge, imatrix output |
| `LLM/calibration_data.txt` | Calibration corpus (~12.8M characters) |
---
#### Why It Works: Global Geometric Tokenization
The core innovation is the **HPC BPE tokenizer** β a byte-pair encoding engine that operates on the HPCGraph substrate without regex word boundaries. This seemingly simple architectural choice has profound consequences for quantization quality.
##### The Standard Approach: Artificial Isolation
Standard tokenizers (tiktoken, SentencePiece, HuggingFace) apply a **regex pre-split** before BPE:
```
"Hello world" β regex β ["Hello", " world"] β BPE per word β [15496, 1917]
```
The regex fence means:
- **Merges cannot cross word boundaries.** The characters `"o "` (letter-o + space) can never form a token.
- **Each word is tokenized in isolation.** The token for `"Hello"` at position 0 is mathematically independent of any text at position 10,000.
- **Token boundaries are locally determined.** Changing text on page 500 cannot affect how page 1 is tokenized.
When a standard tokenizer produces 4,096 tokens for imatrix calibration, those tokens are a **shallow, context-free sample** β an arbitrary window of generic subwords that exercises only the activation patterns present in that one fragment of text.
##### The HPC Approach: Unrestricted Graph Contraction
The HPC BPE tokenizer treats the **entire calibration corpus as a single, continuous phase graph**:
```
"Hello world" β 11 sites, each CZ-coupled to its neighbor β global merge competition
```
1. **Graph Construction** β Each character becomes a site in an `HPCGraph`. Adjacent sites are connected by CZ edges, encoding pair structure as phase entanglement. For a 12.8M character corpus, this creates a graph with 12,799,706 sites and 12,799,705 CZ edges.
2. **Global Merge Competition** β Each BPE pass scans every alive position across the entire graph, finds the lowest-rank merge pair that exists *anywhere* in the 12.8M character sequence, and contracts *all* instances simultaneously. This is not local β a merge decision at position 10,000,000 competes with and can preempt merge decisions at position 100.
3. **Cascade Propagation** β When a merge at position `i` contracts sites `i` and `i+1` into a single token, site `i`'s neighbor changes. The new adjacent pair `(merged_token, next_neighbor)` may have a very low rank, causing it to fire in the next pass β which creates *another* new pair, propagating further. Without regex boundaries, these cascades propagate freely across spaces, punctuation, and line breaks, threading through the entire document's character geometry.
4. **Phase Contraction** β Each merge is simultaneously a **graph contraction** on the HPCGraph. The merged site's local quhit amplitude is updated to a sharp basis state encoding the new token ID (`best_merged % D`), which mathematically severs the entanglement from the consumed CZ edge. The graph tracks the full contraction history.
After 21,576 passes (Mistral) or 24,447 passes (Qwen), the 12.8M character graph contracts to ~3.5Mβ5.2M surviving sites. Each surviving token is not a generic subword β it is **the fixed point of a global contraction** over the entire document's character geometry.
##### Tokens as Global Geometric Frequencies
In standard NLP, a token represents a word. In the HPC tokenizer, a token represents a **global geometric frequency** β a structural alignment that was carved out by 20,000+ rounds of competition across the full corpus.
The key consequence: **the first 4,096 tokens of the output already contain the structural dependencies of the entire 12.8M character document.** This is because:
- The merge rules (pre-trained vocabulary) act as a fixed **geometric frame** β a coordinate system for projecting character sequences into token space.
- The absence of regex boundaries means the projection is **unrestricted** β merges resolve cross-word, cross-line, and cross-paragraph structures that regex-split tokenizers are blind to.
- The specific token IDs and their boundary placements at *any* position are determined by 20,000+ passes of global competition where *every* merge decision was influenced by *every* position in the 12.8M character sequence.
- **You cannot know how the first 500 characters will be tokenized until you have evaluated the last 500 characters.** The tokenization of position 0 is a function of the entire document.
##### The Result: Surgical Quantization at 2 Bits
When these globally-informed tokens are fed through the HPC forward pass for importance collection, the resulting E[xΒ²] statistics are **structurally representative** of the full corpus β even from a single 4,096-token chunk. The cross-layer Belief Propagation then smooths these statistics via residual stream coupling, producing an importance matrix that precisely identifies the load-bearing weights.
**Empirical validation:** Mistral-7B-Instruct-v0.3 quantized to Q2_K (87.5% weight compression, 14.5 GB β 3 GB) using a single HPC-calibrated chunk produces:
- Flawless English grammar and complex vocabulary
- Correct factual retrieval from context ("What color was John's suit?" β "Neon green.")
- Coherent multi-sentence reasoning structure
Standard `llama.cpp` imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with **one chunk** because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
```bash
# Generate HPC importance matrix
python3 LLM/generate_imatrix.py \
model.gguf calibration_data.txt \
-o imatrix.dat --chunks 5 --verbose
```
5 Chunks is the 'sweet spot' for retaining most model intelligence I've found.
**Step C: Quantize with HPC**
Execute the re-quantizer with your newly generated BF16 GGUF and iMatrix.
```bash
python3 hexstate_requantize.py Model-BF16.gguf Model-Q2_K-HexState.gguf --keep-metadata --imatrix imatrix.gguf
```
This automatically routes the attention layers to Q4_0 and FFN/MLP layers to Q2_K using the Shor measurement graph.
## 8. Inference & Runtime Configuration
For correct operation, it is highly recommended to use appropriate chat templates and specific configuration flags in llama.cpp to prevent context length bugs.
**Download Correct Chat Template:**
```bash
curl -L -o chat_template.jinja "https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"
```
**Run llama-server:**
```bash
llama-server -m Model-Q2_K-HexState.gguf -ngl 0 -c 4096 --host 0.0.0.0 --port 8989 --jinja --chat-template-file chat_template.jinja --cache-ram 0 -ctxcp 1
```
**Recommended Sampling Settings:**
- **Temperature:** 0.3β0.4 (Lower reduces sampling noise at low BPW)
- **Top_k:** 20β30 (Narrow sampling for coherence)
- **Top_p:** 0.8β0.85 (Cuts noisy long tail)
- **Repeat_penalty:** 1.15β1.2 (Prevents self-correction loops)
## 9. Fidelity Classification & Troubleshooting
The quantizer reports a fidelity rating based on total RMSE across all quantized tensors:
| Rating | RMSE Threshold | Icon |
|--------|:-:|:-:|
| ULTRA | β€ 1e-04 | β
β
β
β
|
| HIGH | β€ 3e-04 | β
β
β
β |
| GOOD | β€ 1e-03 | β
β
ββ |
| STANDARD | > 1e-03 | β
βββ |
*Note: Due to anisotropic error shaping, a "GOOD" Shor-quantized model will typically outperform a "HIGH" BP-quantized model on reasoning tasks.*
### Common Issues
- **Arabic/Korean characters in output:** The embedded chat template is broken. Use `--chat-template-file` with the correct Jinja template.
- **RAM usage keeps growing:** Use `--cache-ram 0 -ctxcp 1` with llama.cpp to manage sliding window attention.
- **RMSE is higher than standard Q2_K:** This is intentional. The Dβ vesica gate trades total RMSE for computation-aligned error minimization.
- **libhexstate_q2k.so not found:** Make sure to compile the C engine using `make -f makefile.quantize`.
# How the HPC Engine Makes Global Error Decisions
The HPC (Holographic Phase Contraction) engine takes a fundamentally different approach to LLM weight quantization. Instead of minimizing local error block-by-block, it frames scale selection as a **global quantum-inspired optimization problem**. It maps quantization candidates to a constraint graph and evaluates them using the **Griffiths-Niu Sequential Measurement** protocol derived from Shor's factoring algorithm.
Here is a step-by-step breakdown of how the engine makes global error decisions, directly referencing the implementation in `hexstate_quantize.c`.
---
## 1. Candidate Generation and Dβ Vesica Scoring
Before any global optimization occurs, the engine needs candidates. For each block of weights, it generates candidate scales (`d` and `dmin`) around a baseline least-squares optimum.
Instead of scoring these candidates using standard Mean Squared Error (MSE), it uses the **Dβ Vesica Gate**.
### Code Citation (`hexstate_quantize.c`, lines ~1420β1432)
```c
/* Decompose into vesica (DC) and wave (AC) components */
float vesica_err = 0.0f, wave_err = 0.0f;
for (int p = 0; p < half_g; p++) {
float v = e_cur[p] + e_cur[p + half_g];
float w_wave = e_cur[p] - e_cur[p + half_g];
float w_avg = (w_cur[p] + w_cur[p + half_g]) * 0.5f;
vesica_err += v * v * w_avg;
wave_err += w_wave * w_wave * w_avg;
}
/* Triality weighting: penalize vesica 4Γ, wave 1Γ. */
err += 0.5f * (4.0f * vesica_err + 1.0f * wave_err);
```
**Why this matters:** `vesica_err` represents errors that sum together during matrix multiplication, propagating through the network and destroying reasoning. `wave_err` represents errors that naturally cancel out. By penalizing `vesica_err` by 4x, the engine heavily biases towards candidates whose errors cancel out during inference, even if their total Euclidean distance (MSE) from the original weights is larger.
## 2. Graph Construction and Boltzmann Amplitudes
The selected candidates and their Vesica errors are grouped into 6 bins (representing the 6 states of a $Z_6$ "quhit" or quantum digit). The engine constructs an `HPCGraph` mapping each block to one or more quhits.
The errors are transformed into "Boltzmann amplitudes"βrepresenting the likelihood of selecting each state.
### Code Citation (`hexstate_quantize.c`, lines ~1502β1506)
```c
for (int ci = 0; ci < Q4_N_CAND; ci++) {
int qi = Q4_CAND_TO_QUHIT[ci];
amp_re[qi] += exp(-(double)(agg_errors[ci] - min_err) /
(2.0 * (double)temperature));
}
```
## 3. The Griffiths-Niu Sequential Measurement
This is where the global coordination happens. The function `shor_measure_graph` (lines 1166β1314) executes a sequential measurement MSB to LSB. This replaces standard Belief Propagation with a deterministic evaluation that creates massive global correlation.
For each block $k$ being measured:
### A. Neighbor Contribution (Entanglement)
It evaluates how neighboring blocks influence block $k$ by projecting their current amplitudes across the graph edges.
```c
// hexstate_quantize.c: lines 1217-1221
sr += lr*wr - li*wi;
si += lr*wi + li*wr;
```
### B. Feed-Forward Phase Correction
It applies a phase shift based on the outcomes of all blocks measured before it, a signature trait of the semi-classical QFT used in Shor's algorithm.
```c
// hexstate_quantize.c: lines 1181-1184
double power = 36.0;
for (int64_t j = k + 1; j < n_sites; j++) {
theta_k += (double)measured_out[j] / power;
power *= 6.0;
}
```
### C. IDFT6 and Constructive Interference
It runs an Inverse Discrete Fourier Transform (IDFT6). Because the neighbor influence ($C_k$) was baked into the amplitudes *before* the IDFT, the IDFT acts as a coherence filter. **It produces constructive interference peaks precisely at the scale candidate that creates the best global configuration.**
```c
// hexstate_quantize.c: lines 1256-1261
double angle = 2.0 * 3.14159265358979323846 * d * v / 6.0;
double er = cos(angle), ei = sin(angle);
sum_re += alpha_re[d]*er - alpha_im[d]*ei;
sum_im += alpha_re[d]*ei + alpha_im[d]*er;
```
### D. Measurement and Back-Action (The "Magic Pointer")
Once an optimal state is selected (using `argmax` on the squared amplitudes via the Born Rule), the engine calls `shor_collapse_site`.
This function creates the **global error decision**. It doesn't just lock in block $k$'s choice; it propagates the decision to all unmeasured neighbors.
```c
// hexstate_quantize.c: lines 1120-1121
double old_re = pq->edge_re[d], old_im = pq->edge_im[d];
pq->edge_re[d] = old_re * w_re - old_im * w_im;
pq->edge_im[d] = old_re * w_im + old_im * w_re;
```
This back-action fundamentally alters the amplitudes of adjacent blocks, *conditioning* their future quantization choices on the choice just made for block $k$. This creates anti-correlated quantization noise across the entire tensor, ensuring that when the matrix is multiplied against an activation vector, the localized errors cancel each other out.
## 4. Beam Search Refinement
Because the measurement is performed left-to-right (MSB to LSB), the sequence is prone to greedy failure. The HPC engine circumvents this using a **24-Beam Hensel Search**.
Instead of accepting the single path created by the graph collapse, it maintains 24 parallel quantization paths (`Q4_N_BEAMS`). It evaluates candidate extensions by dividing the Shor probability (from the graph marginals) by the local normalized error, advancing the 24 best global configurations simultaneously.
### Code Citation (`hexstate_quantize.c`, lines ~1581β1582)
```c
double ext_err = beams[b].acc_error + cand_errors[blk][c];
extensions[n_ext].score = cand_score[c] / (ext_err + 1e-15);
```
## Summary
The global error decision is not an iterative smoothing process (like BP). It is an exact evaluation of quantum interference. By encoding quantization errors into complex amplitudes, passing them through an IDFT6, and forcing neighbors to condition their subsequent scale selections via wave-collapse back-action, the engine ensures that the chosen scales globally cancel each other's destructive matrix-multiplication errors.
## 10. License
The quantizer code is part of the HPC project (MIT). Quantized models inherit the license of the base model (e.g., Gemma Terms of Use).
|