harshithsaiv
/

kv-cache-compression

@@ -1,53 +1,5 @@
-# Per-Head Mixed-Precision KV Cache Compression
-> Calibrate once. Compress smarter. Same quality.
-Most KV cache quantization treats every attention head equally.
-This is wrong. Some heads are **26x more sensitive** to quantization than others.
-We measure this, allocate bits accordingly, and get better compression than uniform 8-bit with zero quality loss.
----
-## Results
-![Memory vs Context](figures/memory_vs_context_both.png)
-![Compression](figures/compression_bar_both.png)
-| Model | Method | Avg Bits | KV @ 8K | vs FP16 | vs 8-bit | Perplexity | Speed |
-|-------|--------|----------|---------|---------|---------|------------|-------|
-| Mistral-7B | FP16 Baseline | 16 | 1073 MB | 1.0x | — | 14.23 | 37.2 t/s |
-| Mistral-7B | Uniform 8-bit | 8 | 537 MB | 2.0x | 1.0x | ~same | ~same |
-| Mistral-7B | **Per-Head Mixed (Ours)** | **6.95** | **467 MB** | **2.3x** | **1.15x** | **14.23** | **37.2 t/s** |
-| Llama-3-8B | FP16 Baseline | 16 | 1073 MB | 1.0x | — | 20.7 | 36.7 t/s |
-| Llama-3-8B | Uniform 8-bit | 8 | 537 MB | 2.0x | 1.0x | ~same | ~same |
-| Llama-3-8B | **Per-Head Mixed (Ours)** | **7.84** | **526 MB** | **2.04x** | **1.02x** | **20.7** | **36.7 t/s** |
----
-## Long Context Results
-![Long Context](figures/long_context_both.png)
-![OOM Story](figures/oom_story.png)
-| Context | FP16 (Mistral) | Ours (Mistral) | FP16 (Llama) | Ours (Llama) |
-|---------|---------------|----------------|--------------|--------------|
-| 8K | 1,074 MB | 467 MB | 1,074 MB | 526 MB |
-| 16K | 2,147 MB | 933 MB | 2,147 MB | 1,053 MB |
-| 32K | 4,295 MB | 1,866 MB | OOM | ~2,106 MB |
-Llama-3-8B FP16 runs out of memory at 32K context. Our method fits.
----
-## The Key Insight
-![Sensitivity Heatmap](figures/mistral-7b_sensitivity_heatmap.png)
-Each cell is one attention head. Darker means more sensitive, which means it needs higher precision.
-The variance is massive — heads in the same layer need completely different treatment.
-Uniform quantization ignores this entirely.
 ---
@@ -55,18 +7,21 @@ Uniform quantization ignores this entirely.
 **Step 1 — Calibrate (once, ~20 minutes)**
-Run 256 WikiText samples through the model. For each attention head measure reconstruction error at 4-bit and 8-bit. Save the optimal bit allocation to a JSON file (~1KB).
 **Step 2 — Compress (every inference)**
-Load the bit allocation. Quantize each head to its optimal precision. 4-bit heads use half the memory. 8-bit heads stay accurate.
 **Step 3 — Results**
-- 2.3x memory reduction on Mistral-7B
 - 2.04x memory reduction on Llama-3-8B
 - Zero perplexity degradation on both models
 - Same decode speed at 37 tokens/sec
 ---
@@ -94,11 +49,11 @@ Run full pipeline:
 Run step by step:
-    make baseline       MODEL=mistral-7b
-    make calibrate      MODEL=mistral-7b
-    make integrate      MODEL=mistral-7b
-    make benchmark      MODEL=mistral-7b
-    make benchmark-long MODEL=mistral-7b
     make visualize
 ---
@@ -107,12 +62,13 @@ Run step by step:
     kv-cache-compression/
     ├── kernel/
-    │   └── quant_cache.py              mixed-precision quantize/dequantize
     ├── scripts/
     │   ├── baseline.py                 FP16 baseline measurements
     │   ├── calibrate.py                per-head sensitivity calibration
-    │   ├── integrate.py                quantized inference integration
-    │   ├── benchmark.py                full benchmark suite
     │   ├── benchmark_long_context.py   16K/32K context benchmarks
     │   ├── visualize_results.py        benchmark graphs
     │   ├── visualize_long_context.py   long context graphs
@@ -122,8 +78,8 @@ Run step by step:
     │   ├── run_mistral.sh              full Mistral pipeline
     │   └── run_llama.sh                full Llama pipeline
     ├── results/
-    │   ├─��� mistral-7b/                 baseline, calibration, benchmark
-    │   └── llama-3-8b/                 baseline, calibration, benchmark
     ├── figures/                        all generated graphs
     ├── requirements.txt                pip dependencies
     ├── Makefile                        one-command pipeline
@@ -143,28 +99,27 @@ Run step by step:
 ## Limitations
-- Current 4-bit implementation stores values in uint8 which wastes half the space. True bit-packing via Triton kernel is in progress on the triton-kernel branch.
-- Calibration uses WikiText-2. Domain-specific calibration may improve results for specialized use cases.
-- Tested on 7-8B models only. Larger models need validation.
 - Integration is HuggingFace only. vLLM integration is planned.
-<!-- ---
 ## What's Next
-- True Triton 4-bit bit-packing kernel (triton-kernel branch)
 - vLLM PagedAttention integration
 - 32K and 128K context experiments
 - Llama-3-70B and Qwen-72B validation
 - Dynamic per-token bit allocation at decode time
-- ArXiv paper with full evaluation -->
-<!-- ---
-## Citation
     @misc{kvcache-perhead-2026,
-      title  = {Per-Head Mixed-Precision KV Cache Compression},
       author = {Your Name},
       year   = {2026},
       url    = {https://github.com/YOURUSERNAME/kv-cache-compression}

+This is why the Triton kernel is essential — not just faster, but the only way
+to realize the theoretical memory savings from 4-bit quantization.
 ---
 **Step 1 — Calibrate (once, ~20 minutes)**
+Run 256 WikiText samples through the model. For each attention head measure
+reconstruction error at 4-bit and 8-bit. Save optimal bit allocation to JSON (~1KB).
 **Step 2 — Compress (every inference)**
+Load the bit allocation. Use Triton kernel to truly pack 4-bit heads (2 values per byte).
+Keep 8-bit heads at full precision. Result: genuine memory reduction.
 **Step 3 — Results**
+- 2.30x memory reduction on Mistral-7B (vs 2.00x for naive/uniform)
 - 2.04x memory reduction on Llama-3-8B
 - Zero perplexity degradation on both models
 - Same decode speed at 37 tokens/sec
+- Triton kernel is 10-12% faster than naive PyTorch implementation
 ---
 Run step by step:
+    make baseline          MODEL=mistral-7b
+    make calibrate         MODEL=mistral-7b
+    make integrate         MODEL=mistral-7b
+    make benchmark         MODEL=mistral-7b
+    make benchmark-long    MODEL=mistral-7b
     make visualize
 ---
     kv-cache-compression/
     ├── kernel/
+    │   ├── quant_cache.py              naive uint8 implementation
+    │   └── quant_cache_triton.py       true Triton 4-bit bit-packing
     ├── scripts/
     │   ├── baseline.py                 FP16 baseline measurements
     │   ├── calibrate.py                per-head sensitivity calibration
+    │   ├── integrate.py                naive vs triton comparison
+    │   ├── benchmark.py                full 4-method benchmark suite
     │   ├── benchmark_long_context.py   16K/32K context benchmarks
     │   ├── visualize_results.py        benchmark graphs
     │   ├── visualize_long_context.py   long context graphs
     │   ├── run_mistral.sh              full Mistral pipeline
     │   └── run_llama.sh                full Llama pipeline
     ├── results/
+    │   ├── mistral-7b/                 all mistral results and JSONs
+    │   └── llama-3-8b/                 all llama results and JSONs
     ├── figures/                        all generated graphs
     ├── requirements.txt                pip dependencies
     ├── Makefile                        one-command pipeline
 ## Limitations
+- Tested on 7-8B models only. Larger models (70B+) need validation.
+- Calibration uses WikiText-2. Domain-specific calibration may improve results.
 - Integration is HuggingFace only. vLLM integration is planned.
+- Llama-3-8B Triton compression (2.04x) is modest due to high head sensitivity.
+---
 ## What's Next
 - vLLM PagedAttention integration
 - 32K and 128K context experiments
 - Llama-3-70B and Qwen-72B validation
 - Dynamic per-token bit allocation at decode time
+- ArXiv paper with full evaluation
+---
+<!-- ## Citation
     @misc{kvcache-perhead-2026,
+      title  = {Per-Head Mixed-Precision KV Cache Compression with True Triton Bit-Packing},
       author = {Your Name},
       year   = {2026},
       url    = {https://github.com/YOURUSERNAME/kv-cache-compression}