@Zoberzzz on Hugging Face: "Hackernews post · TXT Show HN: I compressed a 160GB KV cache to 640MB at…"

posted an update 24 days ago

Post

171

Hackernews post · TXT

Show HN: I compressed a 160GB KV cache to 640MB at 0.9994 fidelity on a $300 GPU

Title: Show HN: DenseMem — 256x KV cache compression, 0.9994 fidelity, runs on consumer hardware

---

A 72B model at 32K context needs 160GB of KV cache. That's an H100 and $32,000 in HBM3e memory.

I built a protocol that stores the same KV cache in 640MB of DDR5 RAM — on a consumer RTX 4090 and Core i9.

256x compression. 0.9994 cosine similarity. 1.95ms average fetch latency. Verified.

**How:**

Transformer KV cache activations are highly structured and correlated. SVD at rank=64 exploits that structure. Random noise compresses to 0.12 fidelity. Real KV cache activations compress to 0.9994. The math works because the data isn't random — it has geometry.

The system manages a two-tier hierarchy: VRAM is the hot tier, DDR5 is the warm tier. An attention-weighted evictor (0.5 attn + 0.3 recency + 0.2 freq) decides what stays hot. A prefetcher using layer lookahead and token prediction pre-positions pages before they're needed. Average fetch latency: 1.95ms. Max under load: 3.96ms.

Current hit rate is 25% — bottlenecked by my i9's 2-channel DDR5 bandwidth (~38 GB/s). On an 8-channel Threadripper PRO (~224 GB/s) I'm projecting 65-75%.

**Running live:**
- Qwen2.5-7B on RTX 4090 at 32K context (was 4K)
- Every inference tick compressed INT8 via PCA → DDR5
- 2.4s cold start

**The cost math:**
- Uncompressed 72B KV cache: $32,000 in HBM3e
- FoldedMemory: $1.88 in DDR5
- 99.4% cost reduction. Verified on consumer hardware.

GitHub: https://github.com/thorshammerztp-arch/densemem-protocol
Patent Pending: US 64/045,595

Solo developer. Navy veteran. No funding. Consumer hardware.

spanofzero

22 days ago

we are very much on the same wavelength think we may be able to do way better if combined tech
just saying I been able to reduce debugging 95% and build out predeterministic builds in language design and all encompassing math turned into 3dim code i call auqqua now if you look at utilizing that ram technique into parallel hold state yu also just reduced Debugging if you use my math Reimiro Miro as it is 400,000xs faster key is imementing Resonance in a customized way that seems like it would be slower bjt its def not

spanofzero

22 days ago

some benchmarks

spanofzero

17 days ago

•

edited 15 days ago

my bad thought it was chat

spanofzero

17 days ago

This comment has been hidden (marked as Low Quality)

Join the conversation