Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Zoberzzz 
posted an update 24 days ago
Post
171
Hackernews post · TXT


Show HN: I compressed a 160GB KV cache to 640MB at 0.9994 fidelity on a $300 GPU

Title: Show HN: DenseMem — 256x KV cache compression, 0.9994 fidelity, runs on consumer hardware

---

A 72B model at 32K context needs 160GB of KV cache. That's an H100 and $32,000 in HBM3e memory.

I built a protocol that stores the same KV cache in 640MB of DDR5 RAM — on a consumer RTX 4090 and Core i9.

256x compression. 0.9994 cosine similarity. 1.95ms average fetch latency. Verified.

**How:**

Transformer KV cache activations are highly structured and correlated. SVD at rank=64 exploits that structure. Random noise compresses to 0.12 fidelity. Real KV cache activations compress to 0.9994. The math works because the data isn't random — it has geometry.

The system manages a two-tier hierarchy: VRAM is the hot tier, DDR5 is the warm tier. An attention-weighted evictor (0.5 attn + 0.3 recency + 0.2 freq) decides what stays hot. A prefetcher using layer lookahead and token prediction pre-positions pages before they're needed. Average fetch latency: 1.95ms. Max under load: 3.96ms.

Current hit rate is 25% — bottlenecked by my i9's 2-channel DDR5 bandwidth (~38 GB/s). On an 8-channel Threadripper PRO (~224 GB/s) I'm projecting 65-75%.

**Running live:**
- Qwen2.5-7B on RTX 4090 at 32K context (was 4K)
- Every inference tick compressed INT8 via PCA → DDR5
- 2.4s cold start

**The cost math:**
- Uncompressed 72B KV cache: $32,000 in HBM3e
- FoldedMemory: $1.88 in DDR5
- 99.4% cost reduction. Verified on consumer hardware.

GitHub: https://github.com/thorshammerztp-arch/densemem-protocol
Patent Pending: US 64/045,595

Solo developer. Navy veteran. No funding. Consumer hardware.

we are very much on the same wavelength think we may be able to do way better if combined tech
just saying I been able to reduce debugging 95% and build out predeterministic builds in language design and all encompassing math turned into 3dim code i call auqqua now if you look at utilizing that ram technique into parallel hold state yu also just reduced Debugging if you use my math Reimiro Miro as it is 400,000xs faster key is imementing Resonance in a customized way that seems like it would be slower bjt its def not

·

some benchmarks
Screenshot_20260414_014125_Firefox

my bad thought it was chat

·
This comment has been hidden (marked as Low Quality)