Zenalyze's picture
Add arXiv paper link
06fb73d verified
metadata
tags:
  - fraqtl
  - kv-cache-optimized
  - inference
license: other

Llama 3.2 3B — fraQtl KV Cache Optimized

KV cache optimized with fraQtl — 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~6.4GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

Metric Value
Original meta-llama/Llama-3.2-3B
File size Same as original (~6.4GB)
KV cache memory 3.5x less at runtime
PPL before 14.3943
PPL after 14.8613
Delta +0.467 (weight-level)
Config k=32, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
# KV cache uses 3.5x less memory during inference.

Runtime Compression

Our runtime compression achieves +0.01 PPL — 50x better than this weight-level demo. Contact us for integration.


fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501