Qwen 2.5 3B — fraQtl KV Cache Optimized

KV cache optimized with fraQtl — 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~6.2GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

Metric	Value
Original	Qwen/Qwen2.5-3B
File size	Same as original (~6.2GB)
KV cache memory	3.5x less at runtime
PPL before	14.4222
PPL after	14.7302
Delta	+0.308 (weight-level)
Config	k=32, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Qwen-2.5-3B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Qwen-2.5-3B-compressed")
# KV cache uses 3.5x less memory during inference.

Runtime Compression

Our runtime compression achieves +0.01 PPL — 30x better than this weight-level demo. Contact us for integration.

fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501

Downloads last month: 36

Safetensors

Model size

3B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fraQtl/Qwen-2.5-3B-optimized

Quantization Dominates Rank Reduction for KV-Cache Compression

Paper • 2604.11501 • Published 3 days ago