TinyLlama 1.1B โ€” fraQtl KV Cache Optimized

KV cache optimized with fraQtl โ€” 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~2.2GB). The optimization modifies V projection weights so that at inference time, the KV cache uses less GPU memory. The savings happen at runtime, not at download.

Metric Value
Original TinyLlama/TinyLlama-1.1B-Chat-v1.0
File size Same as original (~2.2GB)
PPL before 15.5249
PPL after 15.8782
Delta +0.353 (weight-level)
Config k=16, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")

Runtime Compression

Our runtime compression achieves significantly better results on larger models. Contact us for integration.


fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501

Downloads last month
47
Safetensors
Model size
1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for fraQtl/TinyLlama-1.1B-optimized