Zenalyze's picture
Add arXiv paper link
c367f9b verified
---
tags:
- fraqtl
- kv-cache-optimized
- inference
license: other
---
# Mistral 7B — fraQtl KV Cache Optimized
**KV cache optimized with [fraQtl](https://fraqtl.ai)** — 3.5x less KV cache memory during inference.
> **Note:** The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.
| Metric | Value |
|--------|-------|
| Original | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
| File size | Same as original (~14GB) |
| KV cache memory | **3.5x less at runtime** |
| PPL before | 10.4690 |
| PPL after | 10.6908 |
| Delta | +0.222 (weight-level) |
| Config | k=64, INT3 |
## How It Works
The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory.
**Our runtime compression (the real product) achieves +0.01 PPL** on the same model. Contact us for integration.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
# KV cache uses 3.5x less memory during inference.
```
## Generation Samples
**Prompt:** Explain how photosynthesis works in simple terms:
**Output:** Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...
**Prompt:** The three most important breakthroughs in physics during the 20th century were
**Output:** The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...
## Runtime Compression (the full product)
| Method | PPL Delta | How |
|--------|-----------|-----|
| This download (weight-level) | +0.222 | Modified weights, download and use |
| Runtime cache compression | **+0.01** | fraQtl applied during inference |
Runtime compression gives 30x better quality. Available for production deployment.
---
[fraqtl.ai](https://fraqtl.ai) | contact@fraqtl.ai | Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501)