---
tags:
  - fraqtl
  - kv-cache-optimized
  - inference
license: other
---
# Mistral 7B — fraQtl KV Cache Optimized

**KV cache optimized with [fraQtl](https://fraqtl.ai)** — 3.5x less KV cache memory during inference.

> **Note:** The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

| Metric | Value |
|--------|-------|
| Original | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
| File size | Same as original (~14GB) |
| KV cache memory | **3.5x less at runtime** |
| PPL before | 10.4690 |
| PPL after | 10.6908 |
| Delta | +0.222 (weight-level) |
| Config | k=64, INT3 |

## How It Works

The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory.

**Our runtime compression (the real product) achieves +0.01 PPL** on the same model. Contact us for integration.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
# KV cache uses 3.5x less memory during inference.
```

## Generation Samples

**Prompt:** Explain how photosynthesis works in simple terms:

**Output:** Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...

**Prompt:** The three most important breakthroughs in physics during the 20th century were

**Output:** The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...

## Runtime Compression (the full product)

| Method | PPL Delta | How |
|--------|-----------|-----|
| This download (weight-level) | +0.222 | Modified weights, download and use |
| Runtime cache compression | **+0.01** | fraQtl applied during inference |

Runtime compression gives 30x better quality. Available for production deployment.

---

[fraqtl.ai](https://fraqtl.ai) | contact@fraqtl.ai | Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501)