File size: 2,401 Bytes
dc721d2
 
 
9a7a97c
 
dc721d2
 
9a7a97c
dc721d2
9a7a97c
 
 
dc721d2
 
 
 
9a7a97c
 
dc721d2
b881508
9a7a97c
b881508
dc721d2
9a7a97c
 
 
 
 
 
dc721d2
 
 
 
 
 
 
9a7a97c
dc721d2
 
dd72361
dc721d2
dd72361
dc721d2
9a7a97c
dc721d2
dd72361
dc721d2
9a7a97c
dd72361
9a7a97c
dd72361
 
 
 
9a7a97c
dd72361
9a7a97c
dc721d2
 
 
c367f9b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
tags:
  - fraqtl
  - kv-cache-optimized
  - inference
license: other
---
# Mistral 7B — fraQtl KV Cache Optimized

**KV cache optimized with [fraQtl](https://fraqtl.ai)** — 3.5x less KV cache memory during inference.

> **Note:** The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

| Metric | Value |
|--------|-------|
| Original | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
| File size | Same as original (~14GB) |
| KV cache memory | **3.5x less at runtime** |
| PPL before | 10.4690 |
| PPL after | 10.6908 |
| Delta | +0.222 (weight-level) |
| Config | k=64, INT3 |

## How It Works

The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory.

**Our runtime compression (the real product) achieves +0.01 PPL** on the same model. Contact us for integration.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
# KV cache uses 3.5x less memory during inference.
```

## Generation Samples

**Prompt:** Explain how photosynthesis works in simple terms:

**Output:** Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...

**Prompt:** The three most important breakthroughs in physics during the 20th century were

**Output:** The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...

## Runtime Compression (the full product)

| Method | PPL Delta | How |
|--------|-----------|-----|
| This download (weight-level) | +0.222 | Modified weights, download and use |
| Runtime cache compression | **+0.01** | fraQtl applied during inference |

Runtime compression gives 30x better quality. Available for production deployment.

---

[fraqtl.ai](https://fraqtl.ai) | contact@fraqtl.ai | Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501)