| --- |
| tags: |
| - fraqtl |
| - kv-cache-optimized |
| - inference |
| license: other |
| --- |
| # Mistral 7B — fraQtl KV Cache Optimized |
|
|
| **KV cache optimized with [fraQtl](https://fraqtl.ai)** — 3.5x less KV cache memory during inference. |
|
|
| > **Note:** The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download. |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Original | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | |
| | File size | Same as original (~14GB) | |
| | KV cache memory | **3.5x less at runtime** | |
| | PPL before | 10.4690 | |
| | PPL after | 10.6908 | |
| | Delta | +0.222 (weight-level) | |
| | Config | k=64, INT3 | |
|
|
| ## How It Works |
|
|
| The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory. |
|
|
| **Our runtime compression (the real product) achieves +0.01 PPL** on the same model. Contact us for integration. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed") |
| tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed") |
| # KV cache uses 3.5x less memory during inference. |
| ``` |
|
|
| ## Generation Samples |
|
|
| **Prompt:** Explain how photosynthesis works in simple terms: |
|
|
| **Output:** Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food... |
|
|
| **Prompt:** The three most important breakthroughs in physics during the 20th century were |
|
|
| **Output:** The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory... |
|
|
| ## Runtime Compression (the full product) |
|
|
| | Method | PPL Delta | How | |
| |--------|-----------|-----| |
| | This download (weight-level) | +0.222 | Modified weights, download and use | |
| | Runtime cache compression | **+0.01** | fraQtl applied during inference | |
|
|
| Runtime compression gives 30x better quality. Available for production deployment. |
|
|
| --- |
|
|
| [fraqtl.ai](https://fraqtl.ai) | contact@fraqtl.ai | Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501) |
|
|