--- tags: - fraqtl - kv-cache-optimized - inference license: other --- # Mistral 7B — fraQtl KV Cache Optimized **KV cache optimized with [fraQtl](https://fraqtl.ai)** — 3.5x less KV cache memory during inference. > **Note:** The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download. | Metric | Value | |--------|-------| | Original | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | | File size | Same as original (~14GB) | | KV cache memory | **3.5x less at runtime** | | PPL before | 10.4690 | | PPL after | 10.6908 | | Delta | +0.222 (weight-level) | | Config | k=64, INT3 | ## How It Works The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory. **Our runtime compression (the real product) achieves +0.01 PPL** on the same model. Contact us for integration. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed") tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed") # KV cache uses 3.5x less memory during inference. ``` ## Generation Samples **Prompt:** Explain how photosynthesis works in simple terms: **Output:** Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food... **Prompt:** The three most important breakthroughs in physics during the 20th century were **Output:** The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory... ## Runtime Compression (the full product) | Method | PPL Delta | How | |--------|-----------|-----| | This download (weight-level) | +0.222 | Modified weights, download and use | | Runtime cache compression | **+0.01** | fraQtl applied during inference | Runtime compression gives 30x better quality. Available for production deployment. --- [fraqtl.ai](https://fraqtl.ai) | contact@fraqtl.ai | Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501)