fraQtl
/

Mistral-7B-optimized

kv-cache-optimized

Model card Files Files and versions

Mistral-7B-optimized / README.md

Zenalyze's picture

Add arXiv paper link

c367f9b verified 2 days ago

|

history blame contribute delete

2.4 kB

	---
	tags:
	- fraqtl
	- kv-cache-optimized
	- inference
	license: other
	---
	# Mistral 7B — fraQtl KV Cache Optimized

	KV cache optimized with [fraQtl](https://fraqtl.ai) — 3.5x less KV cache memory during inference.

	> Note: The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Original \| [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) \|
	\| File size \| Same as original (~14GB) \|
	\| KV cache memory \| 3.5x less at runtime \|
	\| PPL before \| 10.4690 \|
	\| PPL after \| 10.6908 \|
	\| Delta \| +0.222 (weight-level) \|
	\| Config \| k=64, INT3 \|

	## How It Works

	The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory.

	Our runtime compression (the real product) achieves +0.01 PPL on the same model. Contact us for integration.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
	tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
	# KV cache uses 3.5x less memory during inference.
	```

	## Generation Samples

	Prompt: Explain how photosynthesis works in simple terms:

	Output: Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...

	Prompt: The three most important breakthroughs in physics during the 20th century were

	Output: The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...

	## Runtime Compression (the full product)

	\| Method \| PPL Delta \| How \|
	\|--------\|-----------\|-----\|
	\| This download (weight-level) \| +0.222 \| Modified weights, download and use \|
	\| Runtime cache compression \| +0.01 \| fraQtl applied during inference \|

	Runtime compression gives 30x better quality. Available for production deployment.

	---

	[fraqtl.ai](https://fraqtl.ai) \| contact@fraqtl.ai \| Patent pending. [Paper: arXiv:2604.11501](https://arxiv.org/abs/2604.11501)