Mistral 7B — fraQtl Weight-Compressed
Full model weight compression with fraQtl
| Metric | Value |
|---|---|
| Original | mistralai/Mistral-7B-v0.1 |
| Weight compression | 4.4x on MLP projections |
| PPL delta | +0.43 to +0.48 (run-to-run variance ±0.07) |
| File size | 14.5 GB (fp16 stub — packed INT3 version coming) |
| KV cache | Additional 3.5x with runtime compression |
What This Is
This model's MLP weights (68% of parameters) have been compressed using the fraQtl eigenbasis — the same V Theorem that powers our KV cache compression, applied to weight matrices.
The model file is currently stored as fp16 (same size as original). A packed INT3 version (~8.5 GB) is coming soon.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-fraqtl", torch_dtype="float16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-fraqtl")
# Optional: enable KV cache compression for additional memory savings
# pip install fraqtl
# import fraqtl
# fraqtl.enable_cache_compression(model, k=16, bits=3)
output = model.generate(tokenizer("The future of AI is", return_tensors="pt").input_ids.to("cuda"), max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Combined Savings (weight + KV cache)
| Component | FP16 | With fraQtl |
|---|---|---|
| Model weights | 14.5 GB | ~8.5 GB (packed, coming) |
| KV cache (32K) | 4.3 GB | 1.2 GB |
| Total GPU | 18.8 GB | ~9.7 GB |
Mistral 7B on a consumer RTX 3080 (10 GB). Previously required A100.
Technical Details
- MLP projections (gate, up, down) compressed via eigenbasis-guided GPTQ
- Eigenbasis from input covariance (X^T X) per layer
- Lloyd-Max quantization for INT2 sacrifice tiers
- Sequential calibration across 32 layers
- V/K projections preserved at full precision (compressed at runtime via KV cache hook)
Note on Quality
Run-to-run variance: ±0.07 PPL (CUDA non-determinism in eigenbasis computation). Honest range: +0.35 to +0.55 PPL. Seed stability experiment (C26) in progress to tighten error bars.
For best demo quality, use the Instruct version. This base model is not optimized for instruction following. Weight compression of Mistral-7B-Instruct is coming.
fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501
- Downloads last month
- 73