Voxtral-Mini-3B-2507-TurboQuant
TurboQuant KV-cache for mistralai/Voxtral-Mini-3B-2507, a 3B-parameter speech-understanding model that handles transcription, speech translation, and audio question-answering.
This artifact ships only the quantized KV-cache bundle (not the model weights). It plugs into the standard transformers loader for Voxtral and reduces audio-context memory during long-form transcription and multi-turn audio QA.
Overview
- Base model:
mistralai/Voxtral-Mini-3B-2507(Apache 2.0, ~578K downloads) - Capabilities: transcription, speech translation, speech understanding/QA
- Quantization target: attention KV-cache only β weights remain in their original precision
- Method: TurboQuant β per-head, per-channel calibrated cache quantization tuned for audio token streams
Voxtral's long audio contexts make the KV-cache the dominant memory cost on-device. TurboQuant shrinks that footprint while preserving WER on standard ASR benchmarks.
Quickstart
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import TurboQuantCache
model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
cache = TurboQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-TurboQuant")
inputs = processor(audio="sample.wav", return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=256)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
Model specs
| Field | Value |
|---|---|
| Parameters | 3B |
| Modality | Audio-in, text-out |
| Languages | Multilingual (24+) |
| Context | Long-form audio |
| Cache quantization | TurboQuant (int8 heads, int4 channels) |
| License | Apache 2.0 |
RotorQuant vs TurboQuant
| TurboQuant | RotorQuant | |
|---|---|---|
| Strategy | Per-head static calibration | Rotational online re-basis |
| Memory reduction | ~3.5x on KV-cache | ~4x on KV-cache |
| Best for | Batch transcription, fixed domains | Streaming, code-switching audio |
| Calibration cost | One-shot, fast | Per-session light re-basis |
TurboQuant is the recommended default for offline transcription and speech translation workloads. For highly variable streaming audio see the RotorQuant variant.
See also
majentik/Voxtral-Mini-3B-2507-RotorQuantβ rotational KV-cache variantmajentik/Voxtral-Mini-3B-2507-TurboQuant-MLX-8bitβ MLX weight-quantized 8-bitmajentik/Voxtral-Mini-3B-2507-TurboQuant-MLX-4bitβ MLX weight-quantized 4-bitmajentik/Voxtral-Mini-3B-2507-TurboQuant-MLX-2bitβ MLX weight-quantized 2-bitmistralai/Voxtral-Mini-3B-2507β upstream base model
Model tree for majentik/Voxtral-Mini-3B-2507-TurboQuant
Base model
mistralai/Voxtral-Mini-3B-2507