Voxtral-Mini-3B-2507-TurboQuant

TurboQuant KV-cache for mistralai/Voxtral-Mini-3B-2507, a 3B-parameter speech-understanding model that handles transcription, speech translation, and audio question-answering.

This artifact ships only the quantized KV-cache bundle (not the model weights). It plugs into the standard transformers loader for Voxtral and reduces audio-context memory during long-form transcription and multi-turn audio QA.

Overview

  • Base model: mistralai/Voxtral-Mini-3B-2507 (Apache 2.0, ~578K downloads)
  • Capabilities: transcription, speech translation, speech understanding/QA
  • Quantization target: attention KV-cache only β€” weights remain in their original precision
  • Method: TurboQuant β€” per-head, per-channel calibrated cache quantization tuned for audio token streams

Voxtral's long audio contexts make the KV-cache the dominant memory cost on-device. TurboQuant shrinks that footprint while preserving WER on standard ASR benchmarks.

Quickstart

from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import TurboQuantCache

model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

cache = TurboQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-TurboQuant")

inputs = processor(audio="sample.wav", return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=256)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Model specs

Field Value
Parameters 3B
Modality Audio-in, text-out
Languages Multilingual (24+)
Context Long-form audio
Cache quantization TurboQuant (int8 heads, int4 channels)
License Apache 2.0

RotorQuant vs TurboQuant

TurboQuant RotorQuant
Strategy Per-head static calibration Rotational online re-basis
Memory reduction ~3.5x on KV-cache ~4x on KV-cache
Best for Batch transcription, fixed domains Streaming, code-switching audio
Calibration cost One-shot, fast Per-session light re-basis

TurboQuant is the recommended default for offline transcription and speech translation workloads. For highly variable streaming audio see the RotorQuant variant.

See also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for majentik/Voxtral-Mini-3B-2507-TurboQuant

Finetuned
(17)
this model