Voxtral-Mini-4B-Realtime-2602-TurboQuant

TurboQuant KV-cache bundle for mistralai/Voxtral-Mini-4B-Realtime-2602, a 4B-parameter real-time speech input ASR model optimized for low-latency streaming transcription.

This artifact ships only the quantized KV-cache โ€” weights load from the upstream repo.

Overview

  • Base model: mistralai/Voxtral-Mini-4B-Realtime-2602 (Apache 2.0, ~864K downloads)
  • Capabilities: real-time ASR, streaming speech-to-text
  • Quantization target: attention KV-cache only
  • Method: TurboQuant โ€” per-head, per-channel calibrated cache quantization

Real-time ASR keeps a rolling audio context; TurboQuant drops cache memory and memory-bandwidth pressure, allowing longer sessions on the same device without dropping frames.

Quickstart

from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import TurboQuantCache

model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

cache = TurboQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant")

for chunk in audio_stream():  # 20 ms PCM chunks
    inputs = processor(audio=chunk, return_tensors="pt")
    out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
    emit(processor.batch_decode(out, skip_special_tokens=True)[0])

Model specs

Field Value
Parameters 4B
Modality Streaming audio-in, text-out
Use case Real-time ASR
Cache quantization TurboQuant (int8 heads, int4 channels)
License Apache 2.0

RotorQuant vs TurboQuant

TurboQuant RotorQuant
Strategy Per-head static calibration Rotational online re-basis
Memory reduction ~3.5x on KV-cache ~4x on KV-cache
Best for Predictable domains, lowest p50 latency Noisy/multi-speaker streams
Calibration cost One-shot, fast Per-session light re-basis

TurboQuant is the lowest-latency option. RotorQuant preserves more quality when domains drift mid-session.

See also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant