majentik's picture
Add model card
aa0480e verified
metadata
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - voxtral
  - audio
  - speech
  - speech-recognition
  - realtime
  - streaming
  - asr
  - kv-cache
  - rotorquant
  - quantization

Voxtral-Mini-4B-Realtime-2602-RotorQuant

RotorQuant KV-cache bundle for mistralai/Voxtral-Mini-4B-Realtime-2602. Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams.

This artifact ships only the quantized KV-cache — weights load from upstream.

Overview

  • Base model: mistralai/Voxtral-Mini-4B-Realtime-2602
  • Capabilities: real-time ASR, streaming speech-to-text
  • Quantization target: attention KV-cache only
  • Method: RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session

Quickstart

from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache

model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")

for chunk in audio_stream():
    inputs = processor(audio=chunk, return_tensors="pt")
    out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
    emit(processor.batch_decode(out, skip_special_tokens=True)[0])

Model specs

Field Value
Parameters 4B
Modality Streaming audio-in, text-out
Use case Real-time ASR
Cache quantization RotorQuant (rotated int4)
License Apache 2.0

RotorQuant vs TurboQuant

RotorQuant TurboQuant
Strategy Rotational online re-basis Per-head static calibration
Memory reduction ~4x on KV-cache ~3.5x on KV-cache
Best for Noisy/multi-speaker streams Predictable domains, lowest p50 latency

See also