majentik
/

Voxtral-Mini-4B-Realtime-2602-RotorQuant

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

Voxtral-Mini-4B-Realtime-2602-RotorQuant / README.md

majentik's picture

Add model card

aa0480e verified 1 day ago

|

history blame contribute delete

2.86 kB

	---
	base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- voxtral
	- audio
	- speech
	- speech-recognition
	- realtime
	- streaming
	- asr
	- kv-cache
	- rotorquant
	- quantization
	---

	# Voxtral-Mini-4B-Realtime-2602-RotorQuant

	RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams.

	This artifact ships only the quantized KV-cache — weights load from upstream.

	## Overview

	- Base model: `mistralai/Voxtral-Mini-4B-Realtime-2602`
	- Capabilities: real-time ASR, streaming speech-to-text
	- Quantization target: attention KV-cache only
	- Method: RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session

	## Quickstart

	```python
	from transformers import VoxtralForConditionalGeneration, AutoProcessor
	from majentik_quant import RotorQuantCache

	model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
	processor = AutoProcessor.from_pretrained(model_id)
	model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

	cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")

	for chunk in audio_stream():
	inputs = processor(audio=chunk, return_tensors="pt")
	out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
	emit(processor.batch_decode(out, skip_special_tokens=True)[0])
	```

	## Model specs

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 4B \|
	\| Modality \| Streaming audio-in, text-out \|
	\| Use case \| Real-time ASR \|
	\| Cache quantization \| RotorQuant (rotated int4) \|
	\| License \| Apache 2.0 \|

	## RotorQuant vs TurboQuant

	\| \| RotorQuant \| TurboQuant \|
	\|---\|---\|---\|
	\| Strategy \| Rotational online re-basis \| Per-head static calibration \|
	\| Memory reduction \| ~4x on KV-cache \| ~3.5x on KV-cache \|
	\| Best for \| Noisy/multi-speaker streams \| Predictable domains, lowest p50 latency \|

	## See also

	- [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit)
	- [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model