majentik
/

Voxtral-Mini-3B-2507-RotorQuant

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

Voxtral-Mini-3B-2507-RotorQuant / README.md

majentik's picture

Add model card

f51e5b5 verified 1 day ago

|

history blame contribute delete

3.12 kB

	---
	base_model: mistralai/Voxtral-Mini-3B-2507
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- voxtral
	- audio
	- speech
	- speech-recognition
	- transcription
	- translation
	- kv-cache
	- rotorquant
	- quantization
	---

	# Voxtral-Mini-3B-2507-RotorQuant

	RotorQuant KV-cache for [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). Uses a rotational online re-basis of the attention cache that is robust to distributional drift across long, code-switched, or noisy audio streams.

	This artifact ships only the quantized KV-cache bundle — model weights load from the upstream repo.

	## Overview

	- Base model: `mistralai/Voxtral-Mini-3B-2507`
	- Capabilities: transcription, speech translation, audio understanding
	- Quantization target: attention KV-cache only
	- Method: RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session

	RotorQuant trades a tiny per-session calibration pass for better low-bit stability on streaming audio. Preferred when audio domains shift mid-stream (multi-speaker meetings, code-switching, noise bursts).

	## Quickstart

	```python
	from transformers import VoxtralForConditionalGeneration, AutoProcessor
	from majentik_quant import RotorQuantCache

	model_id = "mistralai/Voxtral-Mini-3B-2507"
	processor = AutoProcessor.from_pretrained(model_id)
	model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

	cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-RotorQuant")

	inputs = processor(audio="meeting.wav", return_tensors="pt")
	out = model.generate(**inputs, past_key_values=cache, max_new_tokens=512)
	print(processor.batch_decode(out, skip_special_tokens=True)[0])
	```

	## Model specs

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 3B \|
	\| Modality \| Audio-in, text-out \|
	\| Languages \| Multilingual (24+) \|
	\| Cache quantization \| RotorQuant (rotated int4) \|
	\| License \| Apache 2.0 \|

	## RotorQuant vs TurboQuant

	\| \| RotorQuant \| TurboQuant \|
	\|---\|---\|---\|
	\| Strategy \| Rotational online re-basis \| Per-head static calibration \|
	\| Memory reduction \| ~4x on KV-cache \| ~3.5x on KV-cache \|
	\| Best for \| Streaming, code-switching audio \| Batch transcription, fixed domains \|
	\| Calibration cost \| Per-session light re-basis \| One-shot, fast \|

	## See also

	- [`majentik/Voxtral-Mini-3B-2507-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-TurboQuant) — static calibrated KV-cache variant
	- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit) — MLX weight-quantized 8-bit
	- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit) — MLX weight-quantized 4-bit
	- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit) — MLX weight-quantized 2-bit
	- [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) — upstream base model