README.md · majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit at main

Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit / README.md

majentik

Add MLX quantized model

088389d verified 1 day ago

preview code

raw

history blame contribute delete

2.59 kB

	---
	base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
	library_name: mlx
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- voxtral
	- audio
	- speech
	- speech-recognition
	- realtime
	- streaming
	- asr
	- mlx
	- rotorquant
	- quantization
	- 2-bit
	---

	# Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit

	2-bit MLX weight-quantized build of [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) with RotorQuant KV-cache. Ultra-compact real-time ASR for memory-constrained Apple Silicon — best-available 2-bit stability on streaming audio.

	## Overview

	- Base: `mistralai/Voxtral-Mini-4B-Realtime-2602` — 4B real-time ASR model
	- Weight precision: 2-bit (group-wise)
	- KV-cache profile: RotorQuant
	- Approx. on-disk size: ~1.2 GB
	- Runtime: MLX on Apple Silicon

	## Quickstart

	```bash
	pip install mlx-lm
	```

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit")

	for chunk in audio_stream():
	prompt = tokenizer.apply_chat_template(
	[{"role": "user", "content": [{"type": "audio", "path": chunk}]}],
	add_generation_prompt=True,
	)
	emit(generate(model, tokenizer, prompt=prompt, max_tokens=32))
	```

	## Model specs

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 4B \|
	\| Weight bits \| 2 \|
	\| Group size \| 32 \|
	\| Cache profile \| RotorQuant \|
	\| Size on disk \| ~1.2 GB \|
	\| Target hardware \| Apple Silicon (M1/M2/M3/M4) \|
	\| License \| Apache 2.0 \|

	## RotorQuant vs TurboQuant

	\| \| RotorQuant \| TurboQuant \|
	\|---\|---\|---\|
	\| Strategy \| Rotational online re-basis \| Per-head static calibration \|
	\| Memory reduction \| ~4x on KV-cache \| ~3.5x on KV-cache \|
	\| Best for \| Noisy/multi-speaker streams \| Predictable domains, lowest p50 latency \|

	## See also

	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant-MLX-2bit)
	- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant) — KV-cache-only bundle
	- [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model