metadata
base_model: MERaLiON/MERaLiON-3-10B-preview
library_name: mlx
tags:
- rotorquant
- kv-cache-quantization
- meralion
- speech-to-text
- multimodal
- audio
- quantized
- mlx
- 8bit
- apple-silicon
license: other
pipeline_tag: automatic-speech-recognition
language:
- en
MERaLiON-3-10B-RotorQuant-MLX-8bit
8-bit weight-quantized MLX version of MERaLiON/MERaLiON-3-10B-preview with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework.
MERaLiON-3-10B is a multimodal audio-language model built on a Gemma-2 decoder backbone, designed for speech-to-text and audio understanding tasks.
Approximate model size: ~10 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | MERaLiON/MERaLiON-3-10B-preview |
| Parameters | ~10 billion |
| Architecture | Multimodal audio-language (Gemma-2 decoder backbone) |
| Modality | Audio + text input, text output |
| License | See base model |
| Weight Quantization | 8-bit (~10 GB) |
| KV-Cache Quantization | RotorQuant |
| Framework | MLX (Apple Silicon) |
Quickstart
from mlx_lm import load, generate
model, tokenizer = load("majentik/MERaLiON-3-10B-RotorQuant-MLX-8bit")
prompt = "Transcribe the following audio:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
What is RotorQuant?
RotorQuant is a rotation-based KV cache quantization method that applies learned Clifford algebra rotations before quantizing the key-value cache. Key results:
- 5.3x faster prefill compared to TurboQuant baseline
- 28% faster decode throughput
- Perplexity: 6.91 vs 7.07 for TurboQuant (lower is better)
Combined with MLX 8-bit weight quantization, this dual compression approach provides excellent throughput for audio processing workloads.
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | Baseline | Baseline | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
Memory Estimates (MERaLiON-3-10B)
| Precision | Approximate Size | MLX Variant |
|---|---|---|
| FP16 (original) | ~20 GB | -- |
| 8-bit quantized | ~10 GB | This model |
| 4-bit quantized | ~5 GB | RotorQuant-MLX-4bit |
| 2-bit quantized | ~3 GB | RotorQuant-MLX-2bit |
Hardware Requirements
This model requires approximately 10 GB of unified memory. Recommended hardware:
- Apple M1 Pro (16 GB+)
- Apple M2/M3/M4 (16 GB+)
- Any Apple Silicon Mac with 16 GB+ RAM
See Also
- MERaLiON/MERaLiON-3-10B-preview -- Base model
- majentik/MERaLiON-3-10B-RotorQuant-MLX-4bit -- MLX 4-bit variant
- majentik/MERaLiON-3-10B-RotorQuant-MLX-2bit -- MLX 2-bit variant
- majentik/MERaLiON-3-10B-TurboQuant-MLX-8bit -- TurboQuant MLX 8-bit variant
- RotorQuant GitHub
- MLX Framework