Instructions to use majentik/MERaLiON-2-3B-TurboQuant-MLX-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/MERaLiON-2-3B-TurboQuant-MLX-2bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MERaLiON-2-3B-TurboQuant-MLX-2bit majentik/MERaLiON-2-3B-TurboQuant-MLX-2bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
MERaLiON-2-3B-TurboQuant-MLX-2bit
MLX 2-bit TurboQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B for Apple Silicon inference.
TurboQuant applies mixed-precision quantization that preserves critical attention layers at higher precision while aggressively quantizing less sensitive feed-forward layers, optimizing for speed without sacrificing quality.
Model Specifications
| Property | Value |
|---|---|
| Base Model | aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B |
| Parameters | ~3B |
| Architecture | Whisper-large-v3 encoder + Gemma-2-2B-IT decoder |
| Quantization | TurboQuant 2-bit (MLX) |
| Disk Size | ~1 GB |
| Peak RAM | ~1.5 GB |
| License | Apache 2.0 |
| Task | Automatic Speech Recognition / Speech-to-Text |
Quickstart
Installation
pip install mlx-lm mlx-whisper
Inference
from mlx_lm import load, generate
from mlx_lm.cache import TurboQuantCache
model, tokenizer = load("majentik/MERaLiON-2-3B-TurboQuant-MLX-2bit")
# Create TurboQuant-optimized KV cache
cache = TurboQuantCache(model)
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Transcribe the following audio."}],
tokenize=False,
add_generation_prompt=True,
)
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=512,
cache=cache,
)
print(response)
Quantization Details
TurboQuant is a mixed-precision quantization strategy that:
- Retains attention projection layers at higher precision
- Quantizes MLP/feed-forward layers more aggressively where precision loss is tolerable
- Optimizes KV-cache memory layout for faster autoregressive decoding on Apple Silicon
This 2-bit variant offers the smallest possible footprint for the 3B model, enabling speech recognition on extremely memory-constrained Apple Silicon devices. Expect some quality degradation compared to 4-bit and 8-bit variants.
Supported Languages
MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.
Memory Estimates
| Device | Feasibility |
|---|---|
| MacBook Air M1 (8 GB) | Comfortable |
| iPad Pro M1/M2 | Comfortable |
| iPad Air M1 | Feasible |
| iPhone 15 Pro (8 GB) | Feasible |
See Also
- majentik/MERaLiON-2-3B-RotorQuant-MLX-2bit -- RotorQuant 2-bit variant
- majentik/MERaLiON-2-3B-TurboQuant-MLX-4bit -- TurboQuant 4-bit (higher quality)
- majentik/MERaLiON-2-3B-TurboQuant-MLX-8bit -- TurboQuant 8-bit (highest quality)
- majentik/MERaLiON-2-10B-TurboQuant-MLX-2bit -- 10B TurboQuant 2-bit (larger model)
- aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B -- Original base model
Quant trade-off (MLX lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 2-bit | ~799 MB | Aggressive quantization | Very low-RAM Macs |
| 3-bit | ~1.1 GB | Lossy but small | Low-RAM Macs |
| 4-bit | ~1.3 GB | Balanced default | Recommended for most Macs |
| 5-bit | ~1.5 GB | Higher fidelity | Quality-sensitive |
| 6-bit | ~1.8 GB | Approaching FP16 quality | High-fidelity |
| 8-bit | ~2.3 GB | Near-lossless reference | Fidelity-critical work |
(Current variant โ 2bit โ is bolded.)
Variants in this family
(Showing 8 sibling variants under majentik/meralion2-3b-*. The current variant โ TurboQuant-MLX-2bit โ is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-MLX-2bit | mlx-lm | ~983 MB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~1.9 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~3.5 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-MLX-2bit | mlx-lm | ~983 MB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~1.9 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~3.5 GB | Apple Silicon reference |
- Downloads last month
- 20
2-bit