MERaLiON-3-10B-MLX-4bit
4-bit MLX port of MERaLiON/MERaLiON-3-10B-preview for Apple Silicon. 4-bit group quantization (group_size=64), no KV-cache compression applied.
MERaLiON-3 is a multimodal audio-language model from I2R, A*STAR (Singapore), built on a Gemma-2 decoder backbone. It targets speech-to-text, speech translation, and audio understanding across English, Mandarin, Malay, Tamil, Indonesian, and other Southeast Asian languages, with particular strength on Singlish and code-switched speech.
This is the raw MLX 4-bit lane โ weight quantization only, no KV-cache compression. If you want KV-cache compression stacked on top, see the RotorQuant-MLX-4bit or TurboQuant-MLX-4bit variants.
What's in this repo
| Component | Format | Approximate size |
|---|---|---|
| Decoder (Gemma-2 backbone) | 4-bit MLX (group_size=64, affine) | ~5.0 GB |
| Encoder (Whisper-large-v3) | float16 (unquantized) | ~1.2 GB |
| Speech-text adaptor | float16 (unquantized) | ~0.4 GB |
| Tokenizer | Unchanged from upstream |
Approximate total size: ~6.6 GB (about 44% smaller than the 8-bit variant).
Quantized directly from the original full-precision upstream weights (not re-quantized from 8-bit). Encoder and adaptor are kept at float16 since their parameter count is small and quality is sensitive to aggressive quantization on audio feature extraction.
Quickstart
Full multimodal (audio in, text out)
pip install mlx-meralion
from mlx_meralion import load_model, transcribe
model = load_model("majentik/MERaLiON-3-10B-MLX-4bit")
# ASR
text = transcribe(model, "audio.wav")
print(text)
# Translation
text_zh = transcribe(model, "audio.wav", task="translate_zh")
Decoder-only (text generation)
The 4-bit Gemma-2 decoder can be loaded standalone with mlx-lm:
from mlx_lm import load, generate
model, tokenizer = load("majentik/MERaLiON-3-10B-MLX-4bit/decoder")
out = generate(model, tokenizer, prompt="Hello", max_tokens=128)
print(out)
Size comparison
| Precision | Approximate size | Variant |
|---|---|---|
| FP16 (upstream) | ~20 GB | MERaLiON/MERaLiON-3-10B-preview |
| MLX 8-bit | ~11.6 GB | majentik/MERaLiON-3-10B-MLX |
| MLX 4-bit (this repo) | ~6.6 GB | โ |
What's not in this repo
- Training data. Quantization-only release.
- Benchmarks. Expect mild WER drift on 4-bit vs. 8-bit upstream; formal measurement is pending.
- Safety alignment changes. Inherited from upstream.
Known limitations
- Inherits upstream's out-of-scope warning: not intended for tool-calling, math, or coding tasks.
- At 4-bit, some multilingual tail languages (Tamil, Thai, Vietnamese) may see larger WER regressions than English/Mandarin. If quality matters more than memory, prefer the 8-bit variant.
- Audio >30s is chunked automatically by
mlx-meralion; boundary artifacts are possible.
Hardware requirements
- Apple Silicon (M1/M2/M3/M4), macOS
- ~8 GB unified memory recommended (runs comfortably on a base M1/M2 MacBook Air)
License
Released under the MERaLiON Public Licence v3, inherited from the upstream model. See the license PDF.
Links
- Upstream: MERaLiON/MERaLiON-3-10B-preview
- Sibling: majentik/MERaLiON-3-10B-MLX (8-bit)
- KV-quant variants: RotorQuant-MLX-4bit, TurboQuant-MLX-4bit
- Garden hub: majentik/garden
- MLX framework: ml-explore/mlx
- Runtime: mlx-audiollm
- Downloads last month
- 39
4-bit
Model tree for majentik/MERaLiON-3-10B-MLX-4bit
Base model
google/gemma-2-9b