MERaLiON-3-10B-MLX-4bit

4-bit MLX port of MERaLiON/MERaLiON-3-10B-preview for Apple Silicon. 4-bit group quantization (group_size=64), no KV-cache compression applied.

MERaLiON-3 is a multimodal audio-language model from I2R, A*STAR (Singapore), built on a Gemma-2 decoder backbone. It targets speech-to-text, speech translation, and audio understanding across English, Mandarin, Malay, Tamil, Indonesian, and other Southeast Asian languages, with particular strength on Singlish and code-switched speech.

This is the raw MLX 4-bit lane โ€” weight quantization only, no KV-cache compression. If you want KV-cache compression stacked on top, see the RotorQuant-MLX-4bit or TurboQuant-MLX-4bit variants.

What's in this repo

Component Format Approximate size
Decoder (Gemma-2 backbone) 4-bit MLX (group_size=64, affine) ~5.0 GB
Encoder (Whisper-large-v3) float16 (unquantized) ~1.2 GB
Speech-text adaptor float16 (unquantized) ~0.4 GB
Tokenizer Unchanged from upstream

Approximate total size: ~6.6 GB (about 44% smaller than the 8-bit variant).

Quantized directly from the original full-precision upstream weights (not re-quantized from 8-bit). Encoder and adaptor are kept at float16 since their parameter count is small and quality is sensitive to aggressive quantization on audio feature extraction.

Quickstart

Full multimodal (audio in, text out)

pip install mlx-meralion
from mlx_meralion import load_model, transcribe

model = load_model("majentik/MERaLiON-3-10B-MLX-4bit")

# ASR
text = transcribe(model, "audio.wav")
print(text)

# Translation
text_zh = transcribe(model, "audio.wav", task="translate_zh")

Decoder-only (text generation)

The 4-bit Gemma-2 decoder can be loaded standalone with mlx-lm:

from mlx_lm import load, generate

model, tokenizer = load("majentik/MERaLiON-3-10B-MLX-4bit/decoder")
out = generate(model, tokenizer, prompt="Hello", max_tokens=128)
print(out)

Size comparison

Precision Approximate size Variant
FP16 (upstream) ~20 GB MERaLiON/MERaLiON-3-10B-preview
MLX 8-bit ~11.6 GB majentik/MERaLiON-3-10B-MLX
MLX 4-bit (this repo) ~6.6 GB โ€”

What's not in this repo

  • Training data. Quantization-only release.
  • Benchmarks. Expect mild WER drift on 4-bit vs. 8-bit upstream; formal measurement is pending.
  • Safety alignment changes. Inherited from upstream.

Known limitations

  • Inherits upstream's out-of-scope warning: not intended for tool-calling, math, or coding tasks.
  • At 4-bit, some multilingual tail languages (Tamil, Thai, Vietnamese) may see larger WER regressions than English/Mandarin. If quality matters more than memory, prefer the 8-bit variant.
  • Audio >30s is chunked automatically by mlx-meralion; boundary artifacts are possible.

Hardware requirements

  • Apple Silicon (M1/M2/M3/M4), macOS
  • ~8 GB unified memory recommended (runs comfortably on a base M1/M2 MacBook Air)

License

Released under the MERaLiON Public Licence v3, inherited from the upstream model. See the license PDF.

Links

Downloads last month
39
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for majentik/MERaLiON-3-10B-MLX-4bit

Finetuned
(8)
this model