MERaLiON-2-3B-RotorQuant-MLX-4bit

MLX 4-bit RotorQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B for Apple Silicon inference.

RotorQuant applies rotation-based quantization that decorrelates weight matrices before quantization, distributing outlier magnitudes more evenly across channels for improved accuracy at low bit-widths.

Model Specifications

Property	Value
Base Model	aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-3B
Parameters	~3B
Architecture	Whisper-large-v3 encoder + Gemma-2-2B-IT decoder
Quantization	RotorQuant 4-bit (MLX)
Disk Size	~1.5 GB
Peak RAM	~2.5 GB
License	Apache 2.0
Task	Automatic Speech Recognition / Speech-to-Text

Quickstart

Installation

pip install mlx-lm mlx-whisper

Inference

from mlx_lm import load, generate
from mlx_lm.cache import IsoQuantCache

model, tokenizer = load("majentik/MERaLiON-2-3B-RotorQuant-MLX-4bit")

# Create IsoQuantCache for RotorQuant models
cache = IsoQuantCache(model)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following audio."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    cache=cache,
)
print(response)

Quantization Details

RotorQuant is a rotation-based quantization strategy that:

Applies learned rotation matrices to decorrelate weight channels before quantization
Reduces the impact of outlier weights that typically degrade quantized model quality
Provides more uniform weight distributions, leading to better accuracy retention
Pairs with IsoQuantCache for consistent KV-cache quantization during inference

This 4-bit variant provides a strong balance between quality and memory usage for the 3B model. The rotation-based approach is particularly effective at 4-bit, where outlier sensitivity is more pronounced.

Supported Languages

MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.

Memory Estimates

Device	Feasibility
MacBook Air M1 (8 GB)	Comfortable
MacBook Pro M1/M2 (16 GB)	Ideal
iPad Pro M1/M2	Comfortable
iPhone 15 Pro (8 GB)	Feasible

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~799 MB	Aggressive quantization	Very low-RAM Macs
3-bit	~1.1 GB	Lossy but small	Low-RAM Macs
4-bit	~1.3 GB	Balanced default	Recommended for most Macs
5-bit	~1.5 GB	Higher fidelity	Quality-sensitive
6-bit	~1.8 GB	Approaching FP16 quality	High-fidelity
8-bit	~2.3 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/meralion2-3b-*. The current variant — RotorQuant-MLX-4bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit	mlx-lm	~983 MB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~1.9 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~3.5 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	~983 MB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~1.9 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~3.5 GB	Apple Silicon reference

Downloads last month: 23

MLX

Hardware compatibility

Quantized

majentik
/

MERaLiON-2-3B-RotorQuant-MLX-4bit