MLX quantizations + M3 Ultra benchmark (107 tok/s on 4-bit)

by Raullen - opened 9 days ago

Hi — wanted to share that Tmax-9B is now available as four MLX quantizations on mlx-community, and runs through an OpenAI-compatible local server with one command:

pip install rapid-mlx==0.8.18
rapid-mlx serve tmax-9b

Quantizations (all at https://huggingface.co/mlx-community):

Tmax-9B-MLX-bf16 — full precision
Tmax-9B-MLX-8bit
Tmax-9B-MLX-6bit
Tmax-9B-MLX-4bit ← recommended for M3 Ultra

Chat completions + tool calls (via the qwen3_xml parser) both verified end-to-end.

M3 Ultra benchmark (median of 3 runs, rapid-mlx 0.8.18)

Apple M3 Ultra Studio · 28-core CPU · 60-core GPU · 256 GB unified

Tmax 9B variant	Decode tok/s	TTFT ms	Prefill 1k	Prefill 4k	Prefill 16k	Tool-call ms
4-bit	107.4	127.2	1059.8	1123.9	1091.7	726.4
6-bit	79.82	140.1	1032.5	1097.3	1063.7	813.9
8-bit	67.54	142.7	1053.7	1121.3	1086.5	870.6
bf16	—	—	—	—	—	—

Highlights:

107 decode tok/s on 4-bit — ~19% faster than Qwen3.5-9B-4bit control on the same hardware (90.5 tok/s).
127 ms TTFT, sub-1s prefill on 1k, sub-1s tool-call e

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment