Safetensors
qwen3_5
Eval Results

MLX quantizations + M3 Ultra benchmark (107 tok/s on 4-bit)

#2
by Raullen - opened

Hi — wanted to share that Tmax-9B is now available as four MLX quantizations on mlx-community, and runs through an OpenAI-compatible local server with one command:

pip install rapid-mlx==0.8.18
rapid-mlx serve tmax-9b

Quantizations (all at https://huggingface.co/mlx-community):

  • Tmax-9B-MLX-bf16 — full precision
  • Tmax-9B-MLX-8bit
  • Tmax-9B-MLX-6bit
  • Tmax-9B-MLX-4bit ← recommended for M3 Ultra

Chat completions + tool calls (via the qwen3_xml parser) both verified end-to-end.

M3 Ultra benchmark (median of 3 runs, rapid-mlx 0.8.18)

Apple M3 Ultra Studio · 28-core CPU · 60-core GPU · 256 GB unified

Tmax 9B variant Decode tok/s TTFT ms Prefill 1k Prefill 4k Prefill 16k Tool-call ms
4-bit 107.4 127.2 1059.8 1123.9 1091.7 726.4
6-bit 79.82 140.1 1032.5 1097.3 1063.7 813.9
8-bit 67.54 142.7 1053.7 1121.3 1086.5 870.6
bf16

Highlights:

  • 107 decode tok/s on 4-bit — ~19% faster than Qwen3.5-9B-4bit control on the same hardware (90.5 tok/s).
  • 127 ms TTFT, sub-1s prefill on 1k, sub-1s tool-call e

Sign up or log in to comment