MLX quantizations + M3 Ultra benchmark (107 tok/s on 4-bit)
#2
by Raullen - opened
Hi — wanted to share that Tmax-9B is now available as four MLX quantizations on mlx-community, and runs through an OpenAI-compatible local server with one command:
pip install rapid-mlx==0.8.18
rapid-mlx serve tmax-9b
Quantizations (all at https://huggingface.co/mlx-community):
- Tmax-9B-MLX-bf16 — full precision
- Tmax-9B-MLX-8bit
- Tmax-9B-MLX-6bit
- Tmax-9B-MLX-4bit ← recommended for M3 Ultra
Chat completions + tool calls (via the qwen3_xml parser) both verified end-to-end.
M3 Ultra benchmark (median of 3 runs, rapid-mlx 0.8.18)
Apple M3 Ultra Studio · 28-core CPU · 60-core GPU · 256 GB unified
| Tmax 9B variant | Decode tok/s | TTFT ms | Prefill 1k | Prefill 4k | Prefill 16k | Tool-call ms |
|---|---|---|---|---|---|---|
| 4-bit | 107.4 | 127.2 | 1059.8 | 1123.9 | 1091.7 | 726.4 |
| 6-bit | 79.82 | 140.1 | 1032.5 | 1097.3 | 1063.7 | 813.9 |
| 8-bit | 67.54 | 142.7 | 1053.7 | 1121.3 | 1086.5 | 870.6 |
| bf16 | — | — | — | — | — | — |
Highlights:
- 107 decode tok/s on 4-bit — ~19% faster than Qwen3.5-9B-4bit control on the same hardware (90.5 tok/s).
- 127 ms TTFT, sub-1s prefill on 1k, sub-1s tool-call e