Qwen3.6-35B-A3B optimized for MLX.

  • 4-bit baseline with important layers at 8-bit and BF16.
  • This quant does not support image input.

I ended up selecting two winners from my trials. This is the quality+ version, and here's the speed+ version.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit

Benchmarks

metric mlx-community/ Qwen3.6-35B-A3B-4bit mlx-community/ Qwen3.6-35B-A3B-4.4bit-msq 4.8 bit 5.4 bit (this model)
bpw 4.503 4.787 4.788 5.438
peak memory (1024/512) 20.683 21.922 21.928 24.741
prompt tok/s (1024) 2719.4470 ± 15.2250 2695.9370 ± 12.5260 2734.5260 ± 3.8810 2665.3060 ± 11.4520
gen tok/s (512) 108.4990 ± 0.4910 94.2940 ± 0.3650 97.2820 ± 0.0800 89.4920 ± 0.2610
kl divergence 0.0838 ± 0.0008 0.1689 ± 0.0015 0.0244 ± 0.0004 0.0189 ± 0.0003
perplexity 4.6150 ± 0.0320 4.2490 ± 0.0280 4.6410 ± 0.0320 4.6440 ± 0.0320
hellaswag 0.5560 ± 0.0220 0.5780 ± 0.0220 0.5440 ± 0.0220 0.5370 ± 0.0110
piqa 0.7940 ± 0.0180 0.7920 ± 0.0180 0.7920 ± 0.0180 0.7980 ± 0.0180
winogrande 0.7260 ± 0.0200 0.7400 ± 0.0200 0.7120 ± 0.0200 0.7100 ± 0.0200

I've moved over to using speed + KL divergence as my primary optimization metrics. Hellaswag, PIQA, Winogrande, and perplexity are kept as sanity checks, though these require high sample sizes to get usable signal.

Tested on a Mac Studio M3 Ultra with:

mlx_lm.convert --hf-path Qwen/Qwen3.6-35B-A3B --mlx-path ./mlx && mlx_lm.kld --baseline-model ./mlx
mlx_lm.perplexity --sequence-length 512 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 500

mlx_lm.kld is still an open PR.

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
1,685
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit

Quantized
(405)
this model