Qwen3.5 9B optimized to run on Mac.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 4-bit baseline with important layers at 8 and BF16.

Usage

# Start server at http://localhost:8080/chat/completions
uvx --from mlx-lm --with torchvision \
  mlx_vlm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.5-9B-MLX-5.6bit-vision

Benchmarks

metric this model
bpw 5.554
base memory 5.789
peak memory (1024/512) 7.446
prompt tok/s (1024) 1481.661 卤 6.709
gen tok/s (512) 91.086 卤 0.101
kl mean 0.032 卤 0.002
kl p95 0.069 卤 0.002
perplexity 3.739 卤 0.018
winogrande 0.660 卤 0.021

Tested on a Mac Studio M3 Ultra. KL divergence is approximate, based on top_k not full logits. Here's the code.

mlx_lm.kld --baseline-model path/to/mlx-full-precision
mlx_lm.perplexity --sequence-length 2048 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 500

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ from llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
401
Safetensors
Model size
2B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Qwen3.5-9B-MLX-5.6bit-vision

Finetuned
Qwen/Qwen3.5-9B
Quantized
(216)
this model