Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
nightmedia 
posted an update 5 days ago
Post
241
The Qwen3.5-27B performance landscape

I started gathering some numbers on the 27Bs.

You might have noticed that reported metrics differ from Thinking to Instruct models and this is expected. The mxfp8/mxfp4 are the most stable quants I could measure, and I provided Deckard(qx) quants where possible

Converting a Thinking model to Instruct

The model is thinking/instruct, and the instruct mode can be forced by setting the first line of the jinja template to
{%- set enable_thinking = false %}


Qwen3.5-27B-Text

This is a model I tested where the vision tower was removed, its performance is the same as the VL model.
nightmedia/Qwen3.5-27B-Text-qx86-hi-mlx
quant     arc   arc/e boolq hswag obkqa piqa  wino
qx86-hi   0.443,0.498,0.857,0.701,0.372,0.770,0.752
mxfp4     0.460,0.527,0.871,0.694,0.370,0.772,0.752


DavidAU/Qwen3.5-27B-Claude-4.6-OS-INSTRUCT

On the top of the heap of the models I tested, as far as metrics go, is this model created by DavidAU. Samples of the output are provided on the model card.
nightmedia/Qwen3.5-27B-Claude-4.6-OS-INSTRUCT-mxfp8-mlx
quant     arc   arc/e boolq hswag obkqa piqa  wino
mxfp8     0.675,0.827,0.900,0.750,0.496,0.800,0.721
qx86-hi   0.667,0.824,0.902,0.752,0.502,0.791,0.725
qx64-hi   0.664,0.820,0.902
mxfp4     0.653,0.815,0.899

For the Thinking version, see nightmedia/Qwen3.5-27B-Architect-Claude-qx86-hi-mlx

More metrics in comments.

-G

P.S. I will update this as soon as I have new numbers or I found a typo--whichever comes first. The models that show just the arc-check numbers are in the test queue and will be updated soon.

Deckard

nightmedia/Qwen3.5-27B-Architect-Deckard-Heretic

I created a line of Architects and Engineers that use XML tool descriptions in the jinja template. This seems to stabilize the inference and raise the performance, also eliminating looping. This works best in the 35B-A3B MoE, but it appears that it works here too

You can convert any Architect into an Engineer by disabling thinking

I have a few trained Architects, some in 35B

One of my favorites in 27B is Deckard, here in mxfp4. This model was trained by DavidAU on Philip K. Dick works, from Ubik to Blade Runer to The Man In The High Castle, and it can RP and offer snarky commentary, just like the detective

When doing a character preference check in the Star Trek universe, the model picks Geordi LaForge or The Doctor, depending on quant size.

https://huggingface.co/nightmedia/Qwen3.5-27B-Architect-Deckard-Heretic-mxfp4-mlx

          arc   arc/e boolq hswag obkqa piqa  wino
mxfp4     0.461,0.513,0.821,0.727,0.396,0.777,0.773

I did not publish Deckard in Deckard(qx) formula simply because I haven't tested it yet.

It's coming soon.

Opus trained models

This seems to be a popular distill that everyone is doing now, so I tested a few variants. As they might be using different training sets, the IQ and output quality may vary.

Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking
mxfp8     0.462,0.547,0.859

Qwen3.5-27B-Claude-4.6-OS-Auto-Variable-Thinking
mxfp8     0.485,0.566,0.875,0.746,0.408,0.789,0.730

Qwen3.5-27B-Claude-4.6-OS-Auto-Variable-Heretic-Uncensored-Thinking
mxfp8     0.467,0.556,0.859,0.739,0.400,0.786,0.732

TeichAI/Qwen3.5-27b-Opus-4.6-Distill
quant     arc   arc/e boolq hswag obkqa piqa  wino
qx86-hi   0.458,0.544,...
qx64-hi   0.459,0.542,0.724,0.764,0.402,0.790,0.783

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
qx64-hi   0.434,0.530,0.850,0.708,0.384,0.766,0.721

DavidAU/Qwen3.5-27B-Claude-4.6-OS-Auto-Variable-Thinking
mxfp8     0.485,0.566,0.875,0.746,0.408,0.789,0.730

DavidAU/Qwen3.5-27B-Claude-4.6-OS-INSTRUCT
mxfp8     0.675,0.827,0.900,0.750,0.496,0.800,0.721
qx86-hi   0.667,0.824,0.902,0.752,0.502,0.791,0.725
qx64-hi   0.664,0.820,0.902
mxfp4     0.653,0.815,0.899

DavidAU's model can be found here:

https://huggingface.co/nightmedia/Qwen3.5-27B-Claude-4.6-OS-INSTRUCT-mxfp8-mlx

The thinking version with Architect features is the Nightmedia model

https://huggingface.co/nightmedia/Qwen3.5-27B-Architect-Claude-qx86-hi-mlx

Polaris, GLM, Gemini, and other distills

          arc   arc/e boolq hswag obkqa piqa  wino
Jackrong/Qwen3.5-27B-Gemini-3.1-Pro-Reasoning-Distill
mxfp8    0.477,0.525,0.822,0.711,0.398,0.784,0.758

DavidAU/Qwen3.5-27B-Polaris-Advanced-Thinking-Alpha
mxfp4     0.473,0.548,0.709,0.728,0.396,0.777,0.753

DavidAU/Qwen3.5-27B-HERETIC-Polaris-Advanced-Thinking-Alpha-uncensored
mxfp4     0.476,0.537,0.694,...

DavidAU/Qwen3.5-27B-GLM-4.7-Flash-Thinking-ALPHA
mxfp4     0.443,0.504,0.851,...

Qwen3.5-27B-HERETIC-Polaris-Advanced-Thinking-Alpha-uncensored
mxfp4     0.473,0.548,0.709,0.728,0.396,0.777,0.753

Qwen3.5-27B-Architect-Polaris-Heretic
mxfp4     0.474,0.539,0.699,0.724,0.390,0.779,0.762

Qwen3.5-27B-Architect-Deckard-Heretic
mxfp4     0.461,0.513,0.821,0.727,0.396,0.777,0.773

Qwen3.5-27B-Polaris-Advanced-Thinking-Alpha
mxfp4     0.473,0.548,0.709,0.728,0.396,0.777,0.753

Qwen3.5-27B-Text
mxfp4     0.460,0.527,0.871,0.694,0.370,0.772,0.752

more numbers coming..

-G

Running metrics

I use a very simple test provided by the MLX framework

mlx_lm.evaluate --model {{name}} --tasks winogrande boolq arc_challenge arc_easy hellaswag openbookqa piqa

The resulting norm numbers are the numbers shared.

Test run times by model

0.8B-mxfp8   21:32
2B-mxfp8     41:25
4B-mxfp8   1:25:52
9B-mxfp8   2:33:18
27B-mxfp4  5:59:15
35B-A3B    1:47:22
122B-A10B  5:00:13

Some take longer, the mxfp are usually the fastest, except where a Deckard(qx) quant "clicks", then that one is faster.

The numbers shown are on a MBP M4Max

Last numbers from 27Bs include a few 40B brainstormed models by DavidAU, metrics are being processed and will be available this weekend.

In this post