Inference Much slower as compared to other A3B Models
I have tested the model speed in llama.cpp on my hardware. It turns out to be much slower than other A3B models. Also pp and tg both drop very quickly and much more than other similar size MoE models. I want to check with you if you have any benchmarks from your internal testing that how model fares with other similar size models for for packet processing and decoding.
I want to understand if it is llama.cpp problem, or vulkan back-end problem or model is like that due to its internal architecture.
llama-bench build: 8f91ca54e (7822)
FA = off
| Param | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 147.08 ± 1.49 | 113.02 ± 0.32 | 119.46 ± 0.18 | 102.81 ± 0.31 |
| tg128 | 16.17 ± 0.00 | 12.05 ± 0.01 | 12.95 ± 0.01 | 10.77 ± 0.00 |
| pp512 @ d1024 | 136.19 ± 1.73 | 111.24 ± 0.13 | 105.93 ± 0.34 | 86.65 ± 0.31 |
| tg128 @ d1024 | 15.78 ± 0.03 | 11.84 ± 0.01 | 12.06 ± 0.06 | 7.29 ± 0.05 |
| pp512 @ d2048 | 128.45 ± 1.21 | 108.86 ± 0.40 | 94.63 ± 0.51 | 73.20 ± 0.48 |
| tg128 @ d2048 | 15.20 ± 0.03 | 11.50 ± 0.00 | 11.23 ± 0.00 | 5.28 ± 0.03 |
| pp512 @ d8096 | 95.64 ± 0.76 | 98.47 ± 0.93 | 56.28 ± 0.18 | 38.71 ± 0.02 |
| tg128 @ d8096 | 12.28 ± 0.01 | 9.17 ± 0.05 | 5.89 ± 0.05 | 2.19 ± 0.02 |
FA = on
| params | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 146.69 ± 0.87 | 112.54 ± 0.65 | 114.26 ± 0.87 | 86.09 ± 0.12 |
| tg128 | 16.64 ± 0.01 | 12.12 ± 0.01 | 13.39 ± 0.01 | 10.97 ± 0.01 |
| pp512 @ d1024 | 132.76 ± 0.39 | 107.09 ± 0.32 | 77.43 ± 0.10 | 50.39 ± 0.10 |
| tg128 @ d1024 | 16.36 ± 0.08 | 12.05 ± 0.01 | 12.29 ± 0.00 | 9.76 ± 0.01 |
| pp512 @ d2048 | 120.38 ± 0.10 | 101.26 ± 0.28 | 55.47 ± 0.35 | 35.40 ± 0.02 |
| tg128 @ d2048 | 16.11 ± 0.08 | 11.98 ± 0.00 | 11.66 ± 0.01 | 8.79 ± 0.00 |
| pp512 @ d8096 | 77.32 ± 0.34 | 77.85 ± 0.48 | 20.76 ± 0.17 | 12.94 ± 0.01 |
| tg128 @ d8096 | 14.91 ± 0.01 | 11.52 ± 0.00 | 8.92 ± 0.00 | 5.58 ± 0.00 |
I suppose you should compare the speed with vllm / sglang, since llama.cpp support is not added by z.ai
I do not have machine to do that. That is why I am asking for results from there internal testing. That will give reference whether it is due to inference engine or model itself is like this. Qwen team gave benchmarks for Qwen3 Next against other similar size models.
So far GPT OSS 20B is king at lower contexts and NVIDIA-Nemotron-3-Nano-30B-A3B is best due to Mamba-2 architecture in terms of retaining PP and TG.