Inference Much slower as compared to other A3B Models

#47
by engrtipusultan - opened

@ZHANGYUXUAN-zR

I have tested the model speed in llama.cpp on my hardware. It turns out to be much slower than other A3B models. Also pp and tg both drop very quickly and much more than other similar size MoE models. I want to check with you if you have any benchmarks from your internal testing that how model fares with other similar size models for for packet processing and decoding.

I want to understand if it is llama.cpp problem, or vulkan back-end problem or model is like that due to its internal architecture.

llama-bench build: 8f91ca54e (7822)

FA = off

Param gpt-oss 20B MXFP4 MoE nemotron_h_moe 31B.A3.5B Q8_0 qwen3moe 30B.A3B Q8_0 GLM4.7 Flash Q8_0
pp512 147.08 ± 1.49 113.02 ± 0.32 119.46 ± 0.18 102.81 ± 0.31
tg128 16.17 ± 0.00 12.05 ± 0.01 12.95 ± 0.01 10.77 ± 0.00
pp512 @ d1024 136.19 ± 1.73 111.24 ± 0.13 105.93 ± 0.34 86.65 ± 0.31
tg128 @ d1024 15.78 ± 0.03 11.84 ± 0.01 12.06 ± 0.06 7.29 ± 0.05
pp512 @ d2048 128.45 ± 1.21 108.86 ± 0.40 94.63 ± 0.51 73.20 ± 0.48
tg128 @ d2048 15.20 ± 0.03 11.50 ± 0.00 11.23 ± 0.00 5.28 ± 0.03
pp512 @ d8096 95.64 ± 0.76 98.47 ± 0.93 56.28 ± 0.18 38.71 ± 0.02
tg128 @ d8096 12.28 ± 0.01 9.17 ± 0.05 5.89 ± 0.05 2.19 ± 0.02

FA = on

params gpt-oss 20B MXFP4 MoE nemotron_h_moe 31B.A3.5B Q8_0 qwen3moe 30B.A3B Q8_0 GLM4.7 Flash Q8_0
pp512 146.69 ± 0.87 112.54 ± 0.65 114.26 ± 0.87 86.09 ± 0.12
tg128 16.64 ± 0.01 12.12 ± 0.01 13.39 ± 0.01 10.97 ± 0.01
pp512 @ d1024 132.76 ± 0.39 107.09 ± 0.32 77.43 ± 0.10 50.39 ± 0.10
tg128 @ d1024 16.36 ± 0.08 12.05 ± 0.01 12.29 ± 0.00 9.76 ± 0.01
pp512 @ d2048 120.38 ± 0.10 101.26 ± 0.28 55.47 ± 0.35 35.40 ± 0.02
tg128 @ d2048 16.11 ± 0.08 11.98 ± 0.00 11.66 ± 0.01 8.79 ± 0.00
pp512 @ d8096 77.32 ± 0.34 77.85 ± 0.48 20.76 ± 0.17 12.94 ± 0.01
tg128 @ d8096 14.91 ± 0.01 11.52 ± 0.00 8.92 ± 0.00 5.58 ± 0.00
engrtipusultan changed discussion title from Inference Mush slower as compared to other A3B Models to Inference Much slower as compared to other A3B Models

I suppose you should compare the speed with vllm / sglang, since llama.cpp support is not added by z.ai

I do not have machine to do that. That is why I am asking for results from there internal testing. That will give reference whether it is due to inference engine or model itself is like this. Qwen team gave benchmarks for Qwen3 Next against other similar size models.
So far GPT OSS 20B is king at lower contexts and NVIDIA-Nemotron-3-Nano-30B-A3B is best due to Mamba-2 architecture in terms of retaining PP and TG.

Sign up or log in to comment