I got qwen3-coder-next up 24tps when i removed -nkvo and kept -kvu. Think i could increase it further if i took the time to compile llama.cpp and not use docker image.
I run it on threadripper 3970x with 256gb system ram and offloading computation layers to a gtx 1660 6gb vram. Using llama.cpp with -nkvo -kvu and all MoE on CPU. With an amazing speed on 14/TpS generation speed using q8_0. I’m amazed