IQ3_KS on EPYC 9355 + 1x RTX 5090
I have replaced the failed PSU in my machine so I can spam you all again with some free EPYC Turin marketing π:
64K @ q8_0
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap --merge-qkv \
-mla 3 -amb 1024 \
-b 8192 -ub 8192 \
-ctk q8_0 -ctv q8_0 -c 65536 \
-ngl 999 --n-cpu-moe 999 \
--threads 16 \
--threads-batch 28 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 128 | 0 | 14.510 | 564.57 | 6.638 | 19.28 |
| 8192 | 128 | 8192 | 18.004 | 455.01 | 7.110 | 18.00 |
| 8192 | 128 | 16384 | 21.818 | 375.46 | 7.197 | 17.78 |
| 8192 | 128 | 24576 | 25.883 | 316.50 | 7.431 | 17.23 |
| 8192 | 128 | 32768 | 29.354 | 279.08 | 7.647 | 16.74 |
| 8192 | 128 | 40960 | 33.382 | 245.40 | 7.853 | 16.30 |
| 8192 | 128 | 49152 | 36.811 | 222.54 | 8.093 | 15.82 |
| 8192 | 128 | 57344 | 40.593 | 201.81 | 8.362 | 15.31 |
80K @ q8_0
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap --merge-qkv \
-mla 3 -amb 512 \
-b 4096 -ub 4096 \
-ctk q8_0 -ctv q8_0 -c 81920 \
-ngl 999 --n-cpu-moe 999 \
--threads 16 \
--threads-batch 28 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 10.241 | 399.95 | 6.581 | 19.45 |
| 4096 | 128 | 8192 | 12.155 | 336.99 | 6.945 | 18.43 |
| 4096 | 128 | 16384 | 13.839 | 295.98 | 7.160 | 17.88 |
| 4096 | 128 | 24576 | 15.226 | 269.02 | 7.401 | 17.30 |
| 4096 | 128 | 32768 | 16.732 | 244.81 | 7.622 | 16.79 |
| 4096 | 128 | 40960 | 18.418 | 222.39 | 7.875 | 16.25 |
| 4096 | 128 | 49152 | 20.205 | 202.72 | 8.078 | 15.85 |
| 4096 | 128 | 57344 | 22.105 | 185.30 | 8.321 | 15.38 |
| 4096 | 128 | 65536 | 24.047 | 170.33 | 8.601 | 14.88 |
| 4096 | 128 | 73728 | 26.148 | 156.65 | 8.890 | 14.40 |
32K @ f16 on Unsloth's UD-Q6_K_XL (for curious)
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap --merge-qkv \
-mla 3 -amb 512 \
-b 4096 -ub 4096 \
-ctk f16 -ctv f16 -c 32768 \
-ngl 999 --n-cpu-moe 999 \
--threads 16 \
--threads-batch 28 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 17.417 | 235.18 | 8.766 | 14.60 |
| 4096 | 128 | 4096 | 18.230 | 224.69 | 9.022 | 14.19 |
| 4096 | 128 | 8192 | 19.344 | 211.75 | 9.205 | 13.91 |
| 4096 | 128 | 12288 | 20.042 | 204.37 | 9.182 | 13.94 |
| 4096 | 128 | 16384 | 20.868 | 196.28 | 9.405 | 13.61 |
| 4096 | 128 | 20480 | 21.375 | 191.63 | 9.406 | 13.61 |
| 4096 | 128 | 24576 | 22.075 | 185.55 | 9.607 | 13.32 |
| 4096 | 128 | 28672 | 22.766 | 179.92 | 9.608 | 13.32 |
The speeds seem fine to me, but GLM needs a lot of VRAM for context, so larger GPU would certainly help.
Thank you for the quants @ubergarm , as always. I would love to see a bit larger ones, too ;).
--n-cpu-moe 999
Oh interesting, does this end up being the same as --cpu-moe aka -ot exps=CPU ? So you're not offloading any additional routed experts I guess? Just curious as I'm still trying to figure out the best way to run this big A40B beast and one guy suggested offloading more routed experts to GPU didn't help or made it slower?
Thanks for the great comparisons, personally I've found GLM-5 to be pretty capable model and hasn't had the same tool call parameter order issues as qwen35 with opencode (though i saw a PR for that on ik recently).
Yes I've had some requests for larger quants, we'll see I'm being a bit cautious with public quota unfortunately hah..
Glad your PSU is back up and thanks for sharing your results!!
--n-cpu-moe 999
Oh interesting, does this end up being the same as
--cpu-moeaka-ot exps=CPU? So you're not offloading any additional routed experts I guess? Just curious as I'm still trying to figure out the best way to run this big A40B beast and one guy suggested offloading more routed experts to GPU didn't help or made it slower?
I've been out of the loop for a while so I am still a bit lost in all those new convenience params. Yes I definitely meant that as a replacement for -ot exps=CPU and I guess it is +/- what it did. I noticed your post here and took it as a base, just needed to get rid of the dual-GPU setup. For some reason the -ot exps=CPU was failing on "out of memory" so I researched just a bit and ended with this. π
I also noticed you had -ger there which I thought was only for Ring/Ling models, but maybe that changed, too.
Generally speaking on this low-VRAM card what I want is to leave as much VRAM for KV-cache, so I am happy to have all experts on CPU.
I haven't had a chance yet to play with Kimi-K2.5 nor GLM-5 (not even GLM-4.7 and GLM-4.7 Flash).
Things are moving too fast for me π - and now all the Qwens and MiniMax and DeepSeek V4 around the corner...
I noticed tool calling still seems broken for K2.5 on ik_llama, so I am using mainline for that. Your GLM-5 seems to work fine, tho!
-gerthere which I thought was only for Ring/Ling models,
Ahh pretty sure you are right, I just leave it on most things now because i wasn't 100% sure and didn't go back to read the PRs lol. I'll likely remove it then as it is not used so much.
For some reason the -ot exps=CPU was failing on "out of memory"
Some models I've had more difficult time getting them to load and not OOM one place or another depending on if I'm using -sm layer or newer -sm graph etc. But yeah I think --cpu-moe is doing the same thing more or less but maybe won't match some small vectors with similar regex names I'm not 100% myself π
Things are moving too fast for me π
omg you're not kidding, all my quick start commands are out of date a few days later it feels like haha... it really is drinking from a fire hose. we'll see if deepseek v4 comes out soon too.. my hf public quota won't be able to keep up forever lol
Anyway always appreciate sharing and learning with you! cheers!
I was having trouble properly splitting the layers across my GPUs, and as a result was getting slower than expected token generation and overall performance. (like 7t/s generation). I wouldn't mind playing with this model more in opencode if I have at least double that, which is like speeds you saw in your sweep bench.
Just wanted to say thanks for posting the launch command, I referenced it against my old command and I was able to bring it up to around 13t/s generation on my 1x4090, 3x3090 intel QYFS system!
