IQ3_KS on EPYC 9355 + 1x RTX 5090

by sousekd - opened Mar 3

Mar 3

I have replaced the failed PSU in my machine so I can spam you all again with some free EPYC Turin marketing 😀:

64K @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 1024 \
    -b 8192 -ub 8192 \
    -ctk q8_0 -ctv q8_0 -c 65536 \
    -ngl 999 --n-cpu-moe 999 \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	128	0	14.510	564.57	6.638	19.28
8192	128	8192	18.004	455.01	7.110	18.00
8192	128	16384	21.818	375.46	7.197	17.78
8192	128	24576	25.883	316.50	7.431	17.23
8192	128	32768	29.354	279.08	7.647	16.74
8192	128	40960	33.382	245.40	7.853	16.30
8192	128	49152	36.811	222.54	8.093	15.82
8192	128	57344	40.593	201.81	8.362	15.31

80K @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 -c 81920 \
    -ngl 999 --n-cpu-moe 999 \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	10.241	399.95	6.581	19.45
4096	128	8192	12.155	336.99	6.945	18.43
4096	128	16384	13.839	295.98	7.160	17.88
4096	128	24576	15.226	269.02	7.401	17.30
4096	128	32768	16.732	244.81	7.622	16.79
4096	128	40960	18.418	222.39	7.875	16.25
4096	128	49152	20.205	202.72	8.078	15.85
4096	128	57344	22.105	185.30	8.321	15.38
4096	128	65536	24.047	170.33	8.601	14.88
4096	128	73728	26.148	156.65	8.890	14.40

32K @ f16 on Unsloth's UD-Q6_K_XL (for curious)

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 -c 32768 \
    -ngl 999 --n-cpu-moe 999 \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	17.417	235.18	8.766	14.60
4096	128	4096	18.230	224.69	9.022	14.19
4096	128	8192	19.344	211.75	9.205	13.91
4096	128	12288	20.042	204.37	9.182	13.94
4096	128	16384	20.868	196.28	9.405	13.61
4096	128	20480	21.375	191.63	9.406	13.61
4096	128	24576	22.075	185.55	9.607	13.32
4096	128	28672	22.766	179.92	9.608	13.32

The speeds seem fine to me, but GLM needs a lot of VRAM for context, so larger GPU would certainly help.
Thank you for the quants @ubergarm , as always. I would love to see a bit larger ones, too ;).

ubergarm

Owner Mar 3

--n-cpu-moe 999

Oh interesting, does this end up being the same as --cpu-moe aka -ot exps=CPU ? So you're not offloading any additional routed experts I guess? Just curious as I'm still trying to figure out the best way to run this big A40B beast and one guy suggested offloading more routed experts to GPU didn't help or made it slower?

Thanks for the great comparisons, personally I've found GLM-5 to be pretty capable model and hasn't had the same tool call parameter order issues as qwen35 with opencode (though i saw a PR for that on ik recently).

Yes I've had some requests for larger quants, we'll see I'm being a bit cautious with public quota unfortunately hah..

Glad your PSU is back up and thanks for sharing your results!!

sousekd

Mar 3

•

edited Mar 3

--n-cpu-moe 999

Oh interesting, does this end up being the same as --cpu-moe aka -ot exps=CPU ? So you're not offloading any additional routed experts I guess? Just curious as I'm still trying to figure out the best way to run this big A40B beast and one guy suggested offloading more routed experts to GPU didn't help or made it slower?

I've been out of the loop for a while so I am still a bit lost in all those new convenience params. Yes I definitely meant that as a replacement for -ot exps=CPU and I guess it is +/- what it did. I noticed your post here and took it as a base, just needed to get rid of the dual-GPU setup. For some reason the -ot exps=CPU was failing on "out of memory" so I researched just a bit and ended with this. 😀

I also noticed you had -ger there which I thought was only for Ring/Ling models, but maybe that changed, too.

Generally speaking on this low-VRAM card what I want is to leave as much VRAM for KV-cache, so I am happy to have all experts on CPU.

sousekd

Mar 3

I haven't had a chance yet to play with Kimi-K2.5 nor GLM-5 (not even GLM-4.7 and GLM-4.7 Flash).
Things are moving too fast for me 😀 - and now all the Qwens and MiniMax and DeepSeek V4 around the corner...
I noticed tool calling still seems broken for K2.5 on ik_llama, so I am using mainline for that. Your GLM-5 seems to work fine, tho!

ubergarm

Owner Mar 3

@sousekd

-ger there which I thought was only for Ring/Ling models,

Ahh pretty sure you are right, I just leave it on most things now because i wasn't 100% sure and didn't go back to read the PRs lol. I'll likely remove it then as it is not used so much.

For some reason the -ot exps=CPU was failing on "out of memory"

Some models I've had more difficult time getting them to load and not OOM one place or another depending on if I'm using -sm layer or newer -sm graph etc. But yeah I think --cpu-moe is doing the same thing more or less but maybe won't match some small vectors with similar regex names I'm not 100% myself 😅

Things are moving too fast for me 😀

omg you're not kidding, all my quick start commands are out of date a few days later it feels like haha... it really is drinking from a fire hose. we'll see if deepseek v4 comes out soon too.. my hf public quota won't be able to keep up forever lol

Anyway always appreciate sharing and learning with you! cheers!

phakio

28 days ago

•

edited 28 days ago

I was having trouble properly splitting the layers across my GPUs, and as a result was getting slower than expected token generation and overall performance. (like 7t/s generation). I wouldn't mind playing with this model more in opencode if I have at least double that, which is like speeds you saw in your sweep bench.

Just wanted to say thanks for posting the launch command, I referenced it against my old command and I was able to bring it up to around 13t/s generation on my 1x4090, 3x3090 intel QYFS system!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment