llama.cpp inference - 20 times (!) slower than OSS 20 on a RTX 5090

#12
by cmp-nct - opened

Is llama.cpp inference properly supported ?
I ran a test on a 5090, using the Q4 model.

GPT-OSS-20B (also A3B) would provide anything from 200-400 token generation speed.
GLM-4.7-flash runs at only 40tk/sec without context and at 15k context it drops to below 10 tokens/sec.

It is not yet supported in llama.cpp, but will be very soon, watch: https://github.com/ggml-org/llama.cpp/pull/18936

@ddh0
I can see you are quite far already, nice !
Did you get good prefill and generation token speeds in your preliminary work already?
My hopes are high to have something qualitatively better than GPT-OSS (which runs also at 3B activated) while having similar performance.

Yes the speed is great as expected, comparable to a 8B dense model I would say?

Did you force enable flash attention? Because if that's the case, llama.cpp doesn't support flash attention with this model (at least with CUDA) and performance will tank a lot as attention will be computed on the CPU.

You are right theo.
Speed is better now, but still severe degradation.
It starts at 140 (OSS would be 350-400), at 4000 tokens it's down to 105 already, at 15k context the speed is only around 50 remaining.

It starts at 140 (OSS would be 350-400), at 4000 tokens it's down to 105 already, at 15k context the speed is only around 50 remaining.

I confirm

Relevant llama.cpp thread (please don't crowd the thread with comments, I'm just putting it here so people can follow along with the progress): https://github.com/ggml-org/llama.cpp/issues/18944

I got 97 tokens on RTX 3090 and llama.cpp is updated, and it worked yesterday same for me.

Sign up or log in to comment