llama.cpp inference - 20 times (!) slower than OSS 20 on a RTX 5090
Is llama.cpp inference properly supported ?
I ran a test on a 5090, using the Q4 model.
GPT-OSS-20B (also A3B) would provide anything from 200-400 token generation speed.
GLM-4.7-flash runs at only 40tk/sec without context and at 15k context it drops to below 10 tokens/sec.
It is not yet supported in llama.cpp, but will be very soon, watch: https://github.com/ggml-org/llama.cpp/pull/18936
Yes the speed is great as expected, comparable to a 8B dense model I would say?
Did you force enable flash attention? Because if that's the case, llama.cpp doesn't support flash attention with this model (at least with CUDA) and performance will tank a lot as attention will be computed on the CPU.
You are right theo.
Speed is better now, but still severe degradation.
It starts at 140 (OSS would be 350-400), at 4000 tokens it's down to 105 already, at 15k context the speed is only around 50 remaining.
It starts at 140 (OSS would be 350-400), at 4000 tokens it's down to 105 already, at 15k context the speed is only around 50 remaining.
I confirm
Relevant llama.cpp thread (please don't crowd the thread with comments, I'm just putting it here so people can follow along with the progress): https://github.com/ggml-org/llama.cpp/issues/18944
I got 97 tokens on RTX 3090 and llama.cpp is updated, and it worked yesterday same for me.