Some question about Inference Benchmark

#8
by hebangwen - opened

The speed on the image and the table are different. The image says the prefill throughput is 244 tok/s, while the table says the prefill throughput is 335 tok/s. Memory info differs a lot, too. I suppose the context length (4k) in the image is higher than the context length in the table (1k ~ 2k). It would be nice if you can share more detailed information.


Device Model Prefill (tokens/s) Decode (tokens/s) Memory (MB)
AMD Ryzen AI Max 395 LFM2.5-1.2B 5,049 239 896
Granite-4.0-h-1b 3,994 146 1,129
Qwen3-1.7B 3,092 141 1,804
Qualcomm Snapdragon Gen4 (Samsung Galaxy S25 Ultra) LFM2.5-1.2B 244 71 799
Granite-4.0-h-1b 195 47 1,055
Qwen3-1.7B 104 42 1,985
Device Inference Framework Model Prefill (tok/s) Decode (tok/s) Memory (GB)
Qualcomm Snapdragon® X Elite NPU NexaML LFM2.5-1.2B-Instruct 2591 63 0.9GB
Qualcomm Snapdragon® Gen4 (ROG Phone9 Pro) NPU NexaML LFM2.5-1.2B-Instruct 4391 82 0.9GB
Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) CPU llama.cpp (Q4_0) LFM2.5-1.2B-Instruct 335 70 719MB
Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) CPU llama.cpp (Q4_0) Qwen3-1.7B 181 40 1306MB
Liquid AI org

Hi, yes indeed, the table uses 1K prefill and 100 decode tokens (you can see it in the blog post). I'm updating the model card to reflect this here as well.

hebangwen changed discussion status to closed

Sign up or log in to comment