Some question about Inference Benchmark
#8
by
hebangwen
- opened
The speed on the image and the table are different. The image says the prefill throughput is 244 tok/s, while the table says the prefill throughput is 335 tok/s. Memory info differs a lot, too. I suppose the context length (4k) in the image is higher than the context length in the table (1k ~ 2k). It would be nice if you can share more detailed information.
| Device | Model | Prefill (tokens/s) | Decode (tokens/s) | Memory (MB) |
|---|---|---|---|---|
| AMD Ryzen AI Max 395 | LFM2.5-1.2B | 5,049 | 239 | 896 |
| Granite-4.0-h-1b | 3,994 | 146 | 1,129 | |
| Qwen3-1.7B | 3,092 | 141 | 1,804 | |
| Qualcomm Snapdragon Gen4 (Samsung Galaxy S25 Ultra) | LFM2.5-1.2B | 244 | 71 | 799 |
| Granite-4.0-h-1b | 195 | 47 | 1,055 | |
| Qwen3-1.7B | 104 | 42 | 1,985 |
| Device | Inference | Framework | Model | Prefill (tok/s) | Decode (tok/s) | Memory (GB) |
|---|---|---|---|---|---|---|
| Qualcomm Snapdragon® X Elite | NPU | NexaML | LFM2.5-1.2B-Instruct | 2591 | 63 | 0.9GB |
| Qualcomm Snapdragon® Gen4 (ROG Phone9 Pro) | NPU | NexaML | LFM2.5-1.2B-Instruct | 4391 | 82 | 0.9GB |
| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU | llama.cpp (Q4_0) | LFM2.5-1.2B-Instruct | 335 | 70 | 719MB |
| Qualcomm Snapdragon® Gen4 (Samsung Galaxy S25 Ultra) | CPU | llama.cpp (Q4_0) | Qwen3-1.7B | 181 | 40 | 1306MB |
hebangwen
changed discussion status to
closed