Performance report with 2 GPUs: 85 t/s
#1
by SlavikF - opened
System:
- Intel Xeon W5-3425 with 256GB of DDR5-4800Mhz
- Nvidia RTX 4090D 48GB VRAM
- Nvidia RTX 3090 24GB VRAM
- Ubuntu 24
running in Docker compose:
services:
qwen3coder:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b7917
container_name: qwen3coder
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/llama.cpp/router/local-qwen3-coder80b:/root/.cache/llama.cpp
entrypoint: ["./llama-server"]
command: >
--model /root/.cache/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q5_K_M_Qwen3-Coder-Next-Q5_K_M-00001-of-00004.gguf
--alias local-qwen3-coder80b
--host 0.0.0.0 --port 8080
--ctx-size 262144
--parallel 2 --kv-unified
--top-p 0.95 --top-k 40 --temp 1.0 --min-p 0.01
Everything fits in VRAM.
few relevant log lines:
[42849] llama_params_fit_impl: projected to use 65578 MiB of device memory vs. 71856 MiB of free device memory
[42849] llama_params_fit: successfully fit params to free device memory
[42849] llama_context: pipeline parallelism enabled
[42849] sched_reserve: Flash Attention was auto, set to enabled
[42849] no implementations specified for speculative decoding
[42849] slot load_model: id 0 | task -1 | speculative decoding context not initialized
prompt eval time = 2403.32 ms / 4103 tokens ( 1707.22 tokens per second)
eval time = 10383.54 ms / 885 tokens ( 85.23 tokens per second)
Славик, ты цены на ддр видел? Откуда такой шкаф
SlavikF changed discussion title from Tools not working. Performance report with 2 GPUs: 85 t/s to Performance report with 2 GPUs: 85 t/s