Performance report with 2 GPUs: 85 t/s

by SlavikF - opened Feb 4

Discussion

SlavikF

Feb 4

•

edited Feb 5

System:

Intel Xeon W5-3425 with 256GB of DDR5-4800Mhz
Nvidia RTX 4090D 48GB VRAM
Nvidia RTX 3090 24GB VRAM
Ubuntu 24

running in Docker compose:

services:
  qwen3coder:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b7917
    container_name: qwen3coder
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router/local-qwen3-coder80b:/root/.cache/llama.cpp
    entrypoint: ["./llama-server"]
    command: >
      --model  /root/.cache/llama.cpp/Qwen_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q5_K_M_Qwen3-Coder-Next-Q5_K_M-00001-of-00004.gguf
      --alias local-qwen3-coder80b
      --host 0.0.0.0  --port 8080 
      --ctx-size 262144
      --parallel 2  --kv-unified
      --top-p 0.95 --top-k 40 --temp 1.0 --min-p 0.01

Everything fits in VRAM.

few relevant log lines:

[42849] llama_params_fit_impl: projected to use 65578 MiB of device memory vs. 71856 MiB of free device memory

[42849] llama_params_fit: successfully fit params to free device memory

[42849] llama_context: pipeline parallelism enabled

[42849] sched_reserve: Flash Attention was auto, set to enabled

[42849] no implementations specified for speculative decoding
[42849] slot   load_model: id  0 | task -1 | speculative decoding context not initialized

prompt eval time =    2403.32 ms /  4103 tokens (   1707.22 tokens per second)
       eval time =   10383.54 ms /   885 tokens (     85.23 tokens per second)

BubkaGop113

Feb 4

Славик, ты цены на ддр видел? Откуда такой шкаф

SlavikF changed discussion title from Tools not working. Performance report with 2 GPUs: 85 t/s to Performance report with 2 GPUs: 85 t/s Feb 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment