Report: getting 20 t/s with UD-Q4_K_XL and 72 VRAM

#2
by SlavikF - opened

Thank you for publishing quants!

My system:

  • Intel Xeon W5-3425 (12c/24t)
  • 256GB DDR5-4800 (8 channels)
  • RTX 4090D 48GB
  • RTX 3090

llama.cpp:

  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 7542 (af3be131c) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

Getting speed:

prompt eval time =   29609.83 ms /  1987 tokens (  ...   67.11 tokens per second)
       eval time =   91223.57 ms /  1829 tokens (   ...  20.05 tokens per second)

Obviously the speed is lower on larger context.

I'm running it in router mode and using model.ini settings:

[local-minimax230b]
temp=1.0
top-p=0.95
top-k=40
ctx-size=65536

Works great!

Unsloth AI org

Oh very nice!

would love to see what kind of speed you get with 100k of context. the previous M2, i got 80 tokens per second on a simple prompt, slows down considerably over 100k - on my rig at least.

would love to see what kind of speed you get with 100k of context. the previous M2, i got 80 tokens per second on a simple prompt, slows down considerably over 100k - on my rig at least.

that's normal for transformer models :)
linear (hybrid) type models like qwen3-next or kimi-linear, slow down less with long context

tested with 16k context:

  • PP stay over 60 t/s
  • TG goes down to 10 t/s

Tried higher context, but requests with higher context fails:

srv operator(): http client error: Failed to read connection

Looks like that's because of new llama.cpp "--fit" flag. Need to sort it out...

124k of context in opencode and i'm getting 22/tps. not bad. 3x RTX PRO 6000 Blackwell. Maybe I can improve it with the 4th GPU.

Hello.
My numbers on homelab with two nodes from trash, may be this info useful for someone who look at components from secondhand.
MiniMax-M2.1-UD-Q4_K_XL

Runnimg params:
./llama-server -m /mnt/nvme1/llms/MiniMax-M2.1-UD-Q4_K_XL-00001-of-00003.gguf -c 16000 -fa on --host 0.0.0.0 --port 8080 --jinja --threads 36 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0 --presence-penalty 0.25 --no-context-shift -b 1000 -ub 1000 --no-mmap -fit off -ngl 99 -ctk q8_0 -ctv q8_0 --rpc 192.168.10.102:50052 -ts 8,8,8,15,15,8,0,0,8,15,15,15,17

PP:
115 t/s at 2K
108.83 at 6.2K

TG:
18 t/s at near zero context
15 t/s at 2.5K
13.5 t/s at 6.8K
13.2 at 8K
13.0 at 9K
12.8 at 10K

Latest llama.cpp + llama.cpp-rpc
2.5G LAN

Server1
i7-7800X (LGA2066) + 64GB DDR4 3200 (4 channels)
7800XT 16G
7800XT 16G
Mi50 16G
Mi50 16G

Server2
2x E5-2697V4 at Supermicto X10-DRG + 256GB DDR4 2400 (4 channels)
Nvidia-P100 16G
Nvidia-P100 16G
CMP 90HX 10G
CMP 50HX 10G
CMP 50HX 10G
P102-100 10G

PCI-SWITCH 1 X1 -> 4 X1 (this part ruins PP. DO NOT USE THAT)
P102-100 10G
P102-100 10G
P102-100 10G (disabled by -ts)
P102-100 10G (disabled by -ts)

Sign up or log in to comment