Testing Q5 flavors (ubergarm / aessedai / unsloth) for "speed" on 8x RTX 3090

#10
by dehnhaide - opened

In scope:

MODEL: ubergarm/MiniMax-M2.7-IQ5_K
model type = 230B.A10B
model ftype = IQ5_K - 5.5 bpw
model params = 228.690 B
model size = 157.771 GiB (5.926 BPW)
repeating layers = 156.555 GiB (5.912 BPW, 227.461 B parameters)
general.name = MiniMax M2.7

MODEL: aessedai/MiniMax-M2.7-Q5_K_M
model type = 230B.A10B
model ftype = Q8_0
model params = 228.690 B
model size = 157.226 GiB (5.906 BPW)
repeating layers = 156.010 GiB (5.892 BPW, 227.461 B parameters)
general.name = MiniMax M2.7

MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL
model type = 230B.A10B
model ftype = Q5_K - Medium
model params = 228.690 B
model size = 157.797 GiB (5.927 BPW)
repeating layers = 156.581 GiB (5.913 BPW, 227.461 B parameters)
general.name = Minimax-M2.7

Common launch command (version: 4416 (4945d3b7)):
/ik_llama.cpp/build/bin/llama-sweep-bench --model "model_name" -c 16384 -fa 1 --no-mmap -ngl 999 --jinja --seed 1976 -ctk q8_0 -khad -ctv q8_0 -vhad -muge -b 2048 -ub 512 -sm graph --threads 24 --parallel 1 --fit --fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8

MODEL: ubergarm/MiniMax-M2.7-IQ5_K
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.796 643.32 2.486 51.49
512 128 512 0.369 1388.78 2.629 48.68
512 128 1024 0.370 1384.23 2.663 48.07
512 128 1536 0.372 1377.91 2.642 48.45
512 128 2048 0.373 1372.47 2.642 48.45
512 128 2560 0.393 1304.17 2.650 48.30
512 128 3072 0.375 1363.74 2.625 48.76
512 128 3584 0.376 1363.38 2.611 49.03
512 128 4096 0.411 1246.88 2.744 46.65
512 128 4608 0.378 1353.56 2.779 46.06
512 128 5120 0.380 1346.52 2.773 46.16
512 128 5632 0.409 1252.25 2.771 46.19
512 128 6144 0.382 1340.22 2.767 46.26
512 128 6656 0.382 1339.58 2.845 44.99
512 128 7168 0.384 1334.34 2.817 45.44
512 128 7680 0.383 1336.05 2.897 44.19
512 128 8192 0.385 1329.24 2.826 45.30
512 128 8704 0.387 1324.24 2.804 45.65
512 128 9216 0.388 1321.28 2.875 44.52
512 128 9728 0.388 1320.50 2.843 45.03
512 128 10240 0.390 1312.12 2.910 43.98
512 128 10752 0.391 1310.44 2.888 44.32
512 128 11264 0.391 1307.79 2.936 43.59
512 128 11776 0.444 1152.42 2.939 43.56
512 128 12288 0.465 1101.35 2.860 44.76
512 128 12800 0.394 1299.20 2.930 43.68
512 128 13312 0.397 1290.64 2.980 42.95
512 128 13824 0.399 1283.33 2.978 42.98
512 128 14336 0.397 1290.10 2.938 43.57
512 128 14848 0.399 1282.03 2.945 43.47
512 128 15360 0.426 1201.08 2.970 43.10
512 128 15872 0.402 1274.47 2.955 43.31

MODEL: aessedai/MiniMax-M2.7-Q5_K_M
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.569 900.06 2.131 60.06
512 128 512 0.353 1451.87 2.224 57.55
512 128 1024 0.369 1386.36 2.197 58.25
512 128 1536 0.356 1439.61 2.200 58.19
512 128 2048 0.377 1358.53 2.218 57.71
512 128 2560 0.370 1384.66 2.270 56.39
512 128 3072 0.361 1417.56 2.291 55.87
512 128 3584 0.363 1411.63 2.315 55.30
512 128 4096 0.361 1419.19 2.387 53.62
512 128 4608 0.362 1414.17 2.363 54.17
512 128 5120 0.366 1397.51 2.360 54.24
512 128 5632 0.366 1399.66 2.410 53.11
512 128 6144 0.370 1382.89 2.461 52.00
512 128 6656 0.367 1396.45 2.418 52.95
512 128 7168 0.369 1386.97 2.368 54.06
512 128 7680 0.369 1386.16 2.408 53.17
512 128 8192 0.370 1383.54 2.432 52.64
512 128 8704 0.370 1382.38 2.519 50.82
512 128 9216 0.372 1375.36 2.521 50.77
512 128 9728 0.373 1372.62 2.498 51.24
512 128 10240 0.376 1363.38 2.525 50.70
512 128 10752 0.375 1365.42 2.489 51.43
512 128 11264 0.377 1358.84 2.558 50.05
512 128 11776 0.376 1360.63 2.537 50.45
512 128 12288 0.378 1353.63 2.603 49.16
512 128 12800 0.378 1353.04 2.587 49.48
512 128 13312 0.381 1342.72 2.629 48.69
512 128 13824 0.383 1338.30 2.582 49.57
512 128 14336 0.413 1240.04 2.648 48.34
512 128 14848 0.389 1317.81 2.575 49.71
512 128 15360 0.388 1321.22 2.554 50.11
512 128 15872 0.386 1328.07 2.552 50.16

MODEL: unsloth/MiniMax-M2.7-UD-Q5_K_XL

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 24, n_threads_batch = 24

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.650 787.93 2.094 61.12
512 128 512 0.353 1451.45 2.243 57.07
512 128 1024 0.353 1449.03 2.212 57.88
512 128 1536 0.356 1440.19 2.228 57.44
512 128 2048 0.358 1432.16 2.272 56.33
512 128 2560 0.358 1428.59 2.272 56.34
512 128 3072 0.360 1420.65 2.237 57.23
512 128 3584 0.360 1423.10 2.283 56.07
512 128 4096 0.361 1419.62 2.356 54.33
512 128 4608 0.362 1415.34 2.350 54.48
512 128 5120 0.363 1411.33 2.323 55.09
512 128 5632 0.365 1401.12 2.381 53.75
512 128 6144 0.369 1388.57 2.404 53.23
512 128 6656 0.367 1394.91 2.368 54.06
512 128 7168 0.368 1392.53 2.353 54.40
512 128 7680 0.368 1390.67 2.433 52.60
512 128 8192 0.370 1385.52 2.463 51.97
512 128 8704 0.372 1377.16 2.446 52.33
512 128 9216 0.371 1378.97 2.533 50.54
512 128 9728 0.372 1374.74 2.514 50.92
512 128 10240 0.374 1368.52 2.565 49.91
512 128 10752 0.374 1369.33 2.475 51.73
512 128 11264 0.376 1363.03 2.505 51.09
512 128 11776 0.376 1363.24 2.499 51.22
512 128 12288 0.378 1355.07 2.534 50.51
512 128 12800 0.378 1353.47 2.540 50.38
512 128 13312 0.380 1349.10 2.542 50.36
512 128 13824 0.381 1344.73 2.554 50.12
512 128 14336 0.449 1139.19 2.498 51.25
512 128 14848 0.383 1337.16 2.597 49.29
512 128 15360 0.386 1326.97 2.554 50.11
512 128 15872 0.385 1329.90 2.545 50.30

Useless (non-belligerent) conclusion: for me the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)

@dehnhaide

Thanks for doing some llama-sweep-bench speed benchmarks. Sounds like you're full offloading on 8x 3090s?

/ik_llama.cpp/build/bin/llama-sweep-bench \
  --model "model_name" \
  -c 16384 \
  -fa 1 \
  --no-mmap \
  -ngl 999 \
  --jinja \
  --seed 1976 \
  -ctk q8_0 -khad -ctv q8_0 -vhad \
  -muge \
  -b 2048 -ub 512 \
  -sm graph \
  --threads 24 \
  --parallel 1 \
  --fit \
  --fit-margin 3072 -ts 0.875,0.885,0.875,0.885,0.875,0.885,0.875,0.8

A few tips:

  1. Use --threads 1 when doing full GPU offload to avoid extra synchronization latency due to unused CPU threads (might give 1-3% boost)
  2. For more PP pump batch sizes if VRAM allows e.g. -ub 2048 -b 2048 or even -ub 4096 -b 4096
  3. -khad -vhad can have some overhead, and not really needed for q8_0 (best for q6_0 and under likely) so try omitting
  4. add --warmup-batch to help fix the odd drip in the first batch
  5. add -n 64 to speed it up as it will generate less tokens per batch and run your tests much faster

the IQ quants continue to be (for reasons / explanations I must admit I don't master) still the slowest on 8X RTX3090 (with or without CPU / RAM offloading)

Yes, it is a trade-off in that using a more complex IQ quant can give better perplexity per BPW, but at cost of extra compute depending on backend implementation.

The flags --merge-qkv and -grt bf16 can boost TPS (tokens per second) by ~20%. The -grt flag is particularly beneficial if your GPU is not running at PCIe x16 speed.

Sign up or log in to comment