Quick bench for smol-IQ3_KS on 2 GPUs

#4
by curiouspp8 - opened

Uses ~54gb on each @ 120k context. ~45.52 for base weights. Graph on.
60g
2 with full 204k context. kv q8/q6 for all.

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6158.3 | 108.60 |
|        4K |  5306.1 | 100.41 |
|       16K |  5065.6 |  83.55 |
|       32K |  4304.8 |  73.98 |
|       64K |  3410.7 |  51.13 |
+-----------+---------+--------+
|    TTFR 0 |     544 |      - |
|   TTFR 4K |    1355 |      - |
|  TTFR 16K |    3425 |      - |
|  TTFR 32K |    7471 |      - |
|  TTFR 64K |   17904 |      - |
+-----------+---------+--------+

  TG Peak (burst): 112.00 104.00 88.00 78.00 56.00

updated with spec decoding on. Nice improvement on PP but less once getting to larger prompts

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  7117.8 | 110.04 |
|        4K |  6017.1 |  98.01 |
|       16K |  5547.8 |  85.27 |
|       32K |  4751.0 |  74.88 |
|       64K |  3690.3 |  52.15 |
+-----------+---------+--------+
|    TTFR 0 |     500 |      - |
|   TTFR 4K |    1215 |      - |
|  TTFR 16K |    3198 |      - |
|  TTFR 32K |    6736 |      - |
|  TTFR 64K |   16488 |      - |
+-----------+---------+--------+

  TG Peak (burst): 112.00 103.00 91.00 78.00 55.00

Are you using both -khad and -vhad on these tests? oh yes, i think so, i see it now: https://github.com/ikawrakow/ik_llama.cpp/pull/1625

I haven't tried removing those or using unquantized kv cache to see the effects. generally for GPU full offload unquantized f16 kv-cache can be faster for PP i believe. but then you might not have enough VRAM for full context size... tradeoffs!

so many knobs to tweak and benchmark haha...

yep, used -khad without your patch. Then tried the patch yesterday, the model loaded and that fixed inferance on minimax graph 2 + -vhad + -mudge, but VRAM usage was identical as without them. Not sure what I was doing wrong. I am rebuilding now from main to compare the officially merged version see if that actually compresses the KV

So -khad -vhad won't change the size of the kv-cache usage, they just rotate the tensors before quantizing which gives some quality boost. I don't know what -mudge is oh -muge should be used, it will give maybe 10% boost in PP and some few percent in TG probably. It is equivalent of using a mainline pre-fused ffn_(up|gate)_exps quant.

Right, the question to answer is to do some speed benchmarks between these setups:

  • -khad -ctk q8_0 -vhad -ctv q6_0
  • -khad -vhad
  • nothing, just leave it default f16 without hadamard transforms on either k or v cache.

Ahh I see ik already explained it more here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4237769371

Since you been so helpful I paused some workloads and just ran full set of combinations. Note that those cards had some other stuff loaded into VRAM but fully idle during the test. Just in case this impacts anything.

Defaults (gpus at 90.6/94.9GB, just as a relative reference point)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6577.2 | 116.56 |
|        4K |  4853.0 | 108.18 |
|       16K |  4786.6 |  91.15 |
|       32K |  3881.8 |  82.27 |
|       64K |  3179.0 |  69.71 |
+-----------+---------+--------+
|    TTFR 0 |     531 |      - |
|   TTFR 4K |    1481 |      - |
|  TTFR 16K |    3727 |      - |
|  TTFR 32K |    8241 |      - |
|  TTFR 64K |   19177 |      - |
+-----------+---------+--------+

  TG Peak (burst): 120.00 111.00 94.00 86.00 72.00

With -ctk q8_0 -ctv q6_0 (GPUs 85.3/89.7)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6716.3 | 110.31 |
|        4K |  5565.6 | 100.74 |
|       16K |  5207.9 |  85.93 |
|       32K |  4465.5 |  73.30 |
|       64K |  3486.1 |  55.18 |
+-----------+---------+--------+
|    TTFR 0 |     536 |      - |
|   TTFR 4K |    1286 |      - |
|  TTFR 16K |    3338 |      - |
|  TTFR 32K |    7280 |      - |
|  TTFR 64K |   17459 |      - |
+-----------+---------+--------+

  TG Peak (burst): 120.00 107.00 92.00 76.00 71.00

With -ctk q8_0 -ctv q6_0 -khad -vhad

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6417.0 | 111.95 |
|        4K |  5477.4 |  83.21 |
|       16K |  5257.7 |  78.56 |
|       32K |  4424.8 |  65.65 |
|       64K |  3504.7 |  48.56 |
+-----------+---------+--------+
|    TTFR 0 |     567 |      - |
|   TTFR 4K |    1283 |      - |
|  TTFR 16K |    3418 |      - |
|  TTFR 32K |    7262 |      - |
|  TTFR 64K |   17226 |      - |
+-----------+---------+--------+

  TG Peak (burst): 142.00 88.00 93.00 68.00 51.00

With -ctk q8_0 -ctv q6_0 -khad -vhad (83.8, 88.2)

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  6487.2 | 108.75 |
|        4K |  5869.8 |  93.81 |
|       16K |  5457.9 |  80.12 |
|       32K |  4625.9 |  66.09 |
|       64K |  3606.3 |  49.35 |
+-----------+---------+--------+
|    TTFR 0 |     549 |      - |
|   TTFR 4K |    1257 |      - |
|  TTFR 16K |    3310 |      - |
|  TTFR 32K |    7043 |      - |
|  TTFR 64K |   16950 |      - |
+-----------+---------+--------+

  TG Peak (burst): 130.00 97.00 89.00 70.00 59.00

Full final config

  "minimax-m2.7-q3":
    proxy: "http://127.0.0.1:8088"
    env:
      - "CUDA_VISIBLE_DEVICES=0,1"
    cmd: >
      /app/run-server.sh
      --model /models/models--ubergarm--MiniMax-M2.7-GGUF/snapshots/b39e25f035f93fbb15d52bcc5cc8081b717efe65/smol-IQ3_KS/MiniMax-M2.7-smol-IQ3_KS-00001-of-00003.gguf
      --port 8088
      --alias minimax-m2.7-q3
      --jinja
      -c 80000
      -sm graph
      --threads 1
      --n-gpu-layers 99
      -muge
      -ctk q4_0 -ctv q4_0 -khad -vhad
      --batch-size 4096
      --ubatch-size 2048
      --no-mmap
      --host 0.0.0.0
      -cram 20000
      --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4

After few tasks done using "-ctk q4_0 -ctv q4_0 -khad -vhad" I am surprised by coherence and that it generally doesn't feel like a typical degradation due quantization. It mostly works but feels a bit lazier. Like I asked it do this for me. It goes researches how to that in my project then said - here is how you do this if you want to. Vs less quantized versions generally would also do it. I noticed that with weight quantization before. Eg Kimi Q1 was 100% like that. Very smart but very lazy. Q2 was less lazy. Q3 was ~ as you would expect. That that was weights quantization and I found just using ctk/ctv at Q8 was degrading it on longer sessions, but in very obvious ways, they were becoming just dumb or failed tool calls / loops etc. This is not very scientific comparison and mixing of a few things, but hopefully still useful as this is not a bench but real person doing real work with those. With the very latest version of ik and those -khad -vhad optimizations, I'd have to experiment more and now more inclined to include at least Q8 KV as default for all models. q4 on minimax looks very promising so far but too early. I desperately need VRAM and a tiny tradeoff might be 100% worth it. Need to see how long session with opencode holdsup.

Yeah its all trade-offs! Thanks for the benchmarks showing some price to pay for -khad -vhad mostly in TG speeds interestingly.

Right, for MLA models (e.g. GLM-5.1, DeepSeek, Kimi-K2.5 etc which use latent attention compression already by design) I try not to add extra kv-cache compression and go no lower than q8_0. On stuff like MiniMax its fine to play around going lower, but generally i try not to go below -khad -ctk q6_0 -vhad -ctv q4_0 personally, and try to stay above that and just control my prompts / restart the client frequently. You kinda have to look at the dimensions of the GQA and attention style to know how much the architechture is already skimping to save on kv-cache memory. Qwen3.5 is already very efficient given the gated delta net attn stuff, so i leave it at full f16 usually as its already "cheap".

The idea of what is now known as the "ralph wiggam" loop is interesting. It is just a for loop that keeps restarting the client with state updated into the local files / git repo. This can let you keep max context lower and generally better than pushing past 200k imo.

Also check out this guys speculative decoding settings, I'm still fooling around to dial those in on MiniMax and GLM-5.1 https://huggingface.co/zai-org/GLM-5.1/discussions/5#69dce5bc7d17d64f187cfa5f

Cheers!

Sign up or log in to comment