Quick bench for IQ2_KS on 1 GPU

#3
by curiouspp8 - opened

Thank you once again being on top of the new model!

Basic opencode stuff works. Haven't done any work with it. Nice option to have as it fits into 1 RTX 6000 Pro. The IQ3_XXS almost. The base VRAM used 88gb, but context seems pretty VRAM heavy. Loaded 40k @95.3gb.

IQ2_KS used 86.2GB vram duing the test with 120k context size. Overall - in line with VLLM without concurrency, but weird to see such a big TG drop as context increases. Not sure if that's expected in ik_llama.

+-----------+---------+--------+
| Prefilled | PP@4096 | TG@512 |
+-----------+---------+--------+
|         0 |  4558.0 | 103.50 |
|        4K |  3946.5 |  90.62 |
|       16K |  3371.0 |  70.50 |
|       32K |  2700.1 |  48.32 |
|       64K |  1974.8 |  28.96 |
+-----------+---------+--------+
|    TTFR 0 |     881 |      - |
|   TTFR 4K |    2032 |      - |
|  TTFR 16K |    5975 |      - |
|  TTFR 32K |   13456 |      - |
|  TTFR 64K |   34677 |      - |
+-----------+---------+--------+

  TG Peak (burst): 107.00 94.00 74.00 51.00 31.00

@curiouspp8

Great, thanks for the quick test. I'm trying to currently figure out how to do some kind of HumanEval test to decide which will be my daily driver for 96GB VRAM:

  • MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW) fits ~160k quantized kv-cache
  • Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) fits ~256k unquantized kv-cache + mmproj support

If you're following along here, I just figured out one more small thing to get -vhad working now too: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4232579356

So on a single GPU you don't need -sm graph so after applying the above branch+patch and rebuilding ik_llama.cpp you can run with:

./build/bin/llama-server \
  --model "$model" \
  --alias ubergarm/MiniMax-M2.7 \
  -c 163840 \
  -khad -ctk q8_0 -vhad -ctv q6_0 \
  --merge-qkv \
  -muge \
  -ngl 999 \
  -ub 1024 -b 2048 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --no-mmap \
  --spec-type ngram-map-k4v --spec-ngram-size-n 8 --draft-min 1 --draft-max 16 --draft-p-min 0.4 \
  --cache-ram 32768 \
  --prompt-cache-all

If you don't have 32GB of RAM fre, drop the cache-ram to whatever you want e.g. 8192 for 8GiB etc...

you can probably squeeze some more PP speed increasing -ub 2048 -b 2048 but might need to reduce context length... fiddle with it and find what you like.

i'll eventually get some llama-sweep-bench and look into that drop off issue...

I would highly recommend spend more time with minimax. Especially 2.7 seems to be very solid update (based on my personal usage so far)
What kinda impact does -vhad have? Haven't encountered this one before.

I would highly recommend spend more time with minimax.

yeah, initial vibes are that it seems pretty good for some tasks, works well in opencode so far...

but, i think i ran 164 humaneval questions against both models:

  • MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW) fits ~160k quantized kv-cache
    • humaneval pass@1 (base) 0.220 taking 32m48s
  • Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) fits ~256k unquantized kv-cache + mmproj support
    • humaneval pass@1 (base) 0.494 taking 31m20s

assuming my vibecoded EvalPlus client was actually doing the right thing then Qwen3.5-122B is looking better so far...

What kinda impact does -vhad have? Haven't encountered this one before.

its new, ik added it after all the "turboquant" hype ... it can help if you're quantizing the v cache..

details here: https://github.com/ikawrakow/ik_llama.cpp/pull/1527

Another reason I'll likely stick with Qwen3.5-122B for now on 96GB VRAM:

sweep-bench-Qwen3.5-vs-MiniMax-2.7

Sign up or log in to comment