8x RTX 3090, EPYC 7532, 512GB RAM: Benchmarking MiniMax-M2.5-IQ4_NL for High-Speed Coding (35-75 t/s) with opencode

#10
by martossien - opened

Hello everyone,

Following my tests with ubergarm's GLM-4.7-355B-IQ5_K, I have now extensively tested MiniMax-M2.5-IQ4_NL. My primary use case is development with OpenCode (agentic coding workflow), requiring both large context (100k+) and high generation speed.

While MiniMax-M2.5 might not beat the top-tier coding models in pure reasoning benchmarks, its speed/quality ratio on this hardware is phenomenal for iterative development.

Generation Speed: 35 to 75 tokens/s
VRAM Usage: ~175 GB / 192 GB (leaving room for 2 simultaneous OpenCode sessions)
Context: Tested stable at 262k tokens context length

GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
Note: 2 cards are linked via NVLink, others via PCIe
CPU: AMD EPYC 7532 (32 cores / 64 threads)
RAM: 512 GB DDR4 2933 MHz ECC
Software: ik_llama.cpp + NVIDIA Drivers 580.126.09

~/ik_llama.cpp/build/bin/llama-server
--model /path/to/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias MiniMax-M2.5-IQ4_NL
--host 0.0.0.0
--port 8080
--ctx-size 262144
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 4096
--parallel 2
--flash-attn 1
--n-gpu-layers 999
--split-mode graph
--tensor-split 0.9,1,1,1,1,1,1,1
--numa distribute
--run-time-repack
-gr -ger
--merge-qkv
--cache-type-k q5_1
--cache-type-v q5_1
--k-cache-hadamard
--jinja
--chat-template-kwargs '{"enable_thinking": false}'

--chat-template-kwargs '{"enable_thinking": false}': Mandatory for OpenCode. Without this, you will get an "Assistant response prefill is incompatible with enable_thinking" error because OpenCode forces the start of the assistant's reply (prefill).
--parallel 2: Allows two OpenCode developers to work simultaneously (splitting the 262k context budget).
--tensor-split 0.9,...: Reduces load on the first GPU (Headless server but driving 2 screens), preventing OOM on display tasks.
--cache-type-k/v q5_1: i m not shure of this , it's works ( before i use q4_0 )

Huge thanks to ubergarm for providing these high-quality IQ quants! This makes running such massive MoE models locally not just possible, but incredibly fast.

MiniMax-M2.5-IQ4_NL

@martossien

Super, glad you're enjoying the quants on your huge rig!

A few tips given you are using full GPU offload:

  1. When using full GPU offload, always set threads to -t 1 to minimize overhead of synchronizing unused threads. Might give 1-3% more benefit anecdotally. So remove --threads 32 --threads-batch 64 unless you are explicitly leaving some layers on CPU/RAM.

  2. So your --batch-size 2048 --ubatch-size 4096 is strange. The defaults are -ub 512 -b 2048. The microbatch -ub is always smaller than the logical batch -b size, so you can just use -ub 4096 -b 4096 most of the time for increased PP performance at the cost of some latency in very small prompts and a little more VRAM usage for the compute buffer allocation to hold the larger batch.

  3. Remove --numa distribute as that is only important if you are running any tensors on CPU/RAM and only if you're also used numactl before llama-server. Its not being used but nice to keep your commands clean for less confusion.

  4. Remove --run-time-repack as that is only used for CPU/RAM tensors once again. It is smart enough to not repack tensors on GPU fortunately so you're not hurting yourself. In general I don't use -rtr anymore as even for MoEs using the normal non-repacked quant types give big PP gains for large batch sizes.

  5. --cache-type-k q5_1 --cache-type-v q5_1 --k-cache-hadamard Wow that is pretty aggressive kv-cache quantization, pretty amazing it works with such long context. If you spare the extra VRAM, I'd suggest trying -khad -ctk q6_0 -ctv q8_0 As keeping v cache a bit larger is likely good, and q6_0 with khad is still quite good quality.

Have fun and hopefully I'll have some GLM-5 quants up eventualy! Also super curious about Qwen3(.5)Next too!

With :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf --alias MiniMax-M2.5-IQ4_NL --host 0.0.0.0 --port 8080 --ctx-size 198000 --no-mmap --threads 16 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --tensor-split 0.9,1,1,1,1,1,1,1 -gr -ger --merge-qkv --cache-type-k q6_0 --cache-type-v q8_0 --k-cache-hadamard --jinja --chat-template-kwargs '{"enable_thinking": false}'

I test threads 1 but if i put more Prompt processing is better ( 16 works well , more degrad token/s , i m not shure of this)
--threads 1 --threads-batch 32 seem better
Result :Token: 58.0 t/s ( 35 to 75 ) | Prompt: 891.5 t/s ( 75 to 900)

Thank's again

@martossien

Thanks again for sharing the details of your unique rig! Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?

Obviously use whatever gives better results for you, I don't have the hardware to check all the combinations myself!

Cheers!

@martossien

Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?

Hei John,

Got back online my similar beast to @martossien :
GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
CPU: AMD EPYC 7443P (24 cores / 48 threads)
RAM: 256 GB DDR4 3200 MHz ECC

Not here to brag about it but, much to my surprise too, I cannot offload MiniMax-M2.5-IQ4_NL, not even with CTX at 196608... :( Hits OOM when trying to allocate CTX,I guess.
I am launching it with:
/llms/ik_llama.cpp/build/bin/llama-server
--model ~/models/gguf/ubergarm/MiniMax-M2.5-IQ4_NL/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias "ubergarm/MiniMax-M2.5-IQ4_NL"
-c 196608
-ctk q8_0 -ctv q8_0
-mla 3 -fa 1 -amb 512 --no-mmap
-ngl 99
-ub 4096
-b 4096
--threads 1
--host 0.0.0.0
--port 5005
--jinja
--temp 1.0
--top_p 0.95
--top_k 40
--api-key VLLM_API_KEY_2026
--seed 3407
--chat-template-kwargs '{"reasoning_effort": "high"}'
-ger -sm graph --cache-ram 32768

--cache-ram 32768 --> via https://github.com/ggml-org/llama.cpp/pull/16391 since my OpenCode has to chew on huge CTX for its session-2-session "memory" transfers.
P.S. With exactly the same parameters IQ4_XS works like a charm.

Any hints would be much appreciated!

@dehnhaide

Hei good to see you!

My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?

  • IQ4_XS 114.842 GiB
  • IQ4_NL 121.386 GiB

The difference is not huge, so maybe you just need to save a little VRAM so you can fully offload it onto the GPUs?

A few ways to save VRAM here:

  1. Lower batch sizes down to -ub 2048 -b 2048 which takes a little less VRAM but likely slower PP.
  2. -khad -ctk q6_0 -ctv q8_0 will save a little more space on KV-Cache if you are okay quantizing further
  3. worse case you offload some layers onto CPU/RAM but that will slow it down a lot probably.

Otherwise, a few thoughts:

  1. --cache-ram 32768 you could probably increase this further if you're not using the RAM for anything else
  2. -mla 3 -amb 512 this only applies to models with MLA like Kimi-K2.5/DeepSeek/GLM-5.. doesn't hurt here, but no effect

Keep us posted!

@dehnhaide

Hei good to see you!

My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?

  • IQ4_XS 114.842 GiB
    All OK!
  • IQ4_NL 121.386 GiB
    OOM
  1. Lower batch sizes down to -ub 2048 -b 2048 which takes a little less VRAM but likely slower PP.
    WOW, this did the trick! Who would have thought... so subtle and counter-intuitive for my tired brain...
  2. --cache-ram 32768 you could probably increase this further if you're not using the RAM for anything else
    Good hint too! ;)

As always, much appreciated!

Sign up or log in to comment