8x RTX 3090, EPYC 7532, 512GB RAM: Benchmarking MiniMax-M2.5-IQ4_NL for High-Speed Coding (35-75 t/s) with opencode
Hello everyone,
Following my tests with ubergarm's GLM-4.7-355B-IQ5_K, I have now extensively tested MiniMax-M2.5-IQ4_NL. My primary use case is development with OpenCode (agentic coding workflow), requiring both large context (100k+) and high generation speed.
While MiniMax-M2.5 might not beat the top-tier coding models in pure reasoning benchmarks, its speed/quality ratio on this hardware is phenomenal for iterative development.
Generation Speed: 35 to 75 tokens/s
VRAM Usage: ~175 GB / 192 GB (leaving room for 2 simultaneous OpenCode sessions)
Context: Tested stable at 262k tokens context length
GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
Note: 2 cards are linked via NVLink, others via PCIe
CPU: AMD EPYC 7532 (32 cores / 64 threads)
RAM: 512 GB DDR4 2933 MHz ECC
Software: ik_llama.cpp + NVIDIA Drivers 580.126.09
~/ik_llama.cpp/build/bin/llama-server
--model /path/to/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias MiniMax-M2.5-IQ4_NL
--host 0.0.0.0
--port 8080
--ctx-size 262144
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 4096
--parallel 2
--flash-attn 1
--n-gpu-layers 999
--split-mode graph
--tensor-split 0.9,1,1,1,1,1,1,1
--numa distribute
--run-time-repack
-gr -ger
--merge-qkv
--cache-type-k q5_1
--cache-type-v q5_1
--k-cache-hadamard
--jinja
--chat-template-kwargs '{"enable_thinking": false}'
--chat-template-kwargs '{"enable_thinking": false}': Mandatory for OpenCode. Without this, you will get an "Assistant response prefill is incompatible with enable_thinking" error because OpenCode forces the start of the assistant's reply (prefill).
--parallel 2: Allows two OpenCode developers to work simultaneously (splitting the 262k context budget).
--tensor-split 0.9,...: Reduces load on the first GPU (Headless server but driving 2 screens), preventing OOM on display tasks.
--cache-type-k/v q5_1: i m not shure of this , it's works ( before i use q4_0 )
Huge thanks to ubergarm for providing these high-quality IQ quants! This makes running such massive MoE models locally not just possible, but incredibly fast.
Super, glad you're enjoying the quants on your huge rig!
A few tips given you are using full GPU offload:
When using full GPU offload, always set threads to
-t 1to minimize overhead of synchronizing unused threads. Might give 1-3% more benefit anecdotally. So remove--threads 32 --threads-batch 64unless you are explicitly leaving some layers on CPU/RAM.So your
--batch-size 2048 --ubatch-size 4096is strange. The defaults are-ub 512 -b 2048. The microbatch-ubis always smaller than the logical batch-bsize, so you can just use-ub 4096 -b 4096most of the time for increased PP performance at the cost of some latency in very small prompts and a little more VRAM usage for the compute buffer allocation to hold the larger batch.Remove
--numa distributeas that is only important if you are running any tensors on CPU/RAM and only if you're also usednumactlbefore llama-server. Its not being used but nice to keep your commands clean for less confusion.Remove
--run-time-repackas that is only used for CPU/RAM tensors once again. It is smart enough to not repack tensors on GPU fortunately so you're not hurting yourself. In general I don't use-rtranymore as even for MoEs using the normal non-repacked quant types give big PP gains for large batch sizes.--cache-type-k q5_1 --cache-type-v q5_1 --k-cache-hadamardWow that is pretty aggressive kv-cache quantization, pretty amazing it works with such long context. If you spare the extra VRAM, I'd suggest trying-khad -ctk q6_0 -ctv q8_0As keeping v cache a bit larger is likely good, and q6_0 with khad is still quite good quality.
Have fun and hopefully I'll have some GLM-5 quants up eventualy! Also super curious about Qwen3(.5)Next too!
With :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.5-GGUF/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf --alias MiniMax-M2.5-IQ4_NL --host 0.0.0.0 --port 8080 --ctx-size 198000 --no-mmap --threads 16 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --tensor-split 0.9,1,1,1,1,1,1,1 -gr -ger --merge-qkv --cache-type-k q6_0 --cache-type-v q8_0 --k-cache-hadamard --jinja --chat-template-kwargs '{"enable_thinking": false}'
I test threads 1 but if i put more Prompt processing is better ( 16 works well , more degrad token/s , i m not shure of this)
--threads 1 --threads-batch 32 seem better
Result :Token: 58.0 t/s ( 35 to 75 ) | Prompt: 891.5 t/s ( 75 to 900)
Thank's again
Thanks again for sharing the details of your unique rig! Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?
Obviously use whatever gives better results for you, I don't have the hardware to check all the combinations myself!
Cheers!
Huh so strange about adding CPU threads when the model should be 100% GPU offloaded unless there is still some part on CPU/RAM somehow?
Hei John,
Got back online my similar beast to @martossien :
GPU: 8x NVIDIA RTX 3090 (24GB each, total 192GB VRAM)
CPU: AMD EPYC 7443P (24 cores / 48 threads)
RAM: 256 GB DDR4 3200 MHz ECC
Not here to brag about it but, much to my surprise too, I cannot offload MiniMax-M2.5-IQ4_NL, not even with CTX at 196608... :( Hits OOM when trying to allocate CTX,I guess.
I am launching it with:
/llms/ik_llama.cpp/build/bin/llama-server
--model ~/models/gguf/ubergarm/MiniMax-M2.5-IQ4_NL/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf
--alias "ubergarm/MiniMax-M2.5-IQ4_NL"
-c 196608
-ctk q8_0 -ctv q8_0
-mla 3 -fa 1 -amb 512 --no-mmap
-ngl 99
-ub 4096
-b 4096
--threads 1
--host 0.0.0.0
--port 5005
--jinja
--temp 1.0
--top_p 0.95
--top_k 40
--api-key VLLM_API_KEY_2026
--seed 3407
--chat-template-kwargs '{"reasoning_effort": "high"}'
-ger -sm graph --cache-ram 32768
--cache-ram 32768 --> via https://github.com/ggml-org/llama.cpp/pull/16391 since my OpenCode has to chew on huge CTX for its session-2-session "memory" transfers.
P.S. With exactly the same parameters IQ4_XS works like a charm.
Any hints would be much appreciated!
Hei good to see you!
My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?
- IQ4_XS 114.842 GiB
- IQ4_NL 121.386 GiB
The difference is not huge, so maybe you just need to save a little VRAM so you can fully offload it onto the GPUs?
A few ways to save VRAM here:
- Lower batch sizes down to
-ub 2048 -b 2048which takes a little less VRAM but likely slower PP. -khad -ctk q6_0 -ctv q8_0will save a little more space on KV-Cache if you are okay quantizing further- worse case you offload some layers onto CPU/RAM but that will slow it down a lot probably.
Otherwise, a few thoughts:
--cache-ram 32768you could probably increase this further if you're not using the RAM for anything else-mla 3 -amb 512this only applies to models with MLA like Kimi-K2.5/DeepSeek/GLM-5.. doesn't hurt here, but no effect
Keep us posted!
Hei good to see you!
My understanding is you can run the smaller IQ4_XS no problem, but the IQ4_NL is bigger so you OOM now?
- IQ4_XS 114.842 GiB
All OK!- IQ4_NL 121.386 GiB
OOM
- Lower batch sizes down to
-ub 2048 -b 2048which takes a little less VRAM but likely slower PP.
WOW, this did the trick! Who would have thought... so subtle and counter-intuitive for my tired brain...--cache-ram 32768you could probably increase this further if you're not using the RAM for anything else
Good hint too! ;)
As always, much appreciated!
