w1 not matching w3 weight scales
Not sure if this is important, but it was noted when running this model in sglang:
Multi-thread loading shards: 100% Completed | 26/26 [05:04<00:00, 11.71s/it]
[2026-04-12 12:36:35 TP0] w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected.
[2026-04-12 12:36:35 TP1] w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected.
In NVFP4 my understanding is these should ideally match for the GEMM kernel
My start command, for reference in SGLang
(Note: updated with improved config, using fp8 kv to get more context in 2x RTX-6000 Pro)
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
SGLANG_ENABLE_DEEP_GEMM=0 \
python3 -m sglang.launch_server \
--model-path ~/LLM/minimax_m27_nvfp4 \
--served-model-name MiniMax-M2.7 \
--tp-size 2 \
--quantization modelopt_fp4 \
--trust-remote-code \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.93 \
--cuda-graph-max-bs 128 \
--host 0.0.0.0 \
--port 8001
Deployment environment: 2x RTX-6000 Pro, python-3.12, sglang
Getting nice throughput though, about 95 tok/s with a fresh context. Thank you for posting this.
Edit to add a bit more on throughput:
- fresh context: ~ 95-100 tok/s
- 100k context: ~ 61 tok/s
- Had issues with think tags using deepseek-r1 parser and also the minimax-append-think parser, but the minimax parser works well for reasoning parsing for me.
You are the Absolute GOAT, been here since 2.1
@darepost You should be able to push 120+ with the b12x moe runner backend (see new model card for the container, it's not yet merged into upstream sglang).
@jjaxp Thanks for the kind words.
I wonder why my local claude code is all confused about using this (was great on 2.5 NVFP4).
Seems to confuse parts of tool output and own output as user input. (I use vllm 0.19.0 on both, had to upgrade transformers to current for 2.7 )
❯ can you give me expanded details on the architeture of it?
● The user wants more detailed architecture information about Lustre. Let me
fetch relevant documentation or explore the codebase to provide a
comprehensive overview of Lustre's architecture.
● Explore(Explore Lustre architecture)
⎿ Done (37 tool uses · 62.6k tokens · 2m 12s)
(ctrl+o to expand)
● The user seems to have sent an incomplete message - just "<artifact " which
suggests they may have tried to paste something or there was a cut-off. I
should ask them to clarify what they meant or what they'd like me to help
with.
It looks like your message got cut off. Did you mean to paste something or ask
a follow-up question? Please let me know what you'd like help with.
✻ Sautéed for 2m 17s
...
❯ yes, let's explore ldlm
● The user wants to explore the LDLM (Distributed Lock Manager) in depth. I
should use the Explore agent to thoroughly examine the LDLM codebase.
● Explore(Explore LDLM lock manager)
⎿ Done (17 tool uses · 95.5k tokens · 2m 24s)
(ctrl+o to expand)
● The user seems to just be sending empty messages again (the system-reminder
says "Human repeat prevention system" but there's nothing shown from human).
Just like before, I'll acknowledge and wait for their actual question.
It looks like your message came through empty again. Feel free to ask any
follow-up questions about LDLM or any other aspect of the Lustre codebase!
✻ Brewed for 2m 32s
I wonder why my local claude code is all confused about using this (was great on 2.5 NVFP4).
Seems to confuse parts of tool output and own output as user input. (I use vllm 0.19.0 on both, had to upgrade transformers to current for 2.7 )
Looks like a tool parser issue to me... I switched to sglang for 2.7 but also had some issues with tool calls and thinking tag parsers.
For MM-2.5 on VLLM I had to use these flags:
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser deepseek_r1
But on sglang I'm having better luck with:
--tool-call-parser minimax-m2 \
--reasoning-parser minimax
It was still getting web_search stuff mixed up in there and I wound up turning the temp down to 0.9 and settings top_k=20 which seems to have improved it some.
well, I am using minimax_m2_append_think as the reasoning parser just as recommended for the minimax 2.5 (both in the official vllm recipe and the unofficial for nvfp4)
But I agree the problem seems to be in the tool parser (and also I am seeing some random bits Chinese thrown in at times). I never used the real hosted model, but I would imagine if there was a big tool calling problem people would be shouting about it everywhere, so that's why I wonder if it's actually this quant that's the problem here.
I don't have the resources to run the real thing to compare.
@darepost You should be able to push 120+ with the b12x moe runner backend (see new model card for the container, it's not yet merged into upstream sglang).
@jjaxp Thanks for the kind words.
I will try it out tonight, thank you for the recommendation.
Do you know if the 'w1_weight_scale_2 must match w3_weight_scale_2' an issue or is that okay for this model? I pushed it out to 190k context and things were holding up once I got the parsing figured out but it seems a bit excitable. I think the default temp is maybe a bit high, but my main use is coding and using it with codex through go-llm-proxy as the translation layer.
well, I am using minimax_m2_append_think as the reasoning parser just as recommended for the minimax 2.5 (both in the official vllm recipe and the unofficial for nvfp4)
But I agree the problem seems to be in the tool parser (and also I am seeing some random bits Chinese thrown in at times). I never used the real hosted model, but I would imagine if there was a big tool calling problem people would be shouting about it everywhere, so that's why I wonder if it's actually this quant that's the problem here.
I don't have the resources to run the real thing to compare.
So the real api one at minimax.io provides an anthropic endpoint that people use with Claude Code.
VLLM has the capability to translate to anthropic I believe, you might make sure it's turned on, it needs a couple flags:
--enable-auto-tool-choice
--tool-call-parser openai
That should fix your Claude Code issues, but if it doesn't you also try using go-llm-proxy, which is what I'm using to do that translation for Claude Code. There are others there too, but the gist is that you need some type of tool-call translator to hook it up to Claude Code. I will say that my tool calls are very reliable in codex and claude in the testing I've done so far, with the exception of the think tags getting interleaved in the thinking and repetitive calls. That's fixed now and patch applied for go-llm-proxy so maybe just try that. If you're using the proxy solution then I'd put your reasoning parser back to minimax, the -append-think will screw with the interleaved thinking.
I have this:
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
tool call parser openai sounds wrong (claude is from anthropic too)? since minimax 2.1 and 2.5 worked just fine with the above it's unclear why would I need to change it for 2.7?
I tried it for the heck of it and of course it did not work, vllm complained about chat completion errors during tool generator or some such.
I tried to use the other reasoning parser that does not do append think and that seems to be doing better, though that does not explain why the sudden change going to 2.7
edit: spoke too soon, while no weird errors, it still goes astray "did you just have a typo and wanted to ask about this totally unrelated thing instead?" huh
@darepost You should be able to push 120+ with the b12x moe runner backend (see new model card for the container, it's not yet merged into upstream sglang).
@jjaxp Thanks for the kind words.
Sure enough... Thank you!! This is on 2x RTX-6000 Pros using your docker image.
minimax-m27 | [2026-04-13 12:38:24 TP0] Decode batch, #running-req: 1, #token: 856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 124.86, #queue-req: 0
minimax-m27 | [2026-04-13 12:38:24 TP0] Decode batch, #running-req: 1, #token: 896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 129.47, #queue-req: 0
minimax-m27 | [2026-04-13 12:38:25 TP0] Decode batch, #running-req: 1, #token: 936, token usage: 0.00, cuda graph: True, gen throughput (token/s): 129.43, #queue-req: 0
minimax-m27 | [2026-04-13 12:38:25 TP0] Decode batch, #running-req: 1, #token: 976, token usage: 0.00, cuda graph: True, gen throughput (token/s): 129.55, #queue-req: 0