"w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected."
Thanks for the NVFP4 Quant @lukealonso !
Getting "w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected."
This will make the model a bit dumber, and is something that has to be addressed at quantization time. Fix is to re-quantize the model so w1 and w3 share the same scale_2 per expert. When creating the NVFP4 checkpoint:
- Use one shared secondary scale for the fused w13 block (e.g. per-expert mean or max of the two scales), or
- Constrain calibration so w1 and w3 are treated as one block and get a single scale_2.
Thanks for the heads-up. I actually already re-quantized it with the max of the two scales, but the upload was already running for several hours and I didn’t want to stop it. I’ll upload the corrected version overnight.
The good news is the weights are identical for 99.6% of the layers, and the absolute difference is very very small when they don’t match, so you shouldn’t see a meaningful difference with the current version.
I also have a version with K/V scales, but it was actually worse than the default 1.0 kv cache fp8 scales, need to figure out what’s going on there.
Thanks for your careful attention and for sharing this :-)
Thanks a ton for this @lukealonso ! I'm just now getting started with 3rd party / community quants, as I haven't really played with anything < FP8 besides Kimi-K2.5
It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.
It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.
Thank you so much!
@lukealonso I am downloading it as well, as we speak. I can run it both in native precision and nvfp4, if there is an easy test that I can run please let me know. Otherwise, I will test with my normal opencode usecases and will report back as well.
works perfectly fine on vllm and on sglang...
Note that I expose it on 10002 port.
SGLANG
docker run --rm -it \
--gpus '"device=0,1"' \
--shm-size 32g \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path /models/lukealonso/MiniMax-M2.5-NVFP4 \
--served-model-name jarvis-thinker \
--tp-size 2 \
--quantization modelopt_fp4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--dtype auto \
--mem-fraction-static 0.95 \
--context-length 196608 \
--max-running-requests 16 \
--chunked-prefill-size 16384
VLLM
docker run --gpus all --rm --shm-size=24g --ipc=host \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e VLLM_NVFP4_GEMM_BACKEND=cutlass \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve /models/lukealonso/MiniMax-M2.5-NVFP4 \
--served-model-name jarvis-thinker \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 16 \
--attention-config.backend FLASHINFER \
--disable-custom-all-reduce \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--trust-remote-code
2 issues with the vllm recipe:
- setting
VLLM_USE_FLASHINFER_MOE_FP4=0forcesMARLINbackend, notFLASHINFER_CUTLASS. So, I setVLLM_USE_FLASHINFER_MOE_FP4=1. grep -R VLLM_FLASHINFER_FORCE_TENSOR_CORES /opt/vllm-0.15.1/reveals nothing. This is a no-op.
@lukealonso we are down to 0 tests failing. But I had to work around some issues (probably native to the model), at the prompt level:
- text degeneration (stop=length at 16k tokens), rare and it seems fixed with the right prompting (smooth out contradictions/stress in prompts)
- ignoring instructions under stress, rare, fixed with the right prompting (expand and explain more about what needs to be done)
- hallucinations, very rare and only under extreme stress also fixable in prompts (clearly prioritize factual accuracy over satisfying user request for specific number of entries in the output)
MiniMax-M2.1 was a nice toy - fast, sometimes impressive, usually sloppy. MiniMax-M2.5 is a significant update. The issues above are rare, only when pushing the model beyond its limits (my prompt alone was 57k tokens).
MiniMax-M2.5 is a reliable worker and a very good coder too. Amazing!
Thank you @lukealonso ! You rock!
Awesome, thanks for testing!
I'm having excellent results with MiniMax-M2.5 as well, exciting times.
I concur, minimax m2.5 feels very solid. I'm working with 100k+ context normally (opencode) and it stays strong.
Loving this quant too! SGlang is solid. Getting around 90 tk/s on lower contexts, and 60 tk/s on 100k+ contexts. Single request.