"w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected."

#2
by zenmagnets - opened

Thanks for the NVFP4 Quant @lukealonso !

Getting "w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected."

This will make the model a bit dumber, and is something that has to be addressed at quantization time. Fix is to re-quantize the model so w1 and w3 share the same scale_2 per expert. When creating the NVFP4 checkpoint:

  • Use one shared secondary scale for the fused w13 block (e.g. per-expert mean or max of the two scales), or
  • Constrain calibration so w1 and w3 are treated as one block and get a single scale_2.

Thanks for the heads-up. I actually already re-quantized it with the max of the two scales, but the upload was already running for several hours and I didn’t want to stop it. I’ll upload the corrected version overnight.

The good news is the weights are identical for 99.6% of the layers, and the absolute difference is very very small when they don’t match, so you shouldn’t see a meaningful difference with the current version.

I also have a version with K/V scales, but it was actually worse than the default 1.0 kv cache fp8 scales, need to figure out what’s going on there.

Thanks for your careful attention and for sharing this :-)

Thanks a ton for this @lukealonso ! I'm just now getting started with 3rd party / community quants, as I haven't really played with anything < FP8 besides Kimi-K2.5

@lukealonso can you please let us know when you have re-uploaded the newer quants? I'll redownload.

It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.

It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.

Thank you so much!

Uploaded. @ktsaou - would you mind running your test against this version whenever you have time?

@lukealonso I am downloading it as well, as we speak. I can run it both in native precision and nvfp4, if there is an easy test that I can run please let me know. Otherwise, I will test with my normal opencode usecases and will report back as well.

@mtcl Test however you normally do. Thanks!

works perfectly fine on vllm and on sglang...

Note that I expose it on 10002 port.

SGLANG

docker run --rm -it \
  --gpus '"device=0,1"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path /models/lukealonso/MiniMax-M2.5-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.95 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384

VLLM

docker run --gpus all --rm --shm-size=24g --ipc=host \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e VLLM_NVFP4_GEMM_BACKEND=cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve /models/lukealonso/MiniMax-M2.5-NVFP4 \
    --served-model-name jarvis-thinker \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 16 \
    --attention-config.backend FLASHINFER \
    --disable-custom-all-reduce \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    --trust-remote-code

2 issues with the vllm recipe:

  1. setting VLLM_USE_FLASHINFER_MOE_FP4=0 forces MARLIN backend, not FLASHINFER_CUTLASS. So, I set VLLM_USE_FLASHINFER_MOE_FP4=1.
  2. grep -R VLLM_FLASHINFER_FORCE_TENSOR_CORES /opt/vllm-0.15.1/ reveals nothing. This is a no-op.

@lukealonso we are down to 0 tests failing. But I had to work around some issues (probably native to the model), at the prompt level:

  1. text degeneration (stop=length at 16k tokens), rare and it seems fixed with the right prompting (smooth out contradictions/stress in prompts)
  2. ignoring instructions under stress, rare, fixed with the right prompting (expand and explain more about what needs to be done)
  3. hallucinations, very rare and only under extreme stress also fixable in prompts (clearly prioritize factual accuracy over satisfying user request for specific number of entries in the output)

MiniMax-M2.1 was a nice toy - fast, sometimes impressive, usually sloppy. MiniMax-M2.5 is a significant update. The issues above are rare, only when pushing the model beyond its limits (my prompt alone was 57k tokens).

MiniMax-M2.5 is a reliable worker and a very good coder too. Amazing!

Thank you @lukealonso ! You rock!

Awesome, thanks for testing!

I'm having excellent results with MiniMax-M2.5 as well, exciting times.

I concur, minimax m2.5 feels very solid. I'm working with 100k+ context normally (opencode) and it stays strong.

Loving this quant too! SGlang is solid. Getting around 90 tk/s on lower contexts, and 60 tk/s on 100k+ contexts. Single request.

Sign up or log in to comment