Feb 14

Thanks for the NVFP4 Quant @lukealonso !

Getting "w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected."

This will make the model a bit dumber, and is something that has to be addressed at quantization time. Fix is to re-quantize the model so w1 and w3 share the same scale_2 per expert. When creating the NVFP4 checkpoint:

Use one shared secondary scale for the fused w13 block (e.g. per-expert mean or max of the two scales), or
Constrain calibration so w1 and w3 are treated as one block and get a single scale_2.

lukealonso

Owner Feb 14

•

edited Feb 14

Thanks for the heads-up. I actually already re-quantized it with the max of the two scales, but the upload was already running for several hours and I didn’t want to stop it. I’ll upload the corrected version overnight.

The good news is the weights are identical for 99.6% of the layers, and the absolute difference is very very small when they don’t match, so you shouldn’t see a meaningful difference with the current version.

lukealonso

Owner Feb 14

I also have a version with K/V scales, but it was actually worse than the default 1.0 kv cache fp8 scales, need to figure out what’s going on there.

Jon-Nielsen

Feb 14

Thanks for your careful attention and for sharing this :-)

keennay

Feb 14

Thanks a ton for this @lukealonso ! I'm just now getting started with 3rd party / community quants, as I haven't really played with anything < FP8 besides Kimi-K2.5

mtcl

Feb 14

@lukealonso can you please let us know when you have re-uploaded the newer quants? I'll redownload.

lukealonso

Owner Feb 14

It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.

mtcl

Feb 14

It's uploading now, but the differences are incredibly marginal so I wouldn't expect much. It does make the warning go away though.

Thank you so much!

lukealonso

Owner Feb 14

Uploaded. @ktsaou - would you mind running your test against this version whenever you have time?

mtcl

Feb 14

@lukealonso I am downloading it as well, as we speak. I can run it both in native precision and nvfp4, if there is an easy test that I can run please let me know. Otherwise, I will test with my normal opencode usecases and will report back as well.

lukealonso

Owner Feb 14

@mtcl Test however you normally do. Thanks!

mtcl

Feb 15

works perfectly fine on vllm and on sglang...

Note that I expose it on 10002 port.

SGLANG

docker run --rm -it \
  --gpus '"device=0,1"' \
  --shm-size 32g \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path /models/lukealonso/MiniMax-M2.5-NVFP4 \
    --served-model-name jarvis-thinker \
    --tp-size 2 \
    --quantization modelopt_fp4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --dtype auto \
    --mem-fraction-static 0.95 \
    --context-length 196608 \
    --max-running-requests 16 \
    --chunked-prefill-size 16384

VLLM

docker run --gpus all --rm --shm-size=24g --ipc=host \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e VLLM_NVFP4_GEMM_BACKEND=cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve /models/lukealonso/MiniMax-M2.5-NVFP4 \
    --served-model-name jarvis-thinker \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 16 \
    --attention-config.backend FLASHINFER \
    --disable-custom-all-reduce \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    --trust-remote-code

ktsaou

Feb 15

2 issues with the vllm recipe:

setting VLLM_USE_FLASHINFER_MOE_FP4=0 forces MARLIN backend, not FLASHINFER_CUTLASS. So, I set VLLM_USE_FLASHINFER_MOE_FP4=1.
grep -R VLLM_FLASHINFER_FORCE_TENSOR_CORES /opt/vllm-0.15.1/ reveals nothing. This is a no-op.

@lukealonso we are down to 0 tests failing. But I had to work around some issues (probably native to the model), at the prompt level:

text degeneration (stop=length at 16k tokens), rare and it seems fixed with the right prompting (smooth out contradictions/stress in prompts)
ignoring instructions under stress, rare, fixed with the right prompting (expand and explain more about what needs to be done)
hallucinations, very rare and only under extreme stress also fixable in prompts (clearly prioritize factual accuracy over satisfying user request for specific number of entries in the output)

MiniMax-M2.1 was a nice toy - fast, sometimes impressive, usually sloppy. MiniMax-M2.5 is a significant update. The issues above are rare, only when pushing the model beyond its limits (my prompt alone was 57k tokens).

MiniMax-M2.5 is a reliable worker and a very good coder too. Amazing!

Thank you @lukealonso ! You rock!

lukealonso

Owner Feb 15

Awesome, thanks for testing!

I'm having excellent results with MiniMax-M2.5 as well, exciting times.

mtcl

Feb 15

I concur, minimax m2.5 feels very solid. I'm working with 100k+ context normally (opencode) and it stays strong.

Loving this quant too! SGlang is solid. Getting around 90 tk/s on lower contexts, and 60 tk/s on 100k+ contexts. Single request.

ciprianv

Feb 18

is there any possibility of running this also on older 3090 cards? i tried with vllm and sglang and it fails starting..

mtcl

Feb 18

For 3090 there's a bf16-int4 awq version by martism that you can try.

is there any possibility of running this also on older 3090 cards? i tried with vllm and sglang and it fails starting..

ciprianv

Feb 18

on sglang NVFP4 doesn't start for now, but it started on vllm on my 8x3090, on vllm, with same speed as bf16-int4 awq using marlin as I see in the logs. Do i lose any accuracy for not having native fp4 in 3090s? or only i do not get an extra speed up?

mtcl

Feb 19

You definitely get more precision with NVFP4 as compared to bf16-int4

dehnhaide

Feb 20

•

edited Feb 20

on sglang NVFP4 doesn't start for now, but it started on vllm on my 8x3090, on vllm, with same speed as bf16-int4 awq using marlin as I see in the logs. Do i lose any accuracy for not having native fp4 in 3090s? or only i do not get an extra speed up?

Hei @ciprianv ,

Care to share your vllm command parametrization for your 8x 3090... i have pretty much the same setup as you for for whatever reason I don't seem to me able to get it rolling.

Multumesc! :)

ciprianv

Feb 22

[CMD] Full vLLM command:
python3 -m vllm.entrypoints.openai.api_server --model '/models/Minimax-M2.5-NVFP4-New' --served-model-name 'Minimax-M2.5-NVFP4-New' --override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.05, "frequency_penalty": 0.40}' --tensor-parallel-size 8 --pipeline-parallel-size 1 --distributed-executor-backend ray --host 0.0.0.0 --port 8000 --max-model-len auto
--gpu-memory-utilization 0.925 --max-num-seqs 4 --max-num-batched-tokens 1024 --swap-space 8 --trust-remote-code --allow-deprecated-quantization --enable-prefix-caching --enable-chunked-prefill --disable-log-requests --reasoning-parser minimax_m2 --tool-call-parser minimax_m2 --enable-auto-tool-choice --no-enable-expert-parallel

if you don't use multi node network distribution and you have all 8 gpus on the same machine you shoul remove this: --distributed-executor-backend ray

zzyong

Mar 21

Thanks for the heads-up. I actually already re-quantized it with the max of the two scales, but the upload was already running for several hours and I didn’t want to stop it. I’ll upload the corrected version overnight.

The good news is the weights are identical for 99.6% of the layers, and the absolute difference is very very small when they don’t match, so you shouldn’t see a meaningful difference with the current version.

How to quantize it with the max of the two scales, could you share the quantize method, I meet the same problem on other model?