lukealonso/MiniMax-M2.5-NVFP4 · Request: NVFP4 version of MiniMax-M2.5-REAP-139B (to fit on a single RTX 6000 Pro)

Feb 19

Hi,

First of all, thank you so much for converting the main MiniMax model to NVFP4! It's amazing work and incredibly helpful for the community.

I was wondering if you might be able to do the same NVFP4 conversion for the cerebras/MiniMax-M2.5-REAP-139B-A10B model as well?

If that model were available in NVFP4 format, it would fit perfectly into a single RTX 6000 Pro GPU, which would be a game-changer for my setup.

Thank you again for your time and all your great contributions!

lukealonso

Owner Feb 19

Will do.

lukealonso

Owner Feb 20

lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4

mondovero

Feb 20

Super:) Thank you.

mondovero

Feb 20

Hello! I wanted to try this model, but I noticed there are no model files here yet (only the README).

MiniMax-M2.5-REAP-139B-A10B-NVFP4/tree/main.

lukealonso

Owner Feb 20

Sorry, should be fully uploaded now.

mondovero

Feb 20

Thank You, I will check:)

mondovero

Feb 21

Hi @lukealonso ,

I'm trying to run this model on a single NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB) with vLLM v0.14.1-cu130 and the cu130-nightly, but I consistently get completely incoherent multilingual gibberish output regardless of configuration.

Example output for a simple "hello" prompt:
lymp_A Tanadows rain protease预习 mercenariesolidated fragment歌手 DieseregaraPRESSान meminta超人 dew:|: груп Particular uniformly金龙 reportajeodynam diretrizes.tv Alaska Oklahoma该校...

What I tried

Your recommended config (CUTLASS + FLASH_ATTN):
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
--attention-backend FLASH_ATTN
→ Fails immediately: ValueError: FLASH_ATTN is not valid — kv_cache_dtype not supported. The NVFP4 checkpoint forces fp8 KV cache, which FLASH_ATTN cannot handle.
CUTLASS MOE + FlashInfer attention (auto) + fp8 KV cache:
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
--kv-cache-dtype fp8_e4m3
→ Loads but gibberish output.
FlashInfer MOE + FlashInfer attention (auto) + fp8 KV cache:
VLLM_USE_FLASHINFER_MOE_FP4=1
--kv-cache-dtype fp8_e4m3
--max-model-len 50000
--tensor-parallel-size 1
→ Loads but gibberish output.

I also tested with and without --enable-expert-parallel, --calculate-kv-scales, --enable-chunked-prefill, etc. Same result every time.

lukealonso

Owner Feb 21

Hmm. I haven't tried vLLM, just sglang. Do you see the same there?

mondovero

Feb 21

SGLang test (also produced gibberish)

We then tried SGLang v0.5.8.post1-cu130 with the exact flags recommended for NVFP4:

python3 -m sglang.launch_server
--model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
--quantization modelopt_fp4
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--kv-cache-dtype bf16
--enable-flashinfer-allreduce-fusion
--trust-remote-code
--tp 1
--mem-fraction-static 0.95
--max-running-requests 32
--context-length 50000
--host 0.0.0.0 --port 8000

The model loaded successfully. Test request:
{"model": "...", "messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}], "max_tokens": 50, "temperature": 0.1}

Response (reasoning_content field):
bela を示者から不清，哈哈哈 fles nurs acerca立た 신규.o注意力 DECに到着 volunteered，皆反派 Official funky autorizaotivated Zul Flickrを目指します建设项目，较 평가，加国 manages infest uma extraordinarias Larson preface ambiante-lqBtn Dos PRIVATE electrophoresis diffic 목록画像は realising 这个ecycle lbs전한_cent

lukealonso

Owner Feb 22

Very odd, here's what I get:

What is 2+2? Answer in one word

The user is asking a simple math question: "What is 2+2?" and requests the answer in one word.

2 + 2 = 4

So I need to answer in one word. The answer "Four" or "4" would work. Since they said "one word", I should give the word form "Four" as that's a single word, whereas "4" is a number/character. Actually "4" could be considered a word? It's a numeral. But to be safe, "Four" is clearly a word.

So the answer is "Four".

Four

lukealonso

Owner Feb 22

python3 -m sglang.launch_server \
  --model /data/models/MiniMax-M2.5-NVFP4-REAP \
  --served-model-name MiniMax-M2.5-NVFP4 \
  --reasoning-parser minimax \
  --tool-call-parser minimax-m2 \
  --trust-remote-code \
  --tp 1 \
  --mem-fraction-static 0.95 \
  --max-running-requests 32 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype bf16 \
  --enable-flashinfer-allreduce-fusion \
  --host 0.0.0.0 \
  --port 8000

My args

lukealonso

Owner Feb 22

@mondovero Could you please try fetching the latest and trying again? If it doesn't work, can you let me know which version of transformers you have?

mondovero

Feb 23

Here's the exact SGLang setup that works correctly on our end.
Environment

SGLang image: lmsysorg/sglang:v0.5.8.post1-cu130

Model: lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4

GPU: NVIDIA B200 (single GPU, TP=1)

transformers: 4.57.6 (shipped with the sglang image, not upgraded)

Launch command
Bash

python3 -m sglang.launch_server
--model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
--served-model-name MiniMax-M2.5-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--trust-remote-code
--tp 1
--mem-fraction-static 0.95
--max-running-requests 32
--context-length 50000
--quantization modelopt_fp4
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--kv-cache-dtype bf16
--enable-flashinfer-allreduce-fusion
--host 0.0.0.0
--port 8000

Test result

Request:
JSON

{
"model": "MiniMax-M2.5-NVFP4",
"messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}],
"max_tokens": 200,
"temperature": 0.1
}

Response:
JSON

{
"content": "four",
"reasoning_content": "The user asks: "What is 2+2?"... The answer is "four"...",
"finish_reason": "stop"
}

Reasoning is correctly separated into reasoning_content, the answer is clean in content, and the model stops properly. Additionally, we observed that the MiniMax-M2.5-NVFP4 performed significantly better on rare languages.

lukealonso

Owner Feb 23

It makes sense that other languages were victims of the REAPing - something had to be lost.