Request: NVFP4 version of MiniMax-M2.5-REAP-139B (to fit on a single RTX 6000 Pro)
Hi,
First of all, thank you so much for converting the main MiniMax model to NVFP4! It's amazing work and incredibly helpful for the community.
I was wondering if you might be able to do the same NVFP4 conversion for the cerebras/MiniMax-M2.5-REAP-139B-A10B model as well?
If that model were available in NVFP4 format, it would fit perfectly into a single RTX 6000 Pro GPU, which would be a game-changer for my setup.
Thank you again for your time and all your great contributions!
Will do.
Super:) Thank you.
Hello! I wanted to try this model, but I noticed there are no model files here yet (only the README).
Sorry, should be fully uploaded now.
Thank You, I will check:)
Hi @lukealonso ,
I'm trying to run this model on a single NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB) with vLLM v0.14.1-cu130 and the cu130-nightly, but I consistently get completely incoherent multilingual gibberish output regardless of configuration.
Example output for a simple "hello" prompt:
lymp_A Tanadows rain protease预习 mercenariesolidated fragment歌手 DieseregaraPRESSान meminta超人 dew:|: груп Particular uniformly金龙 reportajeodynam diretrizes.tv Alaska Oklahoma该校...
What I tried
Your recommended config (CUTLASS + FLASH_ATTN):
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
--attention-backend FLASH_ATTN
→ Fails immediately: ValueError: FLASH_ATTN is not valid — kv_cache_dtype not supported. The NVFP4 checkpoint forces fp8 KV cache, which FLASH_ATTN cannot handle.CUTLASS MOE + FlashInfer attention (auto) + fp8 KV cache:
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
--kv-cache-dtype fp8_e4m3
→ Loads but gibberish output.FlashInfer MOE + FlashInfer attention (auto) + fp8 KV cache:
VLLM_USE_FLASHINFER_MOE_FP4=1
--kv-cache-dtype fp8_e4m3
--max-model-len 50000
--tensor-parallel-size 1
→ Loads but gibberish output.
I also tested with and without --enable-expert-parallel, --calculate-kv-scales, --enable-chunked-prefill, etc. Same result every time.
Hmm. I haven't tried vLLM, just sglang. Do you see the same there?
SGLang test (also produced gibberish)
We then tried SGLang v0.5.8.post1-cu130 with the exact flags recommended for NVFP4:
python3 -m sglang.launch_server
--model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
--quantization modelopt_fp4
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--kv-cache-dtype bf16
--enable-flashinfer-allreduce-fusion
--trust-remote-code
--tp 1
--mem-fraction-static 0.95
--max-running-requests 32
--context-length 50000
--host 0.0.0.0 --port 8000
The model loaded successfully. Test request:
{"model": "...", "messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}], "max_tokens": 50, "temperature": 0.1}
Response (reasoning_content field):
bela を示者から不清,哈哈哈 fles nurs acerca立た 신규.o注意力 DECに到着 volunteered,皆反派 Official funky autorizaotivated Zul Flickrを目指します建设项目,较 평가,加国 manages infest uma extraordinarias Larson preface ambiante-lqBtn Dos PRIVATE electrophoresis diffic 목록画像は realising 这个ecycle lbs전한_cent
Very odd, here's what I get:
What is 2+2? Answer in one word
The user is asking a simple math question: "What is 2+2?" and requests the answer in one word.
2 + 2 = 4
So I need to answer in one word. The answer "Four" or "4" would work. Since they said "one word", I should give the word form "Four" as that's a single word, whereas "4" is a number/character. Actually "4" could be considered a word? It's a numeral. But to be safe, "Four" is clearly a word.
So the answer is "Four".
Four
python3 -m sglang.launch_server \
--model /data/models/MiniMax-M2.5-NVFP4-REAP \
--served-model-name MiniMax-M2.5-NVFP4 \
--reasoning-parser minimax \
--tool-call-parser minimax-m2 \
--trust-remote-code \
--tp 1 \
--mem-fraction-static 0.95 \
--max-running-requests 32 \
--quantization modelopt_fp4 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--kv-cache-dtype bf16 \
--enable-flashinfer-allreduce-fusion \
--host 0.0.0.0 \
--port 8000
My args
@mondovero Could you please try fetching the latest and trying again? If it doesn't work, can you let me know which version of transformers you have?
Here's the exact SGLang setup that works correctly on our end.
Environment
SGLang image: lmsysorg/sglang:v0.5.8.post1-cu130
Model: lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
GPU: NVIDIA B200 (single GPU, TP=1)
transformers: 4.57.6 (shipped with the sglang image, not upgraded)
Launch command
Bash
python3 -m sglang.launch_server
--model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
--served-model-name MiniMax-M2.5-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--trust-remote-code
--tp 1
--mem-fraction-static 0.95
--max-running-requests 32
--context-length 50000
--quantization modelopt_fp4
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--kv-cache-dtype bf16
--enable-flashinfer-allreduce-fusion
--host 0.0.0.0
--port 8000
Test result
Request:
JSON
{
"model": "MiniMax-M2.5-NVFP4",
"messages": [{"role": "user", "content": "What is 2+2? Answer in one word."}],
"max_tokens": 200,
"temperature": 0.1
}
Response:
JSON
{
"content": "four",
"reasoning_content": "The user asks: "What is 2+2?"... The answer is "four"...",
"finish_reason": "stop"
}
Reasoning is correctly separated into reasoning_content, the answer is clean in content, and the model stops properly. Additionally, we observed that the MiniMax-M2.5-NVFP4 performed significantly better on rare languages.
It makes sense that other languages were victims of the REAPing - something had to be lost.