mratsim/MiniMax-M2.5-BF16-INT4-AWQ

accuracy

by ktsaou - opened Feb 14

Feb 14

@mratsim I have this test that I run with agents to understand the quality of the model: The test is simple: a prompt describing bigquery data and their relations, and then I ask agents questions about these data and I also run the corresponding queries with a script to know expected answers and compare accuracy. This is multi-turn, the prompt is big, the data is a lot, they need to do 3-4 turns to find the answers, so context is about 80-100k tokens for most tests.

I tested this quant vs @lukealonso nvfp4, 3 times each.

The AWQ quant consistently made 10-12 errors out of 52 questions. @lukealonso nvfp4 version has a much smaller error rate of 4-5 errors per run.

So, somehow they are now flipped. MiniMax-M2.5 is more accurate in NVFP4 than AWQ.

mtcl

Feb 14

@ktsaou if you share your testing methodology, i can test it for you as well. I am able to run this model in its native precision as well, so I might be able to do some more testing here for us.

ktsaou

Feb 14

ah! I would love to share. But these are real bigquery data, not a dataset I can share.

ktsaou

Feb 14

I will run the tests again...

mratsim

Owner Feb 14

Interesting, I might requant then.

The only thing I changed compared to the previous one is using batch_size=32 from the llmcompressor release.

I see that there is a default to truncate but I might change it to padding or change the batch_size to 1:

https://github.com/vllm-project/llm-compressor/blob/0.9.0.2/src/llmcompressor/args/dataset_arguments.py#L70-L91

shambler74

Feb 15

@ktsaou
1). NVFP4 is within %1 of FP8 via Nvidia's tests per se, and what most have seen in the wild in most broad tests.
2). INT4, even with some items at BF16, is still INT4. In my Edited VLLM with real PPL, W4A16 deviates in the ~7% range off FP8 where INT8 deviates in the ~0.018%
3). NVFP4 will ALWAYS be more accurate than INT4. INT8 will almost always be more accurate than NVFP4
4). @mratsim was playing with Batch Sizing, as I saw it deliver INSANE speed, but I've seen ALL of the models I quanted, get deteriorated with accuracy, when using ANY batch size. LLM_Compressor warns that truncation may occur and EXTREME truncation occurs in batchsize > 16

TLDR; This is normal for NVFP4 when compared to ANY INT4. @mratsim will requant at batch size 1 and I would expect the errors to be less, but not NVFP4 levels of less.

shambler74

Feb 15

@ktsaou If you want to compare the Quant @mratsim did here, you should compare another persons normal W4A16 (INT4) against this one. That way you can see whether the BF16 actually makes a difference, but remember, MAKE SURE YOU KNOW THE GROUP SIZE before comparing. A W4A16_GS32 is demonstrably better than a W4A16_GS128 when observing nuance and context.

ktsaou

Feb 15

Thank you @shambler74 . Yes you right. However for MiniMax-M2.1 the quality was flipped between the 2 quant types. We were discussing this at https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/discussions/9#698f3598ff0cc62f5009fb56 - @mratsim did a great job helping @lukealonso understand how to get max quality quant and it seems it paid off.

mratsim

Owner Feb 15

Currently requanting but batch_size 1 is really slow. 10min per layer, 63 layers = 630min so over 10 hours

mratsim

Owner Feb 15

New quant with batch_size=1 uploaded

ciprianv

Feb 17

do you have some updates on the batch size 1 comparison ?

ktsaou

Feb 17

I downloaded the updated quant and ran the tests again.

I said above that the AWQ quant gives 10-12 errors out of 52, while nvpf4 gives 4-5 errors out of 52.

@lukealonso has also made changes to the nvfp4 quant and I also improved my prompts a bit, and now the nvfp4 version gives 0 errors. It reliably passes all 52 tests every time. Check https://huggingface.co/lukealonso/MiniMax-M2.5-NVFP4/discussions/2#6991b62b1df8b49d64736b47 for more info.

@mratsim , however the AWQ quant still fails on ~7 cases per run. I identified 2 distinct issues:

ISSUE 1
The model turns PLURAL JSON field names into SINGULAR, although the data values are correct. So, it does the work properly and accurately, there are no retries due to schema validation errors, but it randomly drops s from non-required json field names. Examples:

customers -> customer
business_subscriptions -> business_subscription
community_nodes -> community_node

ISSUE 2
In a couple of cases it was hitting a wall, generating wrong SQL queries again and again (validation errors from bigquery) which resulted in partial responses.

This is 1:1 test between AWQ and NVFP4: same h/w, same vllm (0.15.1), same prompts, same agentic software, same data, same tools.

In case I am doing something wrong, here are the 2 recipes:

Recipe for AWQ:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
export VLLM_SLEEP_WHEN_IDLE=1
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm-0.15.1/bin/vllm serve mratsim/Minimax-M2.5-BF16-INT4-AWQ \
  --host 0.0.0.0 \
  --port 8354 \
  --served-model-name minimax-m2.5 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-config.disable_flashinfer_q_quantization True \
  --max-model-len 196608 \
  --max-num-seqs 64 \
  --dtype auto \
  --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
  --attention-config.backend FLASHINFER \
  --override-generation-config "${SAMPLER_OVERRIDE}" \
  --disable-custom-all-reduce

Recipe for NVFP4:

export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
export VLLM_SLEEP_WHEN_IDLE=1
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm-0.15.1/bin/vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
  --host 0.0.0.0 \
  --port 8354 \
  --served-model-name minimax-m2.5 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --max-model-len 196608 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 32768 \
  --dtype auto \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --attention-config.backend FLASHINFER \
  --kv-cache-dtype fp8_e4m3 \
  --attention-config.disable_flashinfer_q_quantization True \
  --all2all-backend pplx \
  --enable-expert-parallel \
  --disable-custom-all-reduce \
  --override-generation-config "${SAMPLER_OVERRIDE}"

mratsim

Owner Feb 18

ISSUE 1
The model turns PLURAL JSON field names into SINGULAR

That has been driving me crazy, I'm not even sure how that can happen.

This PR is full of this https://github.com/mratsim/delulu/pull/5#pullrequestreview-3816898376

and I only asked to copy a yaml file from one repo to the other.

I'm not too sure what could be triggering this and I'm away for 3 weeks.

ciprianv

Feb 18

did you use fp8 kv cache? in fp16 kv cache i haven't seen issues using it in roocode

ktsaou

Feb 18

I use fp8 kv. I cannot use fp16 kv (I need long context and parallel queries).
I think it does not happen all the time. It is kind of random. Something triggers it, but I am not sure what.

ktsaou

Feb 18

Long context related could be the reason. My prompt alone is 57k tokens.

ciprianv

Feb 18

i was curious if without fp8 it still happening, and also if it depends on using vllm or sglang..

dehnhaide

Feb 27

•

edited Feb 27

Hi Mamy,

I am seconding your observations and others related to m2.5. Yesterday long coding sprint ended in disaster for me when I asked the model to create a init.sh to integrate it's development (Next.js + Python mainly) to a CI/CD pipeline.

Because it cannot manage properly plurals it went in circles for more than 2h trying to generate the same stupid build up and going for "Product" then switching back to "Products" until I couldn't take it any further.

Given the amount of negative feedback I've got from other sources too, it looks like the devil is in some lower value quantization that simply go banana, and not specifically to your cooking recipe. Could be that, despite proper care ți activations, the one dealing with proper "lexical consistency and alignment" might be represented by a lower level of experts activation and be eventually 'brainwashed" by some lower level quant value.

I have had better experience with Qwen3.5 397b despite running it 4 times slower in llama.cpp

https://xcancel.com/bnjmn_marie/status/2027043753484021810

dehnhaide

Mar 8

ISSUE 1
The model turns PLURAL JSON field names into SINGULAR

That has been driving me crazy, I'm not even sure how that can happen.

I'm not too sure what could be triggering this and I'm away for 3 weeks.

Hei Mamy,

Any update / idea on how to solve this issue? I have also ran across the same trouble in my vibe sprints and I am puzzled. Even more so, that it doesn't seem to follow a pattern, when it kicks it and when not.
I have already asked on Reddit, Daniel of Unlsoth so he can maybe has some bright idea on how to identify where the quant fails a specific expert / activation and if any more granular quantization would be needed.
Happy to assist.

mratsim

Owner Mar 11

My best bet would be to collect instances of wrong plurals and create a dataset and requant from that. There must be some experts related to plurals possibly in multiple languages that got lobotomized.

I suspect what doesn't work properly is when the plural is followed by a special token like test_layers_fixtures.nim or actions/checkout@v6 or tags:

mratsim

Owner Mar 12

Discussing with @Doctor-Shotgun ,he seems to have strange typos (on the full MiniMax, lucky him) once he switched to Transformers v5.

I assume it's related to incompatible positional embeddings / rope_parameters changes I mention in my readme

So there might be 2 source of typos with MiniMax:

plurals due to quantization
another one due to transformers v5

dehnhaide

Mar 12

Interesting. Switching to transformers B4 is not an option?! Sorry but because of running so many vllm flavors lately to accommodate the plethora of new models, I kinda lost track on each models transformers reqs. I might check when back at home in two days. Sure, fact that is happening in full version is... itchy?!

mratsim

Owner Mar 17

•

edited Mar 17

So @ktsaou I've been testing @lukealonso quant.

I also get strange typos, here a space being added after dots, and that happened to me 3~4 times in just 2 prompts:

This is with SGLang image mistral-small-4-cu13 https://hub.docker.com/layers/lmsysorg/sglang/mistral-small-4-cu13/images/sha256-a67413b12d99429d459dbb74c946b4ea24ce59df21e8da5671aa25647caee125

which does have Transformers v5.3.0

❯ podman exec -ti sglang /bin/bash                    
root@kaze:/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 0,1 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu130
sglang: 0.5.9
sglang-kernel: 0.4.0
flashinfer_python: 0.6.6
flashinfer_cubin: 0.6.6
flashinfer_jit_cache: 0.6.6+cu130
triton: 3.5.1
transformers: 5.3.0.dev0
torchao: 0.9.0
numpy: 2.4.3
aiohttp: 3.13.3
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 1.7.1
interegular: 0.3.3
modelscope: 1.35.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.42.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.85.0
litellm: Module Not Found
torchcodec: 0.9.1+cu130

And I don't get it with lmsysorg/sglang:latest@sha256:f337dfb36971becc98d12768f356af3c6c12ba57c9aebede0e6948da5ad37da7

❯ podman exec -ti sglang /bin/bash
root@kaze:/sgl-workspace/sglang# python3 -m sglang.check_env
Python: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]
CUDA available: True
GPU 0,1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 0,1 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 590.48.01
PyTorch: 2.9.1+cu129
sglang: 0.5.9
sgl_kernel: 0.3.21
flashinfer_python: 0.6.3
flashinfer_cubin: 0.6.3
flashinfer_jit_cache: 0.6.3+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.2
aiohttp: 3.13.3
fastapi: 0.131.0
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.34.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.41.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.83.0
litellm: Module Not Found
decord2: 3.0.0

This confirms @Doctor-Shotgun findings.

I think something in the model is prone to typo and both changing transformers version OR some quantization make that appear prominently.

mratsim

Owner Mar 17

Upstreamed the discussion https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48

EclipseMist

16 days ago

This could also be due to kv cache quantization below f16 causes noticeable degradation in lots of models https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357 llama cpp just merged a fix that should make q8 perform on par with f16 kv cache or pretty close. You might not be using llama cpp but its something to look into.

dehnhaide

16 days ago

This could also be due to kv cache quantization below f16 causes noticeable degradation in lots of models https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357 llama cpp just merged a fix that should make q8 perform on par with f16 kv cache or pretty close. You might not be using llama cpp but its something to look into.

Are you aware we're discussing a vllm quant, thus issues related to this pair (quant / backend) only? Let's stay on topic for the benefit of everyone.

mratsim

Owner 15 days ago

Everything was tested with 16-bit KV-cache. vLLM and SGLang are tricky with quantized KV-cache.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment