accuracy
@mratsim I have this test that I run with agents to understand the quality of the model: The test is simple: a prompt describing bigquery data and their relations, and then I ask agents questions about these data and I also run the corresponding queries with a script to know expected answers and compare accuracy. This is multi-turn, the prompt is big, the data is a lot, they need to do 3-4 turns to find the answers, so context is about 80-100k tokens for most tests.
I tested this quant vs @lukealonso nvfp4, 3 times each.
The AWQ quant consistently made 10-12 errors out of 52 questions. @lukealonso nvfp4 version has a much smaller error rate of 4-5 errors per run.
So, somehow they are now flipped. MiniMax-M2.5 is more accurate in NVFP4 than AWQ.
ah! I would love to share. But these are real bigquery data, not a dataset I can share.
I will run the tests again...
Interesting, I might requant then.
The only thing I changed compared to the previous one is using batch_size=32 from the llmcompressor release.
I see that there is a default to truncate but I might change it to padding or change the batch_size to 1:
@ktsaou
1). NVFP4 is within %1 of FP8 via Nvidia's tests per se, and what most have seen in the wild in most broad tests.
2). INT4, even with some items at BF16, is still INT4. In my Edited VLLM with real PPL, W4A16 deviates in the ~7% range off FP8 where INT8 deviates in the ~0.018%
3). NVFP4 will ALWAYS be more accurate than INT4. INT8 will almost always be more accurate than NVFP4
4). @mratsim was playing with Batch Sizing, as I saw it deliver INSANE speed, but I've seen ALL of the models I quanted, get deteriorated with accuracy, when using ANY batch size. LLM_Compressor warns that truncation may occur and EXTREME truncation occurs in batchsize > 16
TLDR; This is normal for NVFP4 when compared to ANY INT4. @mratsim will requant at batch size 1 and I would expect the errors to be less, but not NVFP4 levels of less.
@ktsaou If you want to compare the Quant @mratsim did here, you should compare another persons normal W4A16 (INT4) against this one. That way you can see whether the BF16 actually makes a difference, but remember, MAKE SURE YOU KNOW THE GROUP SIZE before comparing. A W4A16_GS32 is demonstrably better than a W4A16_GS128 when observing nuance and context.
Thank you @shambler74 . Yes you right. However for MiniMax-M2.1 the quality was flipped between the 2 quant types. We were discussing this at https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/discussions/9#698f3598ff0cc62f5009fb56 - @mratsim did a great job helping @lukealonso understand how to get max quality quant and it seems it paid off.
New quant with batch_size=1 uploaded
do you have some updates on the batch size 1 comparison ?
I downloaded the updated quant and ran the tests again.
I said above that the AWQ quant gives 10-12 errors out of 52, while nvpf4 gives 4-5 errors out of 52.
@lukealonso has also made changes to the nvfp4 quant and I also improved my prompts a bit, and now the nvfp4 version gives 0 errors. It reliably passes all 52 tests every time. Check https://huggingface.co/lukealonso/MiniMax-M2.5-NVFP4/discussions/2#6991b62b1df8b49d64736b47 for more info.
@mratsim , however the AWQ quant still fails on ~7 cases per run. I identified 2 distinct issues:
ISSUE 1
The model turns PLURAL JSON field names into SINGULAR, although the data values are correct. So, it does the work properly and accurately, there are no retries due to schema validation errors, but it randomly drops s from non-required json field names. Examples:
customers->customerbusiness_subscriptions->business_subscriptioncommunity_nodes->community_node
ISSUE 2
In a couple of cases it was hitting a wall, generating wrong SQL queries again and again (validation errors from bigquery) which resulted in partial responses.
This is 1:1 test between AWQ and NVFP4: same h/w, same vllm (0.15.1), same prompts, same agentic software, same data, same tools.
In case I am doing something wrong, here are the 2 recipes:
Recipe for AWQ:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
export VLLM_SLEEP_WHEN_IDLE=1
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm-0.15.1/bin/vllm serve mratsim/Minimax-M2.5-BF16-INT4-AWQ \
--host 0.0.0.0 \
--port 8354 \
--served-model-name minimax-m2.5 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-batched-tokens 32768 \
--kv-cache-dtype fp8_e4m3 \
--attention-config.disable_flashinfer_q_quantization True \
--max-model-len 196608 \
--max-num-seqs 64 \
--dtype auto \
--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
--attention-config.backend FLASHINFER \
--override-generation-config "${SAMPLER_OVERRIDE}" \
--disable-custom-all-reduce
Recipe for NVFP4:
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
export VLLM_SLEEP_WHEN_IDLE=1
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm-0.15.1/bin/vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
--host 0.0.0.0 \
--port 8354 \
--served-model-name minimax-m2.5 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--max-model-len 196608 \
--max-num-seqs 16 \
--max-num-batched-tokens 32768 \
--dtype auto \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
--enable-prefix-caching \
--enable-chunked-prefill \
--attention-config.backend FLASHINFER \
--kv-cache-dtype fp8_e4m3 \
--attention-config.disable_flashinfer_q_quantization True \
--all2all-backend pplx \
--enable-expert-parallel \
--disable-custom-all-reduce \
--override-generation-config "${SAMPLER_OVERRIDE}"
ISSUE 1
The model turns PLURAL JSON field names into SINGULAR
That has been driving me crazy, I'm not even sure how that can happen.
This PR is full of this https://github.com/mratsim/delulu/pull/5#pullrequestreview-3816898376
and I only asked to copy a yaml file from one repo to the other.
I'm not too sure what could be triggering this and I'm away for 3 weeks.
did you use fp8 kv cache? in fp16 kv cache i haven't seen issues using it in roocode
I use fp8 kv. I cannot use fp16 kv (I need long context and parallel queries).
I think it does not happen all the time. It is kind of random. Something triggers it, but I am not sure what.
Long context related could be the reason. My prompt alone is 57k tokens.
i was curious if without fp8 it still happening, and also if it depends on using vllm or sglang..
Hi Mamy,
I am seconding your observations and others related to m2.5. Yesterday long coding sprint ended in disaster for me when I asked the model to create a init.sh to integrate it's development (Next.js + Python mainly) to a CI/CD pipeline.
Because it cannot manage properly plurals it went in circles for more than 2h trying to generate the same stupid build up and going for "Product" then switching back to "Products" until I couldn't take it any further.
Given the amount of negative feedback I've got from other sources too, it looks like the devil is in some lower value quantization that simply go banana, and not specifically to your cooking recipe. Could be that, despite proper care ți activations, the one dealing with proper "lexical consistency and alignment" might be represented by a lower level of experts activation and be eventually 'brainwashed" by some lower level quant value.
I have had better experience with Qwen3.5 397b despite running it 4 times slower in llama.cpp




