mratsim/MiniMax-M2.5-FP8-INT4-AWQ · Hoping for your magic on MiniMax-M2.7-FP8-INT4-AWQ quant

Hoping for your magic on MiniMax-M2.7-FP8-INT4-AWQ quant

by dehnhaide - opened Apr 12

Discussion

dehnhaide

Apr 12

•

edited Apr 12

Hei Mamy,

I know / feel you're working on such and I keep my fingers crossed this time the quant will not suffer for the same issues that the M2.5 was suffering from.
Many thanks for your effort!

P.S. I don't think a MiniMax M2.7 (Mixed-Precision BF16 + INT4 AWQ) would be needed anymore. I have tried the FP8 quant on 8x3090 and worked like a champ!

mratsim

Owner Apr 12

Yes I'm planning to.

I'm still doing the BF16+INT4 because it's the base quant I use that I recombine with the original upstream quant (FP8 from upstream, INT4 from my own).

And for now SGLang does not support Mixed Precision llmcompressor :/. I plan to switch to ModelOpt at one point but it's time-consuming

HeNryous

Apr 12

im really looking forward to it, with 2 Sparks this fits perfectly and gives currently best tk/s, any time we can expect to be dropped here ? :)

dehnhaide

Apr 12

•

edited Apr 12

im really looking forward to it, with 2 Sparks this fits perfectly and gives currently best tk/s, any time we can expect to be dropped here ? :)

Here you're grateful cause it's coming - and you're not asking for deadlines ... patience is a virtue, right? ;)

HeNryous

Apr 12

im very grateful, just waited already on this since 2.7 was announced, happy with every bit of info and if there is none i will try to wait as patiently as possible 🤤

mratsim

Owner Apr 12

There are 3 time-consuming things to do:

Quantization itself
Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485
MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.

dehnhaide

Apr 12

•

edited Apr 12

Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485

I've seen and felt a bit puzzled on how could that be / happen... but hei, vllm is huge and has armies of maintainers, each with his own idea...

MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.

BTW... both M2.5 and M2.7 are very well behaved (meaning no plurals & punctuation adrift) on Q5_K_M and UD_Q4_K_XL quants on ik_llama, while performance is more than acceptable (>50toks on 8x3090s). The "head scratch" moment is that most of the quanters behind the ggufs never go to the extent of your experts activation calibration you've gone so far and still (at least for coding!) the GGUF seems immune to such issues..
Or call me a fool for not being able to uncover them so far!

mratsim

Owner Apr 12

I think transformers v5 changed how rotational positional embeddings are managed. Can't dive in but I remember when submitting the MiniMax PR that some rope types/functions were modified.

If a model was trained with non-modified and then used with modified rope it may have issues.

Note this is only speculation but this is what makes most sense for me given that even MiniMax original FP8 weights are reported to be buggy with TF v5.

HeNryous

Apr 12

There are 3 time-consuming things to do:

Quantization itself

Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485

MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.

i appreciate the work!!! 😊

mratsim

Owner Apr 12

The latest llm-compressor (commit 3c9d4fd as of Apr 10, 2026) has changed the offloading backend from hf/accelerate to a custom one from llm-compressor which I assume comes from v0.10 (March) and onward: https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0

Unfortunately, now llmcompressor tries to load the full model in RAM/swap despite me trying various configuration of

with offloaded_model():
    AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto_offload",
        offload_folder="./offload_folder",
    )

It gets stuck swapping there

Hence I reverted to an old branch of llm-compressor that I know work. This means I will quantize with Transformers v4. Hopefully the M2.5 issues don't appear there.

Also I'm trying a new quantization strategy, instead of calibrating all experts which might trigger large activations if an expert receives an input it wasn't supposed to and make the real input they are expert of less noticeable, I will significantly increase the calibration set (2K samples instead of 0.6 with more diversity). This hopefully will improve quality (or be worse if my dataset is not general enough, can't be worse than REAP though).

Also this might be my last 4-bit in llm-compressor for a while, it has too many issues I'm tired to deal with:

Mixed precision llm-compressor is not supported in SGLang while ModelOpt mixed precision is: https://github.com/sgl-project/sglang/issues/16276
Partial model loading/disk offloading has regressed and is unusable
final compression is abysmally slow
no saving of intermediate states, an error mean restarting from scratch wasting hours and electricity
stuck in Transformers v4/v5 limbo
no KL-divergence evaluation tool
using only one GPU
backward compat policy: https://github.com/vllm-project/llm-compressor/issues/2057
Post-processing needed for SGLang and vLLM:
- Removing scales (backward compat): https://github.com/vllm-project/llm-compressor/issues/2057
- ignoring virtual fused layers:
  - https://github.com/vllm-project/vllm/issues/31623
  - https://github.com/sgl-project/sglang/issues/16295

So looking in ModelOpt which might require some time to adapt my scripts

mratsim

Owner Apr 13

Well I've been fighting with llmcompressor and compressed-tensors, but somehow I can't get my setup to quantize MiniMax without needed hundreds of GB of swap space:

llm-compressor + compressed-tensors from Feb 13 used to work: https://github.com/mratsim/llm-compressor/commits/minimax-m2/

but now no combination of offload_device (AWQModifier parameter), offload_folder(AutoModelForCausalLM.from_pretrained parameter), sequential_offload_device, allows me to avoid swapping on the very first layer on my 192GB RAM machine.

Furthermore I apparently narrowingly avoided this PR https://github.com/vllm-project/compressed-tensors/pull/572 that changed loading/offloading in my previous quants.

And it was already really a mess

And now there is offload_dir too https://github.com/vllm-project/compressed-tensors/pull/650

Last but not least according to @phaedawg with the latest llm-compressor + compressed-tensors disk offloading does not work with sequential pipeline.

So the weights are on hiatus. Sorry folks.

PS: I wonder if I maybe quantized using 1TB of swap space and forgot about it because of how painful it was

dehnhaide

Apr 15

No harm done @mratsim !

Many thanks for all the time and efforts you've spent so far in getting this vllm quants madness going on / sorted. I am currently evaluating some Intel Autoround quants and it's... saddening to say the least.
Would there be any chance you'd shift a bit the focus towards EXL3? Maybe in getting exllamav3 + tabbyapi to new levels of prowess? Maybe help kingbri, Splice86 and Turboderp stabilize their releases and add new features to their backend?

mratsim

Owner Apr 15

I'm actually trying to get a Qwen3.5 quant for 192GB of VRAM out.

They recently added proper tool calling support (merged in main TabbyAPI 3 days ago) and proper reasoning support about a month ago, so it's in good shape.

Then the extra stuff missing for my Qwen use-case would be tensor parallelism for Gated Delta Net layers and MTP for acceleration as currently I get ~1300 T/s prefill and ~60 T/s decode.

Now for inference engines themselves I actually started to write my own from scratch (had to take a pause for other commitments) so that I can bring EXL3 to single-binaries that could be deployed anywhere (supporting non-Cuda hopefully) and even embedded in say game engines and other apps easily (embedding Python is hell).

For now the plan is there: https://github.com/mratsim/tattletale/issues/1

dehnhaide

Apr 15

Wow, thanks for the reply. I've had a feeling when asking that I'd get something big back on EXL3 side but this is huge!
I've been following your interactions (just tech stalker!) with Exllama community on discord and had the feeling you're cooking something.
Anyway, I will keep and eye on your very interesting initiative(s). Saying Qwen-3.5 in 192 VRAM, I presume it's 397B, right?!
I will retry tomorrow the 122B version to see for myself how the tool / reasoning plays out.

Thanks again! 🙏

Jon-Nielsen

Apr 15

I'm actually trying to get a Qwen3.5 quant for 192GB of VRAM out.

They recently added proper tool calling support (merged in main TabbyAPI 3 days ago) and proper reasoning support about a month ago, so it's in good shape.

Then the extra stuff missing for my Qwen use-case would be tensor parallelism for Gated Delta Net layers and MTP for acceleration as currently I get ~1300 T/s prefill and ~60 T/s decode.

Now for inference engines themselves I actually started to write my own from scratch (had to take a pause for other commitments) so that I can bring EXL3 to single-binaries that could be deployed anywhere (supporting non-Cuda hopefully) and even embedded in say game engines and other apps easily (embedding Python is hell).

For now the plan is there: https://github.com/mratsim/tattletale/issues/1

Great news about tool calling support in tabbyAPI. Going way off topic here, but do you know if it's possible to perform exl3 quantization of GLM 5/5.1 without needing the same amount of VRAM+RAM as the unquantized weights?

Cheers

mratsim

Owner Apr 15

Going way off topic here, but do you know if it's possible to perform exl3 quantization of GLM 5/5.1 without needing the same amount of VRAM+RAM as the unquantized weights?

The issue with DeepSeek, Kimi K2 and GLM 5.x is that they require MLA and it's not trivial to add since Exllamav3 is FlashAttention-based and they have not added support for non Tesla GPU (i.e. it;s only for H100 / B200)

mratsim

Owner Apr 17

New Qwen3.5 quantization is out: https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3

192 GiB VRAM
524K context with FP16 KV cache or 1M context with 8-bit KV-cache
KL divergence of only 0.02!

dehnhaide

Apr 17

•

edited Apr 17

New Qwen3.5 quantization is out: https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3

192 GiB VRAM

524K context with FP16 KV cache or 1M context with 8-bit KV-cache

KL divergence of only 0.02!

Wow, thank you! Now downloading... super excited to test it out!
And the level of documentation: stupidproof, top notch! 😃

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment