Hoping for your magic on MiniMax-M2.7-FP8-INT4-AWQ quant

#2
by dehnhaide - opened

Hei Mamy,

I know / feel you're working on such and I keep my fingers crossed this time the quant will not suffer for the same issues that the M2.5 was suffering from.
Many thanks for your effort!

P.S. I don't think a MiniMax M2.7 (Mixed-Precision BF16 + INT4 AWQ) would be needed anymore. I have tried the FP8 quant on 8x3090 and worked like a champ!

Yes I'm planning to.

I'm still doing the BF16+INT4 because it's the base quant I use that I recombine with the original upstream quant (FP8 from upstream, INT4 from my own).

And for now SGLang does not support Mixed Precision llmcompressor :/. I plan to switch to ModelOpt at one point but it's time-consuming

im really looking forward to it, with 2 Sparks this fits perfectly and gives currently best tk/s, any time we can expect to be dropped here ? :)

im really looking forward to it, with 2 Sparks this fits perfectly and gives currently best tk/s, any time we can expect to be dropped here ? :)

Here you're grateful cause it's coming - and you're not asking for deadlines ... patience is a virtue, right? ;)

im very grateful, just waited already on this since 2.7 was announced, happy with every bit of info and if there is none i will try to wait as patiently as possible 🤤

There are 3 time-consuming things to do:

  1. Quantization itself
  2. Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485
  3. MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.
  1. Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485

I've seen and felt a bit puzzled on how could that be / happen... but hei, vllm is huge and has armies of maintainers, each with his own idea...

  1. MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.

BTW... both M2.5 and M2.7 are very well behaved (meaning no plurals & punctuation adrift) on Q5_K_M and UD_Q4_K_XL quants on ik_llama, while performance is more than acceptable (>50toks on 8x3090s). The "head scratch" moment is that most of the quanters behind the ggufs never go to the extent of your experts activation calibration you've gone so far and still (at least for coding!) the GGUF seems immune to such issues..
Or call me a fool for not being able to uncover them so far!

I think transformers v5 changed how rotational positional embeddings are managed. Can't dive in but I remember when submitting the MiniMax PR that some rope types/functions were modified.

If a model was trained with non-modified and then used with modified rope it may have issues.

Note this is only speculation but this is what makes most sense for me given that even MiniMax original FP8 weights are reported to be buggy with TF v5.

There are 3 time-consuming things to do:

  1. Quantization itself
  2. Update my quantization setup to Transformers v5 to ensure I don't get hit by positional embedding changes between Transformers v4 and v5: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. This might be a time sink because someone tried to adapt my PR to llmcompressor transformers v5 and it got reverted due to many hidden bugs 🤷 https://github.com/vllm-project/llm-compressor/pull/2485
  3. MiniMax-M2.7 has 3 focus: software engineering, business in general (Word/Excel/PPT but domains like HR, Finance, Legal, ....), entrainement (creative writing, ...) so I need to merge my software_engineering+creative calibration set and hopefully try to find new datasets that would stress tests the plurals before special characters I found in https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48. For example a Github Actions dataset.

i appreciate the work!!! 😊

The latest llm-compressor (commit 3c9d4fd as of Apr 10, 2026) has changed the offloading backend from hf/accelerate to a custom one from llm-compressor which I assume comes from v0.10 (March) and onward: https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0

Unfortunately, now llmcompressor tries to load the full model in RAM/swap despite me trying various configuration of

with offloaded_model():
    AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto_offload",
        offload_folder="./offload_folder",
    )

It gets stuck swapping there
image

Hence I reverted to an old branch of llm-compressor that I know work. This means I will quantize with Transformers v4. Hopefully the M2.5 issues don't appear there.

Also I'm trying a new quantization strategy, instead of calibrating all experts which might trigger large activations if an expert receives an input it wasn't supposed to and make the real input they are expert of less noticeable, I will significantly increase the calibration set (2K samples instead of 0.6 with more diversity). This hopefully will improve quality (or be worse if my dataset is not general enough, can't be worse than REAP though).

Also this might be my last 4-bit in llm-compressor for a while, it has too many issues I'm tired to deal with:

So looking in ModelOpt which might require some time to adapt my scripts

Well I've been fighting with llmcompressor and compressed-tensors, but somehow I can't get my setup to quantize MiniMax without needed hundreds of GB of swap space:

llm-compressor + compressed-tensors from Feb 13 used to work: https://github.com/mratsim/llm-compressor/commits/minimax-m2/

but now no combination of offload_device (AWQModifier parameter), offload_folder(AutoModelForCausalLM.from_pretrained parameter), sequential_offload_device, allows me to avoid swapping on the very first layer on my 192GB RAM machine.

Furthermore I apparently narrowingly avoided this PR https://github.com/vllm-project/compressed-tensors/pull/572 that changed loading/offloading in my previous quants.

And it was already really a mess
image

And now there is offload_dir too https://github.com/vllm-project/compressed-tensors/pull/650

Last but not least according to @phaedawg with the latest llm-compressor + compressed-tensors disk offloading does not work with sequential pipeline.

So the weights are on hiatus. Sorry folks.

PS: I wonder if I maybe quantized using 1TB of swap space and forgot about it because of how painful it was

Sign up or log in to comment