mratsim/MiniMax-M2.1-FP8-INT4-AWQ · Kind request: GLM-4.7-Flash

Kind request: GLM-4.7-Flash

by dehnhaide - opened Jan 19

Discussion

dehnhaide

Jan 19

•

edited Jan 19

Hi mratsim,

Would it be possible, if your time allows, to user your magic for a FP8-INT4-AWQ / BF16-INT4-AWQ of zai-org/GLM-4.7-Flash.
I have already tried the FP16 as published but on 4x 3090 the context is a mere 16k. However I was able to test the model's reasoning and it impressed me so I thought it might be worth a version of your quantization recipe.
Many thanks for all your efforts on HF!

mratsim

Owner Jan 19

Looks like it's a new architecture "Glm4MoeLiteForCausalLM" so i need to create a new wrapper in LLMCompressor as it's incompatible with https://github.com/vllm-project/llm-compressor/pull/2170.

I can't say when I'll be able to get to it.

mratsim

Owner Feb 13

Unfortunately I'll be off for a month or so so this will have to be delayed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment