Kind request: GLM-4.7-Flash
#6
by
dehnhaide
- opened
Hi mratsim,
Would it be possible, if your time allows, to user your magic for a FP8-INT4-AWQ / BF16-INT4-AWQ of zai-org/GLM-4.7-Flash.
I have already tried the FP16 as published but on 4x 3090 the context is a mere 16k. However I was able to test the model's reasoning and it impressed me so I thought it might be worth a version of your quantization recipe.
Many thanks for all your efforts on HF!
Looks like it's a new architecture "Glm4MoeLiteForCausalLM" so i need to create a new wrapper in LLMCompressor as it's incompatible with https://github.com/vllm-project/llm-compressor/pull/2170.
I can't say when I'll be able to get to it.
Unfortunately I'll be off for a month or so so this will have to be delayed.