Kind request

#1
by dehnhaide - opened

Hi!
Another great quantization ... until I taste it! Any chance to release an W8A16 version for the poor Ampere brothers, when time and resources allow? 馃叞 There's a plethora of wrongly quantized version of this model and I admit I am already tired trying 2-3 per day... but I fully trust your cookbook!

Thanks!

Yes, it's planned!

@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?

NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 路 BF16 路 I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 路 BF16 路 F8_E4M3
F64 路 I32 路 BF16

P.S. wasted bandwidth... 馃槪

Will get it fixed, whats the gpu you are running on ?

@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?

NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 路 BF16 路 I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 路 BF16 路 F8_E4M3
F64 路 I32 路 BF16

P.S. wasted bandwidth... 馃槪

Sounds like there were a few issues on the export, we fixed that but seems like vllm doesnt support it ? let me check if we can get our stack support the model over the weekend. If so do you rent out Ampere cards and what is the model ?

Sign up or log in to comment