Kind request
Hi!
Another great quantization ... until I taste it! Any chance to release an W8A16 version for the poor Ampere brothers, when time and resources allow? 馃叞 There's a plethora of wrongly quantized version of this model and I admit I am already tired trying 2-3 per day... but I fully trust your cookbook!
Thanks!
Yes, it's planned!
@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?
NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 路 BF16 路 I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 路 BF16 路 F8_E4M3
F64 路 I32 路 BF16
P.S. wasted bandwidth... 馃槪
Will get it fixed, whats the gpu you are running on ?
@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 路 BF16 路 I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 路 BF16 路 F8_E4M3
F64 路 I32 路 BF16P.S. wasted bandwidth... 馃槪
Sounds like there were a few issues on the export, we fixed that but seems like vllm doesnt support it ? let me check if we can get our stack support the model over the weekend. If so do you rent out Ampere cards and what is the model ?