Kind request

by dehnhaide - opened Jan 26

Jan 26

Hi!
Another great quantization ... until I taste it! Any chance to release an W8A16 version for the poor Ampere brothers, when time and resources allow? 🅰 There's a plethora of wrongly quantized version of this model and I admit I am already tired trying 2-3 per day... but I fully trust your cookbook!

Thanks!

mratsim

Owner Jan 27

Yes, it's planned!

Geodd

Feb 6

@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?

dehnhaide

Feb 6

@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?

NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 · BF16 · I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 · BF16 · F8_E4M3
F64 · I32 · BF16

P.S. wasted bandwidth... 😣

Geodd

Feb 7

Will get it fixed, whats the gpu you are running on ?

Geodd

Feb 7

@dehnhaide
Does this work for you https://huggingface.co/Geodd/GLM-4.7-Flash-W8A16 ?

NOPE --> Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4']
The model is "Tensor type F32 · BF16 · I8", most likely an NVFP4 derivative. To work on Ampere, depending on the kernel you want to run on, it would need to be any of:
F32 · BF16 · F8_E4M3
F64 · I32 · BF16

P.S. wasted bandwidth... 😣

Sounds like there were a few issues on the export, we fixed that but seems like vllm doesnt support it ? let me check if we can get our stack support the model over the weekend. If so do you rent out Ampere cards and what is the model ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment