Question regarding the FP8 version

#9
by thecr7guy - opened

Hello,

Great work on this amazing model. Wanted to know if there is a plan to release the FP8 version of zai-org/GLM-4.6V-Flash ?

If you have enough VRAM then you don't need a separate FP8 version: https://docs.vllm.ai/en/v0.5.4/quantization/fp8.html

Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying --quantization="fp8" in the command line or setting quantization="fp8" in the LLM constructor.

Sign up or log in to comment