FP8 Quantization?

#1
by MoRanYue - opened

10 GB may be too large to inference in low-vRAM GPU, will there be a mixed FP8 and even FP4 quantization?

Thanks for raising this! We haven’t tested model quantization yet, but we’ll consider providing a quantized version once its stability is confirmed.

Sign up or log in to comment