Will there be a chance to have INT4 later?

by someshijun - opened Jan 30

Discussion

someshijun

Jan 30

Is INT8 already the limit？

bertbobson

Owner Feb 1

•

edited Feb 1

Is INT8 already the limit？

I have been playing with INT4 a little, combining both SDNQ for quantization speed and Nunchaku for the fast kernel. It ends up only ~1.1x faster than current INT8 in my testing, and the quality makes it near unusable. A more naive INT4 approach without the lora would have faster speeds, but even worse quality.
Not sure INT4 will ever be feasible, outside of going full nunchaku. If we do it naively I estimate we could do it for maybe 20% of the layers of a model.

balukumar

Feb 1

I have been playing with INT4 a little, combining both SDNQ for quantization speed and Nunchaku for the fast kernel. It ends up only ~1.1x faster than current INT8 in my testing, and the quality makes it near unusable. A more naive INT4 approach without the lora would have faster speeds, but even worse quality.
Not sure INT4 will ever be feasible, at best for maybe 20% of the layers of a model.

I think int4 will only be slightly faster on 3xxx series or below because of the lack of native int4 tensors. I think there is overhead of fitting 2 of int4s on a single int8 which effectively makes it not worth it.

Better to stick to int8 and retain the quality.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment