Will there be a chance to have INT4 later?
Is INT8 already the limit?
Is INT8 already the limit?
I have been playing with INT4 a little, combining both SDNQ for quantization speed and Nunchaku for the fast kernel. It ends up only ~1.1x faster than current INT8 in my testing, and the quality makes it near unusable. A more naive INT4 approach without the lora would have faster speeds, but even worse quality.
Not sure INT4 will ever be feasible, outside of going full nunchaku. If we do it naively I estimate we could do it for maybe 20% of the layers of a model.
I have been playing with INT4 a little, combining both SDNQ for quantization speed and Nunchaku for the fast kernel. It ends up only ~1.1x faster than current INT8 in my testing, and the quality makes it near unusable. A more naive INT4 approach without the lora would have faster speeds, but even worse quality.
Not sure INT4 will ever be feasible, at best for maybe 20% of the layers of a model.
I think int4 will only be slightly faster on 3xxx series or below because of the lack of native int4 tensors. I think there is overhead of fitting 2 of int4s on a single int8 which effectively makes it not worth it.
Better to stick to int8 and retain the quality.