NVFP4 / MXFP4 / FP8 quantizations for faster inference

#173

by Iwaku-Real - opened May 27

May 27

The current Anima model is only available in BF16 format and it's over twice as slow compared to Illustrious SDXL even on my RTX 5070 Ti:

At 4-bit and 8-bit quantization, it could not only fit into lower VRAM but on certain hardware (like RTX 50xx which supports NVFP4) it's hardware-accelerated and can make generation up to 2x faster for 8-bit and 4x faster for 4-bit. While there will always be very minor quality loss, this quantization enables the use of negative prompts, unlike the Turbo LoRA which nullifies them.

FLUX.2 has such an approach officially available: https://huggingface.co/collections/black-forest-labs/flux2
Also could work with Nunchaku on this, they have their own super effective FP4 quantization method for models like FLUX, Z-Image, and Qwen-Image: https://github.com/nunchaku-ai/nunchaku

Bedovyy

May 29

•

edited about 1 month ago

~~About FP8/MXFP8 model, I couldn't use torch.compile and it means it's slower than BF16.~~
To use torch.compile on FP8/MXFP8 models, use TorchCompileModelAdvanced from KJNodes, and set to max-autotune-no-cudagraphs mode and dynamic to false.

Try INT8 model with INT8-Fast custom node. It seems best way to boost the generation speed (and it works on turing and ampere GPUs)

About NVFP4, I couldn't satisfied it's quality.
nunchaku maybe good to try, because it calibrate, but not sure usual people not easily doing it (it needs resources=money and time) and post-training means the artist tags works differently.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment