Man, you have done great work!

by lieding1994 - opened Jan 17

lieding1994

Jan 17

I am very curious that how the nvfp4 conversion made, could you share the converson script, and can I aplly to ther community models?

GitMylo

Owner Jan 18

I'll probably release the conversion script when it's finished, it can currently load regular model formats but not quantized formats such as fp8 scaled (naive fp8 does work) or nvfp4.
I might also change the script to be runnable as a command instead of with hardcoded variables. Also the script uses https://github.com/Comfy-Org/comfy-kitchen for the quantization (currently nvfp4 and fp8 scaled) itself.
I'll release it when it's ready.

jorismak

Jan 18

I'm working on my own conversion script for all wan models and qwen models . I'm having a bit of issue maintaining quality with nvfp4.

I'm taking sneak peaks at the quantstack ggufs to see what they are keeping as fp16 and what they are quantizing down to 6bit or 5bit.

But the quality isn't there. The same for your model. It runs fast , but quality is noticeable worse (and maybe unusable) compared to a q4km quant. And a q5km quant is on another level.

I do get a major quality boost when I enable full-precision-mm for all attention layers. But that negates any speed boost by nvfp4, by basically running to main loop of the model in fp16 instead of nvfp4.

So if you have any tips or tricks to spare, I'd be grateful. I'm testing with wan2. 2-ti2v-5b , and they might be harder because the model seems more sensitive to attention quantization. Output is really noisy without full precision mm on input layers.

I've done a few calibration runs on it trying to get proper input scaling and figuring out which attention layers vary the most in input data (since nvfp4 only has 16 steps, layers with huge variation might be better of in fp16).

GitMylo

Owner Jan 18

I've also noticed that quantizing chroma gave me nonsensical outputs, then a couple days later I got outputs close to q4_k_m in quality, and that's with quantizing everything quantizable to nvfp4. I think to some extent it's also a bit of a comfy problem.
Currently on euler/euler a outputs with the quantized chroma are comparable to the gguf and similar to the fp16 content wise, but on dpmpp 2m sde for example, the outputs often become super random and nonsensical. This could also be a mistake with the comfy implementation of nvfp4.

I think to a certain extent comfy's nvfp4 implementation is incomplete, dual stream models (like chroma) were broken until an update (creating extremely chaotic noisy but kind of recognisable outputs, I think it was a scaling issue), the issues were fixed without changing the model file, so I wouldn't be surprised if the weird dpmpp 2m sde results are also because of a mistake in the implementation, causing the model to for example add way more noise than it should.
(for reference, before the update, not even fp4 mixed worked properly for chroma, after the update you get coherent outputs with all qkv+o and linear layers quantized, much better)

Also when quantizing something like z-image, without any scaling, the results are still very good, it can still spell words consistently and stays compatible with loras. I feel like a lot of the issues I've ran into are actually just code bugs, not issues with the weights themselves.

jorismak

Jan 18

the thing with quantizing to nvfp4, is that if you want a GPU to use accelerated nvfp4, the inputs (attentions) need to be scaled down to nvfp4. So it's not (per se) about the weights, but the tensor inputs being fed into the model during inference.

Blackwell can do accelerated nvfp4 matmul, which means the input and the target must be nvfp4. We quantize the weights to nvfp4, but that means the input must also be quantized during a workflow-run to nvfp4.

Some models have huge variation in input values, which don't suite nvfp4 very well. Becoming way more noise is an indication of this. A working model, following the prompt just the same as higher quants, even recognizable low-level micro detail.. but just very noise, and then a diffusion model spends more effort trying to denoise that compared to adding detail.

That's what I mean with setting layers to 'full-precision-mm'. What this does - as I see it, simplified - NOT converting the input to nvfp4 and then doing nvfp4 matmul on the GPU, but scaling the weights up to fp16 and doing a regular fp16 matmul on the GPU. This keeps the input tensors at good precision and fixes a lot of quality issues I have, but this negates any speed benefit.

Like you said, it can also be a scaling issue. The model might just work better with the inputs in a certain scale range. If we don't supply any input scaling, ComfyUI starts trying to auto-discover input scaling per layer on the fly when we run the model. It does this by just looking at the min/max values at that moment in time. But this scaling ComfyUI discovers might not be a scaling the model can work with very well.
Diffusion models are basically denoisers.. if the input scale isn't correct and less denoising is applied in each step, we end up with more noise.

GGUFs have the 'benefit' that the weights are supplied in smaller quantized form, but as the model executes, they are scaled up to fp16 again. So the model is effectively running in fp16 precision, just with weights supplied in less precision.

When we run nvfp4, the model is running in nvfp4 (at least, we want too, because that's where the speed is). But that means the whole model is running in nvfp4 precision. Which isn't that much :).

Maybe I spend some time looking at the 'official' supplied flux1 / flux2 / ltx2 nvfp4 versions. Just to see which layers they kept as fp16 and which they put in nvfp4. Maybe my noise problem is coming from the fact my inputs are still in fp16 and the model doesn't scale mid-way through the timesteps..

Anyway, I'm still learning and experimenting... and a far cry from an 'expert' (heck, I'm learning as I go and probably making lots of mistakes). I love to be corrected so I learn it proper, but I also love to just share and read what others have to share as I go.

jorismak

Jan 18

after testing some more with fp8-scaled version by KJ and the 'reference' fp16 version... maybe what I was seeing as bad quality is just bad quality of the 5B version. The 14B versions come out better (but still in i2v quality suffers compared to Q5KM). I'll get a good baseline first before deciding which conversion settings matter or not in speed and quality :).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment