No difference?

#1
by Seeker36087 - opened

Maybe I'm doing something wrong here, but I'm not seeing any difference in speed compared to using FP8 or GGUF versions...

I'm running an RTX5070 and loading the diffusion models through the standard 'Load Diffusion Model' nodes in ComfyUI - everything runs, but it runs just as slowly, if not more slowly, than when I run a standard non-fp4 model.

Am I doing something wrong here or is it an issue with the quantisation?

If you have --novram it will run text encoders on cpu, otherwise it should be pretty much instant no matter if you're on nvfp4, fp8 or another format. On cpu nvfp4 will need to dequant which will make the model a tiny bit slower than fp8, comparable to gguf.

I don't currently have a vram launch argument set, so ComfyUI defaults to Normal VRAM.

Am I using it correctly, loading it directly into the 'Load Diffusion Model' node the same as any other model?

I tried running a 720x720 generation with 81 frames - which would usually take around 30-40 seconds s/it with FP8 and I was getting the same 30-40 s/it
That is with CFG at 1 and Lightning Loras loaded. When I tried running without them and CFG at 3, each step was taking significantly longer.

Are you talking about the text encoder running faster or ltx running faster? This only affects the text encoder which is at the start of execution to encode your prompt.

Hmm, the text encoder has always been pretty fast for me with FP8 - it's only when using GGUF models that it's slow as hell.

I thought with NVFP4, the idea was that the model ran faster. In a ComfyUI discord announcement yesterday about NVFP4 they claimed that "with last week's release, NVFP4 models give you double the sampling speed of FP8 on NVIDIA Blackwell GPUs" πŸ€”

Yes but this nvfp4 is the text encoder in nvfp4, If you want a speedup with ltx 2 you should use the ltx 2 nvfp4 model

Wait actually I'm confusing this with a different repo

Right so this repo is theoretically faster than standard wan, mostly in sampling time, you won't see much speedup in comparison if you're running with speedup loras though, since they don't use a lot of steps. For 20 steps it will usually be faster than the standard model though.

I uploaded it here on request, there isn't particularly a huge speedup compared to the regular models, I might have a fix later though using a lora to correct imprecise weights (similar to SVDQuant) instead of using mixed precision

Well, I just appreciate the fact that you're trying to create an NVFP4 model of WAN 2.2 - as it seems to have been the holy grail for a while now!

I did try running without Lightning Loras and higher CFG but I was getting 45 seconds per step - and at 20 steps total, that was a bit more than I was willing to commit to 🀣

Kudos for your work though and I look forward to seeing it develop πŸ‘

I'm seeing consistently almost 2x speed improvement on nvfp4 layers Vs fp8/fp16 speed, and also compared to gguf model speed.

Up2date drivers, and a pytorch 2.9.1 with cuda13 and an updated Comfyui are required though.

Also , if the model is being offloaded (or part of it) your gpu can't get up to speed , which means it isn't going to help to do nvfp4. Maybe without any vram flags there isn't enough vram for the text encoder and the wan model (let alone the two wan models).

I run my rtx5060ti 16gb still with the lowvram flag. And with a wan workflow i place 'vram cleaners' between each step to tell Comfyui explicitly to unload the previous model(s).

I always take a look at gpu temperature during a workflow run , to see if my gpu is actually working or not . I've seen situations where my gpu usage is constantly at 100%, but the temperatures aren't above idle temps . So that 100% usage is just copying data from system memory, no real calculations.

Anyway, on my system (older 5800x3d, 16gb system ram ddr4, 16gb 5060ti) I get at most 55sec/it with wan gguf. But this nvfp4 is constantly at 35sec/it.

Sign up or log in to comment