No difference?

by Seeker36087 - opened Jan 13

Jan 13

Maybe I'm doing something wrong here, but I'm not seeing any difference in speed compared to using FP8 or GGUF versions...

I'm running an RTX5070 and loading the diffusion models through the standard 'Load Diffusion Model' nodes in ComfyUI - everything runs, but it runs just as slowly, if not more slowly, than when I run a standard non-fp4 model.

Am I doing something wrong here or is it an issue with the quantisation?

GitMylo

Owner Jan 13

If you have --novram it will run text encoders on cpu, otherwise it should be pretty much instant no matter if you're on nvfp4, fp8 or another format. On cpu nvfp4 will need to dequant which will make the model a tiny bit slower than fp8, comparable to gguf.

Seeker36087

Jan 13

I don't currently have a vram launch argument set, so ComfyUI defaults to Normal VRAM.

Am I using it correctly, loading it directly into the 'Load Diffusion Model' node the same as any other model?

I tried running a 720x720 generation with 81 frames - which would usually take around 30-40 seconds s/it with FP8 and I was getting the same 30-40 s/it
That is with CFG at 1 and Lightning Loras loaded. When I tried running without them and CFG at 3, each step was taking significantly longer.

GitMylo

Owner Jan 13

Are you talking about the text encoder running faster or ltx running faster? This only affects the text encoder which is at the start of execution to encode your prompt.

Seeker36087

Jan 13

Hmm, the text encoder has always been pretty fast for me with FP8 - it's only when using GGUF models that it's slow as hell.

I thought with NVFP4, the idea was that the model ran faster. In a ComfyUI discord announcement yesterday about NVFP4 they claimed that "with last week's release, NVFP4 models give you double the sampling speed of FP8 on NVIDIA Blackwell GPUs" 🤔

GitMylo

Owner Jan 13

Yes but this nvfp4 is the text encoder in nvfp4, If you want a speedup with ltx 2 you should use the ltx 2 nvfp4 model

GitMylo

Owner Jan 13

Wait actually I'm confusing this with a different repo

GitMylo

Owner Jan 13

Right so this repo is theoretically faster than standard wan, mostly in sampling time, you won't see much speedup in comparison if you're running with speedup loras though, since they don't use a lot of steps. For 20 steps it will usually be faster than the standard model though.

I uploaded it here on request, there isn't particularly a huge speedup compared to the regular models, I might have a fix later though using a lora to correct imprecise weights (similar to SVDQuant) instead of using mixed precision

Seeker36087

Jan 13

Well, I just appreciate the fact that you're trying to create an NVFP4 model of WAN 2.2 - as it seems to have been the holy grail for a while now!

I did try running without Lightning Loras and higher CFG but I was getting 45 seconds per step - and at 20 steps total, that was a bit more than I was willing to commit to 🤣

Kudos for your work though and I look forward to seeing it develop 👍

jorismak

Jan 18

I'm seeing consistently almost 2x speed improvement on nvfp4 layers Vs fp8/fp16 speed, and also compared to gguf model speed.

Up2date drivers, and a pytorch 2.9.1 with cuda13 and an updated Comfyui are required though.

Also , if the model is being offloaded (or part of it) your gpu can't get up to speed , which means it isn't going to help to do nvfp4. Maybe without any vram flags there isn't enough vram for the text encoder and the wan model (let alone the two wan models).

I run my rtx5060ti 16gb still with the lowvram flag. And with a wan workflow i place 'vram cleaners' between each step to tell Comfyui explicitly to unload the previous model(s).

I always take a look at gpu temperature during a workflow run , to see if my gpu is actually working or not . I've seen situations where my gpu usage is constantly at 100%, but the temperatures aren't above idle temps . So that 100% usage is just copying data from system memory, no real calculations.

Anyway, on my system (older 5800x3d, 16gb system ram ddr4, 16gb 5060ti) I get at most 55sec/it with wan gguf. But this nvfp4 is constantly at 35sec/it.

AndHor

Jan 29

No idea what I'm doing wrong, still seeing no speedup with this compared to fp8/fp16 versions. Maybe clip and loras need to be nvfp4 as well?
With fp8 clip and no loras it shows ~26s/it compared to "classic" ~33s/it. Have 5070ti, torch 2.10.0+cu130 and 591.74 driver.
Anyway quality of gens is better than fp8 but worse than fp16, didn't expected it changing that much

jorismak

Jan 29

What do you see in the console before it starts doing the diffusion steps? Something about 'xxx available, xxxx loaded'.

I haven't tested with Lora's to be honest.

And I don't know if the model changed since the time I tried it, but it was already a mix between fp16 / fp8 / nvfp4.

If all the 'blocks' would be in nvfp4 you would see close to a 2x speed increase, assuming you have no other bottleneck.
But the blocks aren't all nvfp4. It's a mix (I'm guessing), so the speed increase is somewhere in between.

I had around 1.5x increase with this model some time ago. And I reach around 1.25x with my own nvfp4 versions. Maybe your 1.27x increase is alright?

Or you have a memory bottleneck somewhere, and it's offloading more to ram so that you can't really benefit. I'm just guessing here.

AndHor

Jan 29

Yeah when I inspect the file mixed precision is visible. https://huggingface.co/GitMylo/Wan_2.2_nvfp4/blob/main/wan2.2_i2v_high_noise_14B_nvfp4_mixed.safetensors
Without loras nvfp4 said the same loaded partially; 295.89 MB usable, 0.00 MB loaded, 14746.73 MB offloaded, 675.03 MB buffer reserved, lowvram patches: 400 as with loras.
fp8 was 1.1GB lower and fp16 wax 2x fp8 offloaded. With loras all the same speed. Tested it only with default dtype.
Is the conversion scrypt posted somewhere? Would like to try making loras and clip "compatible" and compare that.
I have maxed out ram on my ddr4 mobo, so running full fp16 isn't a problem. Just wanted to find faster way with good enough output quality, if possible.
The speed ~33s/it seems okay compared to other blackwell cards do. Anyway with this logic shouldn't be fp8 faster than fp16?

jorismak

Jan 29

clip is not related at all. It just creates inputs for the model and is then not used anymore. I think unused a fp8 i had laying around of umt5xxl .

loaded partially; 295.89 MB usable, 0.00 MB loaded, 14746.73 MB offloaded, 675.03 MB buffer reserved, lowvram patches: 400

Well there is your problem right there. You start loading the model with only 300mb of vram available. You need 14.7 + 0.7 gb available .
The model is entirely loaded in system ram. That you get around 30 sec/it i call a miracle.

Do you have any node to 'unload models'? Vram cleaner? Place these between the text encoder and the first sampler (or between the starting latent and the first sampler ), and then between the two samplers . So that everything that's not needed gets unloaded before the next part starts .

AndHor

Feb 1

So @jorismak tried your suggestion and... nah, didn't work. Vram usage on card was the same ~9.6 GB on ksamplers as before (nvpf4 with loras). When they were on negative/positive it did unload clip before sampling, still no speedup. Between ksamplers it acts like default comfy. I suppose that 9.6GB is all latents in vram with small piece of model loaded. Would like to see your workflow, if it works like you say.

sooneek

Feb 4

"I have the exact same issue. Could you please share your workflow if possible? I'd really appreciate it. My specs: RTX 5060 Ti 16GB VRAM, 32GB RAM + 60GB Paging file."

jorismak

Feb 4

Ok, I've been testing again with this model and the latest comfyui.. on my rtx5060ti I get between a 1.3x or 1.5x speed up depending on the resolution I pick... basically on how much needs to be offloaded and the GPU is bottlenecked or not.

This is the stock ComfyUI wan2.2 14b i2v template, but the models switched to nvfp14 and some 'free' memory nodes plastered around before, between and after the ksamplers. (Although I don't seem to need them these days).

as a test and I quantize all the blocks possible to nvfp4 I get more than 1.5x, but the quality is unusable.

GitMylo

Owner Feb 4

The Dasiwawan model in this repo has all quantizable blocks quantized, it's not a mixed quant like the i2v in the repo, I haven't had issues with it anymore since recent comfy updates. I know I'm not imagining things, nvfp4 chroma was completely broken when nvfp4 support was newly added, later it suddenly worked one day, without replacing the model or anything. Same happened for wan, which is why I uploaded that checkpoint as full nvfp4, I used q4 before, but this is very similar quality wise in my testing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment