On my Tesla V100, it's ten times slower than SDXL.

#15

by dawn6666 - opened Feb 2

Feb 2

•

Hello, I am using a Tesla V100 16GB, and the speed is approximately 10 times slower compared to other SDXL models. Under the same sampler settings (30 steps, 1024x1024), 'noob' or 'Illustrious' takes about 13 seconds (30 steps, 1024x1024), whereas 'anima' takes 130 seconds.
Is it FP16?
Or do I need to update something? z-image-turbo and flux2-klein also run fine on my system, taking about 10-15 seconds for 4-8 steps.

CptCharname

Feb 2

•

edited Feb 2

it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.

dawn6666

Feb 2

•

edited Feb 2

no, it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.

So, its speed is indeed the same as what I tested. Is it about ten times slower than SDXL using the same parameters (30 steps, 1024x1024)?

I understand now. He's running it with BF16, but the Tesla V100 is not compatible with BF16 and can only run with FP32.
However, the FP32 performance of the Tesla V100 is very low.
There was the same problem a month ago when z-image-turbo was released.
I forced the use of fp16 UNet in the settings, and the speed returned to normal, similar to the performance of an RTX 3080, generating one image in about 30 seconds. However, it only produces completely black images.
If FP8 is not supported, it will automatically be upgraded to FP32 for execution, resulting in no speed improvement.

CptCharname

Feb 2

no, it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.

So, its speed is indeed the same as what I tested. Is it about ten times slower than SDXL using the same parameters (30 steps, 1024x1024)?

I understand now. He's running it with BF16, but the Tesla V100 is not compatible with BF16 and can only run with FP32.
However, the FP32 performance of the Tesla V100 is very low.
There was the same problem a month ago when z-image-turbo was released.
I forced the use of fp16 UNet in the settings, and the speed returned to normal, similar to the performance of an RTX 3080, generating one image in about 30 seconds. However, it only produces completely black images.
If FP8 is not supported, it will automatically be upgraded to FP32 for execution, resulting in no speed improvement.

ahhhh, i forge that v100 is on old architecture.

shiboishi

Feb 2

It would be great if creators of this model would consider making it work in fp16.

patientxtr

Feb 2

It is the same with many amd gpus (beside from the latest models) forced to use fp32 thus very slow

theo77186

Feb 2

On a bf16 supported GPU (4060Ti 16GB), it's only about 3x slower compared to SDXL and is expected as this model is more compute intensive. The rest of the slowdown could be attributed to fp32 upcast (as the model is in bf16), but because all of my GPUs support bf16, I can't really prove this.

dawn6666

Feb 2

•

edited Feb 2

It would be great if creators of this model would consider making it work in fp16.

I've now solved the problem. I found a patch: https://civitai.com/models/2356447?modelVersionId=2652286.
Place it in this directory: ComfyUI\custom_nodes\anina_fp16_patch.py

dawn6666

Feb 2

On a bf16 supported GPU (4060Ti 16GB), it's only about 3x slower compared to SDXL and is expected as this model is more compute intensive. The rest of the slowdown could be attributed to fp32 upcast (as the model is in bf16), but because all of my GPUs support bf16, I can't really prove this.

After installing the fp16 patch, it can now run successfully in fp16 mode.
According to the author, "40xx" has also received some speed improvements.

dawn6666 changed discussion status to closed Feb 2

dawn6666 changed discussion status to open Feb 2

tdrussell

CircleStone Labs org Feb 2

Thanks for the info, I am working on getting an official fix into ComfyUI.

The above patch works because it is keeping part of the model compute in fp32. That part needs high dynamic range, so only bf16 or fp32 work. But most of the heavy compute can be done in fp16. There are few different ways to do this, I'm working with the Comfy devs to see what is the preferred solution.

HDiffusion

Feb 3

This patch seems to work for me (brought to my attention by someone else): https://civitai.com/models/2356447?modelVersionId=2652286
Obviously something native would be better but this works in the meantime. It seems like they only had to cast a single op to fp32.

Avaredra

Feb 6

Yeah, it's slower.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment