On my Tesla V100, it's ten times slower than SDXL.
Hello, I am using a Tesla V100 16GB, and the speed is approximately 10 times slower compared to other SDXL models. Under the same sampler settings (30 steps, 1024x1024), 'noob' or 'Illustrious' takes about 13 seconds (30 steps, 1024x1024), whereas 'anima' takes 130 seconds.
Is it FP16?
Or do I need to update something? z-image-turbo and flux2-klein also run fine on my system, taking about 10-15 seconds for 4-8 steps.
it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.
no, it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.
So, its speed is indeed the same as what I tested. Is it about ten times slower than SDXL using the same parameters (30 steps, 1024x1024)?
I understand now. He's running it with BF16, but the Tesla V100 is not compatible with BF16 and can only run with FP32.
However, the FP32 performance of the Tesla V100 is very low.
There was the same problem a month ago when z-image-turbo was released.
I forced the use of fp16 UNet in the settings, and the speed returned to normal, similar to the performance of an RTX 3080, generating one image in about 30 seconds. However, it only produces completely black images.
If FP8 is not supported, it will automatically be upgraded to FP32 for execution, resulting in no speed improvement.
no, it is slower due to higher steps and model is in fp16.
plz try distilled fp8mixed, or wait for distilled model after their full release.So, its speed is indeed the same as what I tested. Is it about ten times slower than SDXL using the same parameters (30 steps, 1024x1024)?
I understand now. He's running it with BF16, but the Tesla V100 is not compatible with BF16 and can only run with FP32.
However, the FP32 performance of the Tesla V100 is very low.
There was the same problem a month ago when z-image-turbo was released.
I forced the use of fp16 UNet in the settings, and the speed returned to normal, similar to the performance of an RTX 3080, generating one image in about 30 seconds. However, it only produces completely black images.
If FP8 is not supported, it will automatically be upgraded to FP32 for execution, resulting in no speed improvement.
ahhhh, i forge that v100 is on old architecture.
It would be great if creators of this model would consider making it work in fp16.
It is the same with many amd gpus (beside from the latest models) forced to use fp32 thus very slow
On a bf16 supported GPU (4060Ti 16GB), it's only about 3x slower compared to SDXL and is expected as this model is more compute intensive. The rest of the slowdown could be attributed to fp32 upcast (as the model is in bf16), but because all of my GPUs support bf16, I can't really prove this.
It would be great if creators of this model would consider making it work in fp16.
I've now solved the problem. I found a patch: https://civitai.com/models/2356447?modelVersionId=2652286.
Place it in this directory: ComfyUI\custom_nodes\anina_fp16_patch.py
On a bf16 supported GPU (4060Ti 16GB), it's only about 3x slower compared to SDXL and is expected as this model is more compute intensive. The rest of the slowdown could be attributed to fp32 upcast (as the model is in bf16), but because all of my GPUs support bf16, I can't really prove this.
After installing the fp16 patch, it can now run successfully in fp16 mode.
According to the author, "40xx" has also received some speed improvements.