want to know the difference
What is the difference between scaled and input scaled kijai, Thanks for your contributions to the open source team love u.
It is mentioned in the readme.
Pro tip:
Simply ask AI the difference with the huggingface link! This helps me so much when i want to know something simple without bothering the contributor π«Ά
The input_scaled one said to be more than 50% faster on RTX 40 series or newer.
But the v2 version said to have terrible quality according to someone test at reddit.
Do you need to select 'fp8_e4m3fn_fast' in the 'Load Diffusion Model' node for those input_scaled models to work? In my own tests input_scaled (v1.) were slower than the normal fp8 model. V2. were around the same speed as the normal fp8 but the quality was pretty bad. Still I'm aware Kijai mentioned these are experimental models and he's doing this for us completely free, so zero complaint here. π€
My weak 4060 ti (16gb) or launch flags (--fast fp16_accumulation fp8_matrix_mult?) could quite possibly be to blame here. haha. π
Do you need to select 'fp8_e4m3fn_fast' in the 'Load Diffusion Model' node for those input_scaled models to work? In my own tests input_scaled (v1.) were slower than the normal fp8 model. V2. were around the same speed as the normal fp8 but the quality was pretty bad. Still I'm aware Kijai mentioned these are experimental models and he's doing this for us completely free, so zero complaint here. π€
My weak 4060 ti (16gb) or launch flags (--fast fp16_accumulation fp8_matrix_mult?) could quite possibly be to blame here. haha. π
You should leave it at default, these are mixed precision models that include many bf16 layers too. The fp8 layers are marked to use fp8 matmuls (fp8_fast) already.
The input_scaled models in default mode are ~40% faster for me.
Thank you Kijai. π«‘ π I just tried without any any launch parameter tweaks (only sage attention) and saw no change. The old fp8_scaled is still faster for me by around 30%. Wierd, perhaps it's some wierdness with Arch Linux or because I'm using all nightly Comfyui and repo's haha. Also I'm using Python 3.14.3 and the latest torch 2.12 dev from today. π€
Anyway don't worry about it, you have enough on your plate. I'm very happy with everything. The speed is good enough for me, so no complaints from me. π
Do you need to select 'fp8_e4m3fn_fast' in the 'Load Diffusion Model' node for those input_scaled models to work? In my own tests input_scaled (v1.) were slower than the normal fp8 model. V2. were around the same speed as the normal fp8 but the quality was pretty bad. Still I'm aware Kijai mentioned these are experimental models and he's doing this for us completely free, so zero complaint here. π€
My weak 4060 ti (16gb) or launch flags (--fast fp16_accumulation fp8_matrix_mult?) could quite possibly be to blame here. haha. π
You should leave it at default, these are mixed precision models that include many bf16 layers too. The fp8 layers are marked to use fp8 matmuls (fp8_fast) already.
The input_scaled models in default mode are ~40% faster for me.
I just wanted to say I was wrong in my earlier assumptions. π³ The input_scaled models are indeed faster as you said Kijai. π Also when I mentioned that input_scaled gave bad quality the reason was most likely that I mistakenly had used the distilled lora with an already distilled model making the quality really bad. The actual difference is not very large. So I gather all that contradicted what you said was my own human error. Live and learn I guess. haha. π