Workflow
Hello can you please share also workflow? I am not even close to your 2.03s/it on RTX 3090 i have 5,6s/t with 1MP.
The workflow I used was the default comfyui workflow, with just the unet loaders replaced with the int8 fast ones. 5.6s/it sounds very anomalous, and even slower than fp8 would be. I would suggest using https://github.com/kijai/ComfyUI-MemoryVisualization to see if maybe there is major model offloading going on or something, because that shit ain't sounding right.
(Edit: If you are using torch compile, it is required to use the compile model advanced node from https://github.com/kijai/ComfyUI-KJNodes with dynamic vram disabled, otherwise that may explain some slowdown)
Can you upload your workflow @bertbobson ? I tried using your diffuser loader node along with the KJ workflow and I'm getting garbled output.