Text-to-Image
Diffusers
Safetensors

Amazing work, few questions!

#1
by Luke2642 - opened

I think your philosophy of emphasising training speed with pragmatic architecture choices is a good one, I'm glad it worked out!

I'd love to know more, especially as you mentioned many AI suggestions from papers were fruitless. I love to read all the latest developments, so I'd greatly appreciate it if you would address/dismiss any of the following I've been wondering about?

I wrote this big list when I started having a little fantasy one day of an alternate timeline where we send a usb stick back to Robin Rombach and Patrick Esser (architects) Emad Mostaque (stability financier) and Christoph Schuhmann (LAION). What would it contain?

  • Aspect Ratio Bucketing instructions

  • Training VAE in the OKLAB colourspace instead of RGB. Any mse loss term instantly aligns closer to human perception.

  • Equivariance loss in VAE training, not rigidly enforced by architecture, just gentle equivariance in the loss smooths out latent manifold. 4x-7x faster training. https://github.com/zelaki/eqvae

  • 2D-RoPE in the UNET and Scale-Conditioned Diffusion for texture vs global learning

  • Flow Matching / V-Pred - 2x faster training

  • Zero-Terminal SNR / Offset Noise - for HDR / high contrast

  • Cosine Noise Schedule - more compute on harder steps in middle

  • FreeU at training time - skip connection re-weighting and frequency separation

  • Early flash attention and Quantisation Aware Training for fp16 and int8, so it runs on a potato, 2x faster training

  • Native ControlNet to ingest canny/depth/normal maps - probably yields 2x faster training.

  • Optional/disposible output heads to output depth + normal maps, encouraging 3D geometry learning, probably yields 2x faster training.

  • The simple maths trick DSINE used to massively improve normal map mathematics.

  • ConvNeXt-Style Large Kernel Depthwise Convolutions

  • Native tiled VAE overlap or padding support without artefacts

  • SLERP to prevent variance squashing when lerping high dimensional noise/latents

  • Decoupled Weight Decay & Modern Optimizers - Lion or Prodigy for weight decay, 3x faster training

  • SNR Loss Weighting Strategy - more learning when generating real structure not just pure noise.

  • Aesthetic Score & Quality Conditioning - a linear probe on clip to augment laion alt text with score_0-9 to learn from bad data without lowering quality.

  • Masked Diffusion Training (MDt) - for global coherence and inpainting

  • Blockwise Flow Matching (BFM) - improved different timesteps, 3x faster learning

Taken together, the advancements since 2022 unlock a 100x improvement in training speed over SD 1.5. But how well do they really stack together?

There are a lot more challenges. How much did you invest in recaptioning and dataset enhancement?

I was thinking we need the data and excellent captions from the best VL model, and we need extensive pre-processing to generate top quality depth, normal and segmentation maps for every image, and also to de-matt images into foreground and background to train an alpha/matting aware network natively - something sorely lacking in almost every architecture. This could even be in the VAE, the VAE could be alpha-layer-aware.

Of course there are many other improvements like spatially aware text encoders and latent masking conditioning, the list goes on!

This project seems aligned with yours too:

https://github.com/shallowdream204/dico

You mentioned Sana, but what about

https://huggingface.co/amd/Nitro-T-0.6B

https://stability.ai/news/introducing-stable-cascade

https://github.com/PixArt-alpha/PixArt-sigma

Anyway, sorry for the crazy long question!

AiArtLab org

Please forgive me for this, but I don't have the time, energy, or desire to refute these articles. I just wanted to point out that following them and believing that they can truly accelerate 10-100x is, in my opinion, a misconception. However, thanks to you, I realized that it's better not to mention this at all. In my opinion, some of these articles are greatly underestimated, such as the fact that flow matching can accelerate training by dozens or even hundreds of times. On the contrary, some things are overrated. For example, I got better quality than eqvae in 1 day on 1 GPU ( https://huggingface.co/collections/AiArtLab/vae ), or by my opinion, Pixart Sigma is also incredibly poorly trained, just like Sana. Or Cosine Noise Schedule - i experiment with something like this a lot, and dont get improvements - its just another train. But i may wrong. Don't believe me. Don't believe the articles. Trust yourself and your experiments. Good luck!

The results here are still very poor even when compared to SD1.5.

Thanks recoilme for your reply. I do really appreciate it! I wasn't trying to take anything away from your success, nor was I trying to suggest I know better, I don't!

I'd compiled the list months ago, before I saw your awesome project, and like Dico in particular, I was just happy to see someone pushing the limits of what a unet can do.

I think the goal of EQ-VAE was to help speed up training, not boost quality. If the diffusion model can learn a concept and spatial relationship faster from mirrored, flipped, rotated or scaled images. Intuitively it does make sense, and they do admit the quality drop in their analysis. Maybe it needs more latent channels to maintain quality, and then the speed up is reduced. I don't know. It would be a dream if a model could training at 256 and output 1024, just like you achieved with your 2x vae decode stage!

This principle is why I thought train the control net input and disposable 3D geometry or layer output heads would actually boost training speed too, if the augmented data is good. The trick DSINE used to make normal maps relative rather than absolute made an enormous difference to normal map prediction at the time, something like a 100x improvement, 10x less data, 10x faster training.

Anyway, I am grateful for your reply and time, I wish you the best too!

AiArtLab org

Thanks for the ideas Luke! I haven't looked into ControlNet yet, I want to think about that later. As for the channels β€” in my case, actually reducing them from 128 (Flux2 VAE) to 32 dramatically sped up training. Though, this might be specific to this UNet's architecture: the first Conv2D block going from 32 to 320 channels just seems to work better than 128 to 320. Good luck!

https://github.com/shitagaki-lab/see-through

Just saw this, more training data augmentation possibilities, even without architecture changes to make alpha native.

recoilme changed discussion status to closed

Sign up or log in to comment