Feedback on your model
This model yields the most detailed photos in under 20 seconds (when compared to z-image or flux). I think the architecture tweaks you found are very useful, especially for getting rid of the VAE.
I'm currently finetuning the model with my training code and it yields good results (although the training is a bit unstable unless you lower the learning rate). I've noticed it is hard to get coherent hands/text, probably due to the low number of parameters or training data?
SANA could output somewhat simple coherent text and good hands, but the image coherency was lacking. PixelDiT however seems to fix this issue, while providing more fine details to the images.
Congratulations on developing such a model, you have something really innovative here. The same concept of separated patch-wise/pixel transformer blocks could also be applied to video generation I would assume...