Plans to upscale the parameters count?

#139

by kabachuha - opened May 2

May 2

Hi! I really like the model, however I wonder if the parameter count might be a limitation, because it is even smaller than Z-Image.

In the LLM community there is a fine practice to so-called RYS or "depth upscale" models, to give them better parameter count without re-training from scratch and preserving the most of the model's knowledge. These upscales can later be continue to be trained to get even more grasp on the subjects.

Example depth upscales: https://dnhkng.github.io/posts/rys/, DavidAU's fine-tunes like https://huggingface.co/DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF

Would you like to do this or have you already tried it and the model broke?

guri06

May 2

Great idea, as much as applying the DeepSeek-V4-Pro 1.6T A49B Text Encoder.

onixxexxd5555LOAF

May 2

So the thing is that even the most minuscule changes to the text encoder (such as very low KLD abliteration or even "high quality" quants like q8/fp8) degrade the output quality because the model is trained on the very precise outputs the text encoder currently generates. So even if we end up with a significantly smarter text encoder, it would take very extensive and expensive training to take advantage of that.
Furthermore tdrussell said in the past that he believes the current text encoder is good enough and that he believes the model is being bottlenecked in other ways. (I believe he didn't elaborate further but sounds believable to me.)
And lastly Anima is also a weird Frankenstein model with qwen 0.6 outputs being mapped to underlying t5 with a tiny llm adapter duct taping both together. Which further complicates any such architecture change proposals.

kabachuha

May 3

@onixxexxd5555LOAF I meant to increase the Anima cosmos image transformer layer count, not touching the text encoder

onixxexxd5555LOAF

May 3

In that case, don't get me wrong, I am interested in seeing weird experiments like these. Whether DiT equivalent of layer duplication trick can be done, whether it would be functional without or with minimal fine-tuning or how much it would help. But this model has been training for months and closer to finish line than it is to the beginning. It would be fairly crazy to do highly experimental architectural surgery at this point. But perhaps someone can tinker with these after the model is done training.

mingyi456

May 5

In that case, don't get me wrong, I am interested in seeing weird experiments like these. Whether DiT equivalent of layer duplication trick can be done, whether it would be functional without or with minimal fine-tuning or how much it would help.

@onixxexxd5555LOAF I think this has actually been done before, with the Flux.1-heavy model by city96. He says it is a self-merge over here, and it required a bit of training to "recover" from the self-merge.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment