Scaling up the Qwen3 text encoder
Hi team,
First of all, amazing work on Anima! (I know I've said it before but I mean it especially as it's my favourite model at the moment) I've been really impressed with what you've put together.
I just wanted to drop a quick, friendly suggestion for any future iterations you might be planning. Have you considered experimenting with a larger version of the Qwen3 text encoder?
While the current setup works well, stepping up to a larger Qwen3 model could potentially take the prompt comprehension and complex detail adherence to the next level. For example, models like the legendary Z Image Turbo/base are currently using Qwen3 4B with really fantastic results, and Flux 2 Klein utilizes Qwen3 8B to achieve incredible text understanding, world and character understanding with exceptional prompt adherence.
I have to say though the current texting encoder Qwen3 0.6b that anima is using has good prompt adherence too.
I definitely don't want to sound like I'm telling you how to build your model I know how much work goes into balancing these architectures! I just wanted to share some thoughts on what is working beautifully in the space right now as a potential idea for down the road.
Either way, I'm really looking forward to seeing how Anima continues to evolve. Thanks for all your hard work! Like really hard work and well done listening the community here too. I know the licence isn't the most popular but I can see this being the next SDXL/Illustrious soon.
Why do you need a bigger qwen model, what is the improvement? Is your reasoning bigger parameter number = better model?
I'm not telling you how to build your model.
Tells him how to build his model, doesn't have proof or data saying why, then uses an anecdotal proof citing another model which only legendary point is having spent a ton of money in training using a mostly old arch.
You need to present some proof if you think Russel is wrong using the 0.6b param model.
Bigger is not always better. Depending on the situation and the design, the appropriate model can deliver much better performance and more reasonable costs
0.6B is probably cheaper, faster to train, makes the resulting model as accessible as Illustrious/NoobAi which is their target audience. You can see how quickly PonyV7 died because it decided to go with a completely obscure base that suddenly needed 2x the compute of its predecessor.
Made on NotebookLM using both this website and a great YouTube video review by Fahd Mirza as the sources.
What does this have to do with the text encoder?
This was already discussed in a previous thread, and scruffynerf summed up the gist of it:
"User -> any human language -> Qwen3 0.6b into qwen tokens -> T5 into T5 tokens -> Anima model (who only speaks 'T5')"
Even if you scale the LLM up, T5 is the limit of Anima's prompting capabilities. There might be some slight benefit from a larger LLM, but nothing huge.
The encoder language isn't some hardcoded value. This is a lot of speculation for a model without released code. But if you just infer from Cosmos-Predict2, it's conditioned against embeddings from T5XXL, which is 11B parameters large and has a hidden size of 4096 dim. Compare that to Qwen3-0.6B-Base, which has only 1024 dim. If anything you could probably plug in Qwen3-8B which has the same embedding dimensions as T5XXL-11B. But suddenly your training expense quadrupled as well. And 3/4 of your audience fell away due to hardware constraints, now people scramble to use fp8 versions of your TE which kneecaps the model hard. So unless the dev states otherwise: no there's probably no Prompt -> Q3 embeddings -> T5 embeddings bridge. Considering the small 2B size of the diffuser weights, training it on the data the dev claimed it's being trained on, you can be safe in thinking that by the end of training 99%++ of the original T5 embedding geometry will be overwritten and replaced with Qwen3.
The best the community could do in the future is upcast Qwen3-0.6B-Base to fp32 (afaik the weights released originally are only bf16) and fine tune the fp32 version on a generalized uncensored dataset. Increased precision seems to matter a lot more with smaller models. (Just try to use Qwen3-0.6B-Base fp8 with Anima and see how ass your gens become). But unless the final Anima training goes up to 1.7B, that's the best you can do from a end-user side.
This is a lot of speculation for a model without released code. But if you just infer from Cosmos-Predict2
There is Anima training code in diffusion-pipe (i.e. official) and there is, of course, Anima inference code in Comfy (probably with tdrussell's guidance), at least. Probably in diffuson pipe too, any trainer should be able to plop out sample images, I just haven't used diffusion pipe.
So unless the dev states otherwise: no there's probably no Prompt -> Q3 embeddings -> T5 embeddings bridge
The 0.6B embedding dimensions don't match, you say so yourself. So, what then?
Either way, there is also an LLM adapter.
Comfy code for the LLM adapter
Scroll down, you can see whatever text-related embeddings are ran through the LLM adapter and then fed into the model. "t5xxl_ids" is only used by Anima's text encoder, looking at the text encoder code I think it's fair to assume they're always there and so they're always ran through the adapter (though even if they weren't, the adapter would still be a thing).
tdrussell suggesting not to train the LLM adapter (i.e. it exists) to hopefully lessen the current LoRA forgetting issues:Don't train the LLM adapter. The example config in diffusion-pipe has it like this by default, but I don't know what other training scripts are defaulting to.
-- https://huggingface.co/circlestone-labs/Anima/discussions/60#6998b3f592a56ca2caee79fd
Personally, I've seen lots of papers on architectural improvements which don't make the model heavier, like DDT or ERW/REPA/RAE (-> Flux 2 VAE). DDT claims a good boost to FID and IS, convergence, as well as a reduction in worldwide cancer rates and everything else I guess. I don't have enough ML knowledge though to tell what its pitfalls are and it could be risky.
But at this point, it's likely too late to swap the VAE/TE, let alone do a wacky architectural change like DDT.
The 0.6B embedding dimensions don't match, you say so yourself. So, what then?
Either way, there is also an LLM adapter.Comfy code for the LLM adapter
Scroll down, you can see whatever text-related embeddings are ran through the LLM adapter and then fed into the model. "t5xxl_ids" is only used by Anima's text encoder, looking at the text encoder code I think it's fair to assume they're always there and so they're always ran through the adapter (though even if they weren't, the adapter would still be a thing).
I stand (partially) corrected. The adapter indeed exists.
Anyway I went down the code rabbithole you pointed to, it appears to be fallback code to switch to T5 token IDs in case and because you can literally still just plug in T5XXL and generate images using that as a TE(results are funny, it has zero knowledge of specific anime characters - but that's maybe because I'm using a 4bit version). So the inference code needs to be able to switch back and forth. Either way, the argument being made here over and over is that somehow Q3 to T5 is the bottleneck, which by all accounts it seems to be the other way around if at all. The adapter in Comfy seems to make sure the sampler knows to expect smaller input dimension from Qwen3_06B which is 1024 as opposed to T5XXL's 4096?
Actually after extensive research, especially watching a great YouTube video https://youtu.be/O04fpKBL--c?si=TAQYZptYC9ExTuFS
