Diffusion Single File
comfyui

Qwen 3.5

#67
by kdutt2000 - opened

So I follow AI news quite a lot and it looks like Qwen has released the smaller variants it's Qwen 3.5 series. I'm curious if they have a 0.6b version and if they do it should be the same size as the current great Qwen3 0.6b text encoder.

I know I said previously that I it might be better with a bigger model. However, I saw videos of the original Qwen3 0.6b by Bijanbowen And I have to say size definitely doesn't matter. It's about how you train the model and the current model they have as the text encoder is very very good and trained very well so hats off to the Devs of this great anime model.

I'm also curious if there's a big difference between Qwen 3.5 and Qwen 3.

I love the open source models at the moment. I think they are incredible. Both video image and overall LLMs. It's definitely exciting times for the open source community and AI in general.

There is no 0.6B model in Qwen 3.5, but there is a 0.8B model available. I saw someone in an overseas community experiment with it. The architecture is slightly different, and some unnecessary parts were trimmed down. Fortunately, the hidden_size is the same, so they managed to integrate it into the existing DiT structure and get it running. However, the generated images did not come out properly—the shapes seemed to collapse and blend together.

It might be an issue with the adapter not being properly aligned, but it is difficult to determine exactly what the problem is. For now, it can be considered “possible,” but whether it will lead to “better” results remains to be seen.

I haven't tried the 0.8B yet,

but judging from the 9B and 27B versions, there are obvious improvements in the versions.
And the entire series comes with multimodality, and it might even be possible to consider passing reference images in clip?
(Regarding te's multimodality, it can run smoothly in the Newbie( test version), even though no special training has been conducted for this.)

However, if the existing structure is maintained, using T5's tokenizer, simply replacing from qwen3 0.6b to 3.5 0.8b may result in limited improvement.

I expect there can be some behavior alignment by fine-tuning adapter part. I'll post comment if there are some meaningful results.

There is no 0.6B model in Qwen 3.5, but there is a 0.8B model available. I saw someone in an overseas community experiment with it. The architecture is slightly different, and some unnecessary parts were trimmed down. Fortunately, the hidden_size is the same, so they managed to integrate it into the existing DiT structure and get it running. However, the generated images did not come out properly—the shapes seemed to collapse and blend together.

It might be an issue with the adapter not being properly aligned, but it is difficult to determine exactly what the problem is. For now, it can be considered “possible,” but whether it will lead to “better” results remains to be seen.

Alright — the person from the overseas community I mentioned earlier has finally produced some promising results. For clarity: I won’t post a direct link because the post contains NSFW images and comes from a non-English community, so quoting it directly would be inappropriate. I’ll only report the results here.

In short: about what we expected. They ported the Qwen 3.5 0.8B model and then fine-tuned the LLM adapter on a small dataset. A plain port had previously produced collapsed and distorted images; the additional adapter training, however, produced outputs that better preserve structure and align more closely with the prompts.

The overall conclusion remains the same: it’s still “possible,” but whether switching will actually yield broadly better results is uncertain.

Comparing the two models — Qwen 3.0 0.6B vs. Qwen 3.5 0.8B — you might notice some differences in basic UX, but within Anima there’s no clear, encouraging performance gap yet; the model mostly acts as a text encoder. Fully retraining the adapter on a much larger dataset to utilize the 0.8B model could mean discarding roughly half the existing progress and rebuilding the adapter from scratch.

I hope the developer sees this and posts more details so people can be reassured — I’m waiting for any additional information.

Yeah to be honest The prompt adherence is very good even with the current great Qwen3 0.6b model. Especially when you have multiple characters. Make sure you say names, Jess, John... rather than just 1girl or 1boy I've noticed it gives better results.

Also If you use pure natural language make sure it's descriptive personally I use Dan tags for certain poses/anatomy stuff 😏 and mostly natural language for everything else as it works very well. I have also seen templates that people using for z Imege turbo which uses Qwen3 4b as its Text encoder. This is one of them that has been working well for more complex images:

[Style & Aesthetic]
(Your quality tags)

[Composition & Camera]
(What camera angle you want the photo, full body shot, Dutch angle...)

[Subjects & Anatomy]
(How many subjects do you want? Make sure that you use names like Jess a 21-year-old blonde woman, John a 48-Year-Old african man.... Instead of just 1girl and 1boy As I found this works better sometimes, but your mileage may vary.)

[Action]
(What your subject/subjects are doing...)

[Environment & Atmosphere]
(The more description of your environment and background the better as if you don't. It tends to give you the same background because the model is very good at following the prompt. The variety isn't the best sometimes at the moment)

[Lighting & Contrast]
(Be careful with this because if you go too much on the lighting it can make the characters look a bit washed out and things like that I tend to just say if it's a sunny day or something simple like that)

I've also experimented with BREAK like in SDXL and Illustrious And it's surprisingly worked pretty well, especially when my prompt has been very long.

These may or may not work for you but these are some of the things I've been trying. I hope it helps.

I've also been using a prompt enhancer as well sometimes which has been helping a lot and expanding my prompt although it's through open router so the latency can be a little bit on the slow side, sometimes depending on the LLM model.

Sign up or log in to comment