Artist mixing does not work (or at the very least not in the way you'd expect it to)

#44

by qtpie - opened Feb 9

Feb 9

See attached image. Top row is Anima, middle is 291H (a NoobAI merge) and bottom is base NoobAI 1.0 vpred. Styles are "Mochirong", "Houkisei" and lastly "Mochirong, Houkisei"
Same prompt has been used (and I used @s in Anima's case, of course). As you can see on the NoobAI models there is a mixture using elements from both style tags, meanwhile mixing on Anima doesn't retain much of either style.
And for good measure I flipped the mix on Anima to "@Houkisei, @Mochirong" just to show that order doesn't really seem to do anything.

dre788

Feb 9

Artist/Style mixing is a consequence of CLIP based models. CLIP is the text encoder used for Stable Diffusion XL/1.5/etc. Due to the way CLIP works, it allowed for mixing of tags such as artist, character, artistic medium and so forth. In the newer models after XL, CLIP was abandoned due to it's limitations and with it went artist/style mixing. The LLM encoders that are used in newer models like Anima are much better than CLIP, but we most likely will never see the mixing effect from CLIP again, that wasn't even the intention, it was just a consequence of the limitations of CLIP.

It's going to be interesting seeing the reaction of everyone once they fully realize these newer models can't mix tags the way XL could.

InvictusCreations

Feb 9

Illustrious was extremely schizo when it came to style so using a large amount of artist tags was a good way to stabilize it. But if you want to do any sort of serious style mixing Lora were ALWAYS the way to go.

Darudado

Feb 10

Clip itself isn't a limitation, SDXL's clip meanwhile just had like 77 tokens to work with...
There are modern clip solutions that let you use way more tokens and work better in general, like jina clip. The model maker decided to not use it and you lost the ability to mix very well, also model shows some issues at concept separation compared to using clip.
Heart shaped hands, will often spawn hearts around....

dre788

Feb 10

•

edited Feb 10

Clip itself isn't a limitation, SDXL's clip meanwhile just had like 77 tokens to work with...
There are modern clip solutions that let you use way more tokens and work better in general, like jina clip. The model maker decided to not use it and you lost the ability to mix very well, also model shows some issues at concept separation compared to using clip.

The Clip architecture is very much a limitation compared to newer text encoders. Newer models have semantic understanding, scene reasoning, they can actually follow instructions, and the token limit for the qwen model used for Anima is upwards to 32k. (even jina clip caps out at 8k) All of these features are polar opposite to Clips cross-modal understanding and retrieval method which results in tag blending(aka style mixing).

Heart shaped hands, will often spawn hearts around....

This isn't a problem with the qwen text encoder nor Clip. The danbooru heart hands tag is riddled with heart shaped symbols. https://danbooru.donmai.us/posts?tags=heart_hands
The model is just giving you what shows up in the data.

Espamholding

Feb 11

•

edited Feb 11

I think it's worth mentioning that besides mixing, CLIPless models also can't weight prompts >1.0 properly. Though like how you can mix loras, I've heard there's some more complex way of messing with the model's attention to get that feature back?

ProlificAU

Feb 11

Am I missing the point here? It seems to me that the styles changed when you added the extra artists? Potentially not quite as extreme as the other two, but there's still a clear difference between the images.

Darudado

Feb 12

•

edited Feb 12

Clip itself isn't a limitation, SDXL's clip meanwhile just had like 77 tokens to work with...
There are modern clip solutions that let you use way more tokens and work better in general, like jina clip. The model maker decided to not use it and you lost the ability to mix very well, also model shows some issues at concept separation compared to using clip.

The Clip architecture is very much a limitation compared to newer text encoders. Newer models have semantic understanding, scene reasoning, they can actually follow instructions, and the token limit for the qwen model used for Anima is upwards to 32k. (even jina clip caps out at 8k) All of these features are polar opposite to Clips cross-modal understanding and retrieval method which results in tag blending(aka style mixing).

Heart shaped hands, will often spawn hearts around....

This isn't a problem with the qwen text encoder nor Clip. The danbooru heart hands tag is riddled with heart shaped symbols. https://danbooru.donmai.us/posts?tags=heart_hands
The model is just giving you what shows up in the data.

It's not a limitation and it's very much used, for example by google. The architecture excels at learning concepts and separating them. 8k tokens is hardly a limit, even you say, very unlikely, that 8k tokens is around 1000 words, it's basically already an essay. Are you prompting essays? I'm never using 32k tokens.
Regarding the hearts, no. On SDXL it really wasn't an issue, even if the dataset is the same. I could cite you several other cases where the concepts blend. I hope DIT somehow fixes these issues with more training...
Tag mixing works because clips has a deeper understanding of concepts and can mix them, and can also understand increasing and lowering them, this is why scaling the vectors works, because you're scaling a concept. Contrary to LLMs where you can't really scale vectors. Comfy implemented a very makeshift solution in comfy. You can check it here:
https://github.com/Comfy-Org/ComfyUI/blob/855849c6588180fec88186127aae1a3299387fa6/comfy/text_encoders/anima.py
It's very different from what you'd do on SDXL where you scale the vector and normalize it.

I think it's worth mentioning that besides mixing, CLIPless models also can't weight prompts >1.0 properly. Though like how you can mix loras, I've heard there's some more complex way of messing with the model's attention to get that feature back?

In Comfy's makeshift solution for anima, you can do 1.5 of an artist and get a more similar art to the artist some times. It's very finicky.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment