Inconsistencies with directional prompts in the Anima model
Hi team,
I've noticed some inconsistencies with directional prompts in the Anima model:
1、Prompts like "facing left/right" can be interpreted either as the left/right of the frame or the character's own left/right. The result also varies depending on the seed.
2、Even specifying left hand/right hand actions is not consistently reproducible across seeds.
3、When prompting "face the camera, look back at the background," the model consistently outputs the opposite — the character facing the background while looking back at the camera. I suspect this is because the training data heavily leans toward "looking back" meaning looking toward the camera.
Just wanted to share these observations. Thanks as always for your great work!
It would be great if future versions could accurately interpret spatial relationships — for example, understanding prompts like "character facing A, with back to B, looking at C," etc.
I'd say it comes down to the text encoder, it has only 600m parameters and processes your prompt only in one direction (As opposed to T5 which has bidirectional attention). And small models like this are really, really bad at understanding spatial relations. As in, they're too "stupid".
You can mitigate a little with good prompting:
Establish a sectioning of the frame from your point of view. (i.e. "The image is sectioned into the left side and right side" as a start and then refer to it further on, e.g. "on the left side of the image is a girl with long red hair. On the right side of the image is a girl with short blue hair".)
Don't mix and match, don't suddenly prompt "her left eye is blue and her right eye is red", keep your sectioning "she has a blue eye on the right side and a red eye on the left side".
But ultimately this will never be good since Qwen3-0.6b just isn't on the level to process spatial info in prompts very well, we'll have to use other methods (controlnet if ever available) for that.
A note to 3, since that's an entirely different issue:
There's probably like one image in the dataset where a character is looking away from the viewer and you're looking at the back of their head, much less with their torso facing the viewer, that's just not gonna happen, maybe if the dataset will include more photos to broaden its knowledge, but we're really pushing the 2B here. :,D
To 3. my advice is to consider the booru-tag concepts as integral part of the model. "looking back" is a booru tag with a strong visual representation of a character being seen from the back and that is looking back towards the camera. The booru wiki defines it as:
"A character turning their head, body or eyes to look behind themselves.
The subject's gaze should be directed fully or almost fully backwards over the shoulder, anywhere from 135 to 180 degrees behind them relative to their front.
Oftentimes combined with 'from behind' and 'looking at viewer', and also 'sideways glance' if the character's body is partly facing away.
If the character has both of their eyes closed, use 'facing back' instead."
It would be troublesome if the natural language would actively fight against booru tag concepts. Rather they should work together.
On that note: "viewer" is a much better defined than "camera".
I'd say it comes down to the text encoder, it has only 600m parameters
The model understands left/right generally and other spatial relationships well enough. I'm blaming this more on the captions whatever VLM generated (or rather didn't), since that's the only source of left/rightness. No way "looking left" and "looking right", if consistently captioned, if they were hypothetically booru tags, would be too much for it. Probably even CLIP could deal with that.
To 3. my advice is to consider the booru-tag concepts as integral part of the model. "looking back" is a booru tag with a strong visual representation of a character being seen from the back and that is looking back towards the camera. The booru wiki defines it as:
"A character turning their head, body or eyes to look behind themselves.
The subject's gaze should be directed fully or almost fully backwards over the shoulder, anywhere from 135 to 180 degrees behind them relative to their front.
Oftentimes combined with 'from behind' and 'looking at viewer', and also 'sideways glance' if the character's body is partly facing away.
If the character has both of their eyes closed, use 'facing back' instead."
It would be troublesome if the natural language would actively fight against booru tag concepts. Rather they should work together.
On that note: "viewer" is a much better defined than "camera".
Thanks for the explanation. If the model consistently interprets "looking back" as the character facing away from the viewer while looking back toward the viewer, how would you recommend phrasing a prompt like "character facing the viewer while looking back toward the background"?
Like this picture:
I'd say it comes down to the text encoder, it has only 600m parameters and processes your prompt only in one direction (As opposed to T5 which has bidirectional attention). And small models like this are really, really bad at understanding spatial relations. As in, they're too "stupid".
You can mitigate a little with good prompting:
Establish a sectioning of the frame from your point of view. (i.e. "The image is sectioned into the left side and right side" as a start and then refer to it further on, e.g. "on the left side of the image is a girl with long red hair. On the right side of the image is a girl with short blue hair".)
Don't mix and match, don't suddenly prompt "her left eye is blue and her right eye is red", keep your sectioning "she has a blue eye on the right side and a red eye on the left side".But ultimately this will never be good since Qwen3-0.6b just isn't on the level to process spatial info in prompts very well, we'll have to use other methods (controlnet if ever available) for that.
A note to 3, since that's an entirely different issue:
There's probably like one image in the dataset where a character is looking away from the viewer and you're looking at the back of their head, much less with their torso facing the viewer, that's just not gonna happen, maybe if the dataset will include more photos to broaden its knowledge, but we're really pushing the 2B here. :,D
Thanks for the clarification. Regarding the model’s difficulty distinguishing left from right — this issue still occurs even when only one character is in the prompt, especially when the prompt includes a heavy amount of elements (such as details about the character’s appearance or the background). In such cases, the model tends to prioritize rendering those elements, leaving less room to accurately follow directional cues (like “on the left/right of the frame,” or actions involving the left/right hand)
Thanks for the explanation. If the model consistently interprets "looking back" as the character facing away from the viewer while looking back toward the viewer, how would you recommend phrasing a prompt like "character facing the viewer while looking back toward the background"?
Like this picture:
like InvictusCreations said this is indeed a dataset issue with anime just being face focused. I tinkled a bit but its very inconsistent. Tried to exclude certain things per negs and imply a front view with descriptions of the torso like breast size to force it to make the body face the viewer while looking away - but it's not really working.
positive: masterpiece, best quality, 1girl, black hair, long hair, dress, medium breasts, cleavage, cowboy shot, back of the head,. She is looking at the tree in the background. Her torso is facing the viewer but her head is turning towards the background. 2000s (style),
negs: worst quality, low quality, score_1, score_2, score_3, 6 fingers, 6 toes, ai-generated, looking at viewer, eyes, from behind, back, eyes, face,
What made a big spatial impact: Explaining the back as "background". "Her torso is facing the viewer but her head is turning towards the background" seems to help alot.
