Looking for ways to precisely control the subject and environment.

#140

by Jamerrone - opened May 3

May 3

•

Hi! First of all, thanks for the amazing model. Anima is pretty much in its own league of base models for anime images. I am looking for some help, or rather, tips. It's quite clear that Anima supports both tags and natural language, but they are not equal. I have generated about 100 or so images today, and I noticed that when you generate images with only tags, the quality is quite a bit higher, and it rarely makes mistakes. Whenever I try to introduce natural language, most images have anatomical issues, mainly with hands, and the overall quality is different. Personally I prefer tags because they're extremely precise, but tags also have their issues. The images tend to be more flat (not always), and there is no way of controlling things like background, atmosphere, composition, lighting, etc. in a clean and clear way. Then I came up with the idea of hybrid prompts. Tags only for the subject with a few sentences below for everything else. It kind of works but has the same anatomical issues with hands. I guess I am looking for a way to have precise control over the subject, preferably with tags, but also the option to be artistic with everything else, and I hope someone can help me out.

Example tags only:

Example natural language only (very long, 3-5 paragraphs, LLM-generated):

I tried the hybrid approach, but because the natural language part just described the environment and not the subject, it always zoomed out.

Tags only:

Hybrid:

Long natural language:

This is the best style I was able to come up with:
Tags only:

Hybrid:

best quality, highres, masterpiece, score_7, 1girl, maomao (kusuriya no hitorigoto), kusuriya no hitorigoto, black hair, braid, chromatic aberration, earrings, floral print, flower, flower earrings, from behind, hair flower, hair ornament, hair over shoulder, hand fan, holding, holding fan, jewelry, long hair, looking at viewer, lotus leaf, obi, parted lips, plant, purple eyes, railing, sash, side braids, solo, standing, swept bangs, tuanshan, turning head, very long hair, wooden railing

The composition is a medium shot taken from a rear-three-quarter angle that captures the subject standing against a railing. High-key, dappled sunlight filters through overhead willow branches, creating a bright and airy atmosphere filled with soft cel-shaded highlights. The color palette is vibrant and saturated, emphasizing lush greens and brilliant blues reflecting off a serene water surface. In the background, a tranquil pond is covered in floating lotus leaves and blooming white lilies under a clear, sunlit sky. Dainty light particles and a slight atmospheric haze soften the distant traditional structures visible across the water. The overall lighting is warm and directional, casting intricate shadows that dance across the environmental textures.

As you can see, the hybrid prompt kind of allows me to do what I want, but the hands are 90% broken while they are always perfect with tags only.

Jamerrone

May 3

Another example. Tags only:

best quality, highres, masterpiece, score_7, 1girl, reze (chainsaw man), chainsaw man, anime coloring, black choker, black hair, bruise, bruise on face, bruised eye, choker, closed mouth, eye twitch, glaring, hair between eyes, half-closed eye, half-closed eyes, injury, messy hair, portrait, rain, reze's eye twitch (chainsaw man), scene reference, shirt, short hair, solo, straight-on, uneven eyes, water drop, wet, wet hair, white shirt

Natural lang (long):

masterpiece, best quality, highres, anime screenshot, 1girl, reze, chainsaw man, 

The image presents a striking, straight-on close-up portrait of a young woman, captured with an intense, cinematic anime screenshot aesthetic. Her expression is dark and deeply unsettling, characterized by a heavy, piercing glare. Her vivid teal eyes are half-closed and distinctly uneven, with the right eye appearing to twitch, an effect emphasized by heavy, dark bags, exhaustion lines, and fine textural linework beneath the lower lids. Her face bears the raw marks of recent physical conflict; a prominent, dark bruise shadows the skin beneath her right eye, and several fine scrapes and abrasions are scattered across her cheeks, adding a gritty, battered realism to her otherwise stoic, firmly closed mouth. The overall color grading leans heavily into a melancholic, monochromatic blue-purple palette, reinforcing the somber mood.

Her short, dark raven hair is completely drenched, plastered messily against her pale skin by an unrelenting downpour. Thick, wet strands of hair fall directly between her eyes and curtain the sides of her face, breaking up her silhouette. The water effects are meticulously detailed; individual water droplets cling to the tips of her hair, slide down the bridge of her nose, and bead across her cheeks and chin. The interaction of the heavy rain with her hair and skin features thin, glistening highlights that map the precise flow of moisture over her facial structure, emphasizing the sheer volume of water washing over her in the freezing atmosphere. 

She wears a thick, matte black choker closely wrapped around her neck, sharply defining her throat, sitting just above the visible collar of a white collared shirt that appears muted, saturated, and gray under the oppressive, low-light conditions of the storm. The environment is entirely consumed by the heavy weather. The background is a deliberate, murky expanse of dark, ominous purples and charcoal tones, suggesting a stormy, overcast sky with faint, looming shapes of distant clouds. Thick, vertical streaks of rain fall continuously, cutting across both the foreground and background in dynamic, translucent lines, framing her battered face and amplifying the cinematic, heavy tension of the scene.

Hybrid:

best quality, highres, masterpiece, score_7, 1girl, reze (chainsaw man), chainsaw man, anime coloring, black choker, black hair, bruise, bruise on face, bruised eye, choker, closed mouth, eye twitch, glaring, hair between eyes, half-closed eye, half-closed eyes, injury, messy hair, portrait, rain, reze's eye twitch (chainsaw man), scene reference, shirt, short hair, solo, straight-on, uneven eyes, water drop, wet, wet hair, white shirt

The composition is a centered, straight-on portrait that tightly anchors the frame on the subject. A heavy, monochromatic indigo color wash dominates the scene, creating a somber and desaturated atmosphere. Volumetric rain streaks fall vertically across the frame, adding texture and motion to the midground. The lighting is low-key and flat, with subtle shadows that blend into the deep blue tones of the environment. In the background, dark and indistinct silhouettes of storm clouds loom over a muted, desolate horizon. The overall aesthetic utilizes dynamic anime lighting to emphasize the cold, damp environment without distracting from the central focus.

rconhf

May 3

•

edited May 3

this probably isn't the answer you're hoping for, but Anima really doesn't like long prompts. once you go past ~2–3 paragraphs, it tends to fall apart into mush. structure breaks down, and hands are usually the first thing to go.

what’s worked way better for me is a solid tag base for structure, then a bit of natural language to fill gaps.

full natlang can work too, but you have to keep it tight and precise. if it gets too long or too descriptive, the model just loses the plot. Maybe tdrussel knows why that's the case, unsure if it's something to do with how the prompt is chunked or whatever. I wish I could offer technical details, but all I got is observations 🤷‍♂️

Jamerrone

May 3

Yeah, so that's pretty much what I noticed and tried to display here. Tags only = simpler image, sharp lines, flatter colours, high quality, and no or few issues. Long LLM natural language: highly detailed, Anima understands the prompt itself just fine, with lots of micro-details and dynamic light. It looks great from a distance but not so good once you have a closer look; it always has minor issues and always has broken hands. Hybrid = a mix of both; it has lower image quality than just tags. It can have issues, including broken hands, but it gives you 80% of the subject control you get with only tags, and it allows you to control light, background, angle, weather, etc. It can, however, very easily separate the subject from the background, making it harder to position the subject or control the camera shot. An example of this is the hybrid Reze shot in the pool, because I described the background; it ignored all the angle tags like "upper body", "close-up", etc., which was very annoying. It kept generating a wide shot.

cavemanextreme

May 3

If you give a long natural prompt with just the environment that's what you're getting, that part of the prompt is way stronger. If upper body doesn't do anything just make it stronger until it does something like (upper body:7).

Same thing with hands, with a long natural language prompt it's just a mush of of tokens and you probably need something to tell the model that the hands are important, like a weighted holding staff or something. You can also downweight the natural language prompt.

Also getting perfect hands in an image where they are 5% or something in size is pretty hard, you can just upscale.

synta

May 5

An example of this is the hybrid Reze shot in the pool, because I described the background; it ignored all the angle tags like "upper body", "close-up", etc., which was very annoying. It kept generating a wide shot.

A workaround that helped me is to use the aesthetic quality modifier lora at a lower weight of 0.2-0.4. It's just enough to introduce the dataset bias of the lora which is mainly close-ups. Helps also to get full body close-ups in vertical images, when the natlang is screwing with the composition towards ultra wide shots.

Jamerrone

May 5

An example of this is the hybrid Reze shot in the pool, because I described the background; it ignored all the angle tags like "upper body", "close-up", etc., which was very annoying. It kept generating a wide shot.

A workaround that helped me is to use the aesthetic quality modifier lora at a lower weight of 0.2-0.4. It's just enough to introduce the dataset bias of the lora which is mainly close-ups. Helps also to get full body close-ups in vertical images, when the natlang is screwing with the composition towards ultra wide shots.

Interesting! You mean the most downloaded Anima lora, right? So what you are saying is that because the lora is trained on close-up images, it will help my wide-angle issue? I have to give it a shot!

venluxy

May 6

use this text encoder, maybe it works. I use it my self when trying more complex image like a comic. https://huggingface.co/DavidAU/Qwen3-0.6B-heretic-abliterated-uncensored

rconhf

May 6

use this text encoder, maybe it works. I use it my self when trying more complex image like a comic. https://huggingface.co/DavidAU/Qwen3-0.6B-heretic-abliterated-uncensored

While I can understand the idea, in practice using a different text encoder usually leads to worse results, especially in prompt adherence, and ablation bonks the model on the head even more, so this combination should be doubly detrimental. I'd be surprised if it really helped you 🤔

venluxy

May 6

use this text encoder, maybe it works. I use it my self when trying more complex image like a comic. https://huggingface.co/DavidAU/Qwen3-0.6B-heretic-abliterated-uncensored

While I can understand the idea, in practice using a different text encoder usually leads to worse results, especially in prompt adherence, and ablation bonks the model on the head even more, so this combination should be doubly detrimental. I'd be surprised if it really helped you 🤔

it not a different text encoder. it is the same model anima using, the dev just tried to remove the built in safe guards to make it easier to get more explicit result. he said that the censored model can make explicit result but you need to push it, he just removed the filter to make it easier.

guri06

May 7

it not a different text encoder. it is the same model anima using, the dev just tried to remove the built in safe guards to make it easier to get more explicit result. he said that the censored model can make explicit result but you need to push it, he just removed the filter to make it easier.

Oh, please. The current LLM adapter has already been fine-tuned to a certain extent. The moment you replace it, you will see that the embeddings for the artist tags are completely corrupted.

LeftHandedAIart

May 7

Hi! First of all, thanks for the amazing model. Anima is pretty much in its own league of base models for anime images. I am looking for some help, or rather, tips. It's quite clear that Anima supports both tags and natural language, but they are not equal. I have generated about 100 or so images today, and I noticed that when you generate images with only tags, the quality is quite a bit higher, and it rarely makes mistakes. Whenever I try to introduce natural language, most images have anatomical issues, mainly with hands, and the overall quality is different. Personally I prefer tags because they're extremely precise, but tags also have their issues. The images tend to be more flat (not always), and there is no way of controlling things like background, atmosphere, composition, lighting, etc. in a clean and clear way. Then I came up with the idea of hybrid prompts. Tags only for the subject with a few sentences below for everything else. It kind of works but has the same anatomical issues with hands. I guess I am looking for a way to have precise control over the subject, preferably with tags, but also the option to be artistic with everything else, and I hope someone can help me out.

Example tags only:

Example natural language only (very long, 3-5 paragraphs, LLM-generated):

I tried the hybrid approach, but because the natural language part just described the environment and not the subject, it always zoomed out.

Tags only:

Hybrid:

Long natural language:

This is the best style I was able to come up with:
Tags only:

Hybrid:
best quality, highres, masterpiece, score_7, 1girl, maomao (kusuriya no hitorigoto), kusuriya no hitorigoto, black hair, braid, chromatic aberration, earrings, floral print, flower, flower earrings, from behind, hair flower, hair ornament, hair over shoulder, hand fan, holding, holding fan, jewelry, long hair, looking at viewer, lotus leaf, obi, parted lips, plant, purple eyes, railing, sash, side braids, solo, standing, swept bangs, tuanshan, turning head, very long hair, wooden railing

The composition is a medium shot taken from a rear-three-quarter angle that captures the subject standing against a railing. High-key, dappled sunlight filters through overhead willow branches, creating a bright and airy atmosphere filled with soft cel-shaded highlights. The color palette is vibrant and saturated, emphasizing lush greens and brilliant blues reflecting off a serene water surface. In the background, a tranquil pond is covered in floating lotus leaves and blooming white lilies under a clear, sunlit sky. Dainty light particles and a slight atmospheric haze soften the distant traditional structures visible across the water. The overall lighting is warm and directional, casting intricate shadows that dance across the environmental textures.
As you can see, the hybrid prompt kind of allows me to do what I want, but the hands are 90% broken while they are always perfect with tags only.

I have encountered similar issues myself. I lean more towards tags than natural language.
It has proven effective but I really hope these issues are resolved (at least mostly) by the time we get the final release.

venluxy

May 7

it not a different text encoder. it is the same model anima using, the dev just tried to remove the built in safe guards to make it easier to get more explicit result. he said that the censored model can make explicit result but you need to push it, he just removed the filter to make it easier.

Oh, please. The current LLM adapter has already been fine-tuned to a certain extent. The moment you replace it, you will see that the embeddings for the artist tags are completely corrupted.

I don't use artist tags often, so I am not aware of the problem you have. I use both of the model to get more variety in result and have not had any problem my self .

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment