Diffusion Single File
comfyui

Potential selection bias in the dataset scoring pipeline (ai-generated tag behavior)

#83
by VOIDER - opened

Hey, great work on the Anima preview. I wanted to point out a dataset-side issue regarding the scoring pipeline that might be worth looking into before the final release.

The ai-generated tag (along with a few related ones) is clearly prompt-active. The model card explicitly states no synthetic data was used, and I assume that means no intentionally generated images were added, which makes sense. However, given recent dataset cutoffs, scraped community content inevitably contains user-uploaded AI generations. The fact that the tag works so reliably doesn't prove intentional dataset padding, but it shows the token was learned strongly enough to map to a stable visual prior. The deeper concern is what happened when those images went through the filtering and evaluation pipeline.

Here is a grid showing different AI-related tags in the positive prompt:

xyz_grid-0000-1

Both Waifu Scorer and the PonyV7 Aesthetic Classifier systematically rate AI-generated content higher than equivalent human-drawn art. In my tests, generations with ai-generated in the positive prompt consistently scored higher on both scorers than comparable generations where the same tag was pushed negative. Waifu Scorer was trained on pre-AI data and doesn't meaningfully distinguish between clean hand-drawn art and AI output that shares similar surface-level visual patterns. It effectively rewards typical AI artifacts like over-smoothed skin or certain specular eye highlights with inflated scores. If scorers like these were used upstream for data selection, it creates an obvious retention bias: AI-looking images didn't just slip through, they were preserved at disproportionately high rates.

Here is the score comparison with and without ai-generated across both scorers:

For reference, the scores were tested using these public spaces: Waifu Scorer v3 and PonyV7 Aesthetic Classifier.

Frame 2

I know Anima is currently a base model and quality tags might be dropped out randomly during training. That reduces direct dependence on the tags at inference, but it does not undo bias introduced earlier. If certain image types were overrepresented in the retained subset because of inflated scores, optional tags at training time do not fix a biased scoring pipeline.

The Pony-derived quality side has a separate issue. Pony V6 explicitly describes its training set as being aesthetically ranked based on the author's personal preferences. V7 expanded the dataset significantly but maintained the same scoring methodology, heavily weighted toward a specific composition of anime, cartoon, furry, and pony content. So the classifier used here is essentially a neural network trained to replicate a project-specific house aesthetic rather than a neutral measure of general image quality. The practical result is that using score_9 in prompts pushes Anima noticeably toward a Pony v6-era look, simply because those images scored highly on that classifier and entered the training data at elevated rates.

The root cause here is a variant of dataset contamination related to the model collapse problem. Old taggers and aesthetic scorers assign high-quality labels to AI-generated images that visually resemble their preferred distributions. Those labels and artifacts then get learned by the next model as actual markers of quality, slowly narrowing the style distribution.

I am not saying this breaks Anima, as the overall prompt comprehension and character knowledge are clearly very strong. I am just pointing out a plausible selection bias visible in prompting behavior and scorer response. For the final release, it might be worth either excluding ai-generated-tagged images entirely, penalizing their scores before filtering, or cross-validating Pony-based scores with a more independent scorer to improve quality tag behavior and style diversity.

CircleStone Labs org

The sources I used for the anime images all ban AI-generated content, as a rule.

But I did go and check explicitly. In the training set, there are 230 images with the ai-generated tag. I guess they were scraped after someone added the tag but before it could be removed. I'll filter this out in any future versions, but I don't think this is enough volume to have any real impact on the model. If the model managed to learn some consistent AI-generated look when prompted with that tag, then that's just another thing you can put in the negative prompt if you want.

Also the Pony aesthetic scorer wasn't used to select any images, only score images after I already had the dataset assembled. It gives you another knob to control aesthetics, and since those score tags are trained with 50% dropout it's entirely optional.

Quick input on these Pony score tags. Anima-Preview2 is also capable of ultra-realistic photo images (really close to z-image-turbo quality), but only if you put score_9 in negatives, or even negpip it. If you don't explicitly limit the model with that, (and even that's not enough) the model sometimes leaks this score_9 bias, resulting in that one real ugly artificial "AI-generated" look.
If people really do desire this ugly look (and judging by overabundance of those "finetunes" on CivitAI, they do), I'd rather they make a lora on ai-generated dataset, than poison the whole model with this bias.
In my personal opinion, Pony scoring is more harmful, than if you were to include E621 as dataset.

Thanks for taking the time to check the dataset directly, @tdrussell , and clarifying the pipeline. Knowing that the aesthetic scorers were only used for tagging after assembly rather than filtering definitely shifts the context. It means the core issue isn't dataset selection bias, but rather conditioning and label semantics.

Regarding the 230 explicitly tagged images, I agree that volume alone won't break a model. However, my concern was never primarily about honest uploads with the ai-generated tag. The deeper issue is the sheer volume of untagged AI art on platforms like Pixiv and Danbooru that routinely slips past human moderation. The explicitly tagged ones just confirm that the dataset sources are vulnerable to that contamination.

But the most critical part is how both the Pony scorer and Waifu Scorer interact with that untagged data, especially combined with the 50% dropout rate. While the Pony classifier pushes the model toward a very specific recognizable aesthetic, Waifu Scorer tends to highly rate older, more general AI patterns reminiscent of SD 1.5 generations. It might seem like dropout makes these score tags entirely optional, but mechanically it can actually cause the exact leakage @shiboishi is observing above.

If an untagged AI-generated image gets a high score from either of these biased evaluators, the model sees that image paired with a high-quality tag half the time, and with no quality tag the other half. Instead of safely isolating the artificial aesthetic behind the score trigger, the model learns that those specific visual patterns—whether the Pony look or the older SD 1.5 slop—are also completely acceptable in the unconditional, default state. Dropout reduces strict prompt dependence, but it actively smears the biased aesthetic into the base prior.

This perfectly explains @shiboishi 's practical observation. If users have to actively push score_9 into the negative prompt just to get clean photographic output, it proves that these tags are not functioning as neutral quality knobs. They act as aggressive style priors that the model defaults to. Relying on negative prompts to fix this is treating the symptom rather than the root cause.

Since this is a preview build, it would be incredibly insightful to see an ablation test for the final release: the exact same dataset but completely stripped of both Pony-derived and Waifu Scorer tags. I strongly suspect the base output would be much cleaner, more stylistically flexible, and less prone to that artificial look without those specific aesthetics bleeding into the weights.

Now i'm just simple country folk i don't know 'whole 'lot about this training business and scoring, but, from what i've observed, score tags are absolutely not major drivers of style and quality in prompting, BUT
I did test @shiboishi 's statement about using Score_9 under the negatives to improve realism, and from my observations, it does absolutely improve realism for some reason. Same prompt, same model, same settings, but adding score_9, in negative. There's much less of a plastic/3d render look to it, but it still has a ways to go before it's "z-image" quality as he stated. And this was using "animaika" a model on civitai, not base Anima/v0.1/v0.2.

It doesn't improve the anatomy or anything either lol If that helps prove anything either way, it still fucks up the hands and feet in the same predictable ways across different seeds. Overall if i gave my opinion from testing, score tagging isn't massively impacting the quality of Anima.

My test images: (PG13 warning)
https://civitai.com/posts/27255617
Original image:
https://civitai.com/posts/27163246

(And while i was typing this up, i still prefer the original images without that negative. A worthy test i may do in a moment; Does trying TOO hard to steer Anima away from undesirable results, ironically create undesirable results? May be the case as i look back at the test images and see how the skin texture degraded in some..)

Quick input on these Pony score tags. Anima-Preview2 is also capable of ultra-realistic photo images (really close to z-image-turbo quality), but only if you put score_9 in negatives, or even negpip it. If you don't explicitly limit the model with that, (and even that's not enough) the model sometimes leaks this score_9 bias, resulting in that one real ugly artificial "AI-generated" look.
If people really do desire this ugly look (and judging by overabundance of those "finetunes" on CivitAI, they do), I'd rather they make a lora on ai-generated dataset, than poison the whole model with this bias.
In my personal opinion, Pony scoring is more harmful, than if you were to include E621 as dataset.

Oh yeah, also, if people train loras on ai-generated datasets, it also means you could apply the same loras with a negative weight so you can DRIVE the output even further from it, I'm going to do some experiments later on "deaislopify loras"

I actually think training AI-generated content is a good thing .
because it can help with handling negative content.
I really like using AI-generated tags as negative prompts, but Anima doesn't perform well with this approach.

If people really do desire this ugly look (and judging by overabundance of those "finetunes" on CivitAI, they do), I'd rather they make a lora on ai-generated dataset, than poison the whole model with this bias.
In my personal opinion, Pony scoring is more harmful, than if you were to include E621 as dataset.

That's actually true. Score_9 literally kills the model's base style as if it's dragging you back 2 or 3 years to that creepy soulless style of ponyV6.
Even in its preview state Anima is already a truly wonderful model and I really hope this rating method won't end up dragging it down and keeping it from reaching its full potential.

A short Lora tune on photos can add much of the photo style back (regardless of tagging)

I don't see much effect from score_9 in negatives, you should not expect an anime-focused model to do Z-Image quality photos without some additional training or Loras.

A short Lora tune on photos can add much of the photo style back (regardless of tagging)

I don't see much effect from score_9 in negatives, you should not expect an anime-focused model to do Z-Image quality photos without some additional training or Loras.

how short are we talking? 50 images? 500? I noticed 15-20 image character loras alone restored a ton of realism on its own. I'm curious how simple the process could be.

in my testing the score tags when coupled with period tags past 2020 show way more ai-generated style, 'year 2024' and 'year 2025' are by far the worst, maybe the ai-generated data past 2020 became the 'default' due to those posts usually not being tagged with an artist name, therefore the model takes it as the default style?

Sign up or log in to comment