Volunteers wanted: cleaning the danbooru tag set to fix concepts that trip up Anima (and taggers)

#201

by Grio43 - opened 4 days ago

•

Hi all — I'm the author of OppaiOracle, a from-scratch ViT danbooru tagger. I'm posting here because the concepts my tagger struggles with are, in my experience, the same ones Anima and other danbooru-trained models (SDXL/Illustrious/etc.) struggle with — and the root cause is shared: noise in the danbooru tag data itself. Wherever the tags are noisy, anything trained on them inherits that noise.

I'm looking for one or more volunteers to help clean that data.

Why this matters for Anima

Anima learns concepts from danbooru-style tags, so the failure patterns I've documented on the tagging side map almost directly onto generation failures:

Missing tags — the dominant problem. A typical image ships with ~28 tags when a careful pass would put it at 50+. Concepts that are present but unlabeled get little to no training signal, so the model under-learns them.
Color leakage — aqua bleeds into both blue and green; warm colors (yellow/orange/red) blur together. These boundaries are perceptual, not RGB, so they can't be fixed mechanically.
Hair length — long_hair / very_long_hair are inconsistent across visually similar images.
Size buckets — flat / small / medium / large / huge are continuous quantities crammed into noisy discrete labels.
The neckwear cluster — bow / bowtie / ribbon / ascot / necktie get confused for one another constantly (and it's representative of a broader small-accessory problem).
Character/concept leakage — when a tag is dominated by a handful of characters, the model learns the character instead of the concept.

There's a much longer write-up of all of these on the OppaiOracle model card if you want the detail.

Where I'm at

I've individually cleaned ~2M tags so far (up from ~1.2M for my first model), and I supply cleaned data to duongve. I'm still on a 2024 snapshot; after this cleaning pass I'll convert the old tags to the 2026 taxonomy and train my next model on the corrected 2026 set. I also have ~12k images set aside as a golden set to de-noise some of the most deeply-rooted issues.

How you can help (two tracks)

Tag cleaning — preferred. Help normalize the commonly mixed/wrong concepts above. This is the highest-value contribution and benefits everything downstream, Anima included.
Golden data via AI generation. For concepts that are absent in danbooru, curated AI-generated images help seed them — roughly 500–600 images to start a brand-new concept. For commonly confused concepts, clean reference examples are useful too. If it is mixing with another tag. I need at least 100 images for each tag for the golden dataset to start to separate.

No big commitment required — even a small, focused pass on a single tag family (just the neckwear cluster, or one color axis) is genuinely useful.

If you're interested, reply here or reach me at in discord: mastoras2334. I'm happy to share the tooling and workflow I use, and point you at the highest-impact tag families to start with.

Thanks!

sorryhyun

4 days ago

I'm also quite interested on this project, as main developer of sorryhyun/anima_lora project, involving pipeline for simple image encoder based tagger and various image understanding trials for anima. I was quite wondering about adjusting balance between tag granularity (separating very_long_hair and long_hair) and model f1 accuracy (consolidating those to long_hair) similarly applied to color tags ( red_bow or red_bow, bow ) It would be nice if you guys share how resolved this. Thanks for huge contributions anyway.

Grio43

4 days ago

•

edited 4 days ago

For uncommon tags / long tailed. ASL helps them from degrading. Issue with this. ASL also weakens your boundaries between tags due to the nature of the loss.

Now for the boundary decision. I'll probably switch from ASL to BCE. You can use a golden dataset with let's say long hair and very long hair as they are commonly confused. Using a small golden set even with locking the backbone can cause regression. I'm using my cleaned golden set to fine tune that, which might cause regression but using it as a tool to find incorrectly tagged data in my main dataset.

I after the tuned model finds a lot of incorrectly tagged images causing a boundary issues. I'll fix those tags. Revert to an early checkpoint and train again before I use BCE.

I follow the wiki definition for parents and children tags. For bow and red_bow. It helps the model overall to have the hierarchy like that too. So all your colors of bows fall under bow. So a red bow should also have the bow tag. A green bow should also have the bow tag. It helps cluster the data in the gradient.
Note: I'm speaking only to the tagging aspects of my handling of my vit tagging project. I didn't look in-depth into yours specific project.

edit:
https://huggingface.co/datasets/Grio43/Tag_cleaning/tree/main Ill start maintaining this. It is not all the cleaned tags and still operating under 2024 tags. After I do my current push of cleaning. Ill update the list to 2026 tags and provide more of my recent cleaning. My current light pass of manual cleaning review, I expect 2-3 more months going solo. Expect some errors but will be better than the default dataset.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment