Update README.md
Browse files
README.md
CHANGED
|
@@ -11,7 +11,7 @@ license: apache-2.0
|
|
| 11 |
|
| 12 |
## TL;DR
|
| 13 |
|
| 14 |
-
A multi-label anime tagger trained from scratch on a ~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
|
| 15 |
|
| 16 |
This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
|
| 17 |
|
|
@@ -35,12 +35,12 @@ Two practical notes:
|
|
| 35 |
|
| 36 |
I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
|
| 37 |
|
| 38 |
-
- **Removed ~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
|
| 39 |
-
- **Added ~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
|
| 40 |
|
| 41 |
-
That is ~1.3M corrections in total, which is only on the order of **~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
|
| 42 |
|
| 43 |
-
I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by ~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
|
| 44 |
|
| 45 |
The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
|
| 46 |
|
|
@@ -61,11 +61,11 @@ If you only remember one thing from this section, remember this: **the biggest s
|
|
| 61 |
|
| 62 |
A rough empirical sense of the gap, from manual review:
|
| 63 |
|
| 64 |
-
- A typical image in this dataset arrives with roughly **~28 tags** from the source.
|
| 65 |
- A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
|
| 66 |
-
- During spot-checks I have routinely taken images that arrived with **~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
|
| 67 |
|
| 68 |
-
So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added ~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
|
| 69 |
|
| 70 |
The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
|
| 71 |
|
|
@@ -102,7 +102,7 @@ For some tags, the data is dominated by a small number of characters. When that
|
|
| 102 |
|
| 103 |
### My estimate of cleaning quality
|
| 104 |
|
| 105 |
-
The 300k removals and ~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
|
| 106 |
|
| 107 |
---
|
| 108 |
|
|
@@ -148,7 +148,7 @@ I want to be honest about *why* I think it performs well: **it is almost certain
|
|
| 148 |
|
| 149 |
## Image augmentation settings (V1 and V1.1)
|
| 150 |
|
| 151 |
-
For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
|
| 152 |
|
| 153 |
| Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
|
| 154 |
|---|---|---|
|
|
@@ -164,7 +164,7 @@ For reproducibility, here are the exact augmentation pipelines used for each che
|
|
| 164 |
|
| 165 |
Notes on a few of these choices:
|
| 166 |
|
| 167 |
-
- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (~¼ of brightness) is taken from BYOL's asymmetric augmentation.
|
| 168 |
- **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
|
| 169 |
- **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
|
| 170 |
- **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
|
|
@@ -179,7 +179,7 @@ Notes on a few of these choices:
|
|
| 179 |
| Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
|
| 180 |
| Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
|
| 181 |
| Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
|
| 182 |
-
| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical ~28 tags/image vs. 50+ that should be present. ~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
|
| 183 |
| Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
|
| 184 |
| Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
|
| 185 |
| Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |
|
|
|
|
| 11 |
|
| 12 |
## TL;DR
|
| 13 |
|
| 14 |
+
A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
|
| 15 |
|
| 16 |
This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
|
| 17 |
|
|
|
|
| 35 |
|
| 36 |
I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
|
| 37 |
|
| 38 |
+
- **Removed \~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
|
| 39 |
+
- **Added \~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
|
| 40 |
|
| 41 |
+
That is \~1.3M corrections in total, which is only on the order of **\~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
|
| 42 |
|
| 43 |
+
I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by \~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
|
| 44 |
|
| 45 |
The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
|
| 46 |
|
|
|
|
| 61 |
|
| 62 |
A rough empirical sense of the gap, from manual review:
|
| 63 |
|
| 64 |
+
- A typical image in this dataset arrives with roughly **\~28 tags** from the source.
|
| 65 |
- A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
|
| 66 |
+
- During spot-checks I have routinely taken images that arrived with **\~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
|
| 67 |
|
| 68 |
+
So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added \~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
|
| 69 |
|
| 70 |
The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
|
| 71 |
|
|
|
|
| 102 |
|
| 103 |
### My estimate of cleaning quality
|
| 104 |
|
| 105 |
+
The 300k removals and \~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
|
| 106 |
|
| 107 |
---
|
| 108 |
|
|
|
|
| 148 |
|
| 149 |
## Image augmentation settings (V1 and V1.1)
|
| 150 |
|
| 151 |
+
For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
|
| 152 |
|
| 153 |
| Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
|
| 154 |
|---|---|---|
|
|
|
|
| 164 |
|
| 165 |
Notes on a few of these choices:
|
| 166 |
|
| 167 |
+
- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (\~¼ of brightness) is taken from BYOL's asymmetric augmentation.
|
| 168 |
- **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
|
| 169 |
- **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
|
| 170 |
- **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
|
|
|
|
| 179 |
| Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
|
| 180 |
| Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
|
| 181 |
| Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
|
| 182 |
+
| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical \~28 tags/image vs. 50+ that should be present. \~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
|
| 183 |
| Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
|
| 184 |
| Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
|
| 185 |
| Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |
|