Update README.md

Browse files

Files changed (1) hide show

README.md +12 -12

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ license: apache-2.0
 ## TL;DR
-A multi-label anime tagger trained from scratch on a ~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
 This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
@@ -35,12 +35,12 @@ Two practical notes:
 I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
-- **Removed ~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
-- **Added ~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
-That is ~1.3M corrections in total, which is only on the order of **~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
-I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by ~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
 The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
@@ -61,11 +61,11 @@ If you only remember one thing from this section, remember this: **the biggest s
 A rough empirical sense of the gap, from manual review:
-- A typical image in this dataset arrives with roughly **~28 tags** from the source.
 - A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
-- During spot-checks I have routinely taken images that arrived with **~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
-So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added ~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
 The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
@@ -102,7 +102,7 @@ For some tags, the data is dominated by a small number of characters. When that
 ### My estimate of cleaning quality
-The 300k removals and ~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
 ---
@@ -148,7 +148,7 @@ I want to be honest about *why* I think it performs well: **it is almost certain
 ## Image augmentation settings (V1 and V1.1)
-For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
 | Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
 |---|---|---|
@@ -164,7 +164,7 @@ For reproducibility, here are the exact augmentation pipelines used for each che
 Notes on a few of these choices:
-- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (~¼ of brightness) is taken from BYOL's asymmetric augmentation.
 - **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
 - **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
 - **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
@@ -179,7 +179,7 @@ Notes on a few of these choices:
 | Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
 | Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
 | Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
-| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical ~28 tags/image vs. 50+ that should be present. ~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
 | Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
 | Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
 | Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |

 ## TL;DR
+A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
 This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
 I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
+- **Removed \~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
+- **Added \~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
+That is \~1.3M corrections in total, which is only on the order of **\~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
+I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by \~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
 The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
 A rough empirical sense of the gap, from manual review:
+- A typical image in this dataset arrives with roughly **\~28 tags** from the source.
 - A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
+- During spot-checks I have routinely taken images that arrived with **\~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
+So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added \~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
 The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
 ### My estimate of cleaning quality
+The 300k removals and \~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
 ---
 ## Image augmentation settings (V1 and V1.1)
+For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
 | Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
 |---|---|---|
 Notes on a few of these choices:
+- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (\~¼ of brightness) is taken from BYOL's asymmetric augmentation.
 - **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
 - **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
 - **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
 | Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
 | Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
 | Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
+| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical \~28 tags/image vs. 50+ that should be present. \~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
 | Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
 | Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
 | Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |