Grio43 commited on
Commit
d41b79c
·
verified ·
1 Parent(s): 6d0d089

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -11,7 +11,7 @@ license: apache-2.0
11
 
12
  ## TL;DR
13
 
14
- A multi-label anime tagger trained from scratch on a ~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
15
 
16
  This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
17
 
@@ -35,12 +35,12 @@ Two practical notes:
35
 
36
  I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
37
 
38
- - **Removed ~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
39
- - **Added ~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
40
 
41
- That is ~1.3M corrections in total, which is only on the order of **~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
42
 
43
- I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by ~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
44
 
45
  The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
46
 
@@ -61,11 +61,11 @@ If you only remember one thing from this section, remember this: **the biggest s
61
 
62
  A rough empirical sense of the gap, from manual review:
63
 
64
- - A typical image in this dataset arrives with roughly **~28 tags** from the source.
65
  - A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
66
- - During spot-checks I have routinely taken images that arrived with **~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
67
 
68
- So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added ~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
69
 
70
  The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
71
 
@@ -102,7 +102,7 @@ For some tags, the data is dominated by a small number of characters. When that
102
 
103
  ### My estimate of cleaning quality
104
 
105
- The 300k removals and ~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
106
 
107
  ---
108
 
@@ -148,7 +148,7 @@ I want to be honest about *why* I think it performs well: **it is almost certain
148
 
149
  ## Image augmentation settings (V1 and V1.1)
150
 
151
- For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
152
 
153
  | Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
154
  |---|---|---|
@@ -164,7 +164,7 @@ For reproducibility, here are the exact augmentation pipelines used for each che
164
 
165
  Notes on a few of these choices:
166
 
167
- - **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (~¼ of brightness) is taken from BYOL's asymmetric augmentation.
168
  - **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
169
  - **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
170
  - **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
@@ -179,7 +179,7 @@ Notes on a few of these choices:
179
  | Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
180
  | Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
181
  | Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
182
- | Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical ~28 tags/image vs. 50+ that should be present. ~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
183
  | Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
184
  | Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
185
  | Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |
 
11
 
12
  ## TL;DR
13
 
14
+ A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
15
 
16
  This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
17
 
 
35
 
36
  I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
37
 
38
+ - **Removed \~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
39
+ - **Added \~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
40
 
41
+ That is \~1.3M corrections in total, which is only on the order of **\~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
42
 
43
+ I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by \~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
44
 
45
  The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
46
 
 
61
 
62
  A rough empirical sense of the gap, from manual review:
63
 
64
+ - A typical image in this dataset arrives with roughly **\~28 tags** from the source.
65
  - A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
66
+ - During spot-checks I have routinely taken images that arrived with **\~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
67
 
68
+ So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added \~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
69
 
70
  The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
71
 
 
102
 
103
  ### My estimate of cleaning quality
104
 
105
+ The 300k removals and \~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
106
 
107
  ---
108
 
 
148
 
149
  ## Image augmentation settings (V1 and V1.1)
150
 
151
+ For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
152
 
153
  | Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
154
  |---|---|---|
 
164
 
165
  Notes on a few of these choices:
166
 
167
+ - **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (\~¼ of brightness) is taken from BYOL's asymmetric augmentation.
168
  - **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
169
  - **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
170
  - **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
 
179
  | Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
180
  | Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
181
  | Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
182
+ | Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical \~28 tags/image vs. 50+ that should be present. \~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
183
  | Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
184
  | Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
185
  | Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |