Update README.md

Browse files

Files changed (1) hide show

README.md +9 -4

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ license: apache-2.0
 A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
-This release ships two checkpoints — **V1** (the from-scratch 320×320 model) and **V1.1** (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
 ---
@@ -21,13 +21,14 @@ This release ships two checkpoints — **V1** (the from-scratch 320×320 model)
 | Checkpoint | Native resolution | How it was produced | When to use |
 |---|---|---|---|
-| **V1** | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution and the one it performs best at. | Default choice. Use when you are running inference at 320×320, when throughput matters, or when you want the checkpoint that has seen the most training. |
-| **V1.1** | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). It will not magically be better than V1 across the board — it is a *resolution* upgrade, not a *model quality* upgrade. |
 Two practical notes:
-- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1, will give worse results than matching them. The position-embedding grid is fixed at load time.
 - **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model.
 ---
@@ -142,6 +143,10 @@ On my evaluation set this model achieves:
 Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
 ---

 A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
+**V1** (the from-scratch 320×320 model) is shipping now. **V1.1** (a 448×448 fine-tune of V1) is **planned for release the week of 2026-05-15** and is expected to outperform V1 on this evaluation set; final numbers will be filled in once training and eval are complete. Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
 ---
 | Checkpoint | Native resolution | How it was produced | When to use |
 |---|---|---|---|
+| **V1** *(available now)* | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution. | Use this today. Also the right pick if you are running inference at 320×320 or if throughput matters. |
+| **V1.1** *(coming \~2026-05-15)* | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). Internal expectation is that V1.1 will outperform V1 on the same eval set, but final numbers will only be published after V1.1 training completes — until then, treat V1 as the reference checkpoint. |
 Two practical notes:
+- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1 once it ships, will give worse results than matching them. The position-embedding grid is fixed at load time.
 - **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model.
+- **Until V1.1 ships, V1 is the only released checkpoint.** Numbers and recommendations in this card refer to V1 unless explicitly labeled V1.1.
 ---
 Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+### V1.1 headline numbers (pending)
+V1.1 has not finished training yet. Final Macro F1 / Micro F1 / mAP and the per-bucket mAP breakdown will be added here once V1.1 is released (planned for the week of 2026-05-15). Internal expectation is that V1.1 will exceed V1 on this eval set at its native 448×448 input resolution, but no measured numbers are claimed in this document until they exist.
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
 ---