Update README.md

Browse files

Files changed (1) hide show

README.md +65 -4

README.md CHANGED Viewed

@@ -52,6 +52,20 @@ Two practical notes:
 ---
 ## How this model came to be
 I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
@@ -132,7 +146,7 @@ The 300k removals and \~1M additions were **AI-assisted and then human-reviewed
 - **Human review every output.** This applies most strongly to color, hair length, and size-bucket tags. The model is a fast first pass, not an authoritative labeler.
 - **Treat sibling tags as a group, not a hard pick.** If the model emits `blue_eyes` with high confidence, also check the `purple_eyes` / `aqua_eyes` / `black_eyes` scores before you commit.
 - **Do not use the raw output as ground-truth for downstream training** without manual review. The very confusion patterns that this model can't resolve will get baked into your downstream model.
-- **For thresholding, prefer per-tag thresholds over a single global threshold.** Different tag families have very different precision/recall behavior on this dataset.
 ---
@@ -143,6 +157,14 @@ On my evaluation set this model achieves:
 - The best **precision-equals-recall** point I have measured among comparable open anime taggers.
 - A solid **mAP** relative to the same comparison set.
 ### V1 headline numbers (e27/40, Phase 1, 320×320, 19,292 tags)
 | Metric | Value |
@@ -167,9 +189,12 @@ Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This i
 | Metric | Value |
 |---|---|
 | Overall val/mAP | 0.674 |
-F1 / P=R numbers are intentionally not reported alongside this row — see the *Why F1 numbers are not reported for V1.1* paragraph below for the calibration reason.
 **mAP broken out by tag frequency bucket — V1 vs. V1.1 on the same eval set:**
@@ -183,9 +208,45 @@ F1 / P=R numbers are intentionally not reported alongside this row — see the *
 The same rare-vs-head inversion noted for V1 (rare/mid > head/very-common on mAP) is still present in V1.1, and for the same reason — high-frequency tags are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
-**Why V1.1 stopped at 6 of 15 planned epochs.** Per-epoch mAP growth decelerated from ~+0.7%/epoch in early Phase 2 to ~+0.3%/epoch by epoch 5, while validation loss continued to fall and per-tag calibration shifted (mean activations per image dropped from ~4500 at epoch 0 to ~4200 at epoch 5, but the auto-stop F1 metric is calibration-floored at a fixed threshold of 0.2653 and therefore unreliable as a stop signal — see [TRAINING_HEALTH_TRACKER.md](../TRAINING_HEALTH_TRACKER.md)). At that growth rate, the remaining 9 epochs would have been operating in the regime where it is no longer cleanly distinguishable whether mAP gains are *real ranking improvement* or *memorization of the labeled subset of a noisy multi-label corpus* (the missing-positive bias documented earlier in this card sets a soft ceiling somewhere in this neighbourhood). Continuing was unlikely to buy enough real gain to justify the extra training time, so V1.1 ships at the epoch-5 / step-81822 checkpoint.
-**Why F1 numbers are not reported for V1.1.** V1.1's loss configuration (`gamma_neg=7.0`, `clip=0.2`) shifts the logit distribution relative to V1's (`gamma_neg=4.0`, `clip=0.05`). The in-training F1 metric uses a fixed threshold (0.2653) calibrated against V1's distribution, so V1.1's in-training F1 values are calibration-floored and not comparable to V1's reported F1. Reporting them alongside V1's would invite the wrong comparison. mAP, on the other hand, is threshold-independent and the V1 vs. V1.1 mAP comparison above is apples-to-apples — that is the comparison this card stands behind.
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.

 ---
+## Architecture
+A Vision Transformer (ViT) trained from scratch. Spec (from `V1.1_safetensors/config.json`; V1 shares the same backbone at a smaller patch grid):
+- 18 layers, hidden size 1024, 16 attention heads, FFN dim 4096, patch size 16
+- Regularization: drop-path 0.2, attention dropout 0.05, hidden dropout 0.1
+- Output head: 19,294 classes (all general-category tags; see *Vocabulary* in the comparison section below)
+- Patch grid: V1 = 20×20 (320×320 input); V1.1 = 28×28 (448×448 input, position embeddings interpolated from V1's 20×20 grid)
+- Roughly **\~247M parameters** total (estimated from the spec — exact count can be obtained by summing tensor sizes in `model.safetensors`)
+For context, the comparison set in *Performance notes / Comparison vs other open anime taggers* is **not at parameter parity**: WD swinv2-base v3 is on the order of 99M params and Camie tagger v2 is on the order of 143M params. So the headline gap below is *not* explained by OppaiOracle being a smaller model than the comparison set — it is a larger one.
+---
 ## How this model came to be
 I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
 - **Human review every output.** This applies most strongly to color, hair length, and size-bucket tags. The model is a fast first pass, not an authoritative labeler.
 - **Treat sibling tags as a group, not a hard pick.** If the model emits `blue_eyes` with high confidence, also check the `purple_eyes` / `aqua_eyes` / `black_eyes` scores before you commit.
 - **Do not use the raw output as ground-truth for downstream training** without manual review. The very confusion patterns that this model can't resolve will get baked into your downstream model.
+- **For thresholding, prefer per-tag thresholds over a single global threshold.** Different tag families have very different precision/recall behavior on this dataset. Each variant directory ships `pr_thresholds.json` containing per-tag P=R thresholds for tags with support≥5 in the held-out split — this covers **19,290 of the 19,292 evaluated tags** (essentially every non-`<PAD>`/`<UNK>` tag has ≥5 positives in 296,056 samples) for both V1 and V1.1.
 ---
 - The best **precision-equals-recall** point I have measured among comparable open anime taggers.
 - A solid **mAP** relative to the same comparison set.
+### Evaluation methodology
+So that the headline numbers are interpretable:
+- **Eval set.** Both V1 and V1.1 are evaluated on the **same 296,056-image held-out split**, drawn from the cleaned-and-expanded corpus described above. V1 was evaluated at epoch 27 / step 170,799; V1.1 at epoch 7 / step 85,517 (the full-val recompute completed 2026-05-09).
+- **Threshold sweep.** F1 / P=R operating points are obtained from a post-training sweep of the global threshold over **[0.001, 0.999] in steps of 0.001** (999 points), independently for each model. Both per-tag and single-threshold operating points come from this sweep. Indices `[0, 1]` (`<PAD>` and `<UNK>`) are excluded from all metrics.
+- **Source of truth.** Numbers in this section are pulled from each variant's `pr_thresholds.json`. Both files are at parity on the full-val split. (The copy of `pr_thresholds.json` shipped inside `V1.1_safetensors/` may temporarily be a pre-recompute snapshot at `val_samples: 30000`; the authoritative full-val V1.1 numbers — same checkpoint — live in [experiments/run1_vit/checkpoints/pr_threshold_last.json](../experiments/run1_vit/checkpoints/pr_threshold_last.json) and will be synced into the release directory before push.)
 ### V1 headline numbers (e27/40, Phase 1, 320×320, 19,292 tags)
 | Metric | Value |
 | Metric | Value |
 |---|---|
+| Macro F1 (P=R) | 0.646 |
+| Micro F1 (P=R) | 0.699 |
+| P=R threshold (macro / micro) | 0.753 / 0.793 |
 | Overall val/mAP | 0.674 |
+Macro F1 is the mean of per-tag F1 at a single global threshold (the WD14-comparable convention). The macro number is essentially identical at any support cutoff (0 / 1 / 5) on this val split, since 19,290 of 19,292 non-PAD/UNK tags have ≥5 positives in 296,056 samples — there are no structural-zero outliers depressing the mean. Per-tag-tuned operating points (each tag at its own break-even threshold) average mean P=R = 0.648 and mean F1-opt = 0.675. See the *A note on the F1 numbers* paragraph below for what "F1 at P=R" means here vs. the in-training F1 metric you may have seen in earlier drafts.
 **mAP broken out by tag frequency bucket — V1 vs. V1.1 on the same eval set:**
 The same rare-vs-head inversion noted for V1 (rare/mid > head/very-common on mAP) is still present in V1.1, and for the same reason — high-frequency tags are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+### Comparison vs other open anime taggers
+The TL;DR claim "best P=R I've measured" deserves the underlying numbers. The comparison below is at each model's own break-even threshold (the same convention WD v3 publishes its headline numbers under). All OppaiOracle numbers are pulled from the `pr_thresholds.json` referenced in *Evaluation methodology* above; competitor numbers are quoted from each model's published model card.
+**Macro-F1 at P=R (each model evaluated against its own training distribution / val split):**
+| Model | Macro-F1 (P=R) | Notes |
+|---|---|---|
+| **OppaiOracle V1.1** | **0.646** | `macro_single_threshold.support_ge_1.pr_breakeven.f1`. On the 296K val split, support≥0 / ≥1 / ≥5 collapse to essentially the same number (0.6460 in all three) because 19,290 of 19,292 non-`<PAD>`/`<UNK>` tags have ≥5 positives. |
+| OppaiOracle V1 | 0.588 | `V1_safetensors/pr_thresholds.json`, support≥0 |
+| camie-tagger-v2 | 0.506 | "Macro-OPT" at threshold 0.492, from his model card |
+| wd-eva02-large-tagger-v3 | 0.4772 | model card |
+| wd-vit-large-tagger-v3 | 0.4674 | model card |
+| wd-swinv2-tagger-v3 | 0.4541 | model card |
+**Micro-F1 at P=R:**
+| Model | Micro-F1 (P=R) | Notes |
+|---|---|---|
+| **OppaiOracle V1.1** | **0.699** | `micro.pr_breakeven.f1` |
+| camie-tagger-v2 | 0.673 | "Micro-OPT" at threshold 0.614 — note this is a **different threshold** from his macro-headline operating point, so his macro and micro numbers are not from the same model state |
+| OppaiOracle V1 | 0.659 | `V1_safetensors/pr_thresholds.json`, `micro.pr_breakeven` |
+| WD v3 | not reported | — |
+**Apples-to-apples vocabulary.** Comparing F1 numbers across these models is fair only once the vocabularies are described, because "70K tags" and "19K tags" are not the same target. Camie's headline 70K is dominated by named-entity (character / copyright / artist) tags, while OppaiOracle's vocabulary is general-only:
+| Model | General tags | Total vocab |
+|---|---|---|
+| OppaiOracle V1 / V1.1 | **19,294** | 19,294 (100% general) |
+| camie-tagger-v2 | 30,841 | 70,527 |
+| wd-vit-large-tagger-v3 | 8,106 | 10,861 (8,106 cat-0 general + 2,751 cat-4 character + 4 cat-9) |
+So on the general-tag axis, OppaiOracle's vocabulary is roughly **2.4× WD's** and roughly **0.6× Camie's general slice**, while still beating both on macro-F1. The named-entity tags Camie's total includes are a different problem domain (recognizing specific characters / copyrights / artists) and are not what this model is measured on.
+**Why this comparison is fair on the metric, but not at parameter parity.** Macro-F1 at the model's own P=R threshold is a calibration-agnostic operating-point comparison — every model is being scored at *its own* best threshold, so loss-function calibration differences don't bias the ranking. What it isn't is a parameter-parity comparison: as noted in *Architecture*, OppaiOracle is the largest model in this set (~247M vs. WD swinv2-base ~99M and Camie ~143M). The gap is real on the metric used; the parameter-count caveat belongs in any deeper analysis.
+**Why V1.1 stopped at 6 of 15 planned epochs.** This was a deliberate noise-robust stopping decision, not a regret. Per-epoch mAP growth decelerated from ~+0.7%/epoch in early Phase 2 to ~+0.3%/epoch by epoch 5, while validation loss continued to fall and per-tag calibration shifted (mean activations per image dropped from ~4500 at epoch 0 to ~4200 at epoch 5; the auto-stop F1 metric is calibration-floored at a fixed threshold of 0.2653 and therefore unreliable as a stop signal — see [TRAINING_HEALTH_TRACKER.md](../TRAINING_HEALTH_TRACKER.md)). The deceleration coincides with a known phase transition in the weakly-supervised multi-label / asymmetric-loss literature: V1.1's loss configuration (`γ_neg=7.0` with reduced regularization) is precisely the regime most exposed to **missing-positive memorization** — the model begins learning that *labeled-but-noisy = positive* and *unlabeled-but-actually-present = negative*. Validation has the same missing-positive structure as training, so a model that has crossed into this regime will *raise* noisy-reference mAP even after true ranking quality has plateaued. The remaining 9 epochs would have been operating in the regime where it is no longer cleanly distinguishable whether mAP gains are *real ranking improvement* or *memorization of the labeled subset of a noisy multi-label corpus* (the missing-positive bias documented earlier in this card sets a soft ceiling somewhere in this neighbourhood). **Implication for the comparison numbers above:** competitors that report higher convergence almost certainly trained past this same phase transition (WD v3 trains for 50+ epochs; Camie's training duration is not disclosed). The headline gap is therefore between OppaiOracle's *pre-memorization* checkpoint and competitor checkpoints that have very likely crossed into it — meaning the gap on cleanly-tagged data is real, and probably understated rather than overstated. Continuing was unlikely to buy enough real gain to justify the extra training time, so V1.1 ships at the epoch-5 / step-81822 checkpoint.
+**A note on the F1 numbers.** V1.1's loss configuration (`gamma_neg=7.0`, `clip=0.2`) shifts the logit distribution relative to V1's (`gamma_neg=4.0`, `clip=0.05`), so the **in-training** F1 metric — which uses a fixed threshold (0.2653) calibrated against V1's distribution — is calibration-floored for V1.1 and unreliable both as a stop signal during training and as a comparison number against V1. Earlier drafts of this card therefore declined to report any F1 number for V1.1. The F1 / P=R numbers in the headline table above are **not** the in-training metric; they are the operating point from a post-training threshold sweep over [0.001, 0.999] (step 0.001) on the same 296,056-image held-out split used for V1, with `<PAD>` and `<UNK>` excluded. The sweep finds each model's own break-even / argmax-F1 threshold from scratch, which is calibration-agnostic — so the V1 → V1.1 F1 comparison is apples-to-apples (Macro F1 0.588 → 0.646, Micro F1 0.659 → 0.699; mAP 0.614 → 0.674).
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.