| --- |
| title: OppaiOracle V1.1 (CPU) |
| emoji: ⚡ |
| colorFrom: pink |
| colorTo: purple |
| sdk: gradio |
| sdk_version: 6.14.0 |
| python_version: '3.12' |
| app_file: app.py |
| pinned: false |
| short_description: Anime ViT tagger (V1.1, 448×448, CPU) |
| models: |
| - Grio43/OppaiOracle |
| --- |
| |
| # OppaiOracle V1.1 — CPU Space |
|
|
| Public inference UI for the **V1.1** (448×448) anime tagger checkpoint that lives in the |
| [Grio43/OppaiOracle](https://huggingface.co/Grio43/OppaiOracle) model repo, running on CPU |
| hardware. For faster GPU-backed inference, see [Grio43/OppaiOracle](https://huggingface.co/spaces/Grio43/OppaiOracle). |
|
|
| ## What this Space does |
|
|
| - Downloads `V1.1_onnx/model.onnx` and `V1.1_onnx/vocabulary.json` from the model repo on |
| first launch (cached afterwards). |
| - Letterboxes the uploaded image to 448×448, normalizes with mean/std = 0.5, runs the ONNX |
| graph with `onnxruntime` (CPU EP), and returns the top tags above the chosen threshold. |
| - Sigmoid is already inside the ONNX graph — the output is a probability vector over 19,294 |
| tags. |
|
|
| ## Performance notes |
|
|
| - Each prediction takes **~10–30 s** on the standard HF CPU tier. The V1.1 ViT is ~250M params |
| at 448² input — there is no way around the cost on CPU. |
| - ORT graph optimization is set to `BASIC` (with a fallback to `DISABLE_ALL`) because |
| `ENABLE_ALL` synthesizes a `MemcpyToHost` node on the bool `padding_mask` input that the |
| HF CPU build of ORT rejects with `NOT_IMPLEMENTED`. |
|
|
| ## Tag thresholds |
|
|
| - Default threshold is **0.50** (broader recall for browsing). |
| - The macro PR break-even point measured on the val set for V1.1 is **≈ 0.760** — set the |
| slider there for the calibrated precision = recall operating point. |
| - Per-tag thresholds are available in `V1.1_onnx/pr_thresholds.json` in the model repo for |
| users who want to apply them externally. |
|
|
| ## Known noise patterns (read the model card first) |
|
|
| The full model card on the model repo documents these in detail. In short: |
|
|
| - **Missing tags are the dominant noise mode.** Many present concepts are simply unlabeled |
| in the training corpus, so low predicted scores are less informative than they look. |
| - **Color tags** (eye/hair/general) leak between perceptual neighbours. |
| - **Hair length** boundaries (`long_hair` / `very_long_hair`) are inherently noisy in the |
| source data. |
| - **Neckwear** (`bow` / `bowtie` / `ribbon` / `ascot` / `necktie`) is routinely confused at |
| the source. |
| - Treat any prediction as a *suggestion to inspect*, not a final answer. |
|
|