Polish Hugging Face repository docs

Browse files

Files changed (9) hide show

.gitignore +1 -0
ANDROID.md +86 -30
MAINTENANCE.md +127 -61
README.md +190 -145
docs/onnx.md +154 -0
docs/training.md +233 -0
export_onnx.py +2 -2
onnx_inference.py +105 -0
train.py +36 -6

.gitignore CHANGED Viewed

@@ -9,6 +9,7 @@ test_checkpoints*/
 ab_checkpoints*/
 *.log
 *.onnx.data
 data/**/*.jsonl
 !data/synthetic_small.jsonl
 !data/test_smoke.jsonl

 ab_checkpoints*/
 *.log
 *.onnx.data
+docs/training_notes.md
 data/**/*.jsonl
 !data/synthetic_small.jsonl
 !data/test_smoke.jsonl

ANDROID.md CHANGED Viewed

@@ -1,58 +1,114 @@
-# Android export and runtime
-This repository is used by MiruPlay as a Git submodule at
-`tools/anime_parser`. It contains the Python training pipeline plus an ONNX
-export path for Android.
-For the full scanner integration notes, file-vs-folder behavior, and device
-test procedure, see MiruPlay's `docs/anime-filename-parser.md`.
-## Export
-From `tools/anime_parser`:
-```bash
-python -m pip install -r requirements.txt
-python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
 ```
 The exporter writes:
 - `exports/anime_filename_parser.onnx`
 - `exports/anime_filename_parser.metadata.json`
 - `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
 - `scraper/src/main/assets/anime_parser/vocab.json`
 - `scraper/src/main/assets/anime_parser/config.json`
-The ONNX graph uses fixed Android inputs:
-- `input_ids`: `int64[1,64]`
-- `attention_mask`: `int64[1,64]`
-- `logits`: `float32[1,64,15]`
-The current export was verified against PyTorch with max absolute logits
-difference `1.621246337890625e-05`.
-## Runtime
-Android runs the exported graph through ONNX Runtime Android. Tokenization and
-BIO postprocessing are implemented in:
-`scraper/src/main/kotlin/com/miruplay/tv/scraper/filename/AnimeFilenameParser.kt`
-The app exposes it through `FilenameMetadataParser` in `core:model`. During a
-scan, `ScanCoordinator` passes that parser into `VideoDirectoryClassifier`; the
-classifier keeps the existing release/folder regexes first and lazily calls the
-model only when those heuristics are missing title, season, or episode data.
-Example Kotlin usage:
-```kotlin
-val parsed = animeFilenameParser.parse("[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]")
 ```
-Expected fields:
 ```text
-title=葬送的芙莉莲, season=2, episode=3, group=ANi, resolution=1080P, source=WEB-DL
 ```

+# Android Export and Runtime / Android 导出与运行时
+AniFileBERT is used by MiruPlay as a Git submodule at `tools/anime_parser`.
+AniFileBERT 在 MiruPlay 中作为 `tools/anime_parser` 子模块使用。
+## Export / 导出
+From this repository root, export the published root checkpoint:
+在本仓库根目录导出当前发布 checkpoint：
+```powershell
+uv sync
+uv run python export_onnx.py --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser
 ```
 The exporter writes:
+导出器会写入：
 - `exports/anime_filename_parser.onnx`
 - `exports/anime_filename_parser.metadata.json`
 - `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
 - `scraper/src/main/assets/anime_parser/vocab.json`
 - `scraper/src/main/assets/anime_parser/config.json`
+## Static Graph Shape / 静态图 Shape
+```text
+input_ids      int64[1,128]
+attention_mask int64[1,128]
+logits         float32[1,128,15]
+```
+The current export is verified against PyTorch, with max absolute logits
+difference recorded in `exports/anime_filename_parser.metadata.json`.
+当前导出会和 PyTorch 做数值对齐，最大 logits 误差记录在
+`exports/anime_filename_parser.metadata.json`。
+## Local ONNX Smoke Test / 本地 ONNX 冒烟测试
+```powershell
+uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
+```
+Expected fields / 期望字段：
+```text
+title=神印王座, episode=200, group=GM-Team, resolution=1080P, source=GB
+```
+Special-code example / 特典编号示例：
+```powershell
+uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
+```
+Expected fields / 期望字段：
+```text
+title=Shinsekai Yori, episode=null, group=YYDM&VCB-Studio, special=NCED02
+```
+## Runtime Contract / 运行时契约
+The ONNX graph returns token logits only. Android must implement the same:
+ONNX 图只返回 token logits。Android 必须实现同一套：
+- custom character tokenizer / 自定义字符 tokenizer
+- token id lookup from `vocab.json` / 使用 `vocab.json` 查 token id
+- fixed-length padding to 128 / padding 到固定长度 128
+- constrained BIO decoding / 约束 BIO 解码
+- field aggregation / 字段聚合
+- high-confidence structural cleanup / 高置信结构修正
+The Android runtime implementation lives in MiruPlay:
+Android 运行时实现位于 MiruPlay：
+```text
+scraper/src/main/kotlin/com/miruplay/tv/scraper/filename/AnimeFilenameParser.kt
 ```
+The app exposes it through `FilenameMetadataParser` in `core:model`. During a
+scan, `ScanCoordinator` passes that parser into `VideoDirectoryClassifier`.
+应用通过 `core:model` 的 `FilenameMetadataParser` 暴露解析能力。扫描时，
+`ScanCoordinator` 会把解析器传给 `VideoDirectoryClassifier`。
+## Asset Update Rule / 资产更新规则
+When updating the parser, keep these files in sync:
+更新解析器时，以下文件必须同步：
 ```text
+anime_filename_parser.onnx
+vocab.json
+config.json
 ```
+Do not update only the ONNX file. Token ids, label ids, and max length are part
+of the runtime contract.
+不要只更新 ONNX。token id、label id 和 max length 都是运行时契约的一部分。
+## More Details / 更多说明
+See [`docs/onnx.md`](docs/onnx.md) for a minimal Python ONNX Runtime reference.
+最小 Python ONNX Runtime 参考见 [`docs/onnx.md`](docs/onnx.md)。

MAINTENANCE.md CHANGED Viewed

@@ -1,117 +1,183 @@
-# AniFileBERT Maintenance
 This repository is the standalone Hugging Face model repo used by MiruPlay as
 `tools/anime_parser`.
-## Related Repositories
-| Repository | URL | Purpose |
-|------------|-----|---------|
-| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, training scripts, ONNX export |
-| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Training datasets and manifests |
-| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android app and runtime integration |
-Nested structure:
 ```text
 AniFileBERT
   datasets/AnimeName -> ModerRAS/AnimeName
 ```
-## Clone
-```bash
 git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
 ```
-After a normal clone:
-```bash
 git submodule update --init --recursive
 ```
-## Dataset Waterline
-Current DMHY snapshot:
 ```text
-labeled_samples: 632002
-char_vocab_size: 6199
-strict_bio_violations: 0
 ```
-The authoritative dataset files live in `datasets/AnimeName`.
-## Train
-```bash
-uv sync
-uv run python train.py \
-  --tokenizer char \
-  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
-  --vocab-file datasets/AnimeName/vocab.char.json \
-  --save-dir checkpoints/dmhy-char-guoman-relabel \
-  --init-model-dir . \
-  --epochs 2 \
-  --batch-size 256 \
-  --learning-rate 0.00008 \
-  --warmup-steps 300 \
-  --max-seq-length 128 \
-  --checkpoint-steps 1000 \
-  --parse-eval-limit 2048 \
-  --seed 52
 ```
-## Publish a New Checkpoint
-Copy the final checkpoint to the repository root:
 ```powershell
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/config.json . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/model.safetensors . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/tokenizer_config.json . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/training_args.bin . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/vocab.json . -Force
 Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/run_metadata.json . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/trainer_eval_metrics.json . -Force
-Copy-Item checkpoints/dmhy-char-guoman-relabel/final/parse_eval_metrics.json . -Force
 ```
-There is no tracked `model/` duplicate. The root checkpoint is the publishing
-surface; ignored `checkpoints/` directories are training artifacts.
-Then commit and push:
-```bash
-git add .
-git commit -m "Update AniFileBERT checkpoint"
-git push origin main
 ```
-## Update the Dataset Submodule
-After pushing new files to `ModerRAS/AnimeName`, update the nested pointer:
-```bash
-git submodule update --remote datasets/AnimeName
 git add datasets/AnimeName
 git commit -m "Update AnimeName dataset pointer"
 git push origin main
 ```
-## Update MiruPlay
-From the MiruPlay root:
-```bash
 git submodule update --remote --recursive tools/anime_parser
 git add tools/anime_parser
 git commit -m "Update AniFileBERT submodule"
-git push origin master
 ```
-If a new ONNX export changed Android runtime assets, also stage:
 ```text
 scraper/src/main/assets/anime_parser/anime_filename_parser.onnx

+# AniFileBERT Maintenance / 维护手册
 This repository is the standalone Hugging Face model repo used by MiruPlay as
 `tools/anime_parser`.
+本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。
+## Related Repositories / 相关仓库
+| Repository / 仓库 | URL | Purpose / 用途 |
+| --- | --- | --- |
+| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
+| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
+| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |
+Nested structure / 嵌套结构：
 ```text
 AniFileBERT
   datasets/AnimeName -> ModerRAS/AnimeName
 ```
+## Clone / 克隆
+```powershell
 git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
 ```
+After a normal clone / 普通 clone 后：
+```powershell
 git submodule update --init --recursive
+uv sync
 ```
+## Publishing Surface / 发布面
+The repository root is the only published Hugging Face checkpoint location:
+仓库根目录是唯一的 Hugging Face checkpoint 发布位置：
 ```text
+config.json
+model.safetensors
+tokenizer_config.json
+training_args.bin
+vocab.json
+vocab.char.json
+run_metadata.json
+trainer_eval_metrics.json
+parse_eval_metrics.json
+case_metrics.json
 ```
+There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
+local training artifacts only.
+仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。
+## Standard Training / 标准训练
+For full details, see [`docs/training.md`](docs/training.md).
+完整流程见 [`docs/training.md`](docs/training.md)。
+Recommended full training command / 推荐全量训练命令：
+```powershell
+uv run python train.py --tokenizer char `
+  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
+  --vocab-file datasets/AnimeName/vocab.char.json `
+  --save-dir checkpoints/dmhy-char-full `
+  --init-model-dir . `
+  --epochs 2 `
+  --batch-size 256 `
+  --learning-rate 0.00008 `
+  --warmup-steps 300 `
+  --max-seq-length 128 `
+  --train-split 0.98 `
+  --num-workers 4 `
+  --checkpoint-steps 1000 `
+  --save-total-limit 3 `
+  --parse-eval-limit 2048 `
+  --case-eval-file data/parser_regression_cases.json `
+  --seed 52 `
+  --experiment-name dmhy-char-full
 ```
+## Publish a New Checkpoint / 发布新 checkpoint
+Copy final files to the repository root:
+把 `final` 文件复制到仓库根目录：
 ```powershell
+$final = "checkpoints/dmhy-char-full/final"
+Copy-Item "$final/config.json" . -Force
+Copy-Item "$final/model.safetensors" . -Force
+Copy-Item "$final/tokenizer_config.json" . -Force
+Copy-Item "$final/training_args.bin" . -Force
+Copy-Item "$final/vocab.json" . -Force
+Copy-Item "$final/run_metadata.json" . -Force
+Copy-Item "$final/trainer_eval_metrics.json" . -Force
+Copy-Item "$final/parse_eval_metrics.json" . -Force
+Copy-Item "$final/case_metrics.json" . -Force
 Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
 ```
+Export ONNX / 导出 ONNX：
+```powershell
+uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
+```
+Validate / 验证：
+```powershell
+uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
+uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
+```
+## Dataset Submodule / 数据集子模块
+If `datasets/AnimeName` changed, commit and push it first:
+如果 `datasets/AnimeName` 有变动，先提交并推送它：
+```powershell
+git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
+git -C datasets/AnimeName commit -m "Update anime filename labels"
+git -C datasets/AnimeName lfs push origin main --all
+git -C datasets/AnimeName push origin main
 ```
+Then commit the submodule pointer in this repo:
+然后在本仓库提交 submodule pointer：
+```powershell
 git add datasets/AnimeName
 git commit -m "Update AnimeName dataset pointer"
+```
+## LFS Push Order / LFS 推送顺序
+Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
+because an LFS pointer points to a missing object, upload LFS objects first:
+大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push，
+先上传 LFS 对象：
+```powershell
+git lfs push origin main --all
 git push origin main
 ```
+For dataset changes:
+数据集变动：
+```powershell
+git -C datasets/AnimeName lfs push origin main --all
+git -C datasets/AnimeName push origin main
+```
+## Update MiruPlay / 更新 MiruPlay
+From MiruPlay root:
+在 MiruPlay 根目录：
+```powershell
 git submodule update --remote --recursive tools/anime_parser
 git add tools/anime_parser
 git commit -m "Update AniFileBERT submodule"
 ```
+If Android assets changed, also stage:
+如果 Android assets 变化，也要提交：
 ```text
 scraper/src/main/assets/anime_parser/anime_filename_parser.onnx

README.md CHANGED Viewed

@@ -3,93 +3,100 @@ license: apache-2.0
 library_name: transformers
 pipeline_tag: token-classification
 tags:
-- anime
-- filename-parsing
-- bert
-- token-classification
 datasets:
-- ModerRAS/AnimeName
 language:
-- en
-- ja
-- zh
 ---
 # AniFileBERT
-AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
-The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
-## Model
-- Architecture: `BertForTokenClassification`
-- Hidden size: 256
-- Layers: 4
-- Attention heads: 8
-- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
-- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
-- Max sequence length: 128
-- Parameters: 4,783,631
-The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
-## Dataset
-Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
-Current DMHY export waterline (from `datasets/AnimeName`):
-- Last exported `files.id`: `1675184`
-- Next incremental export: `--min-id 1675185`
-- Weak-labeled samples: `632002`
-- Mixed training samples: `732002`
-## Vocabulary
-The published checkpoint uses a character vocabulary. `vocab.json` at the
-repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
-as a mirrored explicit copy for training/data maintenance. The full DMHY weak
-dataset has **6195 unique characters**, so the complete character vocab is only
-**6199** entries including special tokens and reaches 100% token coverage.
-The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
-dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
-## Evaluation
-Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
-seed 52):
-| Metric | Value |
-|--------|-------|
-| Eval loss | 0.0058 |
-| Entity precision | 0.9922 |
-| Entity recall | 0.9946 |
-| Entity F1 | 0.9934 |
-| Token accuracy | 0.9981 |
-| Held-out parse full match | 2029/2048 (0.9907) |
-| Fixed regression full match | 22/22 (1.0000) |
-The fixed regression set includes second-season aliases such as `Ni`,
-`Ni no Sara`, `貳`, and `弐ノ章`, plus GM-Team bilingual Chinese animation
-bracket layouts, long-running episode IDs, and dense meta blocks.
-## Usage
-Install dependencies:
-```bash
-uv sync
 ```
-Parse a filename with this repository cloned locally:
-```bash
-python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
 ```
-Load only the model weights from the Hub:
 ```python
 from transformers import BertForTokenClassification
@@ -97,114 +104,152 @@ from transformers import BertForTokenClassification
 model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
 ```
-For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.
-## Clone with Dataset Submodule
-```bash
-git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
-# or, after a normal clone:
-git submodule update --init --recursive
-```
-## Training
-### Character-token DMHY training
-```bash
-uv run python convert_to_char_dataset.py \
-  --input datasets/AnimeName/dmhy_weak.jsonl \
-  --output datasets/AnimeName/dmhy_weak_char.jsonl \
-  --vocab-output datasets/AnimeName/vocab.char.json \
-  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
-uv run python train.py --tokenizer char \
-  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
-  --vocab-file datasets/AnimeName/vocab.char.json \
-  --save-dir checkpoints/dmhy-char-guoman-relabel \
-  --init-model-dir . \
-  --epochs 2 --batch-size 256 \
-  --learning-rate 0.00008 --warmup-steps 300 \
-  --checkpoint-steps 1000 --save-total-limit 3 \
-  --parse-eval-limit 2048 \
-  --max-seq-length 128 --seed 52
-```
-The converter keeps source metadata and adds `tokenizer_variant`, source token
-count, and character token count fields to each record. The char dataset's
-p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
-while leaving room for `[CLS]` and `[SEP]`.
-### Relabel the full dataset
-```bash
-uv run python relabel_dataset_from_filenames.py \
-  --input datasets/AnimeName/dmhy_weak.jsonl \
-  --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
-  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
-  --vocab-output datasets/AnimeName/vocab.relabel.json \
-  --base-vocab datasets/AnimeName/vocab.json \
-  --max-vocab-size 8000
-Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
-Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
-Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
-Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
-```
-### Rebuild vocabulary (if needed)
-```bash
-python -c "
-import json, collections
-tokens = collections.Counter()
-[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
-vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
-json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
-"
 ```
-### Export ONNX for MiruPlay Android
-```bash
-uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
 ```
----
-## Google Colab Training
-For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
-Free Colab still has to be started manually, but once `colab_worker.py` is
-running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
-status. Checkpoints live on Google Drive and default profiles resume from the
-latest checkpoint automatically.
-Manual one-shot runs are also supported:
-```bash
-python colab_train.py --profile dmhy_regex_finetune
 ```
-## Repository Layout
-- `model.safetensors`, `config.json`, `vocab.json`: default published model
-- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
-- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
-- `convert_to_char_dataset.py`: full character-token projection for weak labels
-- `inference.py`: end-to-end filename parser CLI
-- `export_onnx.py`: ONNX export for Android integration
-- `exports/`: exported ONNX model and metadata
-- `datasets/AnimeName/`: nested dataset submodule
-## Maintenance Notes
-MiruPlay tracks this repository as `tools/anime_parser`, and this repository
-tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
-repo, remember to commit the submodule pointer in the parent repo.
-For the full maintenance workflow, see MiruPlay's
-`docs/anifilebert-maintenance.md`.

 library_name: transformers
 pipeline_tag: token-classification
 tags:
+  - anime
+  - filename-parsing
+  - bert
+  - token-classification
+  - onnx
 datasets:
+  - ModerRAS/AnimeName
 language:
+  - en
+  - ja
+  - zh
+model-index:
+  - name: AniFileBERT
+    results:
+      - task:
+          type: token-classification
+          name: Anime filename token classification
+        dataset:
+          name: AniFileBERT fixed parser regression cases
+          type: parser-regression
+        metrics:
+          - type: accuracy
+            name: Fixed parser full-match accuracy
+            value: 1.0
 ---
 # AniFileBERT
+**中文**：AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段：字幕组、标题、季、集数、分辨率、来源和 special tag。
+**English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.
+This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.
+## Model Details / 模型信息
+| Item | Value |
+| --- | --- |
+| Architecture / 架构 | `BertForTokenClassification` |
+| Tokenizer / 分词器 | Custom character tokenizer in `tokenizer.py` |
+| Parameters / 参数量 | 4,783,631 |
+| Hidden size / 隐层维度 | 256 |
+| Layers / 层数 | 4 |
+| Attention heads / 注意力头 | 8 |
+| Max sequence length / 最大长度 | 128 |
+| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
+| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
+| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
+**中文**：根目录就是发布 checkpoint，不再保留旧的 `model/` 重复副本。完整解析请使用本仓库的 `inference.py` 或复用 `tokenizer.py`、BIO decode 和字段聚合逻辑；直接 `from_pretrained()` 只能加载 token-classification 权重。
+**English**: The repository root is the published checkpoint. The old duplicate `model/` directory is intentionally not used. For end-to-end parsing, use `inference.py` or reuse this repo's tokenizer, BIO decoder, and field aggregation logic; `from_pretrained()` only loads token-classification weights.
+## Intended Use / 使用场景
+**中文**
+- 解析番剧/动画发布文件名，用于媒体库刮削、归类、搜索和展示。
+- 覆盖常见结构：`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
+- 不适合泛化为自然语言 NER；这是结构化文件名解析任务。
+**English**
+- Parse anime release filenames for media library scraping, classification, search, and display.
+- Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
+- This is not a general natural-language NER model; it is a structured filename parser.
+## Install / 安装
+```powershell
+uv sync
+```
+If the dataset submodule is missing:
+```powershell
+git submodule update --init --recursive
+```
+## Quick Start / 快速使用
+Run the Python parser:
+```powershell
+uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
 ```
+Expected output:
+```json
+{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
 ```
+Load the raw Transformers model:
 ```python
 from transformers import BertForTokenClassification
 model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
 ```
+**中文**：如果需要完整字段解析，请 clone 本仓库并使用 `inference.py`，因为分词器和后处理是自定义的。
+**English**: For complete field parsing, clone this repo and use `inference.py`; the tokenizer and postprocessing are custom.
+## ONNX Usage / ONNX 使用
+The ONNX graph outputs token logits only. A complete parser still needs:
+1. custom character tokenization,
+2. constrained BIO decoding,
+3. field aggregation and high-confidence structural cleanup.
+本仓库提供最小可运行示例：
+```powershell
+uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
 ```
+Static graph shapes:
+- `input_ids`: `int64[1,128]`
+- `attention_mask`: `int64[1,128]`
+- `logits`: `float32[1,128,15]`
+More details: [`docs/onnx.md`](docs/onnx.md) and [`ANDROID.md`](ANDROID.md).
+## Evaluation / 评估
+Current published checkpoint:
+| Metric / 指标 | Value / 数值 |
+| --- | --- |
+| Fixed real-case regression / 固定真实回归 | 26/26 full match |
+| ONNX parity / ONNX 误差 | max abs diff `2.6703e-05` |
+| Token/entity eval after focus tuning / focus 微调后实体评估 | F1 `0.9666`, token accuracy `0.9904` |
+| Focus parse eval / focus 解析评估 | 385/512 full match |
+**中文**：当前发布模型是“全量重标注 char 模型 + special-code focus 微调”。固定回归集覆盖真实用户反馈样式；focus eval 是偏向困难样本的评估，不等同于全量随机 DMHY 评估。
+**English**: The published checkpoint is the full-relabel character model plus a targeted special-code focus fine-tune. The fixed regression set covers real user-reported patterns; focus eval is intentionally biased toward hard examples and is not equivalent to a broad random DMHY evaluation.
+Run regression:
+```powershell
+uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
 ```
+## Training / 训练
+Training uses the dataset submodule at `datasets/AnimeName`.
+Recommended full character-token run:
+```powershell
+uv run python train.py --tokenizer char `
+  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
+  --vocab-file datasets/AnimeName/vocab.char.json `
+  --save-dir checkpoints/dmhy-char-full `
+  --init-model-dir . `
+  --epochs 2 `
+  --batch-size 256 `
+  --learning-rate 0.00008 `
+  --warmup-steps 300 `
+  --max-seq-length 128 `
+  --train-split 0.98 `
+  --num-workers 4 `
+  --checkpoint-steps 1000 `
+  --save-total-limit 3 `
+  --parse-eval-limit 2048 `
+  --seed 52 `
+  --experiment-name dmhy-char-full
+```
+`train.py` writes:
+- Hugging Face checkpoints under `--save-dir`,
+- `final/run_metadata.json`,
+- `final/trainer_eval_metrics.json`,
+- `final/parse_eval_metrics.json`,
+- `final/case_metrics.json` unless `--no-case-eval` is used,
+- TensorBoard logs unless `--no-tensorboard` is used.
+Full workflow: [`docs/training.md`](docs/training.md).
+## Dataset / 数据集
+Authoritative dataset snapshot:
+```text
+datasets/AnimeName/dmhy_weak.jsonl
+datasets/AnimeName/dmhy_weak_char.jsonl
+datasets/AnimeName/vocab.json
+datasets/AnimeName/vocab.char.json
+```
+Current snapshot:
+- rows / 行数: `632002`
+- failed relabel rows / 重标注失败行: `0`
+- strict BIO violations / 严格 BIO 违规: `0`
+- character vocab / 字符词表: `6199`
+- character coverage / 字符覆盖率: `100%`
+**中文**：`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库，再提交父仓库的 submodule pointer。
+**English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.
+## Repository Layout / 仓库结构
+```text
+config.json
+model.safetensors
+tokenizer_config.json
+vocab.json
+training_args.bin
+inference.py
+onnx_inference.py
+export_onnx.py
+train.py
+dataset.py
+tokenizer.py
+dmhy_dataset.py
+label_repairs.py
+relabel_dataset_from_filenames.py
+convert_to_char_dataset.py
+data/parser_regression_cases.json
+datasets/AnimeName/
+exports/anime_filename_parser.onnx
+docs/
 ```
+## Maintenance / 维护
+See [`MAINTENANCE.md`](MAINTENANCE.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.
+## Limitations / 局限
+**中文**
+- 发布命名没有统一标准，极端 OCR 噪声、乱码、非动画命名仍可能失败。
+- ONNX 只包含模型 logits，不包含 tokenizer 和后处理；移动端必须保持 tokenizer/vocab/config 一致。
+- `source` 当前是单值字段，复杂文件名里可能同时存在平台、发布源、编码器和语言标签。
+**English**
+- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
+- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and postprocessing in sync.
+- `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.

docs/onnx.md ADDED Viewed

	@@ -0,0 +1,154 @@

+# ONNX Usage / ONNX 使用说明
+AniFileBERT exports a static-shape ONNX graph for Android and local inference.
+AniFileBERT 导出静态 shape 的 ONNX 图，用于 Android 和本地推理。
+## 1. What ONNX Contains / ONNX 包含什么
+The ONNX graph contains only the BERT token-classification forward pass:
+ONNX 图只包含 BERT token-classification 前向计算：
+```text
+input_ids      int64[1,128]
+attention_mask int64[1,128]
+logits         float32[1,128,15]
+```
+It does **not** contain:
+它**不包含**：
+- filename tokenization / 文件名分词
+- token-to-id conversion / token 到 id 的转换
+- constrained BIO decoding / 约束 BIO 解码
+- field aggregation / 字段聚合
+- structural cleanup / 结构化清理
+Those steps must stay aligned with `tokenizer.py`, `inference.py`, `config.json`,
+and `vocab.json`.
+这些步骤必须与 `tokenizer.py`、`inference.py`、`config.json`、`vocab.json`
+保持一致。
+## 2. Export / 导出
+```powershell
+uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
+```
+The exporter also writes:
+导出器还会写入：
+```text
+exports/anime_filename_parser.metadata.json
+```
+The metadata records the sample filename, output shape, and PyTorch/ONNX max
+absolute logits difference.
+metadata 会记录样本文件名、输出 shape、PyTorch/ONNX logits 最大绝对误差。
+## 3. Local ONNX Inference / 本地 ONNX 推理
+Use `onnx_inference.py` as the minimal runnable reference.
+使用 `onnx_inference.py` 作为最小可运行参考实现。
+```powershell
+uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
+```
+Expected:
+期望输出：
+```json
+{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
+```
+Special-code example:
+特典编号示例：
+```powershell
+uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
+```
+Expected:
+期望输出：
+```json
+{"title":"Shinsekai Yori","season":null,"episode":null,"group":"YYDM&VCB-Studio","resolution":"1080p","source":"x265_flac","special":"NCED02"}
+```
+## 4. Implementation Steps / 实现步骤
+The runtime parser should do this:
+运行时解析器应按以下步骤实现：
+1. Tokenize filename with the custom character tokenizer.
+   使用自定义字符 tokenizer 对文件名分词。
+2. Add `[CLS]` and `[SEP]`, truncate to `max_length - 2`.
+   添加 `[CLS]` 和 `[SEP]`，截断到 `max_length - 2`。
+3. Convert tokens to ids with `vocab.json`.
+   使用 `vocab.json` 转换 token id。
+4. Pad `input_ids` and `attention_mask` to exactly `128`.
+   将 `input_ids` 和 `attention_mask` padding 到固定 `128`。
+5. Run ONNX Runtime.
+   执行 ONNX Runtime。
+6. Slice logits back to real token count, excluding `[CLS]` and `[SEP]`.
+   去掉 `[CLS]` / `[SEP]`，只保留真实 token 的 logits。
+7. Decode labels with constrained BIO transitions.
+   使用约束 BIO transition 解码标签。
+8. Aggregate labels into parser fields.
+   聚合标签为结构化字段。
+9. Apply high-confidence structural cleanup.
+   应用高置信结构修正。
+## 5. Android Notes / Android 注意事项
+Android must bundle these files together:
+Android 端必须同时打包：
+```text
+anime_filename_parser.onnx
+vocab.json
+config.json
+```
+When changing any of them, update all of them in the same commit.
+只要其中任意一个变化，三者必须在同一次提交中一起更新。
+## 6. Common Mistakes / 常见错误
+**Using a standard Hugging Face tokenizer**
+**误用标准 Hugging Face tokenizer**
+This model uses `AnimeTokenizer`, not WordPiece/BPE.
+本模型使用 `AnimeTokenizer`，不是 WordPiece/BPE。
+**Treating ONNX output as final fields**
+**把 ONNX 输出当成最终字段**
+ONNX returns token logits. You still need BIO decode and field aggregation.
+ONNX 返回 token logits，仍然需要 BIO 解码和字段聚合。
+**Changing max length without updating Android**
+**改 max length 但没有同步 Android**
+The exported graph is static. Runtime arrays must match `[1,128]`.
+导出的图是静态 shape，运行时数组必须匹配 `[1,128]`。

docs/training.md ADDED Viewed

	@@ -0,0 +1,233 @@

+# Training Guide / 训练指南
+This document describes the reproducible training workflow for AniFileBERT.
+本文档记录 AniFileBERT 的可复现训练流程。
+## 1. Environment / 环境
+Use `uv` for all dependency and command execution.
+所有依赖和命令优先使用 `uv`。
+```powershell
+uv sync
+uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
+```
+Recommended GPU configuration:
+推荐 GPU 配置：
+- RTX 3080 class GPU or better
+- batch size `192` to `256` for full char training
+- `fp16` enabled automatically when CUDA is available
+- `--num-workers 4` or `8` when the local disk can keep up
+## 2. Dataset / 数据集
+The authoritative dataset lives in the nested submodule:
+权威数据集位于嵌套子模块：
+```text
+datasets/AnimeName/dmhy_weak.jsonl
+datasets/AnimeName/dmhy_weak_char.jsonl
+datasets/AnimeName/vocab.json
+datasets/AnimeName/vocab.char.json
+```
+Current expected properties:
+当前期望属性：
+- rows / 行数: `632002`
+- strict BIO violations / 严格 BIO 违规: `0`
+- character vocab / 字符词表: `6199`
+- character coverage / 字符覆盖率: `100%`
+## 3. Relabel Full Dataset / 全量重标注
+Use this when weak-label rules changed in `dmhy_dataset.py` or `label_repairs.py`.
+当 `dmhy_dataset.py` 或 `label_repairs.py` 的弱标注规则改变时，使用此流程。
+```powershell
+uv run python relabel_dataset_from_filenames.py `
+  --input datasets/AnimeName/dmhy_weak.jsonl `
+  --output datasets/AnimeName/dmhy_weak.relabel.jsonl `
+  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
+  --vocab-output datasets/AnimeName/vocab.relabel.json `
+  --base-vocab datasets/AnimeName/vocab.json `
+  --max-vocab-size 8000 `
+  --progress 50000
+```
+After checking the manifest and sample labels, replace the authoritative files:
+检查 manifest 和样本标注后，再替换权威文件：
+```powershell
+Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
+Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
+Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
+```
+## 4. Convert to Character Dataset / 转换为字符数据集
+The published checkpoint uses the character tokenizer.
+当前发布模型使用字符级 tokenizer。
+```powershell
+uv run python convert_to_char_dataset.py `
+  --input datasets/AnimeName/dmhy_weak.jsonl `
+  --output datasets/AnimeName/dmhy_weak_char.jsonl `
+  --vocab-output datasets/AnimeName/vocab.char.json `
+  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
+  --progress 50000
+```
+## 5. Full Training / 全量训练
+Recommended RTX 3080 run:
+推荐 RTX 3080 训练命令：
+```powershell
+uv run python train.py --tokenizer char `
+  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
+  --vocab-file datasets/AnimeName/vocab.char.json `
+  --save-dir checkpoints/dmhy-char-full `
+  --init-model-dir . `
+  --epochs 2 `
+  --batch-size 256 `
+  --learning-rate 0.00008 `
+  --warmup-steps 300 `
+  --max-seq-length 128 `
+  --train-split 0.98 `
+  --num-workers 4 `
+  --checkpoint-steps 1000 `
+  --save-total-limit 3 `
+  --parse-eval-limit 2048 `
+  --case-eval-file data/parser_regression_cases.json `
+  --seed 52 `
+  --experiment-name dmhy-char-full
+```
+Training outputs:
+训练输出：
+- `checkpoints/<run>/checkpoint-*`: resumable checkpoints / 可恢复 checkpoint
+- `checkpoints/<run>/final`: final Hugging Face checkpoint / 最终 checkpoint
+- `final/run_metadata.json`: run configuration / 训练配置
+- `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标
+- `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率
+- `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归
+- TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard
+## 6. Focus Fine-Tuning / 针对性微调
+Use focus fine-tuning only after a specific real-world failure pattern has been
+confirmed and added to `data/parser_regression_cases.json`.
+只有在确认某类真实失败样式，并加入 `data/parser_regression_cases.json` 后，才使用针对性微调。
+```powershell
+uv run python build_repair_focus_dataset.py `
+  --input datasets/AnimeName/dmhy_weak_char.jsonl `
+  --output data/repair_focus_char.jsonl `
+  --context-samples 50000 `
+  --repeat-repaired 4 `
+  --repeat-manual 24 `
+  --seed 75
+uv run python train.py --tokenizer char `
+  --data-file data/repair_focus_char.jsonl `
+  --vocab-file datasets/AnimeName/vocab.char.json `
+  --save-dir checkpoints/dmhy-char-special-focus `
+  --init-model-dir . `
+  --epochs 1 `
+  --batch-size 64 `
+  --learning-rate 0.00003 `
+  --warmup-steps 50 `
+  --max-seq-length 128 `
+  --train-split 0.95 `
+  --num-workers 0 `
+  --checkpoint-steps 500 `
+  --save-total-limit 2 `
+  --parse-eval-limit 512 `
+  --case-eval-file data/parser_regression_cases.json `
+  --seed 75 `
+  --experiment-name dmhy-char-special-focus
+```
+## 7. Publish to Repository Root / 发布到仓库根目录
+The repository root is the Hugging Face checkpoint surface.
+仓库根目录就是 Hugging Face checkpoint 发布面。
+```powershell
+$final = "checkpoints/dmhy-char-full/final"
+Copy-Item "$final/config.json" . -Force
+Copy-Item "$final/model.safetensors" . -Force
+Copy-Item "$final/tokenizer_config.json" . -Force
+Copy-Item "$final/training_args.bin" . -Force
+Copy-Item "$final/vocab.json" . -Force
+Copy-Item "$final/run_metadata.json" . -Force
+Copy-Item "$final/trainer_eval_metrics.json" . -Force
+Copy-Item "$final/parse_eval_metrics.json" . -Force
+Copy-Item "$final/case_metrics.json" . -Force
+Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
+```
+Then export ONNX:
+然后导出 ONNX：
+```powershell
+uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
+```
+## 8. Validation Checklist / 验证清单
+Run these before committing:
+提交前执行：
+```powershell
+uv run python -m py_compile tokenizer.py dataset.py dmhy_dataset.py label_repairs.py train.py inference.py export_onnx.py onnx_inference.py
+uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
+uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
+uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
+```
+## 9. Git and LFS Order / Git 与 LFS 顺序
+If the dataset submodule changed:
+如果数据集子模块有变动：
+```powershell
+git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
+git -C datasets/AnimeName commit -m "Update anime filename labels"
+git -C datasets/AnimeName lfs push origin main --all
+git -C datasets/AnimeName push origin main
+```
+Then commit the model repo:
+再提交模型仓库：
+```powershell
+git add README.md MAINTENANCE.md ANDROID.md docs/training.md docs/onnx.md `
+  config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
+  exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
+  train.py inference.py export_onnx.py onnx_inference.py data/parser_regression_cases.json datasets/AnimeName
+git commit -m "Update AniFileBERT model and documentation"
+git lfs push origin main --all
+git push origin main
+```

export_onnx.py CHANGED Viewed

@@ -66,9 +66,9 @@ def copy_android_assets(model_dir: Path, onnx_path: Path, assets_dir: Path) -> N
 def main() -> None:
     parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
-    parser.add_argument("--model-dir", default="checkpoints/final", help="HuggingFace checkpoint directory")
     parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
-    parser.add_argument("--max-length", type=int, default=64, help="Fixed sequence length used on Android")
     parser.add_argument(
         "--android-assets-dir",
         help="Optional Android assets directory that receives the ONNX model, vocab, and config",

 def main() -> None:
     parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
+    parser.add_argument("--model-dir", default=".", help="HuggingFace checkpoint directory")
     parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
+    parser.add_argument("--max-length", type=int, default=128, help="Fixed sequence length used on Android")
     parser.add_argument(
         "--android-assets-dir",
         help="Optional Android assets directory that receives the ONNX model, vocab, and config",

onnx_inference.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+Minimal ONNX Runtime inference example for AniFileBERT.
+The ONNX file outputs token logits only. End-to-end parsing still needs the
+repository tokenizer, constrained BIO decoding, and the same field aggregation
+used by inference.py.
+Usage:
+    python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
+"""
+import argparse
+import json
+from pathlib import Path
+from typing import Dict, List, Tuple
+import numpy as np
+import onnxruntime as ort
+import torch
+from inference import constrained_bio_decode, postprocess
+from tokenizer import AnimeTokenizer, load_tokenizer
+def encode(
+    filename: str,
+    tokenizer: AnimeTokenizer,
+    max_length: int,
+) -> Tuple[List[str], np.ndarray, np.ndarray, int]:
+    tokens = tokenizer.tokenize(filename)
+    available = min(len(tokens), max_length - 2)
+    used_tokens = tokens[:available]
+    input_ids = [tokenizer.cls_token_id]
+    input_ids.extend(tokenizer.convert_tokens_to_ids(used_tokens))
+    input_ids.append(tokenizer.sep_token_id)
+    attention_mask = [1] * len(input_ids)
+    pad_len = max_length - len(input_ids)
+    if pad_len > 0:
+        input_ids.extend([tokenizer.pad_token_id] * pad_len)
+        attention_mask.extend([0] * pad_len)
+    return (
+        used_tokens,
+        np.asarray([input_ids], dtype=np.int64),
+        np.asarray([attention_mask], dtype=np.int64),
+        available,
+    )
+def load_id2label(model_dir: Path) -> Dict[int, str]:
+    config = json.loads((model_dir / "config.json").read_text(encoding="utf-8"))
+    return {int(label_id): label for label_id, label in config["id2label"].items()}
+def parse_with_onnx(
+    filename: str,
+    model_dir: Path,
+    onnx_path: Path,
+    max_length: int,
+    use_rules: bool = True,
+) -> Dict:
+    tokenizer = load_tokenizer(str(model_dir))
+    id2label = load_id2label(model_dir)
+    tokens, input_ids, attention_mask, available = encode(filename, tokenizer, max_length)
+    session = ort.InferenceSession(str(onnx_path), providers=["CPUExecutionProvider"])
+    logits = session.run(
+        ["logits"],
+        {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        },
+    )[0]
+    token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
+    label_ids = constrained_bio_decode(token_logits, id2label)
+    labels = [id2label.get(label_id, "O") for label_id in label_ids]
+    result = postprocess(tokens, labels, tokenizer=tokenizer, filename=filename, use_rules=use_rules)
+    result["_input"] = filename
+    return result
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run AniFileBERT ONNX inference")
+    parser.add_argument("filename", help="Anime filename to parse")
+    parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
+    parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
+    parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
+    parser.add_argument("--no-rule-assist", action="store_true", help="Disable structural postprocessing")
+    args = parser.parse_args()
+    result = parse_with_onnx(
+        filename=args.filename,
+        model_dir=Path(args.model_dir),
+        onnx_path=Path(args.onnx),
+        max_length=args.max_length,
+        use_rules=not args.no_rule_assist,
+    )
+    print(json.dumps(result, ensure_ascii=False))
+if __name__ == "__main__":
+    main()

train.py CHANGED Viewed

@@ -1,11 +1,9 @@
 """
-Training script for anime filename parser.
-Trains a Tiny BERT model for token classification on synthetic anime filename data.
-Uses HuggingFace Trainer for CPU training.
-Usage:
-    python train.py
 """
 import os
@@ -106,6 +104,12 @@ def parse_args() -> argparse.Namespace:
                         help="Optional experiment name written to run_metadata.json")
     parser.add_argument("--parse-eval-limit", type=int, default=512,
                         help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
     parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
     parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
     parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
@@ -626,6 +630,32 @@ def main():
             total = parse_metrics["field_total"][field]
             print(f"  {field}: {correct}/{total} ({accuracy:.4f})")
 if __name__ == "__main__":
     main()

 """
+Train AniFileBERT for structured anime filename parsing.
+The training loop keeps the existing PyTorch/Transformers stack, writes
+Hugging Face checkpoints, records token/entity metrics, and also evaluates
+end-to-end parser exact-match on held-out filenames and fixed real-world cases.
 """
 import os
                         help="Optional experiment name written to run_metadata.json")
     parser.add_argument("--parse-eval-limit", type=int, default=512,
                         help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
+    parser.add_argument("--case-eval-file", default=os.path.join("data", "parser_regression_cases.json"),
+                        help="Fixed real-world parser regression case file evaluated after training")
+    parser.add_argument("--case-eval-output", default=None,
+                        help="Optional output path for fixed case metrics; defaults to final/case_metrics.json")
+    parser.add_argument("--no-case-eval", action="store_true",
+                        help="Skip fixed real-world parser regression evaluation")
     parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
     parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
     parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
             total = parse_metrics["field_total"][field]
             print(f"  {field}: {correct}/{total} ({accuracy:.4f})")
+    if not args.no_case_eval:
+        if args.case_eval_file and os.path.isfile(args.case_eval_file):
+            from evaluate_parser_cases import evaluate_cases
+            case_metrics = evaluate_cases(
+                model_dir=final_save_path,
+                case_file=args.case_eval_file,
+                tokenizer_variant=tokenizer_variant,
+                max_length=config.max_seq_length,
+                use_rules=True,
+                constrain_bio=True,
+            )
+            case_output = args.case_eval_output or os.path.join(final_save_path, "case_metrics.json")
+            os.makedirs(os.path.dirname(case_output) or ".", exist_ok=True)
+            with open(case_output, "w", encoding="utf-8") as f:
+                json.dump(case_metrics, f, ensure_ascii=False, indent=2)
+            print("\nFixed case regression evaluation:")
+            print(
+                f"  full_match: {case_metrics['full_correct']}/"
+                f"{case_metrics['case_count']} ({case_metrics['full_accuracy']:.4f})"
+            )
+            if case_metrics["failures"]:
+                print(f"  failures: {len(case_metrics['failures'])} (see {case_output})")
+        elif args.case_eval_file:
+            print(f"\nSkipping fixed case regression evaluation; file not found: {args.case_eval_file}")
 if __name__ == "__main__":
     main()