Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Polish Hugging Face repository docs
Browse files- .gitignore +1 -0
- ANDROID.md +86 -30
- MAINTENANCE.md +127 -61
- README.md +190 -145
- docs/onnx.md +154 -0
- docs/training.md +233 -0
- export_onnx.py +2 -2
- onnx_inference.py +105 -0
- train.py +36 -6
.gitignore
CHANGED
|
@@ -9,6 +9,7 @@ test_checkpoints*/
|
|
| 9 |
ab_checkpoints*/
|
| 10 |
*.log
|
| 11 |
*.onnx.data
|
|
|
|
| 12 |
data/**/*.jsonl
|
| 13 |
!data/synthetic_small.jsonl
|
| 14 |
!data/test_smoke.jsonl
|
|
|
|
| 9 |
ab_checkpoints*/
|
| 10 |
*.log
|
| 11 |
*.onnx.data
|
| 12 |
+
docs/training_notes.md
|
| 13 |
data/**/*.jsonl
|
| 14 |
!data/synthetic_small.jsonl
|
| 15 |
!data/test_smoke.jsonl
|
ANDROID.md
CHANGED
|
@@ -1,58 +1,114 @@
|
|
| 1 |
-
# Android
|
| 2 |
|
| 3 |
-
|
| 4 |
-
`tools/anime_parser`. It contains the Python training pipeline plus an ONNX
|
| 5 |
-
export path for Android.
|
| 6 |
|
| 7 |
-
|
| 8 |
-
test procedure, see MiruPlay's `docs/anime-filename-parser.md`.
|
| 9 |
|
| 10 |
-
## Export
|
| 11 |
|
| 12 |
-
From
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
```
|
| 18 |
|
| 19 |
The exporter writes:
|
| 20 |
|
|
|
|
|
|
|
| 21 |
- `exports/anime_filename_parser.onnx`
|
| 22 |
- `exports/anime_filename_parser.metadata.json`
|
| 23 |
- `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
|
| 24 |
- `scraper/src/main/assets/anime_parser/vocab.json`
|
| 25 |
- `scraper/src/main/assets/anime_parser/config.json`
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
- `logits`: `float32[1,64,15]`
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
|
| 36 |
-
##
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
```
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
```text
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Android Export and Runtime / Android 导出与运行时
|
| 2 |
|
| 3 |
+
AniFileBERT is used by MiruPlay as a Git submodule at `tools/anime_parser`.
|
|
|
|
|
|
|
| 4 |
|
| 5 |
+
AniFileBERT 在 MiruPlay 中作为 `tools/anime_parser` 子模块使用。
|
|
|
|
| 6 |
|
| 7 |
+
## Export / 导出
|
| 8 |
|
| 9 |
+
From this repository root, export the published root checkpoint:
|
| 10 |
|
| 11 |
+
在本仓库根目录导出当前发布 checkpoint:
|
| 12 |
+
|
| 13 |
+
```powershell
|
| 14 |
+
uv sync
|
| 15 |
+
uv run python export_onnx.py --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser
|
| 16 |
```
|
| 17 |
|
| 18 |
The exporter writes:
|
| 19 |
|
| 20 |
+
导出器会写入:
|
| 21 |
+
|
| 22 |
- `exports/anime_filename_parser.onnx`
|
| 23 |
- `exports/anime_filename_parser.metadata.json`
|
| 24 |
- `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
|
| 25 |
- `scraper/src/main/assets/anime_parser/vocab.json`
|
| 26 |
- `scraper/src/main/assets/anime_parser/config.json`
|
| 27 |
|
| 28 |
+
## Static Graph Shape / 静态图 Shape
|
| 29 |
+
|
| 30 |
+
```text
|
| 31 |
+
input_ids int64[1,128]
|
| 32 |
+
attention_mask int64[1,128]
|
| 33 |
+
logits float32[1,128,15]
|
| 34 |
+
```
|
| 35 |
|
| 36 |
+
The current export is verified against PyTorch, with max absolute logits
|
| 37 |
+
difference recorded in `exports/anime_filename_parser.metadata.json`.
|
|
|
|
| 38 |
|
| 39 |
+
当前导出会和 PyTorch 做数值对齐,最大 logits 误差记录在
|
| 40 |
+
`exports/anime_filename_parser.metadata.json`。
|
| 41 |
|
| 42 |
+
## Local ONNX Smoke Test / 本地 ONNX 冒烟测试
|
| 43 |
|
| 44 |
+
```powershell
|
| 45 |
+
uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 46 |
+
```
|
| 47 |
|
| 48 |
+
Expected fields / 期望字段:
|
| 49 |
|
| 50 |
+
```text
|
| 51 |
+
title=神印王座, episode=200, group=GM-Team, resolution=1080P, source=GB
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Special-code example / 特典编号示例:
|
| 55 |
+
|
| 56 |
+
```powershell
|
| 57 |
+
uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
Expected fields / 期望字段:
|
| 61 |
+
|
| 62 |
+
```text
|
| 63 |
+
title=Shinsekai Yori, episode=null, group=YYDM&VCB-Studio, special=NCED02
|
| 64 |
+
```
|
| 65 |
|
| 66 |
+
## Runtime Contract / 运行时契约
|
| 67 |
|
| 68 |
+
The ONNX graph returns token logits only. Android must implement the same:
|
| 69 |
+
|
| 70 |
+
ONNX 图只返回 token logits。Android 必须实现同一套:
|
| 71 |
+
|
| 72 |
+
- custom character tokenizer / 自定义字符 tokenizer
|
| 73 |
+
- token id lookup from `vocab.json` / 使用 `vocab.json` 查 token id
|
| 74 |
+
- fixed-length padding to 128 / padding 到固定长度 128
|
| 75 |
+
- constrained BIO decoding / 约束 BIO 解码
|
| 76 |
+
- field aggregation / 字段聚合
|
| 77 |
+
- high-confidence structural cleanup / 高置信结构修正
|
| 78 |
+
|
| 79 |
+
The Android runtime implementation lives in MiruPlay:
|
| 80 |
+
|
| 81 |
+
Android 运行时实现位于 MiruPlay:
|
| 82 |
+
|
| 83 |
+
```text
|
| 84 |
+
scraper/src/main/kotlin/com/miruplay/tv/scraper/filename/AnimeFilenameParser.kt
|
| 85 |
```
|
| 86 |
|
| 87 |
+
The app exposes it through `FilenameMetadataParser` in `core:model`. During a
|
| 88 |
+
scan, `ScanCoordinator` passes that parser into `VideoDirectoryClassifier`.
|
| 89 |
+
|
| 90 |
+
应用通过 `core:model` 的 `FilenameMetadataParser` 暴露解析能力。扫描时,
|
| 91 |
+
`ScanCoordinator` 会把解析器传给 `VideoDirectoryClassifier`。
|
| 92 |
+
|
| 93 |
+
## Asset Update Rule / 资产更新规则
|
| 94 |
+
|
| 95 |
+
When updating the parser, keep these files in sync:
|
| 96 |
+
|
| 97 |
+
更新解析器时,以下文件必须同步:
|
| 98 |
|
| 99 |
```text
|
| 100 |
+
anime_filename_parser.onnx
|
| 101 |
+
vocab.json
|
| 102 |
+
config.json
|
| 103 |
```
|
| 104 |
+
|
| 105 |
+
Do not update only the ONNX file. Token ids, label ids, and max length are part
|
| 106 |
+
of the runtime contract.
|
| 107 |
+
|
| 108 |
+
不要只更新 ONNX。token id、label id 和 max length 都是运行时契约的一部分。
|
| 109 |
+
|
| 110 |
+
## More Details / 更多说明
|
| 111 |
+
|
| 112 |
+
See [`docs/onnx.md`](docs/onnx.md) for a minimal Python ONNX Runtime reference.
|
| 113 |
+
|
| 114 |
+
最小 Python ONNX Runtime 参考见 [`docs/onnx.md`](docs/onnx.md)。
|
MAINTENANCE.md
CHANGED
|
@@ -1,117 +1,183 @@
|
|
| 1 |
-
# AniFileBERT Maintenance
|
| 2 |
|
| 3 |
This repository is the standalone Hugging Face model repo used by MiruPlay as
|
| 4 |
`tools/anime_parser`.
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|------------|-----|---------|
|
| 10 |
-
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, training scripts, ONNX export |
|
| 11 |
-
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Training datasets and manifests |
|
| 12 |
-
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android app and runtime integration |
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
```text
|
| 17 |
AniFileBERT
|
| 18 |
datasets/AnimeName -> ModerRAS/AnimeName
|
| 19 |
```
|
| 20 |
|
| 21 |
-
## Clone
|
| 22 |
|
| 23 |
-
```
|
| 24 |
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
|
| 25 |
```
|
| 26 |
|
| 27 |
-
After a normal clone
|
| 28 |
|
| 29 |
-
```
|
| 30 |
git submodule update --init --recursive
|
|
|
|
| 31 |
```
|
| 32 |
|
| 33 |
-
##
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
```text
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
-
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
--
|
| 58 |
-
--
|
| 59 |
-
--
|
| 60 |
-
--
|
| 61 |
-
--
|
| 62 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
-
## Publish a New Checkpoint
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
```powershell
|
| 70 |
-
|
| 71 |
-
Copy-Item
|
| 72 |
-
Copy-Item
|
| 73 |
-
Copy-Item
|
| 74 |
-
Copy-Item
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
|
| 76 |
-
Copy-Item checkpoints/dmhy-char-guoman-relabel/final/run_metadata.json . -Force
|
| 77 |
-
Copy-Item checkpoints/dmhy-char-guoman-relabel/final/trainer_eval_metrics.json . -Force
|
| 78 |
-
Copy-Item checkpoints/dmhy-char-guoman-relabel/final/parse_eval_metrics.json . -Force
|
| 79 |
```
|
| 80 |
|
| 81 |
-
|
| 82 |
-
surface; ignored `checkpoints/` directories are training artifacts.
|
| 83 |
|
| 84 |
-
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
```
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
```
|
| 97 |
-
git submodule update --remote datasets/AnimeName
|
| 98 |
git add datasets/AnimeName
|
| 99 |
git commit -m "Update AnimeName dataset pointer"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
git push origin main
|
| 101 |
```
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
git submodule update --remote --recursive tools/anime_parser
|
| 109 |
git add tools/anime_parser
|
| 110 |
git commit -m "Update AniFileBERT submodule"
|
| 111 |
-
git push origin master
|
| 112 |
```
|
| 113 |
|
| 114 |
-
If
|
|
|
|
|
|
|
| 115 |
|
| 116 |
```text
|
| 117 |
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
|
|
|
|
| 1 |
+
# AniFileBERT Maintenance / 维护手册
|
| 2 |
|
| 3 |
This repository is the standalone Hugging Face model repo used by MiruPlay as
|
| 4 |
`tools/anime_parser`.
|
| 5 |
|
| 6 |
+
本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。
|
| 7 |
|
| 8 |
+
## Related Repositories / 相关仓库
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
| Repository / 仓库 | URL | Purpose / 用途 |
|
| 11 |
+
| --- | --- | --- |
|
| 12 |
+
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
|
| 13 |
+
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
|
| 14 |
+
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |
|
| 15 |
+
|
| 16 |
+
Nested structure / 嵌套结构:
|
| 17 |
|
| 18 |
```text
|
| 19 |
AniFileBERT
|
| 20 |
datasets/AnimeName -> ModerRAS/AnimeName
|
| 21 |
```
|
| 22 |
|
| 23 |
+
## Clone / 克隆
|
| 24 |
|
| 25 |
+
```powershell
|
| 26 |
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
|
| 27 |
```
|
| 28 |
|
| 29 |
+
After a normal clone / 普通 clone 后:
|
| 30 |
|
| 31 |
+
```powershell
|
| 32 |
git submodule update --init --recursive
|
| 33 |
+
uv sync
|
| 34 |
```
|
| 35 |
|
| 36 |
+
## Publishing Surface / 发布面
|
| 37 |
+
|
| 38 |
+
The repository root is the only published Hugging Face checkpoint location:
|
| 39 |
|
| 40 |
+
仓库根目录是唯一的 Hugging Face checkpoint 发布位置:
|
| 41 |
|
| 42 |
```text
|
| 43 |
+
config.json
|
| 44 |
+
model.safetensors
|
| 45 |
+
tokenizer_config.json
|
| 46 |
+
training_args.bin
|
| 47 |
+
vocab.json
|
| 48 |
+
vocab.char.json
|
| 49 |
+
run_metadata.json
|
| 50 |
+
trainer_eval_metrics.json
|
| 51 |
+
parse_eval_metrics.json
|
| 52 |
+
case_metrics.json
|
| 53 |
```
|
| 54 |
|
| 55 |
+
There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
|
| 56 |
+
local training artifacts only.
|
| 57 |
|
| 58 |
+
仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。
|
| 59 |
|
| 60 |
+
## Standard Training / 标准训练
|
| 61 |
+
|
| 62 |
+
For full details, see [`docs/training.md`](docs/training.md).
|
| 63 |
+
|
| 64 |
+
完整流程见 [`docs/training.md`](docs/training.md)。
|
| 65 |
+
|
| 66 |
+
Recommended full training command / 推荐全量训练命令:
|
| 67 |
+
|
| 68 |
+
```powershell
|
| 69 |
+
uv run python train.py --tokenizer char `
|
| 70 |
+
--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
|
| 71 |
+
--vocab-file datasets/AnimeName/vocab.char.json `
|
| 72 |
+
--save-dir checkpoints/dmhy-char-full `
|
| 73 |
+
--init-model-dir . `
|
| 74 |
+
--epochs 2 `
|
| 75 |
+
--batch-size 256 `
|
| 76 |
+
--learning-rate 0.00008 `
|
| 77 |
+
--warmup-steps 300 `
|
| 78 |
+
--max-seq-length 128 `
|
| 79 |
+
--train-split 0.98 `
|
| 80 |
+
--num-workers 4 `
|
| 81 |
+
--checkpoint-steps 1000 `
|
| 82 |
+
--save-total-limit 3 `
|
| 83 |
+
--parse-eval-limit 2048 `
|
| 84 |
+
--case-eval-file data/parser_regression_cases.json `
|
| 85 |
+
--seed 52 `
|
| 86 |
+
--experiment-name dmhy-char-full
|
| 87 |
```
|
| 88 |
|
| 89 |
+
## Publish a New Checkpoint / 发布新 checkpoint
|
| 90 |
+
|
| 91 |
+
Copy final files to the repository root:
|
| 92 |
|
| 93 |
+
把 `final` 文件复制到仓库根目录:
|
| 94 |
|
| 95 |
```powershell
|
| 96 |
+
$final = "checkpoints/dmhy-char-full/final"
|
| 97 |
+
Copy-Item "$final/config.json" . -Force
|
| 98 |
+
Copy-Item "$final/model.safetensors" . -Force
|
| 99 |
+
Copy-Item "$final/tokenizer_config.json" . -Force
|
| 100 |
+
Copy-Item "$final/training_args.bin" . -Force
|
| 101 |
+
Copy-Item "$final/vocab.json" . -Force
|
| 102 |
+
Copy-Item "$final/run_metadata.json" . -Force
|
| 103 |
+
Copy-Item "$final/trainer_eval_metrics.json" . -Force
|
| 104 |
+
Copy-Item "$final/parse_eval_metrics.json" . -Force
|
| 105 |
+
Copy-Item "$final/case_metrics.json" . -Force
|
| 106 |
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
|
|
|
|
|
|
|
|
|
|
| 107 |
```
|
| 108 |
|
| 109 |
+
Export ONNX / 导出 ONNX:
|
|
|
|
| 110 |
|
| 111 |
+
```powershell
|
| 112 |
+
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
|
| 113 |
+
```
|
| 114 |
|
| 115 |
+
Validate / 验证:
|
| 116 |
+
|
| 117 |
+
```powershell
|
| 118 |
+
uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
|
| 119 |
+
uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
## Dataset Submodule / 数据集子模块
|
| 123 |
+
|
| 124 |
+
If `datasets/AnimeName` changed, commit and push it first:
|
| 125 |
+
|
| 126 |
+
如果 `datasets/AnimeName` 有变动,先提交并推送它:
|
| 127 |
+
|
| 128 |
+
```powershell
|
| 129 |
+
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
|
| 130 |
+
git -C datasets/AnimeName commit -m "Update anime filename labels"
|
| 131 |
+
git -C datasets/AnimeName lfs push origin main --all
|
| 132 |
+
git -C datasets/AnimeName push origin main
|
| 133 |
```
|
| 134 |
|
| 135 |
+
Then commit the submodule pointer in this repo:
|
| 136 |
|
| 137 |
+
然后在本仓库提交 submodule pointer:
|
| 138 |
|
| 139 |
+
```powershell
|
|
|
|
| 140 |
git add datasets/AnimeName
|
| 141 |
git commit -m "Update AnimeName dataset pointer"
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
## LFS Push Order / LFS 推送顺序
|
| 145 |
+
|
| 146 |
+
Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
|
| 147 |
+
because an LFS pointer points to a missing object, upload LFS objects first:
|
| 148 |
+
|
| 149 |
+
大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push,
|
| 150 |
+
先上传 LFS 对象:
|
| 151 |
+
|
| 152 |
+
```powershell
|
| 153 |
+
git lfs push origin main --all
|
| 154 |
git push origin main
|
| 155 |
```
|
| 156 |
|
| 157 |
+
For dataset changes:
|
| 158 |
+
|
| 159 |
+
数据集变动:
|
| 160 |
+
|
| 161 |
+
```powershell
|
| 162 |
+
git -C datasets/AnimeName lfs push origin main --all
|
| 163 |
+
git -C datasets/AnimeName push origin main
|
| 164 |
+
```
|
| 165 |
|
| 166 |
+
## Update MiruPlay / 更新 MiruPlay
|
| 167 |
|
| 168 |
+
From MiruPlay root:
|
| 169 |
+
|
| 170 |
+
在 MiruPlay 根目录:
|
| 171 |
+
|
| 172 |
+
```powershell
|
| 173 |
git submodule update --remote --recursive tools/anime_parser
|
| 174 |
git add tools/anime_parser
|
| 175 |
git commit -m "Update AniFileBERT submodule"
|
|
|
|
| 176 |
```
|
| 177 |
|
| 178 |
+
If Android assets changed, also stage:
|
| 179 |
+
|
| 180 |
+
如果 Android assets 变化,也要提交:
|
| 181 |
|
| 182 |
```text
|
| 183 |
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
|
README.md
CHANGED
|
@@ -3,93 +3,100 @@ license: apache-2.0
|
|
| 3 |
library_name: transformers
|
| 4 |
pipeline_tag: token-classification
|
| 5 |
tags:
|
| 6 |
-
- anime
|
| 7 |
-
- filename-parsing
|
| 8 |
-
- bert
|
| 9 |
-
- token-classification
|
|
|
|
| 10 |
datasets:
|
| 11 |
-
- ModerRAS/AnimeName
|
| 12 |
language:
|
| 13 |
-
- en
|
| 14 |
-
- ja
|
| 15 |
-
- zh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
# AniFileBERT
|
| 19 |
|
| 20 |
-
AniFileBERT
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- Hidden size: 256
|
| 28 |
-
- Layers: 4
|
| 29 |
-
- Attention heads: 8
|
| 30 |
-
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
|
| 31 |
-
- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
|
| 32 |
-
- Max sequence length: 128
|
| 33 |
-
- Parameters: 4,783,631
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
- Next incremental export: `--min-id 1675185`
|
| 45 |
-
- Weak-labeled samples: `632002`
|
| 46 |
-
- Mixed training samples: `732002`
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
-
repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
|
| 52 |
-
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
|
| 53 |
-
dataset has **6195 unique characters**, so the complete character vocab is only
|
| 54 |
-
**6199** entries including special tokens and reaches 100% token coverage.
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
|
|
|
| 58 |
|
| 59 |
-
##
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|--------|-------|
|
| 66 |
-
| Eval loss | 0.0058 |
|
| 67 |
-
| Entity precision | 0.9922 |
|
| 68 |
-
| Entity recall | 0.9946 |
|
| 69 |
-
| Entity F1 | 0.9934 |
|
| 70 |
-
| Token accuracy | 0.9981 |
|
| 71 |
-
| Held-out parse full match | 2029/2048 (0.9907) |
|
| 72 |
-
| Fixed regression full match | 22/22 (1.0000) |
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
|
| 78 |
-
##
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
```
|
| 83 |
-
uv
|
| 84 |
```
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
```
|
| 91 |
|
| 92 |
-
Load
|
| 93 |
|
| 94 |
```python
|
| 95 |
from transformers import BertForTokenClassification
|
|
@@ -97,114 +104,152 @@ from transformers import BertForTokenClassification
|
|
| 97 |
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
|
| 98 |
```
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
-
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
|
| 106 |
-
# or, after a normal clone:
|
| 107 |
-
git submodule update --init --recursive
|
| 108 |
-
```
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
### Character-token DMHY training
|
| 113 |
-
|
| 114 |
-
```bash
|
| 115 |
-
uv run python convert_to_char_dataset.py \
|
| 116 |
-
--input datasets/AnimeName/dmhy_weak.jsonl \
|
| 117 |
-
--output datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 118 |
-
--vocab-output datasets/AnimeName/vocab.char.json \
|
| 119 |
-
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
|
| 120 |
-
|
| 121 |
-
uv run python train.py --tokenizer char \
|
| 122 |
-
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 123 |
-
--vocab-file datasets/AnimeName/vocab.char.json \
|
| 124 |
-
--save-dir checkpoints/dmhy-char-guoman-relabel \
|
| 125 |
-
--init-model-dir . \
|
| 126 |
-
--epochs 2 --batch-size 256 \
|
| 127 |
-
--learning-rate 0.00008 --warmup-steps 300 \
|
| 128 |
-
--checkpoint-steps 1000 --save-total-limit 3 \
|
| 129 |
-
--parse-eval-limit 2048 \
|
| 130 |
-
--max-seq-length 128 --seed 52
|
| 131 |
-
```
|
| 132 |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
while leaving room for `[CLS]` and `[SEP]`.
|
| 137 |
-
|
| 138 |
-
### Relabel the full dataset
|
| 139 |
-
|
| 140 |
-
```bash
|
| 141 |
-
uv run python relabel_dataset_from_filenames.py \
|
| 142 |
-
--input datasets/AnimeName/dmhy_weak.jsonl \
|
| 143 |
-
--output datasets/AnimeName/dmhy_weak.relabel.jsonl \
|
| 144 |
-
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
|
| 145 |
-
--vocab-output datasets/AnimeName/vocab.relabel.json \
|
| 146 |
-
--base-vocab datasets/AnimeName/vocab.json \
|
| 147 |
-
--max-vocab-size 8000
|
| 148 |
-
|
| 149 |
-
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
|
| 150 |
-
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
|
| 151 |
-
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
|
| 152 |
-
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
|
| 153 |
-
```
|
| 154 |
|
| 155 |
-
|
| 156 |
|
| 157 |
-
```
|
| 158 |
-
python -
|
| 159 |
-
import json, collections
|
| 160 |
-
tokens = collections.Counter()
|
| 161 |
-
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
|
| 162 |
-
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
|
| 163 |
-
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
|
| 164 |
-
"
|
| 165 |
```
|
| 166 |
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
```
|
| 172 |
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
-
Free Colab still has to be started manually, but once `colab_worker.py` is
|
| 179 |
-
running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
|
| 180 |
-
status. Checkpoints live on Google Drive and default profiles resume from the
|
| 181 |
-
latest checkpoint automatically.
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
```
|
| 188 |
|
| 189 |
-
##
|
| 190 |
|
| 191 |
-
|
| 192 |
-
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
|
| 193 |
-
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
|
| 194 |
-
- `convert_to_char_dataset.py`: full character-token projection for weak labels
|
| 195 |
-
- `inference.py`: end-to-end filename parser CLI
|
| 196 |
-
- `export_onnx.py`: ONNX export for Android integration
|
| 197 |
-
- `exports/`: exported ONNX model and metadata
|
| 198 |
-
- `datasets/AnimeName/`: nested dataset submodule
|
| 199 |
|
| 200 |
-
##
|
| 201 |
|
| 202 |
-
|
| 203 |
-
tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
|
| 204 |
-
repo, remember to commit the submodule pointer in the parent repo.
|
| 205 |
|
| 206 |
-
|
| 207 |
-
|
|
|
|
| 208 |
|
|
|
|
| 209 |
|
|
|
|
|
|
|
|
|
|
| 210 |
|
|
|
|
| 3 |
library_name: transformers
|
| 4 |
pipeline_tag: token-classification
|
| 5 |
tags:
|
| 6 |
+
- anime
|
| 7 |
+
- filename-parsing
|
| 8 |
+
- bert
|
| 9 |
+
- token-classification
|
| 10 |
+
- onnx
|
| 11 |
datasets:
|
| 12 |
+
- ModerRAS/AnimeName
|
| 13 |
language:
|
| 14 |
+
- en
|
| 15 |
+
- ja
|
| 16 |
+
- zh
|
| 17 |
+
model-index:
|
| 18 |
+
- name: AniFileBERT
|
| 19 |
+
results:
|
| 20 |
+
- task:
|
| 21 |
+
type: token-classification
|
| 22 |
+
name: Anime filename token classification
|
| 23 |
+
dataset:
|
| 24 |
+
name: AniFileBERT fixed parser regression cases
|
| 25 |
+
type: parser-regression
|
| 26 |
+
metrics:
|
| 27 |
+
- type: accuracy
|
| 28 |
+
name: Fixed parser full-match accuracy
|
| 29 |
+
value: 1.0
|
| 30 |
---
|
| 31 |
|
| 32 |
# AniFileBERT
|
| 33 |
|
| 34 |
+
**中文**:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。
|
| 35 |
|
| 36 |
+
**English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.
|
| 37 |
|
| 38 |
+
This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.
|
| 39 |
|
| 40 |
+
## Model Details / 模型信息
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
| Item | Value |
|
| 43 |
+
| --- | --- |
|
| 44 |
+
| Architecture / 架构 | `BertForTokenClassification` |
|
| 45 |
+
| Tokenizer / 分词器 | Custom character tokenizer in `tokenizer.py` |
|
| 46 |
+
| Parameters / 参数量 | 4,783,631 |
|
| 47 |
+
| Hidden size / 隐层维度 | 256 |
|
| 48 |
+
| Layers / 层数 | 4 |
|
| 49 |
+
| Attention heads / 注意力头 | 8 |
|
| 50 |
+
| Max sequence length / 最大长度 | 128 |
|
| 51 |
+
| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
|
| 52 |
+
| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
|
| 53 |
+
| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
|
| 54 |
|
| 55 |
+
**中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。完整解析请使用本仓库的 `inference.py` 或复用 `tokenizer.py`、BIO decode 和字段聚合逻辑;直接 `from_pretrained()` 只能加载 token-classification 权重。
|
| 56 |
|
| 57 |
+
**English**: The repository root is the published checkpoint. The old duplicate `model/` directory is intentionally not used. For end-to-end parsing, use `inference.py` or reuse this repo's tokenizer, BIO decoder, and field aggregation logic; `from_pretrained()` only loads token-classification weights.
|
| 58 |
|
| 59 |
+
## Intended Use / 使用场景
|
| 60 |
|
| 61 |
+
**中文**
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
- 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。
|
| 64 |
+
- 覆盖常见结构:`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
|
| 65 |
+
- 不适合泛化为自然语言 NER;这是结构化文件名解析任务。
|
| 66 |
|
| 67 |
+
**English**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
- Parse anime release filenames for media library scraping, classification, search, and display.
|
| 70 |
+
- Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
|
| 71 |
+
- This is not a general natural-language NER model; it is a structured filename parser.
|
| 72 |
|
| 73 |
+
## Install / 安装
|
| 74 |
|
| 75 |
+
```powershell
|
| 76 |
+
uv sync
|
| 77 |
+
```
|
| 78 |
|
| 79 |
+
If the dataset submodule is missing:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
```powershell
|
| 82 |
+
git submodule update --init --recursive
|
| 83 |
+
```
|
| 84 |
|
| 85 |
+
## Quick Start / 快速使用
|
| 86 |
|
| 87 |
+
Run the Python parser:
|
| 88 |
|
| 89 |
+
```powershell
|
| 90 |
+
uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 91 |
```
|
| 92 |
|
| 93 |
+
Expected output:
|
| 94 |
|
| 95 |
+
```json
|
| 96 |
+
{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
|
| 97 |
```
|
| 98 |
|
| 99 |
+
Load the raw Transformers model:
|
| 100 |
|
| 101 |
```python
|
| 102 |
from transformers import BertForTokenClassification
|
|
|
|
| 104 |
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
|
| 105 |
```
|
| 106 |
|
| 107 |
+
**中文**:如果需要完整字段解析,请 clone 本仓库并使用 `inference.py`,因为分词器和后处理是自定义的。
|
| 108 |
|
| 109 |
+
**English**: For complete field parsing, clone this repo and use `inference.py`; the tokenizer and postprocessing are custom.
|
| 110 |
|
| 111 |
+
## ONNX Usage / ONNX 使用
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
The ONNX graph outputs token logits only. A complete parser still needs:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
+
1. custom character tokenization,
|
| 116 |
+
2. constrained BIO decoding,
|
| 117 |
+
3. field aggregation and high-confidence structural cleanup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
本仓库提供最小可运行示例:
|
| 120 |
|
| 121 |
+
```powershell
|
| 122 |
+
uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
```
|
| 124 |
|
| 125 |
+
Static graph shapes:
|
| 126 |
+
|
| 127 |
+
- `input_ids`: `int64[1,128]`
|
| 128 |
+
- `attention_mask`: `int64[1,128]`
|
| 129 |
+
- `logits`: `float32[1,128,15]`
|
| 130 |
+
|
| 131 |
+
More details: [`docs/onnx.md`](docs/onnx.md) and [`ANDROID.md`](ANDROID.md).
|
| 132 |
+
|
| 133 |
+
## Evaluation / 评估
|
| 134 |
|
| 135 |
+
Current published checkpoint:
|
| 136 |
+
|
| 137 |
+
| Metric / 指标 | Value / 数值 |
|
| 138 |
+
| --- | --- |
|
| 139 |
+
| Fixed real-case regression / 固定真实回归 | 26/26 full match |
|
| 140 |
+
| ONNX parity / ONNX 误差 | max abs diff `2.6703e-05` |
|
| 141 |
+
| Token/entity eval after focus tuning / focus 微调后实体评估 | F1 `0.9666`, token accuracy `0.9904` |
|
| 142 |
+
| Focus parse eval / focus 解析评估 | 385/512 full match |
|
| 143 |
+
|
| 144 |
+
**中文**:当前发布模型是“全量重标注 char 模型 + special-code focus 微调”。固定回归集覆盖真实用户反馈样式;focus eval 是偏向困难样本的评估,不等同于全量随机 DMHY 评估。
|
| 145 |
+
|
| 146 |
+
**English**: The published checkpoint is the full-relabel character model plus a targeted special-code focus fine-tune. The fixed regression set covers real user-reported patterns; focus eval is intentionally biased toward hard examples and is not equivalent to a broad random DMHY evaluation.
|
| 147 |
+
|
| 148 |
+
Run regression:
|
| 149 |
+
|
| 150 |
+
```powershell
|
| 151 |
+
uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
|
| 152 |
```
|
| 153 |
|
| 154 |
+
## Training / 训练
|
| 155 |
+
|
| 156 |
+
Training uses the dataset submodule at `datasets/AnimeName`.
|
| 157 |
+
|
| 158 |
+
Recommended full character-token run:
|
| 159 |
+
|
| 160 |
+
```powershell
|
| 161 |
+
uv run python train.py --tokenizer char `
|
| 162 |
+
--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
|
| 163 |
+
--vocab-file datasets/AnimeName/vocab.char.json `
|
| 164 |
+
--save-dir checkpoints/dmhy-char-full `
|
| 165 |
+
--init-model-dir . `
|
| 166 |
+
--epochs 2 `
|
| 167 |
+
--batch-size 256 `
|
| 168 |
+
--learning-rate 0.00008 `
|
| 169 |
+
--warmup-steps 300 `
|
| 170 |
+
--max-seq-length 128 `
|
| 171 |
+
--train-split 0.98 `
|
| 172 |
+
--num-workers 4 `
|
| 173 |
+
--checkpoint-steps 1000 `
|
| 174 |
+
--save-total-limit 3 `
|
| 175 |
+
--parse-eval-limit 2048 `
|
| 176 |
+
--seed 52 `
|
| 177 |
+
--experiment-name dmhy-char-full
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
`train.py` writes:
|
| 181 |
+
|
| 182 |
+
- Hugging Face checkpoints under `--save-dir`,
|
| 183 |
+
- `final/run_metadata.json`,
|
| 184 |
+
- `final/trainer_eval_metrics.json`,
|
| 185 |
+
- `final/parse_eval_metrics.json`,
|
| 186 |
+
- `final/case_metrics.json` unless `--no-case-eval` is used,
|
| 187 |
+
- TensorBoard logs unless `--no-tensorboard` is used.
|
| 188 |
|
| 189 |
+
Full workflow: [`docs/training.md`](docs/training.md).
|
| 190 |
|
| 191 |
+
## Dataset / 数据集
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
+
Authoritative dataset snapshot:
|
| 194 |
+
|
| 195 |
+
```text
|
| 196 |
+
datasets/AnimeName/dmhy_weak.jsonl
|
| 197 |
+
datasets/AnimeName/dmhy_weak_char.jsonl
|
| 198 |
+
datasets/AnimeName/vocab.json
|
| 199 |
+
datasets/AnimeName/vocab.char.json
|
| 200 |
+
```
|
| 201 |
|
| 202 |
+
Current snapshot:
|
| 203 |
+
|
| 204 |
+
- rows / 行数: `632002`
|
| 205 |
+
- failed relabel rows / 重标注失败行: `0`
|
| 206 |
+
- strict BIO violations / 严格 BIO 违规: `0`
|
| 207 |
+
- character vocab / 字符词表: `6199`
|
| 208 |
+
- character coverage / 字符覆盖率: `100%`
|
| 209 |
+
|
| 210 |
+
**中文**:`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。
|
| 211 |
+
|
| 212 |
+
**English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.
|
| 213 |
+
|
| 214 |
+
## Repository Layout / 仓库结构
|
| 215 |
+
|
| 216 |
+
```text
|
| 217 |
+
config.json
|
| 218 |
+
model.safetensors
|
| 219 |
+
tokenizer_config.json
|
| 220 |
+
vocab.json
|
| 221 |
+
training_args.bin
|
| 222 |
+
inference.py
|
| 223 |
+
onnx_inference.py
|
| 224 |
+
export_onnx.py
|
| 225 |
+
train.py
|
| 226 |
+
dataset.py
|
| 227 |
+
tokenizer.py
|
| 228 |
+
dmhy_dataset.py
|
| 229 |
+
label_repairs.py
|
| 230 |
+
relabel_dataset_from_filenames.py
|
| 231 |
+
convert_to_char_dataset.py
|
| 232 |
+
data/parser_regression_cases.json
|
| 233 |
+
datasets/AnimeName/
|
| 234 |
+
exports/anime_filename_parser.onnx
|
| 235 |
+
docs/
|
| 236 |
```
|
| 237 |
|
| 238 |
+
## Maintenance / 维护
|
| 239 |
|
| 240 |
+
See [`MAINTENANCE.md`](MAINTENANCE.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
+
## Limitations / 局限
|
| 243 |
|
| 244 |
+
**中文**
|
|
|
|
|
|
|
| 245 |
|
| 246 |
+
- 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。
|
| 247 |
+
- ONNX 只包含模型 logits,不包含 tokenizer 和后处理;移动端必须保持 tokenizer/vocab/config 一致。
|
| 248 |
+
- `source` 当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。
|
| 249 |
|
| 250 |
+
**English**
|
| 251 |
|
| 252 |
+
- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
|
| 253 |
+
- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and postprocessing in sync.
|
| 254 |
+
- `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.
|
| 255 |
|
docs/onnx.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ONNX Usage / ONNX 使用说明
|
| 2 |
+
|
| 3 |
+
AniFileBERT exports a static-shape ONNX graph for Android and local inference.
|
| 4 |
+
|
| 5 |
+
AniFileBERT 导出静态 shape 的 ONNX 图,用于 Android 和本地推理。
|
| 6 |
+
|
| 7 |
+
## 1. What ONNX Contains / ONNX 包含什么
|
| 8 |
+
|
| 9 |
+
The ONNX graph contains only the BERT token-classification forward pass:
|
| 10 |
+
|
| 11 |
+
ONNX 图只包含 BERT token-classification 前向计算:
|
| 12 |
+
|
| 13 |
+
```text
|
| 14 |
+
input_ids int64[1,128]
|
| 15 |
+
attention_mask int64[1,128]
|
| 16 |
+
logits float32[1,128,15]
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
It does **not** contain:
|
| 20 |
+
|
| 21 |
+
它**不包含**:
|
| 22 |
+
|
| 23 |
+
- filename tokenization / 文件名分词
|
| 24 |
+
- token-to-id conversion / token 到 id 的转换
|
| 25 |
+
- constrained BIO decoding / 约束 BIO 解码
|
| 26 |
+
- field aggregation / 字段聚合
|
| 27 |
+
- structural cleanup / 结构化清理
|
| 28 |
+
|
| 29 |
+
Those steps must stay aligned with `tokenizer.py`, `inference.py`, `config.json`,
|
| 30 |
+
and `vocab.json`.
|
| 31 |
+
|
| 32 |
+
这些步骤必须与 `tokenizer.py`、`inference.py`、`config.json`、`vocab.json`
|
| 33 |
+
保持一致。
|
| 34 |
+
|
| 35 |
+
## 2. Export / 导出
|
| 36 |
+
|
| 37 |
+
```powershell
|
| 38 |
+
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
The exporter also writes:
|
| 42 |
+
|
| 43 |
+
导出器还会写入:
|
| 44 |
+
|
| 45 |
+
```text
|
| 46 |
+
exports/anime_filename_parser.metadata.json
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
The metadata records the sample filename, output shape, and PyTorch/ONNX max
|
| 50 |
+
absolute logits difference.
|
| 51 |
+
|
| 52 |
+
metadata 会记录样本文件名、输出 shape、PyTorch/ONNX logits 最大绝对误差。
|
| 53 |
+
|
| 54 |
+
## 3. Local ONNX Inference / 本地 ONNX 推理
|
| 55 |
+
|
| 56 |
+
Use `onnx_inference.py` as the minimal runnable reference.
|
| 57 |
+
|
| 58 |
+
使用 `onnx_inference.py` 作为最小可运行参考实现。
|
| 59 |
+
|
| 60 |
+
```powershell
|
| 61 |
+
uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Expected:
|
| 65 |
+
|
| 66 |
+
期望输出:
|
| 67 |
+
|
| 68 |
+
```json
|
| 69 |
+
{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Special-code example:
|
| 73 |
+
|
| 74 |
+
特典编号示例:
|
| 75 |
+
|
| 76 |
+
```powershell
|
| 77 |
+
uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Expected:
|
| 81 |
+
|
| 82 |
+
期望输出:
|
| 83 |
+
|
| 84 |
+
```json
|
| 85 |
+
{"title":"Shinsekai Yori","season":null,"episode":null,"group":"YYDM&VCB-Studio","resolution":"1080p","source":"x265_flac","special":"NCED02"}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## 4. Implementation Steps / 实现步骤
|
| 89 |
+
|
| 90 |
+
The runtime parser should do this:
|
| 91 |
+
|
| 92 |
+
运行时解析器应按以下步骤实现:
|
| 93 |
+
|
| 94 |
+
1. Tokenize filename with the custom character tokenizer.
|
| 95 |
+
使用自定义字符 tokenizer 对文件名分词。
|
| 96 |
+
2. Add `[CLS]` and `[SEP]`, truncate to `max_length - 2`.
|
| 97 |
+
添加 `[CLS]` 和 `[SEP]`,截断到 `max_length - 2`。
|
| 98 |
+
3. Convert tokens to ids with `vocab.json`.
|
| 99 |
+
使用 `vocab.json` 转换 token id。
|
| 100 |
+
4. Pad `input_ids` and `attention_mask` to exactly `128`.
|
| 101 |
+
将 `input_ids` 和 `attention_mask` padding 到固定 `128`。
|
| 102 |
+
5. Run ONNX Runtime.
|
| 103 |
+
执行 ONNX Runtime。
|
| 104 |
+
6. Slice logits back to real token count, excluding `[CLS]` and `[SEP]`.
|
| 105 |
+
去掉 `[CLS]` / `[SEP]`,只保留真实 token 的 logits。
|
| 106 |
+
7. Decode labels with constrained BIO transitions.
|
| 107 |
+
使用约束 BIO transition 解码标签。
|
| 108 |
+
8. Aggregate labels into parser fields.
|
| 109 |
+
聚合标签为结构化字段。
|
| 110 |
+
9. Apply high-confidence structural cleanup.
|
| 111 |
+
应用高置信结构修正。
|
| 112 |
+
|
| 113 |
+
## 5. Android Notes / Android 注意事项
|
| 114 |
+
|
| 115 |
+
Android must bundle these files together:
|
| 116 |
+
|
| 117 |
+
Android 端必须同时打包:
|
| 118 |
+
|
| 119 |
+
```text
|
| 120 |
+
anime_filename_parser.onnx
|
| 121 |
+
vocab.json
|
| 122 |
+
config.json
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
When changing any of them, update all of them in the same commit.
|
| 126 |
+
|
| 127 |
+
只要其中任意一个变化,三者必须在同一次提交中一起更新。
|
| 128 |
+
|
| 129 |
+
## 6. Common Mistakes / 常见错误
|
| 130 |
+
|
| 131 |
+
**Using a standard Hugging Face tokenizer**
|
| 132 |
+
|
| 133 |
+
**误用标准 Hugging Face tokenizer**
|
| 134 |
+
|
| 135 |
+
This model uses `AnimeTokenizer`, not WordPiece/BPE.
|
| 136 |
+
|
| 137 |
+
本模型使用 `AnimeTokenizer`,不是 WordPiece/BPE。
|
| 138 |
+
|
| 139 |
+
**Treating ONNX output as final fields**
|
| 140 |
+
|
| 141 |
+
**把 ONNX 输出当成最终字段**
|
| 142 |
+
|
| 143 |
+
ONNX returns token logits. You still need BIO decode and field aggregation.
|
| 144 |
+
|
| 145 |
+
ONNX 返回 token logits,仍然需要 BIO 解码和字段聚合。
|
| 146 |
+
|
| 147 |
+
**Changing max length without updating Android**
|
| 148 |
+
|
| 149 |
+
**改 max length 但没有同步 Android**
|
| 150 |
+
|
| 151 |
+
The exported graph is static. Runtime arrays must match `[1,128]`.
|
| 152 |
+
|
| 153 |
+
导出的图是静态 shape,运行时数组必须匹配 `[1,128]`。
|
| 154 |
+
|
docs/training.md
ADDED
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training Guide / 训练指南
|
| 2 |
+
|
| 3 |
+
This document describes the reproducible training workflow for AniFileBERT.
|
| 4 |
+
|
| 5 |
+
本文档记录 AniFileBERT 的可复现训练流程。
|
| 6 |
+
|
| 7 |
+
## 1. Environment / 环境
|
| 8 |
+
|
| 9 |
+
Use `uv` for all dependency and command execution.
|
| 10 |
+
|
| 11 |
+
所有依赖和命令优先使用 `uv`。
|
| 12 |
+
|
| 13 |
+
```powershell
|
| 14 |
+
uv sync
|
| 15 |
+
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
Recommended GPU configuration:
|
| 19 |
+
|
| 20 |
+
推荐 GPU 配置:
|
| 21 |
+
|
| 22 |
+
- RTX 3080 class GPU or better
|
| 23 |
+
- batch size `192` to `256` for full char training
|
| 24 |
+
- `fp16` enabled automatically when CUDA is available
|
| 25 |
+
- `--num-workers 4` or `8` when the local disk can keep up
|
| 26 |
+
|
| 27 |
+
## 2. Dataset / 数据集
|
| 28 |
+
|
| 29 |
+
The authoritative dataset lives in the nested submodule:
|
| 30 |
+
|
| 31 |
+
权威数据集位于嵌套子模块:
|
| 32 |
+
|
| 33 |
+
```text
|
| 34 |
+
datasets/AnimeName/dmhy_weak.jsonl
|
| 35 |
+
datasets/AnimeName/dmhy_weak_char.jsonl
|
| 36 |
+
datasets/AnimeName/vocab.json
|
| 37 |
+
datasets/AnimeName/vocab.char.json
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Current expected properties:
|
| 41 |
+
|
| 42 |
+
当前期望属性:
|
| 43 |
+
|
| 44 |
+
- rows / 行数: `632002`
|
| 45 |
+
- strict BIO violations / 严格 BIO 违规: `0`
|
| 46 |
+
- character vocab / 字符词表: `6199`
|
| 47 |
+
- character coverage / 字符覆盖率: `100%`
|
| 48 |
+
|
| 49 |
+
## 3. Relabel Full Dataset / 全量重标注
|
| 50 |
+
|
| 51 |
+
Use this when weak-label rules changed in `dmhy_dataset.py` or `label_repairs.py`.
|
| 52 |
+
|
| 53 |
+
当 `dmhy_dataset.py` 或 `label_repairs.py` 的弱标注规则改变时,使用此流程。
|
| 54 |
+
|
| 55 |
+
```powershell
|
| 56 |
+
uv run python relabel_dataset_from_filenames.py `
|
| 57 |
+
--input datasets/AnimeName/dmhy_weak.jsonl `
|
| 58 |
+
--output datasets/AnimeName/dmhy_weak.relabel.jsonl `
|
| 59 |
+
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
|
| 60 |
+
--vocab-output datasets/AnimeName/vocab.relabel.json `
|
| 61 |
+
--base-vocab datasets/AnimeName/vocab.json `
|
| 62 |
+
--max-vocab-size 8000 `
|
| 63 |
+
--progress 50000
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
After checking the manifest and sample labels, replace the authoritative files:
|
| 67 |
+
|
| 68 |
+
检查 manifest 和样本标注后,再替换权威文件:
|
| 69 |
+
|
| 70 |
+
```powershell
|
| 71 |
+
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
|
| 72 |
+
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
|
| 73 |
+
Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
## 4. Convert to Character Dataset / 转换为字符数据集
|
| 77 |
+
|
| 78 |
+
The published checkpoint uses the character tokenizer.
|
| 79 |
+
|
| 80 |
+
当前发布模型使用字符级 tokenizer。
|
| 81 |
+
|
| 82 |
+
```powershell
|
| 83 |
+
uv run python convert_to_char_dataset.py `
|
| 84 |
+
--input datasets/AnimeName/dmhy_weak.jsonl `
|
| 85 |
+
--output datasets/AnimeName/dmhy_weak_char.jsonl `
|
| 86 |
+
--vocab-output datasets/AnimeName/vocab.char.json `
|
| 87 |
+
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
|
| 88 |
+
--progress 50000
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## 5. Full Training / 全量训练
|
| 92 |
+
|
| 93 |
+
Recommended RTX 3080 run:
|
| 94 |
+
|
| 95 |
+
推荐 RTX 3080 训练命令:
|
| 96 |
+
|
| 97 |
+
```powershell
|
| 98 |
+
uv run python train.py --tokenizer char `
|
| 99 |
+
--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
|
| 100 |
+
--vocab-file datasets/AnimeName/vocab.char.json `
|
| 101 |
+
--save-dir checkpoints/dmhy-char-full `
|
| 102 |
+
--init-model-dir . `
|
| 103 |
+
--epochs 2 `
|
| 104 |
+
--batch-size 256 `
|
| 105 |
+
--learning-rate 0.00008 `
|
| 106 |
+
--warmup-steps 300 `
|
| 107 |
+
--max-seq-length 128 `
|
| 108 |
+
--train-split 0.98 `
|
| 109 |
+
--num-workers 4 `
|
| 110 |
+
--checkpoint-steps 1000 `
|
| 111 |
+
--save-total-limit 3 `
|
| 112 |
+
--parse-eval-limit 2048 `
|
| 113 |
+
--case-eval-file data/parser_regression_cases.json `
|
| 114 |
+
--seed 52 `
|
| 115 |
+
--experiment-name dmhy-char-full
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
Training outputs:
|
| 119 |
+
|
| 120 |
+
训练输出:
|
| 121 |
+
|
| 122 |
+
- `checkpoints/<run>/checkpoint-*`: resumable checkpoints / 可恢复 checkpoint
|
| 123 |
+
- `checkpoints/<run>/final`: final Hugging Face checkpoint / 最终 checkpoint
|
| 124 |
+
- `final/run_metadata.json`: run configuration / 训练配置
|
| 125 |
+
- `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标
|
| 126 |
+
- `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率
|
| 127 |
+
- `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归
|
| 128 |
+
- TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard
|
| 129 |
+
|
| 130 |
+
## 6. Focus Fine-Tuning / 针对性微调
|
| 131 |
+
|
| 132 |
+
Use focus fine-tuning only after a specific real-world failure pattern has been
|
| 133 |
+
confirmed and added to `data/parser_regression_cases.json`.
|
| 134 |
+
|
| 135 |
+
只有在确认某类真实失败样式,并加入 `data/parser_regression_cases.json` 后,才使用针对性微调。
|
| 136 |
+
|
| 137 |
+
```powershell
|
| 138 |
+
uv run python build_repair_focus_dataset.py `
|
| 139 |
+
--input datasets/AnimeName/dmhy_weak_char.jsonl `
|
| 140 |
+
--output data/repair_focus_char.jsonl `
|
| 141 |
+
--context-samples 50000 `
|
| 142 |
+
--repeat-repaired 4 `
|
| 143 |
+
--repeat-manual 24 `
|
| 144 |
+
--seed 75
|
| 145 |
+
|
| 146 |
+
uv run python train.py --tokenizer char `
|
| 147 |
+
--data-file data/repair_focus_char.jsonl `
|
| 148 |
+
--vocab-file datasets/AnimeName/vocab.char.json `
|
| 149 |
+
--save-dir checkpoints/dmhy-char-special-focus `
|
| 150 |
+
--init-model-dir . `
|
| 151 |
+
--epochs 1 `
|
| 152 |
+
--batch-size 64 `
|
| 153 |
+
--learning-rate 0.00003 `
|
| 154 |
+
--warmup-steps 50 `
|
| 155 |
+
--max-seq-length 128 `
|
| 156 |
+
--train-split 0.95 `
|
| 157 |
+
--num-workers 0 `
|
| 158 |
+
--checkpoint-steps 500 `
|
| 159 |
+
--save-total-limit 2 `
|
| 160 |
+
--parse-eval-limit 512 `
|
| 161 |
+
--case-eval-file data/parser_regression_cases.json `
|
| 162 |
+
--seed 75 `
|
| 163 |
+
--experiment-name dmhy-char-special-focus
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
## 7. Publish to Repository Root / 发布到仓库根目录
|
| 167 |
+
|
| 168 |
+
The repository root is the Hugging Face checkpoint surface.
|
| 169 |
+
|
| 170 |
+
仓库根目录就是 Hugging Face checkpoint 发布面。
|
| 171 |
+
|
| 172 |
+
```powershell
|
| 173 |
+
$final = "checkpoints/dmhy-char-full/final"
|
| 174 |
+
Copy-Item "$final/config.json" . -Force
|
| 175 |
+
Copy-Item "$final/model.safetensors" . -Force
|
| 176 |
+
Copy-Item "$final/tokenizer_config.json" . -Force
|
| 177 |
+
Copy-Item "$final/training_args.bin" . -Force
|
| 178 |
+
Copy-Item "$final/vocab.json" . -Force
|
| 179 |
+
Copy-Item "$final/run_metadata.json" . -Force
|
| 180 |
+
Copy-Item "$final/trainer_eval_metrics.json" . -Force
|
| 181 |
+
Copy-Item "$final/parse_eval_metrics.json" . -Force
|
| 182 |
+
Copy-Item "$final/case_metrics.json" . -Force
|
| 183 |
+
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
Then export ONNX:
|
| 187 |
+
|
| 188 |
+
然后导出 ONNX:
|
| 189 |
+
|
| 190 |
+
```powershell
|
| 191 |
+
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
## 8. Validation Checklist / 验证清单
|
| 195 |
+
|
| 196 |
+
Run these before committing:
|
| 197 |
+
|
| 198 |
+
提交前执行:
|
| 199 |
+
|
| 200 |
+
```powershell
|
| 201 |
+
uv run python -m py_compile tokenizer.py dataset.py dmhy_dataset.py label_repairs.py train.py inference.py export_onnx.py onnx_inference.py
|
| 202 |
+
uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
|
| 203 |
+
uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 204 |
+
uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
## 9. Git and LFS Order / Git 与 LFS 顺序
|
| 208 |
+
|
| 209 |
+
If the dataset submodule changed:
|
| 210 |
+
|
| 211 |
+
如果数据集子模块有变动:
|
| 212 |
+
|
| 213 |
+
```powershell
|
| 214 |
+
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
|
| 215 |
+
git -C datasets/AnimeName commit -m "Update anime filename labels"
|
| 216 |
+
git -C datasets/AnimeName lfs push origin main --all
|
| 217 |
+
git -C datasets/AnimeName push origin main
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
Then commit the model repo:
|
| 221 |
+
|
| 222 |
+
再提交模型仓库:
|
| 223 |
+
|
| 224 |
+
```powershell
|
| 225 |
+
git add README.md MAINTENANCE.md ANDROID.md docs/training.md docs/onnx.md `
|
| 226 |
+
config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
|
| 227 |
+
exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
|
| 228 |
+
train.py inference.py export_onnx.py onnx_inference.py data/parser_regression_cases.json datasets/AnimeName
|
| 229 |
+
git commit -m "Update AniFileBERT model and documentation"
|
| 230 |
+
git lfs push origin main --all
|
| 231 |
+
git push origin main
|
| 232 |
+
```
|
| 233 |
+
|
export_onnx.py
CHANGED
|
@@ -66,9 +66,9 @@ def copy_android_assets(model_dir: Path, onnx_path: Path, assets_dir: Path) -> N
|
|
| 66 |
|
| 67 |
def main() -> None:
|
| 68 |
parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
|
| 69 |
-
parser.add_argument("--model-dir", default="
|
| 70 |
parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
|
| 71 |
-
parser.add_argument("--max-length", type=int, default=
|
| 72 |
parser.add_argument(
|
| 73 |
"--android-assets-dir",
|
| 74 |
help="Optional Android assets directory that receives the ONNX model, vocab, and config",
|
|
|
|
| 66 |
|
| 67 |
def main() -> None:
|
| 68 |
parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
|
| 69 |
+
parser.add_argument("--model-dir", default=".", help="HuggingFace checkpoint directory")
|
| 70 |
parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
|
| 71 |
+
parser.add_argument("--max-length", type=int, default=128, help="Fixed sequence length used on Android")
|
| 72 |
parser.add_argument(
|
| 73 |
"--android-assets-dir",
|
| 74 |
help="Optional Android assets directory that receives the ONNX model, vocab, and config",
|
onnx_inference.py
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Minimal ONNX Runtime inference example for AniFileBERT.
|
| 3 |
+
|
| 4 |
+
The ONNX file outputs token logits only. End-to-end parsing still needs the
|
| 5 |
+
repository tokenizer, constrained BIO decoding, and the same field aggregation
|
| 6 |
+
used by inference.py.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import argparse
|
| 13 |
+
import json
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
from typing import Dict, List, Tuple
|
| 16 |
+
|
| 17 |
+
import numpy as np
|
| 18 |
+
import onnxruntime as ort
|
| 19 |
+
import torch
|
| 20 |
+
|
| 21 |
+
from inference import constrained_bio_decode, postprocess
|
| 22 |
+
from tokenizer import AnimeTokenizer, load_tokenizer
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def encode(
|
| 26 |
+
filename: str,
|
| 27 |
+
tokenizer: AnimeTokenizer,
|
| 28 |
+
max_length: int,
|
| 29 |
+
) -> Tuple[List[str], np.ndarray, np.ndarray, int]:
|
| 30 |
+
tokens = tokenizer.tokenize(filename)
|
| 31 |
+
available = min(len(tokens), max_length - 2)
|
| 32 |
+
used_tokens = tokens[:available]
|
| 33 |
+
|
| 34 |
+
input_ids = [tokenizer.cls_token_id]
|
| 35 |
+
input_ids.extend(tokenizer.convert_tokens_to_ids(used_tokens))
|
| 36 |
+
input_ids.append(tokenizer.sep_token_id)
|
| 37 |
+
attention_mask = [1] * len(input_ids)
|
| 38 |
+
|
| 39 |
+
pad_len = max_length - len(input_ids)
|
| 40 |
+
if pad_len > 0:
|
| 41 |
+
input_ids.extend([tokenizer.pad_token_id] * pad_len)
|
| 42 |
+
attention_mask.extend([0] * pad_len)
|
| 43 |
+
|
| 44 |
+
return (
|
| 45 |
+
used_tokens,
|
| 46 |
+
np.asarray([input_ids], dtype=np.int64),
|
| 47 |
+
np.asarray([attention_mask], dtype=np.int64),
|
| 48 |
+
available,
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def load_id2label(model_dir: Path) -> Dict[int, str]:
|
| 53 |
+
config = json.loads((model_dir / "config.json").read_text(encoding="utf-8"))
|
| 54 |
+
return {int(label_id): label for label_id, label in config["id2label"].items()}
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def parse_with_onnx(
|
| 58 |
+
filename: str,
|
| 59 |
+
model_dir: Path,
|
| 60 |
+
onnx_path: Path,
|
| 61 |
+
max_length: int,
|
| 62 |
+
use_rules: bool = True,
|
| 63 |
+
) -> Dict:
|
| 64 |
+
tokenizer = load_tokenizer(str(model_dir))
|
| 65 |
+
id2label = load_id2label(model_dir)
|
| 66 |
+
tokens, input_ids, attention_mask, available = encode(filename, tokenizer, max_length)
|
| 67 |
+
|
| 68 |
+
session = ort.InferenceSession(str(onnx_path), providers=["CPUExecutionProvider"])
|
| 69 |
+
logits = session.run(
|
| 70 |
+
["logits"],
|
| 71 |
+
{
|
| 72 |
+
"input_ids": input_ids,
|
| 73 |
+
"attention_mask": attention_mask,
|
| 74 |
+
},
|
| 75 |
+
)[0]
|
| 76 |
+
|
| 77 |
+
token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
|
| 78 |
+
label_ids = constrained_bio_decode(token_logits, id2label)
|
| 79 |
+
labels = [id2label.get(label_id, "O") for label_id in label_ids]
|
| 80 |
+
result = postprocess(tokens, labels, tokenizer=tokenizer, filename=filename, use_rules=use_rules)
|
| 81 |
+
result["_input"] = filename
|
| 82 |
+
return result
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def main() -> None:
|
| 86 |
+
parser = argparse.ArgumentParser(description="Run AniFileBERT ONNX inference")
|
| 87 |
+
parser.add_argument("filename", help="Anime filename to parse")
|
| 88 |
+
parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
|
| 89 |
+
parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
|
| 90 |
+
parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
|
| 91 |
+
parser.add_argument("--no-rule-assist", action="store_true", help="Disable structural postprocessing")
|
| 92 |
+
args = parser.parse_args()
|
| 93 |
+
|
| 94 |
+
result = parse_with_onnx(
|
| 95 |
+
filename=args.filename,
|
| 96 |
+
model_dir=Path(args.model_dir),
|
| 97 |
+
onnx_path=Path(args.onnx),
|
| 98 |
+
max_length=args.max_length,
|
| 99 |
+
use_rules=not args.no_rule_assist,
|
| 100 |
+
)
|
| 101 |
+
print(json.dumps(result, ensure_ascii=False))
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
if __name__ == "__main__":
|
| 105 |
+
main()
|
train.py
CHANGED
|
@@ -1,11 +1,9 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
Usage:
|
| 8 |
-
python train.py
|
| 9 |
"""
|
| 10 |
|
| 11 |
import os
|
|
@@ -106,6 +104,12 @@ def parse_args() -> argparse.Namespace:
|
|
| 106 |
help="Optional experiment name written to run_metadata.json")
|
| 107 |
parser.add_argument("--parse-eval-limit", type=int, default=512,
|
| 108 |
help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
|
| 110 |
parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
|
| 111 |
parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
|
|
@@ -626,6 +630,32 @@ def main():
|
|
| 626 |
total = parse_metrics["field_total"][field]
|
| 627 |
print(f" {field}: {correct}/{total} ({accuracy:.4f})")
|
| 628 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 629 |
|
| 630 |
if __name__ == "__main__":
|
| 631 |
main()
|
|
|
|
| 1 |
"""
|
| 2 |
+
Train AniFileBERT for structured anime filename parsing.
|
| 3 |
|
| 4 |
+
The training loop keeps the existing PyTorch/Transformers stack, writes
|
| 5 |
+
Hugging Face checkpoints, records token/entity metrics, and also evaluates
|
| 6 |
+
end-to-end parser exact-match on held-out filenames and fixed real-world cases.
|
|
|
|
|
|
|
| 7 |
"""
|
| 8 |
|
| 9 |
import os
|
|
|
|
| 104 |
help="Optional experiment name written to run_metadata.json")
|
| 105 |
parser.add_argument("--parse-eval-limit", type=int, default=512,
|
| 106 |
help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
|
| 107 |
+
parser.add_argument("--case-eval-file", default=os.path.join("data", "parser_regression_cases.json"),
|
| 108 |
+
help="Fixed real-world parser regression case file evaluated after training")
|
| 109 |
+
parser.add_argument("--case-eval-output", default=None,
|
| 110 |
+
help="Optional output path for fixed case metrics; defaults to final/case_metrics.json")
|
| 111 |
+
parser.add_argument("--no-case-eval", action="store_true",
|
| 112 |
+
help="Skip fixed real-world parser regression evaluation")
|
| 113 |
parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
|
| 114 |
parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
|
| 115 |
parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
|
|
|
|
| 630 |
total = parse_metrics["field_total"][field]
|
| 631 |
print(f" {field}: {correct}/{total} ({accuracy:.4f})")
|
| 632 |
|
| 633 |
+
if not args.no_case_eval:
|
| 634 |
+
if args.case_eval_file and os.path.isfile(args.case_eval_file):
|
| 635 |
+
from evaluate_parser_cases import evaluate_cases
|
| 636 |
+
|
| 637 |
+
case_metrics = evaluate_cases(
|
| 638 |
+
model_dir=final_save_path,
|
| 639 |
+
case_file=args.case_eval_file,
|
| 640 |
+
tokenizer_variant=tokenizer_variant,
|
| 641 |
+
max_length=config.max_seq_length,
|
| 642 |
+
use_rules=True,
|
| 643 |
+
constrain_bio=True,
|
| 644 |
+
)
|
| 645 |
+
case_output = args.case_eval_output or os.path.join(final_save_path, "case_metrics.json")
|
| 646 |
+
os.makedirs(os.path.dirname(case_output) or ".", exist_ok=True)
|
| 647 |
+
with open(case_output, "w", encoding="utf-8") as f:
|
| 648 |
+
json.dump(case_metrics, f, ensure_ascii=False, indent=2)
|
| 649 |
+
print("\nFixed case regression evaluation:")
|
| 650 |
+
print(
|
| 651 |
+
f" full_match: {case_metrics['full_correct']}/"
|
| 652 |
+
f"{case_metrics['case_count']} ({case_metrics['full_accuracy']:.4f})"
|
| 653 |
+
)
|
| 654 |
+
if case_metrics["failures"]:
|
| 655 |
+
print(f" failures: {len(case_metrics['failures'])} (see {case_output})")
|
| 656 |
+
elif args.case_eval_file:
|
| 657 |
+
print(f"\nSkipping fixed case regression evaluation; file not found: {args.case_eval_file}")
|
| 658 |
+
|
| 659 |
|
| 660 |
if __name__ == "__main__":
|
| 661 |
main()
|