ModerRAS commited on
Commit
376db19
·
1 Parent(s): e8412e3

Polish Hugging Face repository docs

Browse files
Files changed (9) hide show
  1. .gitignore +1 -0
  2. ANDROID.md +86 -30
  3. MAINTENANCE.md +127 -61
  4. README.md +190 -145
  5. docs/onnx.md +154 -0
  6. docs/training.md +233 -0
  7. export_onnx.py +2 -2
  8. onnx_inference.py +105 -0
  9. train.py +36 -6
.gitignore CHANGED
@@ -9,6 +9,7 @@ test_checkpoints*/
9
  ab_checkpoints*/
10
  *.log
11
  *.onnx.data
 
12
  data/**/*.jsonl
13
  !data/synthetic_small.jsonl
14
  !data/test_smoke.jsonl
 
9
  ab_checkpoints*/
10
  *.log
11
  *.onnx.data
12
+ docs/training_notes.md
13
  data/**/*.jsonl
14
  !data/synthetic_small.jsonl
15
  !data/test_smoke.jsonl
ANDROID.md CHANGED
@@ -1,58 +1,114 @@
1
- # Android export and runtime
2
 
3
- This repository is used by MiruPlay as a Git submodule at
4
- `tools/anime_parser`. It contains the Python training pipeline plus an ONNX
5
- export path for Android.
6
 
7
- For the full scanner integration notes, file-vs-folder behavior, and device
8
- test procedure, see MiruPlay's `docs/anime-filename-parser.md`.
9
 
10
- ## Export
11
 
12
- From `tools/anime_parser`:
13
 
14
- ```bash
15
- python -m pip install -r requirements.txt
16
- python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
 
 
17
  ```
18
 
19
  The exporter writes:
20
 
 
 
21
  - `exports/anime_filename_parser.onnx`
22
  - `exports/anime_filename_parser.metadata.json`
23
  - `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
24
  - `scraper/src/main/assets/anime_parser/vocab.json`
25
  - `scraper/src/main/assets/anime_parser/config.json`
26
 
27
- The ONNX graph uses fixed Android inputs:
 
 
 
 
 
 
28
 
29
- - `input_ids`: `int64[1,64]`
30
- - `attention_mask`: `int64[1,64]`
31
- - `logits`: `float32[1,64,15]`
32
 
33
- The current export was verified against PyTorch with max absolute logits
34
- difference `1.621246337890625e-05`.
35
 
36
- ## Runtime
37
 
38
- Android runs the exported graph through ONNX Runtime Android. Tokenization and
39
- BIO postprocessing are implemented in:
 
40
 
41
- `scraper/src/main/kotlin/com/miruplay/tv/scraper/filename/AnimeFilenameParser.kt`
42
 
43
- The app exposes it through `FilenameMetadataParser` in `core:model`. During a
44
- scan, `ScanCoordinator` passes that parser into `VideoDirectoryClassifier`; the
45
- classifier keeps the existing release/folder regexes first and lazily calls the
46
- model only when those heuristics are missing title, season, or episode data.
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- Example Kotlin usage:
49
 
50
- ```kotlin
51
- val parsed = animeFilenameParser.parse("[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
54
- Expected fields:
 
 
 
 
 
 
 
 
 
 
55
 
56
  ```text
57
- title=葬送的芙莉莲, season=2, episode=3, group=ANi, resolution=1080P, source=WEB-DL
 
 
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Android Export and Runtime / Android 导出与运行时
2
 
3
+ AniFileBERT is used by MiruPlay as a Git submodule at `tools/anime_parser`.
 
 
4
 
5
+ AniFileBERT MiruPlay 中作为 `tools/anime_parser` 子模块使用。
 
6
 
7
+ ## Export / 导出
8
 
9
+ From this repository root, export the published root checkpoint:
10
 
11
+ 在本仓库根目录导出当前发布 checkpoint:
12
+
13
+ ```powershell
14
+ uv sync
15
+ uv run python export_onnx.py --model-dir . --max-length 128 --android-assets-dir ../../scraper/src/main/assets/anime_parser
16
  ```
17
 
18
  The exporter writes:
19
 
20
+ 导出器会写入:
21
+
22
  - `exports/anime_filename_parser.onnx`
23
  - `exports/anime_filename_parser.metadata.json`
24
  - `scraper/src/main/assets/anime_parser/anime_filename_parser.onnx`
25
  - `scraper/src/main/assets/anime_parser/vocab.json`
26
  - `scraper/src/main/assets/anime_parser/config.json`
27
 
28
+ ## Static Graph Shape / 静态图 Shape
29
+
30
+ ```text
31
+ input_ids int64[1,128]
32
+ attention_mask int64[1,128]
33
+ logits float32[1,128,15]
34
+ ```
35
 
36
+ The current export is verified against PyTorch, with max absolute logits
37
+ difference recorded in `exports/anime_filename_parser.metadata.json`.
 
38
 
39
+ 当前导出会和 PyTorch 做数值对齐,最大 logits 误差记录在
40
+ `exports/anime_filename_parser.metadata.json`。
41
 
42
+ ## Local ONNX Smoke Test / 本地 ONNX 冒烟测试
43
 
44
+ ```powershell
45
+ uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
46
+ ```
47
 
48
+ Expected fields / 期望字段:
49
 
50
+ ```text
51
+ title=神印王座, episode=200, group=GM-Team, resolution=1080P, source=GB
52
+ ```
53
+
54
+ Special-code example / 特典编号示例:
55
+
56
+ ```powershell
57
+ uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
58
+ ```
59
+
60
+ Expected fields / 期望字段:
61
+
62
+ ```text
63
+ title=Shinsekai Yori, episode=null, group=YYDM&VCB-Studio, special=NCED02
64
+ ```
65
 
66
+ ## Runtime Contract / 运行时契约
67
 
68
+ The ONNX graph returns token logits only. Android must implement the same:
69
+
70
+ ONNX 图只返回 token logits。Android 必须实现同一套:
71
+
72
+ - custom character tokenizer / 自定义字符 tokenizer
73
+ - token id lookup from `vocab.json` / 使用 `vocab.json` 查 token id
74
+ - fixed-length padding to 128 / padding 到固定长度 128
75
+ - constrained BIO decoding / 约束 BIO 解码
76
+ - field aggregation / 字段聚合
77
+ - high-confidence structural cleanup / 高置信结构修正
78
+
79
+ The Android runtime implementation lives in MiruPlay:
80
+
81
+ Android 运行时实现位于 MiruPlay:
82
+
83
+ ```text
84
+ scraper/src/main/kotlin/com/miruplay/tv/scraper/filename/AnimeFilenameParser.kt
85
  ```
86
 
87
+ The app exposes it through `FilenameMetadataParser` in `core:model`. During a
88
+ scan, `ScanCoordinator` passes that parser into `VideoDirectoryClassifier`.
89
+
90
+ 应用通过 `core:model` 的 `FilenameMetadataParser` 暴露解析能力。扫描时,
91
+ `ScanCoordinator` 会把解析器传给 `VideoDirectoryClassifier`。
92
+
93
+ ## Asset Update Rule / 资产更新规则
94
+
95
+ When updating the parser, keep these files in sync:
96
+
97
+ 更新解析器时,以下文件必须同步:
98
 
99
  ```text
100
+ anime_filename_parser.onnx
101
+ vocab.json
102
+ config.json
103
  ```
104
+
105
+ Do not update only the ONNX file. Token ids, label ids, and max length are part
106
+ of the runtime contract.
107
+
108
+ 不要只更新 ONNX。token id、label id 和 max length 都是运行时契约的一部分。
109
+
110
+ ## More Details / 更多说明
111
+
112
+ See [`docs/onnx.md`](docs/onnx.md) for a minimal Python ONNX Runtime reference.
113
+
114
+ 最小 Python ONNX Runtime 参考见 [`docs/onnx.md`](docs/onnx.md)。
MAINTENANCE.md CHANGED
@@ -1,117 +1,183 @@
1
- # AniFileBERT Maintenance
2
 
3
  This repository is the standalone Hugging Face model repo used by MiruPlay as
4
  `tools/anime_parser`.
5
 
6
- ## Related Repositories
7
 
8
- | Repository | URL | Purpose |
9
- |------------|-----|---------|
10
- | AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, training scripts, ONNX export |
11
- | AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Training datasets and manifests |
12
- | MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android app and runtime integration |
13
 
14
- Nested structure:
 
 
 
 
 
 
15
 
16
  ```text
17
  AniFileBERT
18
  datasets/AnimeName -> ModerRAS/AnimeName
19
  ```
20
 
21
- ## Clone
22
 
23
- ```bash
24
  git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
25
  ```
26
 
27
- After a normal clone:
28
 
29
- ```bash
30
  git submodule update --init --recursive
 
31
  ```
32
 
33
- ## Dataset Waterline
 
 
34
 
35
- Current DMHY snapshot:
36
 
37
  ```text
38
- labeled_samples: 632002
39
- char_vocab_size: 6199
40
- strict_bio_violations: 0
 
 
 
 
 
 
 
41
  ```
42
 
43
- The authoritative dataset files live in `datasets/AnimeName`.
 
44
 
45
- ## Train
46
 
47
- ```bash
48
- uv sync
49
- uv run python train.py \
50
- --tokenizer char \
51
- --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
52
- --vocab-file datasets/AnimeName/vocab.char.json \
53
- --save-dir checkpoints/dmhy-char-guoman-relabel \
54
- --init-model-dir . \
55
- --epochs 2 \
56
- --batch-size 256 \
57
- --learning-rate 0.00008 \
58
- --warmup-steps 300 \
59
- --max-seq-length 128 \
60
- --checkpoint-steps 1000 \
61
- --parse-eval-limit 2048 \
62
- --seed 52
 
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
- ## Publish a New Checkpoint
 
 
66
 
67
- Copy the final checkpoint to the repository root:
68
 
69
  ```powershell
70
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/config.json . -Force
71
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/model.safetensors . -Force
72
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/tokenizer_config.json . -Force
73
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/training_args.bin . -Force
74
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/vocab.json . -Force
 
 
 
 
 
75
  Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
76
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/run_metadata.json . -Force
77
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/trainer_eval_metrics.json . -Force
78
- Copy-Item checkpoints/dmhy-char-guoman-relabel/final/parse_eval_metrics.json . -Force
79
  ```
80
 
81
- There is no tracked `model/` duplicate. The root checkpoint is the publishing
82
- surface; ignored `checkpoints/` directories are training artifacts.
83
 
84
- Then commit and push:
 
 
85
 
86
- ```bash
87
- git add .
88
- git commit -m "Update AniFileBERT checkpoint"
89
- git push origin main
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
- ## Update the Dataset Submodule
93
 
94
- After pushing new files to `ModerRAS/AnimeName`, update the nested pointer:
95
 
96
- ```bash
97
- git submodule update --remote datasets/AnimeName
98
  git add datasets/AnimeName
99
  git commit -m "Update AnimeName dataset pointer"
 
 
 
 
 
 
 
 
 
 
 
 
100
  git push origin main
101
  ```
102
 
103
- ## Update MiruPlay
 
 
 
 
 
 
 
104
 
105
- From the MiruPlay root:
106
 
107
- ```bash
 
 
 
 
108
  git submodule update --remote --recursive tools/anime_parser
109
  git add tools/anime_parser
110
  git commit -m "Update AniFileBERT submodule"
111
- git push origin master
112
  ```
113
 
114
- If a new ONNX export changed Android runtime assets, also stage:
 
 
115
 
116
  ```text
117
  scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
 
1
+ # AniFileBERT Maintenance / 维护手册
2
 
3
  This repository is the standalone Hugging Face model repo used by MiruPlay as
4
  `tools/anime_parser`.
5
 
6
+ 本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。
7
 
8
+ ## Related Repositories / 相关仓库
 
 
 
 
9
 
10
+ | Repository / 仓库 | URL | Purpose / 用途 |
11
+ | --- | --- | --- |
12
+ | AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
13
+ | AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
14
+ | MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |
15
+
16
+ Nested structure / 嵌套结构:
17
 
18
  ```text
19
  AniFileBERT
20
  datasets/AnimeName -> ModerRAS/AnimeName
21
  ```
22
 
23
+ ## Clone / 克隆
24
 
25
+ ```powershell
26
  git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
27
  ```
28
 
29
+ After a normal clone / 普通 clone 后:
30
 
31
+ ```powershell
32
  git submodule update --init --recursive
33
+ uv sync
34
  ```
35
 
36
+ ## Publishing Surface / 发布面
37
+
38
+ The repository root is the only published Hugging Face checkpoint location:
39
 
40
+ 仓库根目录是唯一的 Hugging Face checkpoint 发布位置:
41
 
42
  ```text
43
+ config.json
44
+ model.safetensors
45
+ tokenizer_config.json
46
+ training_args.bin
47
+ vocab.json
48
+ vocab.char.json
49
+ run_metadata.json
50
+ trainer_eval_metrics.json
51
+ parse_eval_metrics.json
52
+ case_metrics.json
53
  ```
54
 
55
+ There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
56
+ local training artifacts only.
57
 
58
+ 仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。
59
 
60
+ ## Standard Training / 标准训练
61
+
62
+ For full details, see [`docs/training.md`](docs/training.md).
63
+
64
+ 完整流程见 [`docs/training.md`](docs/training.md)。
65
+
66
+ Recommended full training command / 推荐全量训练命令:
67
+
68
+ ```powershell
69
+ uv run python train.py --tokenizer char `
70
+ --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
71
+ --vocab-file datasets/AnimeName/vocab.char.json `
72
+ --save-dir checkpoints/dmhy-char-full `
73
+ --init-model-dir . `
74
+ --epochs 2 `
75
+ --batch-size 256 `
76
+ --learning-rate 0.00008 `
77
+ --warmup-steps 300 `
78
+ --max-seq-length 128 `
79
+ --train-split 0.98 `
80
+ --num-workers 4 `
81
+ --checkpoint-steps 1000 `
82
+ --save-total-limit 3 `
83
+ --parse-eval-limit 2048 `
84
+ --case-eval-file data/parser_regression_cases.json `
85
+ --seed 52 `
86
+ --experiment-name dmhy-char-full
87
  ```
88
 
89
+ ## Publish a New Checkpoint / 发布新 checkpoint
90
+
91
+ Copy final files to the repository root:
92
 
93
+ `final` 文件复制到仓库根目录:
94
 
95
  ```powershell
96
+ $final = "checkpoints/dmhy-char-full/final"
97
+ Copy-Item "$final/config.json" . -Force
98
+ Copy-Item "$final/model.safetensors" . -Force
99
+ Copy-Item "$final/tokenizer_config.json" . -Force
100
+ Copy-Item "$final/training_args.bin" . -Force
101
+ Copy-Item "$final/vocab.json" . -Force
102
+ Copy-Item "$final/run_metadata.json" . -Force
103
+ Copy-Item "$final/trainer_eval_metrics.json" . -Force
104
+ Copy-Item "$final/parse_eval_metrics.json" . -Force
105
+ Copy-Item "$final/case_metrics.json" . -Force
106
  Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
 
 
 
107
  ```
108
 
109
+ Export ONNX / 导出 ONNX:
 
110
 
111
+ ```powershell
112
+ uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
113
+ ```
114
 
115
+ Validate / 验证:
116
+
117
+ ```powershell
118
+ uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
119
+ uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
120
+ ```
121
+
122
+ ## Dataset Submodule / 数据集子模块
123
+
124
+ If `datasets/AnimeName` changed, commit and push it first:
125
+
126
+ 如果 `datasets/AnimeName` 有变动,先提交并推送它:
127
+
128
+ ```powershell
129
+ git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
130
+ git -C datasets/AnimeName commit -m "Update anime filename labels"
131
+ git -C datasets/AnimeName lfs push origin main --all
132
+ git -C datasets/AnimeName push origin main
133
  ```
134
 
135
+ Then commit the submodule pointer in this repo:
136
 
137
+ 然后在本仓库提交 submodule pointer
138
 
139
+ ```powershell
 
140
  git add datasets/AnimeName
141
  git commit -m "Update AnimeName dataset pointer"
142
+ ```
143
+
144
+ ## LFS Push Order / LFS 推送顺序
145
+
146
+ Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
147
+ because an LFS pointer points to a missing object, upload LFS objects first:
148
+
149
+ 大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push,
150
+ 先上传 LFS 对象:
151
+
152
+ ```powershell
153
+ git lfs push origin main --all
154
  git push origin main
155
  ```
156
 
157
+ For dataset changes:
158
+
159
+ 数据集变动:
160
+
161
+ ```powershell
162
+ git -C datasets/AnimeName lfs push origin main --all
163
+ git -C datasets/AnimeName push origin main
164
+ ```
165
 
166
+ ## Update MiruPlay / 更新 MiruPlay
167
 
168
+ From MiruPlay root:
169
+
170
+ 在 MiruPlay 根目录:
171
+
172
+ ```powershell
173
  git submodule update --remote --recursive tools/anime_parser
174
  git add tools/anime_parser
175
  git commit -m "Update AniFileBERT submodule"
 
176
  ```
177
 
178
+ If Android assets changed, also stage:
179
+
180
+ 如果 Android assets 变化,也要提交:
181
 
182
  ```text
183
  scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
README.md CHANGED
@@ -3,93 +3,100 @@ license: apache-2.0
3
  library_name: transformers
4
  pipeline_tag: token-classification
5
  tags:
6
- - anime
7
- - filename-parsing
8
- - bert
9
- - token-classification
 
10
  datasets:
11
- - ModerRAS/AnimeName
12
  language:
13
- - en
14
- - ja
15
- - zh
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ---
17
 
18
  # AniFileBERT
19
 
20
- AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
21
 
22
- The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
23
 
24
- ## Model
25
 
26
- - Architecture: `BertForTokenClassification`
27
- - Hidden size: 256
28
- - Layers: 4
29
- - Attention heads: 8
30
- - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
31
- - Tokenizer: custom character tokenizer implemented in `tokenizer.py`
32
- - Max sequence length: 128
33
- - Parameters: 4,783,631
34
 
35
- The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ## Dataset
38
 
39
- Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
40
 
41
- Current DMHY export waterline (from `datasets/AnimeName`):
42
 
43
- - Last exported `files.id`: `1675184`
44
- - Next incremental export: `--min-id 1675185`
45
- - Weak-labeled samples: `632002`
46
- - Mixed training samples: `732002`
47
 
48
- ## Vocabulary
 
 
49
 
50
- The published checkpoint uses a character vocabulary. `vocab.json` at the
51
- repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
52
- as a mirrored explicit copy for training/data maintenance. The full DMHY weak
53
- dataset has **6195 unique characters**, so the complete character vocab is only
54
- **6199** entries including special tokens and reaches 100% token coverage.
55
 
56
- The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
57
- dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
 
58
 
59
- ## Evaluation
60
 
61
- Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
62
- seed 52):
 
63
 
64
- | Metric | Value |
65
- |--------|-------|
66
- | Eval loss | 0.0058 |
67
- | Entity precision | 0.9922 |
68
- | Entity recall | 0.9946 |
69
- | Entity F1 | 0.9934 |
70
- | Token accuracy | 0.9981 |
71
- | Held-out parse full match | 2029/2048 (0.9907) |
72
- | Fixed regression full match | 22/22 (1.0000) |
73
 
74
- The fixed regression set includes second-season aliases such as `Ni`,
75
- `Ni no Sara`, `貳`, and `弐ノ章`, plus GM-Team bilingual Chinese animation
76
- bracket layouts, long-running episode IDs, and dense meta blocks.
77
 
78
- ## Usage
79
 
80
- Install dependencies:
81
 
82
- ```bash
83
- uv sync
84
  ```
85
 
86
- Parse a filename with this repository cloned locally:
87
 
88
- ```bash
89
- python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
90
  ```
91
 
92
- Load only the model weights from the Hub:
93
 
94
  ```python
95
  from transformers import BertForTokenClassification
@@ -97,114 +104,152 @@ from transformers import BertForTokenClassification
97
  model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
98
  ```
99
 
100
- For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.
101
 
102
- ## Clone with Dataset Submodule
103
 
104
- ```bash
105
- git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
106
- # or, after a normal clone:
107
- git submodule update --init --recursive
108
- ```
109
 
110
- ## Training
111
-
112
- ### Character-token DMHY training
113
-
114
- ```bash
115
- uv run python convert_to_char_dataset.py \
116
- --input datasets/AnimeName/dmhy_weak.jsonl \
117
- --output datasets/AnimeName/dmhy_weak_char.jsonl \
118
- --vocab-output datasets/AnimeName/vocab.char.json \
119
- --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
120
-
121
- uv run python train.py --tokenizer char \
122
- --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
123
- --vocab-file datasets/AnimeName/vocab.char.json \
124
- --save-dir checkpoints/dmhy-char-guoman-relabel \
125
- --init-model-dir . \
126
- --epochs 2 --batch-size 256 \
127
- --learning-rate 0.00008 --warmup-steps 300 \
128
- --checkpoint-steps 1000 --save-total-limit 3 \
129
- --parse-eval-limit 2048 \
130
- --max-seq-length 128 --seed 52
131
- ```
132
 
133
- The converter keeps source metadata and adds `tokenizer_variant`, source token
134
- count, and character token count fields to each record. The char dataset's
135
- p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
136
- while leaving room for `[CLS]` and `[SEP]`.
137
-
138
- ### Relabel the full dataset
139
-
140
- ```bash
141
- uv run python relabel_dataset_from_filenames.py \
142
- --input datasets/AnimeName/dmhy_weak.jsonl \
143
- --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
144
- --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
145
- --vocab-output datasets/AnimeName/vocab.relabel.json \
146
- --base-vocab datasets/AnimeName/vocab.json \
147
- --max-vocab-size 8000
148
-
149
- Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
150
- Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
151
- Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
152
- Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
153
- ```
154
 
155
- ### Rebuild vocabulary (if needed)
156
 
157
- ```bash
158
- python -c "
159
- import json, collections
160
- tokens = collections.Counter()
161
- [ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
162
- vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
163
- json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
164
- "
165
  ```
166
 
167
- ### Export ONNX for MiruPlay Android
 
 
 
 
 
 
 
 
168
 
169
- ```bash
170
- uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  ```
172
 
173
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- ## Google Colab Training
176
 
177
- For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
178
- Free Colab still has to be started manually, but once `colab_worker.py` is
179
- running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
180
- status. Checkpoints live on Google Drive and default profiles resume from the
181
- latest checkpoint automatically.
182
 
183
- Manual one-shot runs are also supported:
 
 
 
 
 
 
 
184
 
185
- ```bash
186
- python colab_train.py --profile dmhy_regex_finetune
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  ```
188
 
189
- ## Repository Layout
190
 
191
- - `model.safetensors`, `config.json`, `vocab.json`: default published model
192
- - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
193
- - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
194
- - `convert_to_char_dataset.py`: full character-token projection for weak labels
195
- - `inference.py`: end-to-end filename parser CLI
196
- - `export_onnx.py`: ONNX export for Android integration
197
- - `exports/`: exported ONNX model and metadata
198
- - `datasets/AnimeName/`: nested dataset submodule
199
 
200
- ## Maintenance Notes
201
 
202
- MiruPlay tracks this repository as `tools/anime_parser`, and this repository
203
- tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
204
- repo, remember to commit the submodule pointer in the parent repo.
205
 
206
- For the full maintenance workflow, see MiruPlay's
207
- `docs/anifilebert-maintenance.md`.
 
208
 
 
209
 
 
 
 
210
 
 
3
  library_name: transformers
4
  pipeline_tag: token-classification
5
  tags:
6
+ - anime
7
+ - filename-parsing
8
+ - bert
9
+ - token-classification
10
+ - onnx
11
  datasets:
12
+ - ModerRAS/AnimeName
13
  language:
14
+ - en
15
+ - ja
16
+ - zh
17
+ model-index:
18
+ - name: AniFileBERT
19
+ results:
20
+ - task:
21
+ type: token-classification
22
+ name: Anime filename token classification
23
+ dataset:
24
+ name: AniFileBERT fixed parser regression cases
25
+ type: parser-regression
26
+ metrics:
27
+ - type: accuracy
28
+ name: Fixed parser full-match accuracy
29
+ value: 1.0
30
  ---
31
 
32
  # AniFileBERT
33
 
34
+ **中文**:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。
35
 
36
+ **English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.
37
 
38
+ This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.
39
 
40
+ ## Model Details / 模型信息
 
 
 
 
 
 
 
41
 
42
+ | Item | Value |
43
+ | --- | --- |
44
+ | Architecture / 架构 | `BertForTokenClassification` |
45
+ | Tokenizer / 分词器 | Custom character tokenizer in `tokenizer.py` |
46
+ | Parameters / 参数量 | 4,783,631 |
47
+ | Hidden size / 隐层维度 | 256 |
48
+ | Layers / 层数 | 4 |
49
+ | Attention heads / 注意力头 | 8 |
50
+ | Max sequence length / 最大长度 | 128 |
51
+ | Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
52
+ | Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
53
+ | ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
54
 
55
+ **中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。完整解析请使用本仓库的 `inference.py` 或复用 `tokenizer.py`、BIO decode 和字段聚合逻辑;直接 `from_pretrained()` 只能加载 token-classification 权重。
56
 
57
+ **English**: The repository root is the published checkpoint. The old duplicate `model/` directory is intentionally not used. For end-to-end parsing, use `inference.py` or reuse this repo's tokenizer, BIO decoder, and field aggregation logic; `from_pretrained()` only loads token-classification weights.
58
 
59
+ ## Intended Use / 使用场景
60
 
61
+ **中文**
 
 
 
62
 
63
+ - 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。
64
+ - 覆盖常见结构:`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
65
+ - 不适合泛化为自然语言 NER;这是结构化文件名解析任务。
66
 
67
+ **English**
 
 
 
 
68
 
69
+ - Parse anime release filenames for media library scraping, classification, search, and display.
70
+ - Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
71
+ - This is not a general natural-language NER model; it is a structured filename parser.
72
 
73
+ ## Install / 安装
74
 
75
+ ```powershell
76
+ uv sync
77
+ ```
78
 
79
+ If the dataset submodule is missing:
 
 
 
 
 
 
 
 
80
 
81
+ ```powershell
82
+ git submodule update --init --recursive
83
+ ```
84
 
85
+ ## Quick Start / 快速使用
86
 
87
+ Run the Python parser:
88
 
89
+ ```powershell
90
+ uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
91
  ```
92
 
93
+ Expected output:
94
 
95
+ ```json
96
+ {"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
97
  ```
98
 
99
+ Load the raw Transformers model:
100
 
101
  ```python
102
  from transformers import BertForTokenClassification
 
104
  model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
105
  ```
106
 
107
+ **中文**:如果需要完整字段解析,请 clone 本仓库并使用 `inference.py`,因为分词器和后处理是自定义的。
108
 
109
+ **English**: For complete field parsing, clone this repo and use `inference.py`; the tokenizer and postprocessing are custom.
110
 
111
+ ## ONNX Usage / ONNX 使用
 
 
 
 
112
 
113
+ The ONNX graph outputs token logits only. A complete parser still needs:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
+ 1. custom character tokenization,
116
+ 2. constrained BIO decoding,
117
+ 3. field aggregation and high-confidence structural cleanup.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
+ 本仓库提供最小可运行示例:
120
 
121
+ ```powershell
122
+ uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
 
 
 
 
 
 
123
  ```
124
 
125
+ Static graph shapes:
126
+
127
+ - `input_ids`: `int64[1,128]`
128
+ - `attention_mask`: `int64[1,128]`
129
+ - `logits`: `float32[1,128,15]`
130
+
131
+ More details: [`docs/onnx.md`](docs/onnx.md) and [`ANDROID.md`](ANDROID.md).
132
+
133
+ ## Evaluation / 评估
134
 
135
+ Current published checkpoint:
136
+
137
+ | Metric / 指标 | Value / 数值 |
138
+ | --- | --- |
139
+ | Fixed real-case regression / 固定真实回归 | 26/26 full match |
140
+ | ONNX parity / ONNX 误差 | max abs diff `2.6703e-05` |
141
+ | Token/entity eval after focus tuning / focus 微调后实体评估 | F1 `0.9666`, token accuracy `0.9904` |
142
+ | Focus parse eval / focus 解析评估 | 385/512 full match |
143
+
144
+ **中文**:当前发布模型是“全量重标注 char 模型 + special-code focus 微调”。固定回归集覆盖真实用户反馈样式;focus eval 是偏向困难样本的评估,不等同于全量随机 DMHY 评估。
145
+
146
+ **English**: The published checkpoint is the full-relabel character model plus a targeted special-code focus fine-tune. The fixed regression set covers real user-reported patterns; focus eval is intentionally biased toward hard examples and is not equivalent to a broad random DMHY evaluation.
147
+
148
+ Run regression:
149
+
150
+ ```powershell
151
+ uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
152
  ```
153
 
154
+ ## Training / 训练
155
+
156
+ Training uses the dataset submodule at `datasets/AnimeName`.
157
+
158
+ Recommended full character-token run:
159
+
160
+ ```powershell
161
+ uv run python train.py --tokenizer char `
162
+ --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
163
+ --vocab-file datasets/AnimeName/vocab.char.json `
164
+ --save-dir checkpoints/dmhy-char-full `
165
+ --init-model-dir . `
166
+ --epochs 2 `
167
+ --batch-size 256 `
168
+ --learning-rate 0.00008 `
169
+ --warmup-steps 300 `
170
+ --max-seq-length 128 `
171
+ --train-split 0.98 `
172
+ --num-workers 4 `
173
+ --checkpoint-steps 1000 `
174
+ --save-total-limit 3 `
175
+ --parse-eval-limit 2048 `
176
+ --seed 52 `
177
+ --experiment-name dmhy-char-full
178
+ ```
179
+
180
+ `train.py` writes:
181
+
182
+ - Hugging Face checkpoints under `--save-dir`,
183
+ - `final/run_metadata.json`,
184
+ - `final/trainer_eval_metrics.json`,
185
+ - `final/parse_eval_metrics.json`,
186
+ - `final/case_metrics.json` unless `--no-case-eval` is used,
187
+ - TensorBoard logs unless `--no-tensorboard` is used.
188
 
189
+ Full workflow: [`docs/training.md`](docs/training.md).
190
 
191
+ ## Dataset / 数据集
 
 
 
 
192
 
193
+ Authoritative dataset snapshot:
194
+
195
+ ```text
196
+ datasets/AnimeName/dmhy_weak.jsonl
197
+ datasets/AnimeName/dmhy_weak_char.jsonl
198
+ datasets/AnimeName/vocab.json
199
+ datasets/AnimeName/vocab.char.json
200
+ ```
201
 
202
+ Current snapshot:
203
+
204
+ - rows / 行数: `632002`
205
+ - failed relabel rows / 重标注失败行: `0`
206
+ - strict BIO violations / 严格 BIO 违规: `0`
207
+ - character vocab / 字符词表: `6199`
208
+ - character coverage / 字符覆盖率: `100%`
209
+
210
+ **中文**:`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。
211
+
212
+ **English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.
213
+
214
+ ## Repository Layout / 仓库结构
215
+
216
+ ```text
217
+ config.json
218
+ model.safetensors
219
+ tokenizer_config.json
220
+ vocab.json
221
+ training_args.bin
222
+ inference.py
223
+ onnx_inference.py
224
+ export_onnx.py
225
+ train.py
226
+ dataset.py
227
+ tokenizer.py
228
+ dmhy_dataset.py
229
+ label_repairs.py
230
+ relabel_dataset_from_filenames.py
231
+ convert_to_char_dataset.py
232
+ data/parser_regression_cases.json
233
+ datasets/AnimeName/
234
+ exports/anime_filename_parser.onnx
235
+ docs/
236
  ```
237
 
238
+ ## Maintenance / 维护
239
 
240
+ See [`MAINTENANCE.md`](MAINTENANCE.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.
 
 
 
 
 
 
 
241
 
242
+ ## Limitations / 局限
243
 
244
+ **中文**
 
 
245
 
246
+ - 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。
247
+ - ONNX 只包含模型 logits,不包含 tokenizer 和后处理;移动端必须保持 tokenizer/vocab/config 一致。
248
+ - `source` 当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。
249
 
250
+ **English**
251
 
252
+ - Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
253
+ - ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and postprocessing in sync.
254
+ - `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.
255
 
docs/onnx.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ONNX Usage / ONNX 使用说明
2
+
3
+ AniFileBERT exports a static-shape ONNX graph for Android and local inference.
4
+
5
+ AniFileBERT 导出静态 shape 的 ONNX 图,用于 Android 和本地推理。
6
+
7
+ ## 1. What ONNX Contains / ONNX 包含什么
8
+
9
+ The ONNX graph contains only the BERT token-classification forward pass:
10
+
11
+ ONNX 图只包含 BERT token-classification 前向计算:
12
+
13
+ ```text
14
+ input_ids int64[1,128]
15
+ attention_mask int64[1,128]
16
+ logits float32[1,128,15]
17
+ ```
18
+
19
+ It does **not** contain:
20
+
21
+ 它**不包含**:
22
+
23
+ - filename tokenization / 文件名分词
24
+ - token-to-id conversion / token 到 id 的转换
25
+ - constrained BIO decoding / 约束 BIO 解码
26
+ - field aggregation / 字段聚合
27
+ - structural cleanup / 结构化清理
28
+
29
+ Those steps must stay aligned with `tokenizer.py`, `inference.py`, `config.json`,
30
+ and `vocab.json`.
31
+
32
+ 这些步骤必须与 `tokenizer.py`、`inference.py`、`config.json`、`vocab.json`
33
+ 保持一致。
34
+
35
+ ## 2. Export / 导出
36
+
37
+ ```powershell
38
+ uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
39
+ ```
40
+
41
+ The exporter also writes:
42
+
43
+ 导出器还会写入:
44
+
45
+ ```text
46
+ exports/anime_filename_parser.metadata.json
47
+ ```
48
+
49
+ The metadata records the sample filename, output shape, and PyTorch/ONNX max
50
+ absolute logits difference.
51
+
52
+ metadata 会记录样本文件名、输出 shape、PyTorch/ONNX logits 最大绝对误差。
53
+
54
+ ## 3. Local ONNX Inference / 本地 ONNX 推理
55
+
56
+ Use `onnx_inference.py` as the minimal runnable reference.
57
+
58
+ 使用 `onnx_inference.py` 作为最小可运行参考实现。
59
+
60
+ ```powershell
61
+ uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
62
+ ```
63
+
64
+ Expected:
65
+
66
+ 期望输出:
67
+
68
+ ```json
69
+ {"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
70
+ ```
71
+
72
+ Special-code example:
73
+
74
+ 特典编号示例:
75
+
76
+ ```powershell
77
+ uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
78
+ ```
79
+
80
+ Expected:
81
+
82
+ 期望输出:
83
+
84
+ ```json
85
+ {"title":"Shinsekai Yori","season":null,"episode":null,"group":"YYDM&VCB-Studio","resolution":"1080p","source":"x265_flac","special":"NCED02"}
86
+ ```
87
+
88
+ ## 4. Implementation Steps / 实现步骤
89
+
90
+ The runtime parser should do this:
91
+
92
+ 运行时解析器应按以下步骤实现:
93
+
94
+ 1. Tokenize filename with the custom character tokenizer.
95
+ 使用自定义字符 tokenizer 对文件名分词。
96
+ 2. Add `[CLS]` and `[SEP]`, truncate to `max_length - 2`.
97
+ 添加 `[CLS]` 和 `[SEP]`,截断到 `max_length - 2`。
98
+ 3. Convert tokens to ids with `vocab.json`.
99
+ 使用 `vocab.json` 转换 token id。
100
+ 4. Pad `input_ids` and `attention_mask` to exactly `128`.
101
+ 将 `input_ids` 和 `attention_mask` padding 到固定 `128`。
102
+ 5. Run ONNX Runtime.
103
+ 执行 ONNX Runtime。
104
+ 6. Slice logits back to real token count, excluding `[CLS]` and `[SEP]`.
105
+ 去掉 `[CLS]` / `[SEP]`,只保留真实 token 的 logits。
106
+ 7. Decode labels with constrained BIO transitions.
107
+ 使用约束 BIO transition 解码标签。
108
+ 8. Aggregate labels into parser fields.
109
+ 聚合标签为结构化字段。
110
+ 9. Apply high-confidence structural cleanup.
111
+ 应用高置信结构修正。
112
+
113
+ ## 5. Android Notes / Android 注意事项
114
+
115
+ Android must bundle these files together:
116
+
117
+ Android 端必须同时打包:
118
+
119
+ ```text
120
+ anime_filename_parser.onnx
121
+ vocab.json
122
+ config.json
123
+ ```
124
+
125
+ When changing any of them, update all of them in the same commit.
126
+
127
+ 只要其中任意一个变化,三者必须在同一次提交中一起更新。
128
+
129
+ ## 6. Common Mistakes / 常见错误
130
+
131
+ **Using a standard Hugging Face tokenizer**
132
+
133
+ **误用标准 Hugging Face tokenizer**
134
+
135
+ This model uses `AnimeTokenizer`, not WordPiece/BPE.
136
+
137
+ 本模型使用 `AnimeTokenizer`,不是 WordPiece/BPE。
138
+
139
+ **Treating ONNX output as final fields**
140
+
141
+ **把 ONNX 输出当成最终字段**
142
+
143
+ ONNX returns token logits. You still need BIO decode and field aggregation.
144
+
145
+ ONNX 返回 token logits,仍然需要 BIO 解码和字段聚合。
146
+
147
+ **Changing max length without updating Android**
148
+
149
+ **改 max length 但没有同步 Android**
150
+
151
+ The exported graph is static. Runtime arrays must match `[1,128]`.
152
+
153
+ 导出的图是静态 shape,运行时数组必须匹配 `[1,128]`。
154
+
docs/training.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Guide / 训练指南
2
+
3
+ This document describes the reproducible training workflow for AniFileBERT.
4
+
5
+ 本文档记录 AniFileBERT 的可复现训练流程。
6
+
7
+ ## 1. Environment / 环境
8
+
9
+ Use `uv` for all dependency and command execution.
10
+
11
+ 所有依赖和命令优先使用 `uv`。
12
+
13
+ ```powershell
14
+ uv sync
15
+ uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
16
+ ```
17
+
18
+ Recommended GPU configuration:
19
+
20
+ 推荐 GPU 配置:
21
+
22
+ - RTX 3080 class GPU or better
23
+ - batch size `192` to `256` for full char training
24
+ - `fp16` enabled automatically when CUDA is available
25
+ - `--num-workers 4` or `8` when the local disk can keep up
26
+
27
+ ## 2. Dataset / 数据集
28
+
29
+ The authoritative dataset lives in the nested submodule:
30
+
31
+ 权威数据集位于嵌套子模块:
32
+
33
+ ```text
34
+ datasets/AnimeName/dmhy_weak.jsonl
35
+ datasets/AnimeName/dmhy_weak_char.jsonl
36
+ datasets/AnimeName/vocab.json
37
+ datasets/AnimeName/vocab.char.json
38
+ ```
39
+
40
+ Current expected properties:
41
+
42
+ 当前期望属性:
43
+
44
+ - rows / 行数: `632002`
45
+ - strict BIO violations / 严格 BIO 违规: `0`
46
+ - character vocab / 字符词表: `6199`
47
+ - character coverage / 字符覆盖率: `100%`
48
+
49
+ ## 3. Relabel Full Dataset / 全量重标注
50
+
51
+ Use this when weak-label rules changed in `dmhy_dataset.py` or `label_repairs.py`.
52
+
53
+ 当 `dmhy_dataset.py` 或 `label_repairs.py` 的弱标注规则改变时,使用此流程。
54
+
55
+ ```powershell
56
+ uv run python relabel_dataset_from_filenames.py `
57
+ --input datasets/AnimeName/dmhy_weak.jsonl `
58
+ --output datasets/AnimeName/dmhy_weak.relabel.jsonl `
59
+ --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
60
+ --vocab-output datasets/AnimeName/vocab.relabel.json `
61
+ --base-vocab datasets/AnimeName/vocab.json `
62
+ --max-vocab-size 8000 `
63
+ --progress 50000
64
+ ```
65
+
66
+ After checking the manifest and sample labels, replace the authoritative files:
67
+
68
+ 检查 manifest 和样本标注后,再替换权威文件:
69
+
70
+ ```powershell
71
+ Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
72
+ Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
73
+ Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
74
+ ```
75
+
76
+ ## 4. Convert to Character Dataset / 转换为字符数据集
77
+
78
+ The published checkpoint uses the character tokenizer.
79
+
80
+ 当前发布模型使用字符级 tokenizer。
81
+
82
+ ```powershell
83
+ uv run python convert_to_char_dataset.py `
84
+ --input datasets/AnimeName/dmhy_weak.jsonl `
85
+ --output datasets/AnimeName/dmhy_weak_char.jsonl `
86
+ --vocab-output datasets/AnimeName/vocab.char.json `
87
+ --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
88
+ --progress 50000
89
+ ```
90
+
91
+ ## 5. Full Training / 全量训练
92
+
93
+ Recommended RTX 3080 run:
94
+
95
+ 推荐 RTX 3080 训练命令:
96
+
97
+ ```powershell
98
+ uv run python train.py --tokenizer char `
99
+ --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
100
+ --vocab-file datasets/AnimeName/vocab.char.json `
101
+ --save-dir checkpoints/dmhy-char-full `
102
+ --init-model-dir . `
103
+ --epochs 2 `
104
+ --batch-size 256 `
105
+ --learning-rate 0.00008 `
106
+ --warmup-steps 300 `
107
+ --max-seq-length 128 `
108
+ --train-split 0.98 `
109
+ --num-workers 4 `
110
+ --checkpoint-steps 1000 `
111
+ --save-total-limit 3 `
112
+ --parse-eval-limit 2048 `
113
+ --case-eval-file data/parser_regression_cases.json `
114
+ --seed 52 `
115
+ --experiment-name dmhy-char-full
116
+ ```
117
+
118
+ Training outputs:
119
+
120
+ 训练输出:
121
+
122
+ - `checkpoints/<run>/checkpoint-*`: resumable checkpoints / 可恢复 checkpoint
123
+ - `checkpoints/<run>/final`: final Hugging Face checkpoint / 最终 checkpoint
124
+ - `final/run_metadata.json`: run configuration / 训练配置
125
+ - `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标
126
+ - `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率
127
+ - `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归
128
+ - TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard
129
+
130
+ ## 6. Focus Fine-Tuning / 针对性微调
131
+
132
+ Use focus fine-tuning only after a specific real-world failure pattern has been
133
+ confirmed and added to `data/parser_regression_cases.json`.
134
+
135
+ 只有在确认某类真实失败样式,并加入 `data/parser_regression_cases.json` 后,才使用针对性微调。
136
+
137
+ ```powershell
138
+ uv run python build_repair_focus_dataset.py `
139
+ --input datasets/AnimeName/dmhy_weak_char.jsonl `
140
+ --output data/repair_focus_char.jsonl `
141
+ --context-samples 50000 `
142
+ --repeat-repaired 4 `
143
+ --repeat-manual 24 `
144
+ --seed 75
145
+
146
+ uv run python train.py --tokenizer char `
147
+ --data-file data/repair_focus_char.jsonl `
148
+ --vocab-file datasets/AnimeName/vocab.char.json `
149
+ --save-dir checkpoints/dmhy-char-special-focus `
150
+ --init-model-dir . `
151
+ --epochs 1 `
152
+ --batch-size 64 `
153
+ --learning-rate 0.00003 `
154
+ --warmup-steps 50 `
155
+ --max-seq-length 128 `
156
+ --train-split 0.95 `
157
+ --num-workers 0 `
158
+ --checkpoint-steps 500 `
159
+ --save-total-limit 2 `
160
+ --parse-eval-limit 512 `
161
+ --case-eval-file data/parser_regression_cases.json `
162
+ --seed 75 `
163
+ --experiment-name dmhy-char-special-focus
164
+ ```
165
+
166
+ ## 7. Publish to Repository Root / 发布到仓库根目录
167
+
168
+ The repository root is the Hugging Face checkpoint surface.
169
+
170
+ 仓库根目录就是 Hugging Face checkpoint 发布面。
171
+
172
+ ```powershell
173
+ $final = "checkpoints/dmhy-char-full/final"
174
+ Copy-Item "$final/config.json" . -Force
175
+ Copy-Item "$final/model.safetensors" . -Force
176
+ Copy-Item "$final/tokenizer_config.json" . -Force
177
+ Copy-Item "$final/training_args.bin" . -Force
178
+ Copy-Item "$final/vocab.json" . -Force
179
+ Copy-Item "$final/run_metadata.json" . -Force
180
+ Copy-Item "$final/trainer_eval_metrics.json" . -Force
181
+ Copy-Item "$final/parse_eval_metrics.json" . -Force
182
+ Copy-Item "$final/case_metrics.json" . -Force
183
+ Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
184
+ ```
185
+
186
+ Then export ONNX:
187
+
188
+ 然后导出 ONNX:
189
+
190
+ ```powershell
191
+ uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
192
+ ```
193
+
194
+ ## 8. Validation Checklist / 验证清单
195
+
196
+ Run these before committing:
197
+
198
+ 提交前执行:
199
+
200
+ ```powershell
201
+ uv run python -m py_compile tokenizer.py dataset.py dmhy_dataset.py label_repairs.py train.py inference.py export_onnx.py onnx_inference.py
202
+ uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
203
+ uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
204
+ uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
205
+ ```
206
+
207
+ ## 9. Git and LFS Order / Git 与 LFS 顺序
208
+
209
+ If the dataset submodule changed:
210
+
211
+ 如果数据集子模块有变动:
212
+
213
+ ```powershell
214
+ git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
215
+ git -C datasets/AnimeName commit -m "Update anime filename labels"
216
+ git -C datasets/AnimeName lfs push origin main --all
217
+ git -C datasets/AnimeName push origin main
218
+ ```
219
+
220
+ Then commit the model repo:
221
+
222
+ 再提交模型仓库:
223
+
224
+ ```powershell
225
+ git add README.md MAINTENANCE.md ANDROID.md docs/training.md docs/onnx.md `
226
+ config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
227
+ exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
228
+ train.py inference.py export_onnx.py onnx_inference.py data/parser_regression_cases.json datasets/AnimeName
229
+ git commit -m "Update AniFileBERT model and documentation"
230
+ git lfs push origin main --all
231
+ git push origin main
232
+ ```
233
+
export_onnx.py CHANGED
@@ -66,9 +66,9 @@ def copy_android_assets(model_dir: Path, onnx_path: Path, assets_dir: Path) -> N
66
 
67
  def main() -> None:
68
  parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
69
- parser.add_argument("--model-dir", default="checkpoints/final", help="HuggingFace checkpoint directory")
70
  parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
71
- parser.add_argument("--max-length", type=int, default=64, help="Fixed sequence length used on Android")
72
  parser.add_argument(
73
  "--android-assets-dir",
74
  help="Optional Android assets directory that receives the ONNX model, vocab, and config",
 
66
 
67
  def main() -> None:
68
  parser = argparse.ArgumentParser(description="Export anime filename parser to ONNX")
69
+ parser.add_argument("--model-dir", default=".", help="HuggingFace checkpoint directory")
70
  parser.add_argument("--output", default="exports/anime_filename_parser.onnx", help="Output ONNX file")
71
+ parser.add_argument("--max-length", type=int, default=128, help="Fixed sequence length used on Android")
72
  parser.add_argument(
73
  "--android-assets-dir",
74
  help="Optional Android assets directory that receives the ONNX model, vocab, and config",
onnx_inference.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Minimal ONNX Runtime inference example for AniFileBERT.
3
+
4
+ The ONNX file outputs token logits only. End-to-end parsing still needs the
5
+ repository tokenizer, constrained BIO decoding, and the same field aggregation
6
+ used by inference.py.
7
+
8
+ Usage:
9
+ python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
10
+ """
11
+
12
+ import argparse
13
+ import json
14
+ from pathlib import Path
15
+ from typing import Dict, List, Tuple
16
+
17
+ import numpy as np
18
+ import onnxruntime as ort
19
+ import torch
20
+
21
+ from inference import constrained_bio_decode, postprocess
22
+ from tokenizer import AnimeTokenizer, load_tokenizer
23
+
24
+
25
+ def encode(
26
+ filename: str,
27
+ tokenizer: AnimeTokenizer,
28
+ max_length: int,
29
+ ) -> Tuple[List[str], np.ndarray, np.ndarray, int]:
30
+ tokens = tokenizer.tokenize(filename)
31
+ available = min(len(tokens), max_length - 2)
32
+ used_tokens = tokens[:available]
33
+
34
+ input_ids = [tokenizer.cls_token_id]
35
+ input_ids.extend(tokenizer.convert_tokens_to_ids(used_tokens))
36
+ input_ids.append(tokenizer.sep_token_id)
37
+ attention_mask = [1] * len(input_ids)
38
+
39
+ pad_len = max_length - len(input_ids)
40
+ if pad_len > 0:
41
+ input_ids.extend([tokenizer.pad_token_id] * pad_len)
42
+ attention_mask.extend([0] * pad_len)
43
+
44
+ return (
45
+ used_tokens,
46
+ np.asarray([input_ids], dtype=np.int64),
47
+ np.asarray([attention_mask], dtype=np.int64),
48
+ available,
49
+ )
50
+
51
+
52
+ def load_id2label(model_dir: Path) -> Dict[int, str]:
53
+ config = json.loads((model_dir / "config.json").read_text(encoding="utf-8"))
54
+ return {int(label_id): label for label_id, label in config["id2label"].items()}
55
+
56
+
57
+ def parse_with_onnx(
58
+ filename: str,
59
+ model_dir: Path,
60
+ onnx_path: Path,
61
+ max_length: int,
62
+ use_rules: bool = True,
63
+ ) -> Dict:
64
+ tokenizer = load_tokenizer(str(model_dir))
65
+ id2label = load_id2label(model_dir)
66
+ tokens, input_ids, attention_mask, available = encode(filename, tokenizer, max_length)
67
+
68
+ session = ort.InferenceSession(str(onnx_path), providers=["CPUExecutionProvider"])
69
+ logits = session.run(
70
+ ["logits"],
71
+ {
72
+ "input_ids": input_ids,
73
+ "attention_mask": attention_mask,
74
+ },
75
+ )[0]
76
+
77
+ token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
78
+ label_ids = constrained_bio_decode(token_logits, id2label)
79
+ labels = [id2label.get(label_id, "O") for label_id in label_ids]
80
+ result = postprocess(tokens, labels, tokenizer=tokenizer, filename=filename, use_rules=use_rules)
81
+ result["_input"] = filename
82
+ return result
83
+
84
+
85
+ def main() -> None:
86
+ parser = argparse.ArgumentParser(description="Run AniFileBERT ONNX inference")
87
+ parser.add_argument("filename", help="Anime filename to parse")
88
+ parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
89
+ parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
90
+ parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
91
+ parser.add_argument("--no-rule-assist", action="store_true", help="Disable structural postprocessing")
92
+ args = parser.parse_args()
93
+
94
+ result = parse_with_onnx(
95
+ filename=args.filename,
96
+ model_dir=Path(args.model_dir),
97
+ onnx_path=Path(args.onnx),
98
+ max_length=args.max_length,
99
+ use_rules=not args.no_rule_assist,
100
+ )
101
+ print(json.dumps(result, ensure_ascii=False))
102
+
103
+
104
+ if __name__ == "__main__":
105
+ main()
train.py CHANGED
@@ -1,11 +1,9 @@
1
  """
2
- Training script for anime filename parser.
3
 
4
- Trains a Tiny BERT model for token classification on synthetic anime filename data.
5
- Uses HuggingFace Trainer for CPU training.
6
-
7
- Usage:
8
- python train.py
9
  """
10
 
11
  import os
@@ -106,6 +104,12 @@ def parse_args() -> argparse.Namespace:
106
  help="Optional experiment name written to run_metadata.json")
107
  parser.add_argument("--parse-eval-limit", type=int, default=512,
108
  help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
 
 
 
 
 
 
109
  parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
110
  parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
111
  parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
@@ -626,6 +630,32 @@ def main():
626
  total = parse_metrics["field_total"][field]
627
  print(f" {field}: {correct}/{total} ({accuracy:.4f})")
628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
629
 
630
  if __name__ == "__main__":
631
  main()
 
1
  """
2
+ Train AniFileBERT for structured anime filename parsing.
3
 
4
+ The training loop keeps the existing PyTorch/Transformers stack, writes
5
+ Hugging Face checkpoints, records token/entity metrics, and also evaluates
6
+ end-to-end parser exact-match on held-out filenames and fixed real-world cases.
 
 
7
  """
8
 
9
  import os
 
104
  help="Optional experiment name written to run_metadata.json")
105
  parser.add_argument("--parse-eval-limit", type=int, default=512,
106
  help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
107
+ parser.add_argument("--case-eval-file", default=os.path.join("data", "parser_regression_cases.json"),
108
+ help="Fixed real-world parser regression case file evaluated after training")
109
+ parser.add_argument("--case-eval-output", default=None,
110
+ help="Optional output path for fixed case metrics; defaults to final/case_metrics.json")
111
+ parser.add_argument("--no-case-eval", action="store_true",
112
+ help="Skip fixed real-world parser regression evaluation")
113
  parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
114
  parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
115
  parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
 
630
  total = parse_metrics["field_total"][field]
631
  print(f" {field}: {correct}/{total} ({accuracy:.4f})")
632
 
633
+ if not args.no_case_eval:
634
+ if args.case_eval_file and os.path.isfile(args.case_eval_file):
635
+ from evaluate_parser_cases import evaluate_cases
636
+
637
+ case_metrics = evaluate_cases(
638
+ model_dir=final_save_path,
639
+ case_file=args.case_eval_file,
640
+ tokenizer_variant=tokenizer_variant,
641
+ max_length=config.max_seq_length,
642
+ use_rules=True,
643
+ constrain_bio=True,
644
+ )
645
+ case_output = args.case_eval_output or os.path.join(final_save_path, "case_metrics.json")
646
+ os.makedirs(os.path.dirname(case_output) or ".", exist_ok=True)
647
+ with open(case_output, "w", encoding="utf-8") as f:
648
+ json.dump(case_metrics, f, ensure_ascii=False, indent=2)
649
+ print("\nFixed case regression evaluation:")
650
+ print(
651
+ f" full_match: {case_metrics['full_correct']}/"
652
+ f"{case_metrics['case_count']} ({case_metrics['full_accuracy']:.4f})"
653
+ )
654
+ if case_metrics["failures"]:
655
+ print(f" failures: {len(case_metrics['failures'])} (see {case_output})")
656
+ elif args.case_eval_file:
657
+ print(f"\nSkipping fixed case regression evaluation; file not found: {args.case_eval_file}")
658
+
659
 
660
  if __name__ == "__main__":
661
  main()