File size: 6,236 Bytes
376db19
3197202
 
 
 
376db19
3197202
376db19
3197202
376db19
 
 
 
 
 
 
3197202
 
 
 
 
 
376db19
3197202
376db19
3197202
 
 
376db19
3197202
376db19
3197202
376db19
3197202
 
376db19
 
 
3197202
376db19
3197202
 
376db19
 
 
 
 
 
8c50d16
 
 
 
 
 
 
 
 
 
 
be6a29a
8c50d16
 
3197202
 
376db19
 
3197202
376db19
3197202
376db19
 
8c50d16
376db19
8c50d16
376db19
359ff82
 
 
376db19
 
359ff82
 
 
3197202
 
376db19
 
 
3197202
376db19
3197202
 
359ff82
376db19
 
 
 
 
8c50d16
 
 
 
 
be6a29a
e63569d
3197202
 
376db19
e63569d
376db19
8c50d16
376db19
3197202
376db19
 
 
8c50d16
 
 
376db19
 
f712f4b
116c87c
 
 
f712f4b
 
116c87c
f712f4b
376db19
 
 
 
 
 
 
 
 
 
 
3197202
 
376db19
3197202
376db19
3197202
376db19
3197202
 
376db19
 
 
 
 
 
 
 
 
 
 
 
3197202
 
 
376db19
 
 
 
 
 
 
 
3197202
376db19
3197202
376db19
 
 
 
 
3197202
 
 
 
 
376db19
 
 
3197202
 
 
 
 
 
8c50d16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# AniFileBERT Maintenance / 维护手册

This repository is the standalone Hugging Face model repo used by MiruPlay as
`tools/anime_parser`.

本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。

## Related Repositories / 相关仓库

| Repository / 仓库 | URL | Purpose / 用途 |
| --- | --- | --- |
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |

Nested structure / 嵌套结构:

```text
AniFileBERT
  datasets/AnimeName -> ModerRAS/AnimeName
```

## Clone / 克隆

```powershell
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
```

After a normal clone / 普通 clone 后:

```powershell
git submodule update --init --recursive
uv sync
```

## Publishing Surface / 发布面

The repository root is the only published Hugging Face checkpoint location:

仓库根目录是唯一的 Hugging Face checkpoint 发布位置:

```text
config.json
model.safetensors
tokenizer_config.json
training_args.bin
vocab.json
vocab.char.json
```

Release reports are kept under `reports/`:

发布报告保存在 `reports/````text
reports/run_metadata.json
reports/trainer_eval_metrics.json
reports/parse_eval_metrics.json
reports/case_metrics.json
reports/perf_metrics.json
reports/benchmark_results.json
reports/training_lineage.json
```

There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
local training artifacts only.

仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。

## Standard Training / 标准训练

For full details, see [`training.md`](training.md).

完整流程见 [`training.md`](training.md)。

Current release training uses the virtual-shard flow in [`training.md`](training.md):

当前发布训练使用 [`training.md`](training.md) 中的 virtual-shard 流程:

```powershell
uv run python -m compileall -q anifilebert tools
cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
# Then follow docs/training.md section "Full Training with Virtual BIO Shards".
```

## Publish a New Checkpoint / 发布新 checkpoint

Copy final files to the repository root:

把 `final` 文件复制到仓库根目录:

```powershell
$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
New-Item -ItemType Directory -Path reports -Force | Out-Null
Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
```

Export ONNX / 导出 ONNX:

```powershell
uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```

Validate / 验证:

```powershell
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
```

The default parser path is thin runtime: model logits, constrained BIO, entity
aggregation, and light string/number normalization. Do not add structural
filename regex assists back to the default runtime; parser quality should come
from labels and model training.

默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
不要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自标签和模型训练。

## Dataset Submodule / 数据集子模块

If `datasets/AnimeName` changed, commit and push it first:

如果 `datasets/AnimeName` 有变动,先提交并推送它:

```powershell
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```

Then commit the submodule pointer in this repo:

然后在本仓库提交 submodule pointer:

```powershell
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
```

## LFS Push Order / LFS 推送顺序

Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
because an LFS pointer points to a missing object, upload LFS objects first:

大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push,
先上传 LFS 对象:

```powershell
git lfs push origin main --all
git push origin main
```

For dataset changes:

数据集变动:

```powershell
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```

## Update MiruPlay / 更新 MiruPlay

From MiruPlay root:

在 MiruPlay 根目录:

```powershell
git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
```

If Android assets changed, also stage:

如果 Android assets 变化,也要提交:

```text
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json
```