Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
File size: 6,236 Bytes
376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 8c50d16 be6a29a 8c50d16 3197202 376db19 3197202 376db19 3197202 376db19 8c50d16 376db19 8c50d16 376db19 359ff82 376db19 359ff82 3197202 376db19 3197202 376db19 3197202 359ff82 376db19 8c50d16 be6a29a e63569d 3197202 376db19 e63569d 376db19 8c50d16 376db19 3197202 376db19 8c50d16 376db19 f712f4b 116c87c f712f4b 116c87c f712f4b 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 376db19 3197202 8c50d16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | # AniFileBERT Maintenance / 维护手册
This repository is the standalone Hugging Face model repo used by MiruPlay as
`tools/anime_parser`.
本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。
## Related Repositories / 相关仓库
| Repository / 仓库 | URL | Purpose / 用途 |
| --- | --- | --- |
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |
Nested structure / 嵌套结构:
```text
AniFileBERT
datasets/AnimeName -> ModerRAS/AnimeName
```
## Clone / 克隆
```powershell
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
```
After a normal clone / 普通 clone 后:
```powershell
git submodule update --init --recursive
uv sync
```
## Publishing Surface / 发布面
The repository root is the only published Hugging Face checkpoint location:
仓库根目录是唯一的 Hugging Face checkpoint 发布位置:
```text
config.json
model.safetensors
tokenizer_config.json
training_args.bin
vocab.json
vocab.char.json
```
Release reports are kept under `reports/`:
发布报告保存在 `reports/`:
```text
reports/run_metadata.json
reports/trainer_eval_metrics.json
reports/parse_eval_metrics.json
reports/case_metrics.json
reports/perf_metrics.json
reports/benchmark_results.json
reports/training_lineage.json
```
There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
local training artifacts only.
仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。
## Standard Training / 标准训练
For full details, see [`training.md`](training.md).
完整流程见 [`training.md`](training.md)。
Current release training uses the virtual-shard flow in [`training.md`](training.md):
当前发布训练使用 [`training.md`](training.md) 中的 virtual-shard 流程:
```powershell
uv run python -m compileall -q anifilebert tools
cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
# Then follow docs/training.md section "Full Training with Virtual BIO Shards".
```
## Publish a New Checkpoint / 发布新 checkpoint
Copy final files to the repository root:
把 `final` 文件复制到仓库根目录:
```powershell
$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
New-Item -ItemType Directory -Path reports -Force | Out-Null
Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
```
Export ONNX / 导出 ONNX:
```powershell
uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```
Validate / 验证:
```powershell
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
```
The default parser path is thin runtime: model logits, constrained BIO, entity
aggregation, and light string/number normalization. Do not add structural
filename regex assists back to the default runtime; parser quality should come
from labels and model training.
默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
不要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自标签和模型训练。
## Dataset Submodule / 数据集子模块
If `datasets/AnimeName` changed, commit and push it first:
如果 `datasets/AnimeName` 有变动,先提交并推送它:
```powershell
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```
Then commit the submodule pointer in this repo:
然后在本仓库提交 submodule pointer:
```powershell
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
```
## LFS Push Order / LFS 推送顺序
Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
because an LFS pointer points to a missing object, upload LFS objects first:
大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push,
先上传 LFS 对象:
```powershell
git lfs push origin main --all
git push origin main
```
For dataset changes:
数据集变动:
```powershell
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```
## Update MiruPlay / 更新 MiruPlay
From MiruPlay root:
在 MiruPlay 根目录:
```powershell
git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
```
If Android assets changed, also stage:
如果 Android assets 变化,也要提交:
```text
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json
```
|