Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| # AniFileBERT Maintenance / 维护手册 | |
| This repository is the standalone Hugging Face model repo used by MiruPlay as | |
| `tools/anime_parser`. | |
| 本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。 | |
| ## Related Repositories / 相关仓库 | |
| | Repository / 仓库 | URL | Purpose / 用途 | | |
| | --- | --- | --- | | |
| | AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 | | |
| | AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 | | |
| | MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 | | |
| Nested structure / 嵌套结构: | |
| ```text | |
| AniFileBERT | |
| datasets/AnimeName -> ModerRAS/AnimeName | |
| ``` | |
| ## Clone / 克隆 | |
| ```powershell | |
| git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT | |
| ``` | |
| After a normal clone / 普通 clone 后: | |
| ```powershell | |
| git submodule update --init --recursive | |
| uv sync | |
| ``` | |
| ## Publishing Surface / 发布面 | |
| The repository root is the only published Hugging Face checkpoint location: | |
| 仓库根目录是唯一的 Hugging Face checkpoint 发布位置: | |
| ```text | |
| config.json | |
| model.safetensors | |
| tokenizer_config.json | |
| training_args.bin | |
| vocab.json | |
| vocab.char.json | |
| ``` | |
| Release reports are kept under `reports/`: | |
| 发布报告保存在 `reports/`: | |
| ```text | |
| reports/run_metadata.json | |
| reports/trainer_eval_metrics.json | |
| reports/parse_eval_metrics.json | |
| reports/case_metrics.json | |
| reports/perf_metrics.json | |
| reports/benchmark_results.json | |
| reports/training_lineage.json | |
| ``` | |
| There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are | |
| local training artifacts only. | |
| 仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。 | |
| ## Standard Training / 标准训练 | |
| For full details, see [`training.md`](training.md). | |
| 完整流程见 [`training.md`](training.md)。 | |
| Current release training uses the virtual-shard flow in [`training.md`](training.md): | |
| 当前发布训练使用 [`training.md`](training.md) 中的 virtual-shard 流程: | |
| ```powershell | |
| uv run python -m compileall -q anifilebert tools | |
| cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml | |
| # Then follow docs/training.md section "Full Training with Virtual BIO Shards". | |
| ``` | |
| ## Publish a New Checkpoint / 发布新 checkpoint | |
| Copy final files to the repository root: | |
| 把 `final` 文件复制到仓库根目录: | |
| ```powershell | |
| $final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final" | |
| Copy-Item "$final/config.json" . -Force | |
| Copy-Item "$final/model.safetensors" . -Force | |
| Copy-Item "$final/tokenizer_config.json" . -Force | |
| Copy-Item "$final/training_args.bin" . -Force | |
| Copy-Item "$final/vocab.json" . -Force | |
| New-Item -ItemType Directory -Path reports -Force | Out-Null | |
| Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force | |
| Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force | |
| Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force | |
| Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force | |
| Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force | |
| Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force | |
| ``` | |
| Export ONNX / 导出 ONNX: | |
| ```powershell | |
| uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128 | |
| ``` | |
| Validate / 验证: | |
| ```powershell | |
| uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json | |
| uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" | |
| uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json | |
| ``` | |
| The default parser path is thin runtime: model logits, constrained BIO, entity | |
| aggregation, and light string/number normalization. Do not add structural | |
| filename regex assists back to the default runtime; parser quality should come | |
| from labels and model training. | |
| 默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。 | |
| 不要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自标签和模型训练。 | |
| ## Dataset Submodule / 数据集子模块 | |
| If `datasets/AnimeName` changed, commit and push it first: | |
| 如果 `datasets/AnimeName` 有变动,先提交并推送它: | |
| ```powershell | |
| git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json | |
| git -C datasets/AnimeName commit -m "Update anime filename labels" | |
| git -C datasets/AnimeName lfs push origin main --all | |
| git -C datasets/AnimeName push origin main | |
| ``` | |
| Then commit the submodule pointer in this repo: | |
| 然后在本仓库提交 submodule pointer: | |
| ```powershell | |
| git add datasets/AnimeName | |
| git commit -m "Update AnimeName dataset pointer" | |
| ``` | |
| ## LFS Push Order / LFS 推送顺序 | |
| Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push | |
| because an LFS pointer points to a missing object, upload LFS objects first: | |
| 大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push, | |
| 先上传 LFS 对象: | |
| ```powershell | |
| git lfs push origin main --all | |
| git push origin main | |
| ``` | |
| For dataset changes: | |
| 数据集变动: | |
| ```powershell | |
| git -C datasets/AnimeName lfs push origin main --all | |
| git -C datasets/AnimeName push origin main | |
| ``` | |
| ## Update MiruPlay / 更新 MiruPlay | |
| From MiruPlay root: | |
| 在 MiruPlay 根目录: | |
| ```powershell | |
| git submodule update --remote --recursive tools/anime_parser | |
| git add tools/anime_parser | |
| git commit -m "Update AniFileBERT submodule" | |
| ``` | |
| If Android assets changed, also stage: | |
| 如果 Android assets 变化,也要提交: | |
| ```text | |
| scraper/src/main/assets/anime_parser/anime_filename_parser.onnx | |
| scraper/src/main/assets/anime_parser/config.json | |
| scraper/src/main/assets/anime_parser/vocab.json | |
| ``` | |