Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| # Training Guide / 训练指南 | |
| This document describes the reproducible training workflow for AniFileBERT. | |
| 本文档记录 AniFileBERT 的可复现训练流程。 | |
| ## 1. Environment / 环境 | |
| Use `uv` for all dependency and command execution. | |
| 所有依赖和命令优先使用 `uv`。 | |
| ```powershell | |
| uv sync | |
| uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())" | |
| ``` | |
| Recommended GPU configuration: | |
| 推荐 GPU 配置: | |
| - RTX 3080 class GPU or better; current release training used an RTX 5070 Ti | |
| - batch size `1792` with the virtual-sharded dataset path on the 5070 Ti | |
| - `bf16`/TF32 on Ada/Blackwell-class CUDA devices when available | |
| - `--num-workers 4 --persistent-workers` with the virtual-sharded dataset path | |
| ## 2. Dataset / 数据集 | |
| The authoritative dataset lives in the nested submodule: | |
| 权威数据集位于嵌套子模块: | |
| ```text | |
| datasets/AnimeName/dmhy_weak.jsonl | |
| datasets/AnimeName/dmhy_weak_char.jsonl | |
| datasets/AnimeName/vocab.json | |
| datasets/AnimeName/vocab.char.json | |
| ``` | |
| Current expected properties: | |
| 当前期望属性: | |
| - rows / 行数: `632002` | |
| - strict BIO violations / 严格 BIO 违规: `0` | |
| - character vocab / 字符词表: `6199` | |
| - character coverage / 字符覆盖率: `100%` | |
| ## 3. Relabel Full Dataset / 全量重标注 | |
| Use this when weak-label rules changed in `tools/dmhy_dataset.py` or `anifilebert/label_repairs.py`. | |
| 当 `tools/dmhy_dataset.py` 或 `anifilebert/label_repairs.py` 的弱标注规则改变时,使用此流程。 | |
| ```powershell | |
| uv run python -m tools.relabel_dataset_from_filenames ` | |
| --input datasets/AnimeName/dmhy_weak.jsonl ` | |
| --output datasets/AnimeName/dmhy_weak.relabel.jsonl ` | |
| --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json ` | |
| --vocab-output datasets/AnimeName/vocab.relabel.json ` | |
| --base-vocab datasets/AnimeName/vocab.json ` | |
| --max-vocab-size 8000 ` | |
| --progress 50000 | |
| ``` | |
| After checking the manifest and sample labels, replace the authoritative files: | |
| 检查 manifest 和样本标注后,再替换权威文件: | |
| ```powershell | |
| Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force | |
| Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force | |
| Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force | |
| ``` | |
| ## 4. Convert to Character Dataset / 转换为字符数据集 | |
| The published checkpoint uses the character tokenizer. | |
| 当前发布模型使用字符级 tokenizer。 | |
| ```powershell | |
| uv run python -m tools.convert_to_char_dataset ` | |
| --input datasets/AnimeName/dmhy_weak.jsonl ` | |
| --output datasets/AnimeName/dmhy_weak_char.jsonl ` | |
| --vocab-output datasets/AnimeName/vocab.char.json ` | |
| --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json ` | |
| --progress 50000 | |
| ``` | |
| ## 5. Full Training with Virtual BIO Shards / 虚拟 BIO shard 全量训练 | |
| Recommended RTX 5070 Ti run: | |
| 推荐 RTX 5070 Ti 训练命令: | |
| ```powershell | |
| @' | |
| import random | |
| from pathlib import Path | |
| source = Path("datasets/AnimeName/dmhy_weak_char.jsonl") | |
| target = Path("data/generated/virtual_source_train_seed105.jsonl") | |
| rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line] | |
| random.Random(105).shuffle(rows) | |
| target.parent.mkdir(parents=True, exist_ok=True) | |
| target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8") | |
| '@ | .\.venv\Scripts\python.exe - | |
| cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml | |
| .\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe ` | |
| --input data/generated/virtual_source_train_seed105.jsonl ` | |
| --vocab-file datasets/AnimeName/vocab.char.json ` | |
| --output-dir data/generated/virtual_char_sps32_seed105 ` | |
| --max-length 128 ` | |
| --samples-per-source 32 ` | |
| --seed 105 ` | |
| --threads 20 ` | |
| --separator-mode per-gap ` | |
| --bracket-mode per-part | |
| .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` | |
| --data-file datasets/AnimeName/dmhy_weak_char.jsonl ` | |
| --vocab-file datasets/AnimeName/vocab.char.json ` | |
| --virtual-dataset-dir data/generated/virtual_char_sps32_seed105 ` | |
| --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5 ` | |
| --init-model-dir . ` | |
| --epochs 10 ` | |
| --batch-size 1792 ` | |
| --learning-rate 0.00001 ` | |
| --warmup-steps 2000 ` | |
| --max-seq-length 128 ` | |
| --train-split 0.98 ` | |
| --num-workers 4 ` | |
| --prefetch-factor 4 ` | |
| --persistent-workers ` | |
| --checkpoint-steps 5000 ` | |
| --save-total-limit 3 ` | |
| --parse-eval-limit 2048 ` | |
| --case-eval-file data/parser_regression_cases.json ` | |
| --bf16 ` | |
| --no-periodic-eval ` | |
| --perf-log-steps 1000 ` | |
| --perf-sample-interval 0.5 ` | |
| --seed 105 ` | |
| --experiment-name dmhy-char-virtual-sps32-10epoch-lr1e5 | |
| ``` | |
| The Rust generator samples BIO entity block subsets/permutations, separator | |
| variants, bracket styles, incomplete filename fragments, and standalone special | |
| fixtures into compact pre-encoded `.npy` shards. The current release generated | |
| `20,439,848` training rows from `619,361` train-split source rows plus `935` | |
| special fixtures, then trained for 10 epochs / `114,070` optimizer steps. | |
| Rust 生成器会把 BIO 实体块子集/重排、分隔符变体、括号样式、不完整文件名片段、 | |
| 以及 standalone special fixtures 预编码成紧凑 `.npy` shard。当前发布从 `619,361` | |
| 条 train split 源样本和 `935` 条 special fixture 生成了 `20,439,848` 条训练行, | |
| 并完整训练 10 epoch / `114,070` 个 optimizer steps。 | |
| Training outputs: | |
| 训练输出: | |
| - `checkpoints/<run>/checkpoint-*`: resumable checkpoints / 可恢复 checkpoint | |
| - `checkpoints/<run>/final`: final Hugging Face checkpoint / 最终 checkpoint | |
| - `final/run_metadata.json`: run configuration / 训练配置 | |
| - `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标 | |
| - `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率 | |
| - `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归 | |
| - `final/perf_metrics.json`: training throughput/GPU telemetry when enabled / 启用时记录训练吞吐和 GPU 采样 | |
| - TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard | |
| ## 6. Thin Hard-Case Fine-Tuning / 薄层困难样本微调 | |
| Use hard-case fine-tuning only after a specific real-world failure pattern has | |
| been confirmed, fixed in the weak labels, and added to | |
| `data/parser_regression_cases.json`. | |
| 只有在确认某类真实失败样式、修复弱标注并加入 | |
| `data/parser_regression_cases.json` 后,才使用困难样本微调。 | |
| ```powershell | |
| uv run python -m tools.build_repair_focus_dataset ` | |
| --input datasets/AnimeName/dmhy_weak_char.jsonl ` | |
| --output data/generated/focus_after_virtual_sps32_char.jsonl ` | |
| --context-samples 50000 ` | |
| --repeat-focus 3 ` | |
| --repeat-manual 96 ` | |
| --seed 205 | |
| .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` | |
| --data-file data/generated/focus_after_virtual_sps32_char.jsonl ` | |
| --vocab-file datasets/AnimeName/vocab.char.json ` | |
| --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus ` | |
| --init-model-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5/final ` | |
| --epochs 1 ` | |
| --batch-size 1792 ` | |
| --learning-rate 0.000002 ` | |
| --warmup-steps 20 ` | |
| --max-seq-length 128 ` | |
| --train-split 0.95 ` | |
| --num-workers 0 ` | |
| --checkpoint-steps 300 ` | |
| --save-total-limit 2 ` | |
| --parse-eval-limit 2048 ` | |
| --case-eval-file data/parser_regression_cases.json ` | |
| --bf16 ` | |
| --no-periodic-eval ` | |
| --perf-log-steps 50 ` | |
| --perf-sample-interval 0.5 ` | |
| --seed 208 ` | |
| --experiment-name dmhy-char-virtual-sps32-10epoch-lightfocus | |
| ``` | |
| The default quality gate is model-led parsing: | |
| 默认质量门槛以模型主导解析为准: | |
| - fixed regression `model_only >= 85%` | |
| - held-out parse `model_only >= 75%` | |
| - `normalized_only` is the default thin runtime metric | |
| - structural filename assists are not part of training or release metrics | |
| - 固定回归 `model_only >= 85%` | |
| - held-out 解析 `model_only >= 75%` | |
| - `normalized_only` 是默认薄层运行时指标 | |
| - 结构化文件名辅助不属于训练或发布指标 | |
| ## 7. Publish to Repository Root / 发布到仓库根目录 | |
| The repository root is the Hugging Face checkpoint surface. | |
| 仓库根目录就是 Hugging Face checkpoint 发布面。 | |
| ```powershell | |
| $final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final" | |
| Copy-Item "$final/config.json" . -Force | |
| Copy-Item "$final/model.safetensors" . -Force | |
| Copy-Item "$final/tokenizer_config.json" . -Force | |
| Copy-Item "$final/training_args.bin" . -Force | |
| Copy-Item "$final/vocab.json" . -Force | |
| New-Item -ItemType Directory -Path reports -Force | Out-Null | |
| Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force | |
| Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force | |
| Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force | |
| Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force | |
| Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force | |
| Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force | |
| ``` | |
| Then export ONNX: | |
| 然后导出 ONNX: | |
| ```powershell | |
| uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128 | |
| ``` | |
| ## 8. Validation Checklist / 验证清单 | |
| Run these before committing: | |
| 提交前执行: | |
| ```powershell | |
| uv run python -m compileall -q anifilebert tools | |
| uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json | |
| uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" | |
| uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv" | |
| uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json | |
| ``` | |
| ## 9. Git and LFS Order / Git 与 LFS 顺序 | |
| If the dataset submodule changed: | |
| 如果数据集子模块有变动: | |
| ```powershell | |
| git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json | |
| git -C datasets/AnimeName commit -m "Update anime filename labels" | |
| git -C datasets/AnimeName lfs push origin main --all | |
| git -C datasets/AnimeName push origin main | |
| ``` | |
| Then commit the model repo: | |
| 再提交模型仓库: | |
| ```powershell | |
| git add README.md docs/maintenance.md docs/android.md docs/training.md docs/onnx.md ` | |
| config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json ` | |
| exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json ` | |
| reports anifilebert tools data/parser_regression_cases.json datasets/AnimeName | |
| git commit -m "Update AniFileBERT model and documentation" | |
| git lfs push origin main --all | |
| git push origin main | |
| ``` | |