Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Training Guide / 训练指南
This document describes the reproducible training workflow for AniFileBERT.
本文档记录 AniFileBERT 的可复现训练流程。
1. Environment / 环境
Use uv for all dependency and command execution.
所有依赖和命令优先使用 uv。
uv sync
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
Recommended GPU configuration:
推荐 GPU 配置:
- RTX 3080 class GPU or better; current release training used an RTX 5070 Ti
- batch size
1792with the virtual-sharded dataset path on the 5070 Ti bf16/TF32 on Ada/Blackwell-class CUDA devices when available--num-workers 4 --persistent-workerswith the virtual-sharded dataset path
2. Dataset / 数据集
The authoritative dataset lives in the nested submodule:
权威数据集位于嵌套子模块:
datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json
Current expected properties:
当前期望属性:
- rows / 行数:
632002 - strict BIO violations / 严格 BIO 违规:
0 - character vocab / 字符词表:
6199 - character coverage / 字符覆盖率:
100%
3. Relabel Full Dataset / 全量重标注
Use this when weak-label rules changed in tools/dmhy_dataset.py or anifilebert/label_repairs.py.
当 tools/dmhy_dataset.py 或 anifilebert/label_repairs.py 的弱标注规则改变时,使用此流程。
uv run python -m tools.relabel_dataset_from_filenames `
--input datasets/AnimeName/dmhy_weak.jsonl `
--output datasets/AnimeName/dmhy_weak.relabel.jsonl `
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
--vocab-output datasets/AnimeName/vocab.relabel.json `
--base-vocab datasets/AnimeName/vocab.json `
--max-vocab-size 8000 `
--progress 50000
After checking the manifest and sample labels, replace the authoritative files:
检查 manifest 和样本标注后,再替换权威文件:
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
4. Convert to Character Dataset / 转换为字符数据集
The published checkpoint uses the character tokenizer.
当前发布模型使用字符级 tokenizer。
uv run python -m tools.convert_to_char_dataset `
--input datasets/AnimeName/dmhy_weak.jsonl `
--output datasets/AnimeName/dmhy_weak_char.jsonl `
--vocab-output datasets/AnimeName/vocab.char.json `
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
--progress 50000
5. Full Training with Virtual BIO Shards / 虚拟 BIO shard 全量训练
Recommended RTX 5070 Ti run:
推荐 RTX 5070 Ti 训练命令:
@'
import random
from pathlib import Path
source = Path("datasets/AnimeName/dmhy_weak_char.jsonl")
target = Path("data/generated/virtual_source_train_seed105.jsonl")
rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line]
random.Random(105).shuffle(rows)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8")
'@ | .\.venv\Scripts\python.exe -
cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
.\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe `
--input data/generated/virtual_source_train_seed105.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--output-dir data/generated/virtual_char_sps32_seed105 `
--max-length 128 `
--samples-per-source 32 `
--seed 105 `
--threads 20 `
--separator-mode per-gap `
--bracket-mode per-part
.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--virtual-dataset-dir data/generated/virtual_char_sps32_seed105 `
--save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5 `
--init-model-dir . `
--epochs 10 `
--batch-size 1792 `
--learning-rate 0.00001 `
--warmup-steps 2000 `
--max-seq-length 128 `
--train-split 0.98 `
--num-workers 4 `
--prefetch-factor 4 `
--persistent-workers `
--checkpoint-steps 5000 `
--save-total-limit 3 `
--parse-eval-limit 2048 `
--case-eval-file data/parser_regression_cases.json `
--bf16 `
--no-periodic-eval `
--perf-log-steps 1000 `
--perf-sample-interval 0.5 `
--seed 105 `
--experiment-name dmhy-char-virtual-sps32-10epoch-lr1e5
The Rust generator samples BIO entity block subsets/permutations, separator
variants, bracket styles, incomplete filename fragments, and standalone special
fixtures into compact pre-encoded .npy shards. The current release generated
20,439,848 training rows from 619,361 train-split source rows plus 935
special fixtures, then trained for 10 epochs / 114,070 optimizer steps.
Rust 生成器会把 BIO 实体块子集/重排、分隔符变体、括号样式、不完整文件名片段、
以及 standalone special fixtures 预编码成紧凑 .npy shard。当前发布从 619,361
条 train split 源样本和 935 条 special fixture 生成了 20,439,848 条训练行,
并完整训练 10 epoch / 114,070 个 optimizer steps。
Training outputs:
训练输出:
checkpoints/<run>/checkpoint-*: resumable checkpoints / 可恢复 checkpointcheckpoints/<run>/final: final Hugging Face checkpoint / 最终 checkpointfinal/run_metadata.json: run configuration / 训练配置final/trainer_eval_metrics.json: seqeval metrics / token/entity 指标final/parse_eval_metrics.json: held-out parser exact-match / held-out 解析准确率final/case_metrics.json: fixed real-world case regression / 固定真实 case 回归final/perf_metrics.json: training throughput/GPU telemetry when enabled / 启用时记录训练吞吐和 GPU 采样- TensorBoard logs unless
--no-tensorboardis set / 默认写 TensorBoard
6. Thin Hard-Case Fine-Tuning / 薄层困难样本微调
Use hard-case fine-tuning only after a specific real-world failure pattern has
been confirmed, fixed in the weak labels, and added to
data/parser_regression_cases.json.
只有在确认某类真实失败样式、修复弱标注并加入
data/parser_regression_cases.json 后,才使用困难样本微调。
uv run python -m tools.build_repair_focus_dataset `
--input datasets/AnimeName/dmhy_weak_char.jsonl `
--output data/generated/focus_after_virtual_sps32_char.jsonl `
--context-samples 50000 `
--repeat-focus 3 `
--repeat-manual 96 `
--seed 205
.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
--data-file data/generated/focus_after_virtual_sps32_char.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus `
--init-model-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5/final `
--epochs 1 `
--batch-size 1792 `
--learning-rate 0.000002 `
--warmup-steps 20 `
--max-seq-length 128 `
--train-split 0.95 `
--num-workers 0 `
--checkpoint-steps 300 `
--save-total-limit 2 `
--parse-eval-limit 2048 `
--case-eval-file data/parser_regression_cases.json `
--bf16 `
--no-periodic-eval `
--perf-log-steps 50 `
--perf-sample-interval 0.5 `
--seed 208 `
--experiment-name dmhy-char-virtual-sps32-10epoch-lightfocus
The default quality gate is model-led parsing:
默认质量门槛以模型主导解析为准:
fixed regression
model_only >= 85%held-out parse
model_only >= 75%normalized_onlyis the default thin runtime metricstructural filename assists are not part of training or release metrics
固定回归
model_only >= 85%held-out 解析
model_only >= 75%normalized_only是默认薄层运行时指标结构化文件名辅助不属于训练或发布指标
7. Publish to Repository Root / 发布到仓库根目录
The repository root is the Hugging Face checkpoint surface.
仓库根目录就是 Hugging Face checkpoint 发布面。
$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
New-Item -ItemType Directory -Path reports -Force | Out-Null
Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
Then export ONNX:
然后导出 ONNX:
uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
8. Validation Checklist / 验证清单
Run these before committing:
提交前执行:
uv run python -m compileall -q anifilebert tools
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
9. Git and LFS Order / Git 与 LFS 顺序
If the dataset submodule changed:
如果数据集子模块有变动:
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
Then commit the model repo:
再提交模型仓库:
git add README.md docs/maintenance.md docs/android.md docs/training.md docs/onnx.md `
config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
reports anifilebert tools data/parser_regression_cases.json datasets/AnimeName
git commit -m "Update AniFileBERT model and documentation"
git lfs push origin main --all
git push origin main