AniFileBERT / docs /maintenance.md
ModerRAS's picture
Train virtual-shard anime parser
359ff82

AniFileBERT Maintenance / 维护手册

This repository is the standalone Hugging Face model repo used by MiruPlay as tools/anime_parser.

本仓库是 MiruPlay 通过 tools/anime_parser 引用的独立 Hugging Face 模型仓库。

Related Repositories / 相关仓库

Repository / 仓库 URL Purpose / 用途
AniFileBERT https://huggingface.co/ModerRAS/AniFileBERT Model, scripts, ONNX export / 模型、脚本、ONNX 导出
AnimeName https://huggingface.co/datasets/ModerRAS/AnimeName Dataset snapshot / 数据集快照
MiruPlay https://github.com/ModerRAS/MiruPlay Android integration / Android 集成

Nested structure / 嵌套结构:

AniFileBERT
  datasets/AnimeName -> ModerRAS/AnimeName

Clone / 克隆

git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT

After a normal clone / 普通 clone 后:

git submodule update --init --recursive
uv sync

Publishing Surface / 发布面

The repository root is the only published Hugging Face checkpoint location:

仓库根目录是唯一的 Hugging Face checkpoint 发布位置:

config.json
model.safetensors
tokenizer_config.json
training_args.bin
vocab.json
vocab.char.json

Release reports are kept under reports/:

发布报告保存在 reports/

reports/run_metadata.json
reports/trainer_eval_metrics.json
reports/parse_eval_metrics.json
reports/case_metrics.json
reports/perf_metrics.json
reports/benchmark_results.json
reports/training_lineage.json

There is no tracked model/ duplicate. Ignored checkpoints/ directories are local training artifacts only.

仓库不再跟踪旧的 model/ 副本。被 ignore 的 checkpoints/ 仅是本地训练产物。

Standard Training / 标准训练

For full details, see training.md.

完整流程见 training.md

Current release training uses the virtual-shard flow in training.md:

当前发布训练使用 training.md 中的 virtual-shard 流程:

uv run python -m compileall -q anifilebert tools
cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
# Then follow docs/training.md section "Full Training with Virtual BIO Shards".

Publish a New Checkpoint / 发布新 checkpoint

Copy final files to the repository root:

final 文件复制到仓库根目录:

$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
New-Item -ItemType Directory -Path reports -Force | Out-Null
Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force

Export ONNX / 导出 ONNX:

uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128

Validate / 验证:

uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json

The default parser path is thin runtime: model logits, constrained BIO, entity aggregation, and light string/number normalization. Do not add structural filename regex assists back to the default runtime; parser quality should come from labels and model training.

默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。 不要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自标签和模型训练。

Dataset Submodule / 数据集子模块

If datasets/AnimeName changed, commit and push it first:

如果 datasets/AnimeName 有变动,先提交并推送它:

git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main

Then commit the submodule pointer in this repo:

然后在本仓库提交 submodule pointer:

git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"

LFS Push Order / LFS 推送顺序

Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push because an LFS pointer points to a missing object, upload LFS objects first:

大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push, 先上传 LFS 对象:

git lfs push origin main --all
git push origin main

For dataset changes:

数据集变动:

git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main

Update MiruPlay / 更新 MiruPlay

From MiruPlay root:

在 MiruPlay 根目录:

git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"

If Android assets changed, also stage:

如果 Android assets 变化,也要提交:

scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json