# AniFileBERT Maintenance / 维护手册 This repository is the standalone Hugging Face model repo used by MiruPlay as `tools/anime_parser`. 本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。 ## Related Repositories / 相关仓库 | Repository / 仓库 | URL | Purpose / 用途 | | --- | --- | --- | | AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 | | AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 | | MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 | Nested structure / 嵌套结构: ```text AniFileBERT datasets/AnimeName -> ModerRAS/AnimeName ``` ## Clone / 克隆 ```powershell git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT ``` After a normal clone / 普通 clone 后: ```powershell git submodule update --init --recursive uv sync ``` ## Publishing Surface / 发布面 The repository root is the only published Hugging Face checkpoint location: 仓库根目录是唯一的 Hugging Face checkpoint 发布位置: ```text config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json run_metadata.json trainer_eval_metrics.json parse_eval_metrics.json case_metrics.json ``` There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are local training artifacts only. 仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。 ## Standard Training / 标准训练 For full details, see [`docs/training.md`](docs/training.md). 完整流程见 [`docs/training.md`](docs/training.md)。 Recommended full training command / 推荐全量训练命令: ```powershell uv run python train.py --tokenizer char ` --data-file datasets/AnimeName/dmhy_weak_char.jsonl ` --vocab-file datasets/AnimeName/vocab.char.json ` --save-dir checkpoints/dmhy-char-full ` --init-model-dir . ` --epochs 2 ` --batch-size 256 ` --learning-rate 0.00008 ` --warmup-steps 300 ` --max-seq-length 128 ` --train-split 0.98 ` --num-workers 4 ` --checkpoint-steps 1000 ` --save-total-limit 3 ` --parse-eval-limit 2048 ` --case-eval-file data/parser_regression_cases.json ` --seed 52 ` --experiment-name dmhy-char-full ``` ## Publish a New Checkpoint / 发布新 checkpoint Copy final files to the repository root: 把 `final` 文件复制到仓库根目录: ```powershell $final = "checkpoints/dmhy-char-full/final" Copy-Item "$final/config.json" . -Force Copy-Item "$final/model.safetensors" . -Force Copy-Item "$final/tokenizer_config.json" . -Force Copy-Item "$final/training_args.bin" . -Force Copy-Item "$final/vocab.json" . -Force Copy-Item "$final/run_metadata.json" . -Force Copy-Item "$final/trainer_eval_metrics.json" . -Force Copy-Item "$final/parse_eval_metrics.json" . -Force Copy-Item "$final/case_metrics.json" . -Force Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force ``` Export ONNX / 导出 ONNX: ```powershell uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128 ``` Validate / 验证: ```powershell uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" ``` ## Dataset Submodule / 数据集子模块 If `datasets/AnimeName` changed, commit and push it first: 如果 `datasets/AnimeName` 有变动,先提交并推送它: ```powershell git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json git -C datasets/AnimeName commit -m "Update anime filename labels" git -C datasets/AnimeName lfs push origin main --all git -C datasets/AnimeName push origin main ``` Then commit the submodule pointer in this repo: 然后在本仓库提交 submodule pointer: ```powershell git add datasets/AnimeName git commit -m "Update AnimeName dataset pointer" ``` ## LFS Push Order / LFS 推送顺序 Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push because an LFS pointer points to a missing object, upload LFS objects first: 大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push, 先上传 LFS 对象: ```powershell git lfs push origin main --all git push origin main ``` For dataset changes: 数据集变动: ```powershell git -C datasets/AnimeName lfs push origin main --all git -C datasets/AnimeName push origin main ``` ## Update MiruPlay / 更新 MiruPlay From MiruPlay root: 在 MiruPlay 根目录: ```powershell git submodule update --remote --recursive tools/anime_parser git add tools/anime_parser git commit -m "Update AniFileBERT submodule" ``` If Android assets changed, also stage: 如果 Android assets 变化,也要提交: ```text scraper/src/main/assets/anime_parser/anime_filename_parser.onnx scraper/src/main/assets/anime_parser/config.json scraper/src/main/assets/anime_parser/vocab.json ```