# Training Guide / 训练指南 This document describes the reproducible training workflow for AniFileBERT. 本文档记录 AniFileBERT 的可复现训练流程。 ## 1. Environment / 环境 Use `uv` for all dependency and command execution. 所有依赖和命令优先使用 `uv`。 ```powershell uv sync uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())" ``` Recommended GPU configuration: 推荐 GPU 配置: - RTX 3080 class GPU or better; current release training used an RTX 5070 Ti - batch size `1792` with the virtual-sharded dataset path on the 5070 Ti - `bf16`/TF32 on Ada/Blackwell-class CUDA devices when available - `--num-workers 4 --persistent-workers` with the virtual-sharded dataset path ## 2. Dataset / 数据集 The authoritative dataset lives in the nested submodule: 权威数据集位于嵌套子模块: ```text datasets/AnimeName/dmhy_weak.jsonl datasets/AnimeName/dmhy_weak_char.jsonl datasets/AnimeName/vocab.json datasets/AnimeName/vocab.char.json ``` Current expected properties: 当前期望属性: - rows / 行数: `632002` - strict BIO violations / 严格 BIO 违规: `0` - character vocab / 字符词表: `6199` - character coverage / 字符覆盖率: `100%` ## 3. Relabel Full Dataset / 全量重标注 Use this when weak-label rules changed in `tools/dmhy_dataset.py` or `anifilebert/label_repairs.py`. 当 `tools/dmhy_dataset.py` 或 `anifilebert/label_repairs.py` 的弱标注规则改变时,使用此流程。 ```powershell uv run python -m tools.relabel_dataset_from_filenames ` --input datasets/AnimeName/dmhy_weak.jsonl ` --output datasets/AnimeName/dmhy_weak.relabel.jsonl ` --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json ` --vocab-output datasets/AnimeName/vocab.relabel.json ` --base-vocab datasets/AnimeName/vocab.json ` --max-vocab-size 8000 ` --progress 50000 ``` After checking the manifest and sample labels, replace the authoritative files: 检查 manifest 和样本标注后,再替换权威文件: ```powershell Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force ``` ## 4. Convert to Character Dataset / 转换为字符数据集 The published checkpoint uses the character tokenizer. 当前发布模型使用字符级 tokenizer。 ```powershell uv run python -m tools.convert_to_char_dataset ` --input datasets/AnimeName/dmhy_weak.jsonl ` --output datasets/AnimeName/dmhy_weak_char.jsonl ` --vocab-output datasets/AnimeName/vocab.char.json ` --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json ` --progress 50000 ``` ## 5. Full Training with Virtual BIO Shards / 虚拟 BIO shard 全量训练 Recommended RTX 5070 Ti run. The path-context switches below are intended for the next path-aware retrain; the currently published checkpoint lineage predates this augmentation. 推荐 RTX 5070 Ti 训练命令。下面的路径上下文参数用于下一轮 path-aware 重新训练; 当前已发布 checkpoint 的 lineage 早于这次增强。 ```powershell @' import random from pathlib import Path source = Path("datasets/AnimeName/dmhy_weak_char.jsonl") target = Path("data/generated/virtual_source_train_seed105.jsonl") rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line] random.Random(105).shuffle(rows) target.parent.mkdir(parents=True, exist_ok=True) target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8") '@ | .\.venv\Scripts\python.exe - cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml uv run python -m tools.extend_char_vocab ` --input datasets/AnimeName/vocab.char.json ` --output data/generated/vocab.char.path.json .\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe ` --input data/generated/virtual_source_train_seed105.jsonl ` --vocab-file data/generated/vocab.char.path.json ` --output-dir data/generated/virtual_char_sps32_path4_seed105 ` --max-length 128 ` --samples-per-source 32 ` --path-samples-per-source 4 ` --seed 105 ` --threads 20 ` --separator-mode per-gap ` --bracket-mode per-part .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` --data-file datasets/AnimeName/dmhy_weak_char.jsonl ` --vocab-file data/generated/vocab.char.path.json ` --virtual-dataset-dir data/generated/virtual_char_sps32_path4_seed105 ` --save-dir checkpoints/dmhy-char-virtual-sps32-path4-10epoch-lr1e5 ` --init-model-dir . ` --epochs 10 ` --batch-size 1792 ` --learning-rate 0.00001 ` --warmup-steps 2000 ` --max-seq-length 128 ` --train-split 0.98 ` --num-workers 4 ` --prefetch-factor 4 ` --persistent-workers ` --checkpoint-steps 5000 ` --save-total-limit 3 ` --parse-eval-limit 2048 ` --case-eval-file data/parser_regression_cases.json ` --bf16 ` --no-periodic-eval ` --perf-log-steps 1000 ` --perf-sample-interval 0.5 ` --seed 105 ` --experiment-name dmhy-char-virtual-sps32-path4-10epoch-lr1e5 ``` The Rust generator samples BIO entity block subsets/permutations, separator variants, bracket styles, incomplete filename fragments, and standalone special fixtures into compact pre-encoded `.npy` shards. When `--path-samples-per-source` is enabled, it also creates synthetic full-path samples such as `O:\115open\影音\动漫\TITLE\Season 01\03 [1080P][WEB-DL].mkv`, with all prefix directories labeled `O` and only the terminal title/season/episode/meta segments carrying BIO labels. Use `tools.extend_char_vocab` before path training so `/` and `\` are real character tokens instead of `[UNK]`. The current release generated `20,439,848` training rows from `619,361` train-split source rows plus `935` special fixtures, then trained for 10 epochs / `114,070` optimizer steps. Rust 生成器会把 BIO 实体块子集/重排、分隔符变体、括号样式、不完整文件名片段、 以及 standalone special fixtures 预编码成紧凑 `.npy` shard。开启 `--path-samples-per-source` 时,还会生成类似 `O:\115open\影音\动漫\TITLE\Season 01\03 [1080P][WEB-DL].mkv` 的完整路径样本: 前缀目录全部标为 `O`,末尾 title/season/episode/group/meta 片段保留 BIO 标签。 部分变体会生成 `[Group] Title - 01 ...` 这种完整发布文件名;其中重复出现的 basename title 标为 `O`,避免实体聚合学到重复输出 title。 路径训练前先用 `tools.extend_char_vocab` 派生词表,让 `/` 和 `\` 成为真实字符 token,而不是 `[UNK]`。 If you need standalone JSONL focus data instead of pre-encoded virtual shards, run `uv run python -m tools.build_path_prefix_dataset --input datasets/AnimeName/dmhy_weak_char.jsonl --output data/generated/path_prefix_seed105_char.jsonl`. It expands real DMHY rows into `noise/noise/title/Season N/01 ...` path samples with the prefix components labeled `O`. Add the following options when building a short group boundary focus set: ```powershell --require-group --max-group-length 4 --group-prefix-prob 1 --basename-title-prob 1 ``` Use `data/path_prefix_regression_cases.json` as the focused path-noise regression gate when comparing baseline and path-aware checkpoints. Optional structure/regularization experiments: - `--model-head crf` trains a BERT emission classifier with a linear-chain CRF loss. The CRF loss intentionally does not hard-mask BIO transitions, because weak labels may contain imperfect BIO spans; BIO constraints are still applied during CRF Viterbi decoding. - `--rdrop-alpha ` enables two-pass R-Drop symmetric KL regularization. Treat CRF/R-Drop runs as experiments until they have passed both `data/parser_regression_cases.json` and `data/path_prefix_regression_cases.json`. 当前发布从 `619,361` 条 train split 源样本和 `935` 条 special fixture 生成了 `20,439,848` 条训练行, 并完整训练 10 epoch / `114,070` 个 optimizer steps。 Training outputs: 训练输出: - `checkpoints//checkpoint-*`: resumable checkpoints / 可恢复 checkpoint - `checkpoints//final`: final Hugging Face checkpoint / 最终 checkpoint - `final/run_metadata.json`: run configuration / 训练配置 - `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标 - `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率 - `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归 - `final/perf_metrics.json`: training throughput/GPU telemetry when enabled / 启用时记录训练吞吐和 GPU 采样 - TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard ## 6. Thin Hard-Case Fine-Tuning / 薄层困难样本微调 Use hard-case fine-tuning only after a specific real-world failure pattern has been confirmed, fixed in the weak labels, and added to `data/parser_regression_cases.json`. 只有在确认某类真实失败样式、修复弱标注并加入 `data/parser_regression_cases.json` 后,才使用困难样本微调。 ```powershell uv run python -m tools.build_repair_focus_dataset ` --input datasets/AnimeName/dmhy_weak_char.jsonl ` --output data/generated/focus_after_virtual_sps32_char.jsonl ` --context-samples 50000 ` --repeat-focus 3 ` --repeat-manual 96 ` --seed 205 .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` --data-file data/generated/focus_after_virtual_sps32_char.jsonl ` --vocab-file datasets/AnimeName/vocab.char.json ` --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus ` --init-model-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5/final ` --epochs 1 ` --batch-size 1792 ` --learning-rate 0.000002 ` --warmup-steps 20 ` --max-seq-length 128 ` --train-split 0.95 ` --num-workers 0 ` --checkpoint-steps 300 ` --save-total-limit 2 ` --parse-eval-limit 2048 ` --case-eval-file data/parser_regression_cases.json ` --bf16 ` --no-periodic-eval ` --perf-log-steps 50 ` --perf-sample-interval 0.5 ` --seed 208 ` --experiment-name dmhy-char-virtual-sps32-10epoch-lightfocus ``` The default quality gate is model-led parsing: 默认质量门槛以模型主导解析为准: - fixed regression `model_only >= 85%` - held-out parse `model_only >= 75%` - `normalized_only` is the default thin runtime metric - structural filename assists are not part of training or release metrics - 固定回归 `model_only >= 85%` - held-out 解析 `model_only >= 75%` - `normalized_only` 是默认薄层运行时指标 - 结构化文件名辅助不属于训练或发布指标 ## 7. Publish to Repository Root / 发布到仓库根目录 The repository root is the Hugging Face checkpoint surface. 仓库根目录就是 Hugging Face checkpoint 发布面。 ```powershell $final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final" $releaseVocab = "datasets/AnimeName/vocab.char.json" # For a path-aware run trained with data/generated/vocab.char.path.json: # $releaseVocab = "data/generated/vocab.char.path.json" Copy-Item "$final/config.json" . -Force Copy-Item "$final/model.safetensors" . -Force Copy-Item "$final/tokenizer_config.json" . -Force Copy-Item "$final/training_args.bin" . -Force Copy-Item "$final/vocab.json" . -Force New-Item -ItemType Directory -Path reports -Force | Out-Null Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force Copy-Item $releaseVocab .\vocab.char.json -Force ``` Then export ONNX: 然后导出 ONNX: ```powershell uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128 ``` ## 8. Validation Checklist / 验证清单 Run these before committing: 提交前执行: ```powershell uv run python -m compileall -q anifilebert tools uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv" ## 9. Recover Interrupted Work / 恢复中断训练现场 When Codex/session history is lost, use the recovery helper to reconstruct what was running, whether it finished, and where to resume. 当 Codex/会话历史丢失时,使用恢复脚本快速还原“当时在跑什么、是否跑完、从哪里续跑”。 Local latest run: ```powershell uv run python -m tools.recover_training_context ``` Specific local run: ```powershell uv run python -m tools.recover_training_context --run dmhy-char-virtual-sps32-path4-10epoch-lr1e5 ``` Local + remote worker (RTX 5070 Ti): ```powershell uv run python -m tools.recover_training_context ` --remote-host adqew@192.168.63.157 ` --remote-repo "C:\WorkSpace\Python\AniFileBERT" ``` Machine-readable JSON report: ```powershell uv run python -m tools.recover_training_context ` --remote-host adqew@192.168.63.157 ` --format json ` --output reports/recovery_latest.json ``` The script inspects: - latest/selected run under `checkpoints/` - latest `checkpoint-*` step and `trainer_state.json` - `final/run_metadata.json` - `logs//combined.log` tail - remote run state via SSH (optional) If the run is incomplete, it prints a resume hint with `--resume-from-checkpoint`. uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json ``` ## 9. Git and LFS Order / Git 与 LFS 顺序 If the dataset submodule changed: 如果数据集子模块有变动: ```powershell git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json git -C datasets/AnimeName commit -m "Update anime filename labels" git -C datasets/AnimeName lfs push origin main --all git -C datasets/AnimeName push origin main ``` Then commit the model repo: 再提交模型仓库: ```powershell git add README.md docs/maintenance.md docs/android.md docs/training.md docs/onnx.md ` config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json ` exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json ` reports anifilebert tools data/parser_regression_cases.json datasets/AnimeName git commit -m "Update AniFileBERT model and documentation" git lfs push origin main --all git push origin main ```