AniFileBERT / docs /training.md

chore: checkpoint current training and manual relabel progress

efb213a 12 days ago

preview code

raw

history blame

15.2 kB

Training Guide / 训练指南

This document describes the reproducible training workflow for AniFileBERT.

本文档记录 AniFileBERT 的可复现训练流程。

1. Environment / 环境

Use uv for all dependency and command execution.

所有依赖和命令优先使用 uv。

uv sync
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

Recommended GPU configuration:

2. Dataset / 数据集

The authoritative dataset lives in the nested submodule:

权威数据集位于嵌套子模块：

datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json

Current expected properties:

当前期望属性：

rows / 行数: 632002
strict BIO violations / 严格 BIO 违规: 0
character vocab / 字符词表: 6199
character coverage / 字符覆盖率: 100%

3. Relabel Full Dataset / 全量重标注

Use this when weak-label rules changed in tools/dmhy_dataset.py or anifilebert/label_repairs.py.

当 tools/dmhy_dataset.py 或 anifilebert/label_repairs.py 的弱标注规则改变时，使用此流程。

uv run python -m tools.relabel_dataset_from_filenames `
  --input datasets/AnimeName/dmhy_weak.jsonl `
  --output datasets/AnimeName/dmhy_weak.relabel.jsonl `
  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
  --vocab-output datasets/AnimeName/vocab.relabel.json `
  --base-vocab datasets/AnimeName/vocab.json `
  --max-vocab-size 8000 `
  --progress 50000

After checking the manifest and sample labels, replace the authoritative files:

检查 manifest 和样本标注后，再替换权威文件：

Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force

4. Convert to Character Dataset / 转换为字符数据集

The published checkpoint uses the character tokenizer.

当前发布模型使用字符级 tokenizer。

uv run python -m tools.convert_to_char_dataset `
  --input datasets/AnimeName/dmhy_weak.jsonl `
  --output datasets/AnimeName/dmhy_weak_char.jsonl `
  --vocab-output datasets/AnimeName/vocab.char.json `
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
  --progress 50000

5. Full Training with Virtual BIO Shards / 虚拟 BIO shard 全量训练

Recommended RTX 5070 Ti run. The path-context switches below are intended for the next path-aware retrain; the currently published checkpoint lineage predates this augmentation.

推荐 RTX 5070 Ti 训练命令。下面的路径上下文参数用于下一轮 path-aware 重新训练；当前已发布 checkpoint 的 lineage 早于这次增强。

@'
import random
from pathlib import Path

source = Path("datasets/AnimeName/dmhy_weak_char.jsonl")
target = Path("data/generated/virtual_source_train_seed105.jsonl")
rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line]
random.Random(105).shuffle(rows)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8")
'@ | .\.venv\Scripts\python.exe -

cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
uv run python -m tools.extend_char_vocab `
  --input datasets/AnimeName/vocab.char.json `
  --output data/generated/vocab.char.path.json

.\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe `
  --input data/generated/virtual_source_train_seed105.jsonl `
  --vocab-file data/generated/vocab.char.path.json `
  --output-dir data/generated/virtual_char_sps32_path4_seed105 `
  --max-length 128 `
  --samples-per-source 32 `
  --path-samples-per-source 4 `
  --seed 105 `
  --threads 20 `
  --separator-mode per-gap `
  --bracket-mode per-part

.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
  --vocab-file data/generated/vocab.char.path.json `
  --virtual-dataset-dir data/generated/virtual_char_sps32_path4_seed105 `
  --save-dir checkpoints/dmhy-char-virtual-sps32-path4-10epoch-lr1e5 `
  --init-model-dir . `
  --epochs 10 `
  --batch-size 1792 `
  --learning-rate 0.00001 `
  --warmup-steps 2000 `
  --max-seq-length 128 `
  --train-split 0.98 `
  --num-workers 4 `
  --prefetch-factor 4 `
  --persistent-workers `
  --checkpoint-steps 5000 `
  --save-total-limit 3 `
  --parse-eval-limit 2048 `
  --case-eval-file data/parser_regression_cases.json `
  --bf16 `
  --no-periodic-eval `
  --perf-log-steps 1000 `
  --perf-sample-interval 0.5 `
  --seed 105 `
  --experiment-name dmhy-char-virtual-sps32-path4-10epoch-lr1e5

The Rust generator samples BIO entity block subsets/permutations, separator variants, bracket styles, incomplete filename fragments, and standalone special fixtures into compact pre-encoded .npy shards. When --path-samples-per-source is enabled, it also creates synthetic full-path samples such as O:\115open\影音\动漫\TITLE\Season 01\03 [1080P][WEB-DL].mkv, with all prefix directories labeled O and only the terminal title/season/episode/meta segments carrying BIO labels. Use tools.extend_char_vocab before path training so / and \ are real character tokens instead of [UNK].

The current release generated 20,439,848 training rows from 619,361 train-split source rows plus 935 special fixtures, then trained for 10 epochs / 114,070 optimizer steps.

Rust 生成器会把 BIO 实体块子集/重排、分隔符变体、括号样式、不完整文件名片段、以及 standalone special fixtures 预编码成紧凑 .npy shard。开启 --path-samples-per-source 时，还会生成类似 O:\115open\影音\动漫\TITLE\Season 01\03 [1080P][WEB-DL].mkv 的完整路径样本：前缀目录全部标为 O，末尾 title/season/episode/group/meta 片段保留 BIO 标签。部分变体会生成 [Group] Title - 01 ... 这种完整发布文件名；其中重复出现的 basename title 标为 O，避免实体聚合学到重复输出 title。路径训练前先用 tools.extend_char_vocab 派生词表，让 / 和 \ 成为真实字符 token，而不是 [UNK]。

If you need standalone JSONL focus data instead of pre-encoded virtual shards, run uv run python -m tools.build_path_prefix_dataset --input datasets/AnimeName/dmhy_weak_char.jsonl --output data/generated/path_prefix_seed105_char.jsonl. It expands real DMHY rows into noise/noise/title/Season N/01 ... path samples with the prefix components labeled O. Add the following options when building a short group boundary focus set:

--require-group --max-group-length 4 --group-prefix-prob 1 --basename-title-prob 1

Use data/path_prefix_regression_cases.json as the focused path-noise regression gate when comparing baseline and path-aware checkpoints.

Optional structure/regularization experiments:

--model-head crf trains a BERT emission classifier with a linear-chain CRF loss. The CRF loss intentionally does not hard-mask BIO transitions, because weak labels may contain imperfect BIO spans; BIO constraints are still applied during CRF Viterbi decoding.
--rdrop-alpha <weight> enables two-pass R-Drop symmetric KL regularization. Treat CRF/R-Drop runs as experiments until they have passed both data/parser_regression_cases.json and data/path_prefix_regression_cases.json.

当前发布从 619,361 条 train split 源样本和 935 条 special fixture 生成了 20,439,848 条训练行，并完整训练 10 epoch / 114,070 个 optimizer steps。

Training outputs:

训练输出：

checkpoints/<run>/checkpoint-*: resumable checkpoints / 可恢复 checkpoint
checkpoints/<run>/final: final Hugging Face checkpoint / 最终 checkpoint
final/run_metadata.json: run configuration / 训练配置
final/trainer_eval_metrics.json: seqeval metrics / token/entity 指标
final/parse_eval_metrics.json: held-out parser exact-match / held-out 解析准确率
final/case_metrics.json: fixed real-world case regression / 固定真实 case 回归
final/perf_metrics.json: training throughput/GPU telemetry when enabled / 启用时记录训练吞吐和 GPU 采样
TensorBoard logs unless --no-tensorboard is set / 默认写 TensorBoard

6. Thin Hard-Case Fine-Tuning / 薄层困难样本微调

Use hard-case fine-tuning only after a specific real-world failure pattern has been confirmed, fixed in the weak labels, and added to data/parser_regression_cases.json.

只有在确认某类真实失败样式、修复弱标注并加入 data/parser_regression_cases.json 后，才使用困难样本微调。

uv run python -m tools.build_repair_focus_dataset `
  --input datasets/AnimeName/dmhy_weak_char.jsonl `
  --output data/generated/focus_after_virtual_sps32_char.jsonl `
  --context-samples 50000 `
  --repeat-focus 3 `
  --repeat-manual 96 `
  --seed 205

.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file data/generated/focus_after_virtual_sps32_char.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus `
  --init-model-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5/final `
  --epochs 1 `
  --batch-size 1792 `
  --learning-rate 0.000002 `
  --warmup-steps 20 `
  --max-seq-length 128 `
  --train-split 0.95 `
  --num-workers 0 `
  --checkpoint-steps 300 `
  --save-total-limit 2 `
  --parse-eval-limit 2048 `
  --case-eval-file data/parser_regression_cases.json `
  --bf16 `
  --no-periodic-eval `
  --perf-log-steps 50 `
  --perf-sample-interval 0.5 `
  --seed 208 `
  --experiment-name dmhy-char-virtual-sps32-10epoch-lightfocus

The default quality gate is model-led parsing:

默认质量门槛以模型主导解析为准：

fixed regression model_only >= 85%
held-out parse model_only >= 75%
normalized_only is the default thin runtime metric
structural filename assists are not part of training or release metrics
固定回归 model_only >= 85%
held-out 解析 model_only >= 75%
normalized_only 是默认薄层运行时指标
结构化文件名辅助不属于训练或发布指标

7. Publish to Repository Root / 发布到仓库根目录

The repository root is the Hugging Face checkpoint surface.

仓库根目录就是 Hugging Face checkpoint 发布面。

$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
$releaseVocab = "datasets/AnimeName/vocab.char.json"
# For a path-aware run trained with data/generated/vocab.char.path.json:
# $releaseVocab = "data/generated/vocab.char.path.json"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
New-Item -ItemType Directory -Path reports -Force | Out-Null
Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
Copy-Item $releaseVocab .\vocab.char.json -Force

Then export ONNX:

然后导出 ONNX：

uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128

8. Validation Checklist / 验证清单

Run these before committing:

提交前执行：

uv run python -m compileall -q anifilebert tools
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"

## 9. Recover Interrupted Work / 恢复中断训练现场

When Codex/session history is lost, use the recovery helper to reconstruct
what was running, whether it finished, and where to resume.

当 Codex/会话历史丢失时，使用恢复脚本快速还原“当时在跑什么、是否跑完、从哪里续跑”。

Local latest run:

```powershell
uv run python -m tools.recover_training_context

Specific local run:

uv run python -m tools.recover_training_context --run dmhy-char-virtual-sps32-path4-10epoch-lr1e5

Local + remote worker (RTX 5070 Ti):

uv run python -m tools.recover_training_context `
  --remote-host adqew@192.168.63.157 `
  --remote-repo "C:\WorkSpace\Python\AniFileBERT"

Machine-readable JSON report:

uv run python -m tools.recover_training_context `
  --remote-host adqew@192.168.63.157 `
  --format json `
  --output reports/recovery_latest.json

The script inspects:

latest/selected run under checkpoints/
latest checkpoint-* step and trainer_state.json
final/run_metadata.json
logs/<run>/combined.log tail
remote run state via SSH (optional)

If the run is incomplete, it prints a resume hint with --resume-from-checkpoint. uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json


## 9. Git and LFS Order / Git 与 LFS 顺序

If the dataset submodule changed:

如果数据集子模块有变动：

```powershell
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main

Then commit the model repo:

再提交模型仓库：

git add README.md docs/maintenance.md docs/android.md docs/training.md docs/onnx.md `
  config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
  exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
  reports anifilebert tools data/parser_regression_cases.json datasets/AnimeName
git commit -m "Update AniFileBERT model and documentation"
git lfs push origin main --all
git push origin main