AniFileBERT / docs /training.md

ModerRAS

Polish Hugging Face repository docs

376db19 4 days ago

preview code

raw

history blame

7.57 kB

Training Guide / 训练指南

This document describes the reproducible training workflow for AniFileBERT.

本文档记录 AniFileBERT 的可复现训练流程。

1. Environment / 环境

Use uv for all dependency and command execution.

所有依赖和命令优先使用 uv。

uv sync
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

Recommended GPU configuration:

2. Dataset / 数据集

The authoritative dataset lives in the nested submodule:

权威数据集位于嵌套子模块：

datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json

Current expected properties:

当前期望属性：

rows / 行数: 632002
strict BIO violations / 严格 BIO 违规: 0
character vocab / 字符词表: 6199
character coverage / 字符覆盖率: 100%

3. Relabel Full Dataset / 全量重标注

Use this when weak-label rules changed in dmhy_dataset.py or label_repairs.py.

当 dmhy_dataset.py 或 label_repairs.py 的弱标注规则改变时，使用此流程。

uv run python relabel_dataset_from_filenames.py `
  --input datasets/AnimeName/dmhy_weak.jsonl `
  --output datasets/AnimeName/dmhy_weak.relabel.jsonl `
  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
  --vocab-output datasets/AnimeName/vocab.relabel.json `
  --base-vocab datasets/AnimeName/vocab.json `
  --max-vocab-size 8000 `
  --progress 50000

After checking the manifest and sample labels, replace the authoritative files:

检查 manifest 和样本标注后，再替换权威文件：

Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force

4. Convert to Character Dataset / 转换为字符数据集

The published checkpoint uses the character tokenizer.

当前发布模型使用字符级 tokenizer。

uv run python convert_to_char_dataset.py `
  --input datasets/AnimeName/dmhy_weak.jsonl `
  --output datasets/AnimeName/dmhy_weak_char.jsonl `
  --vocab-output datasets/AnimeName/vocab.char.json `
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
  --progress 50000

5. Full Training / 全量训练

Recommended RTX 3080 run:

6. Focus Fine-Tuning / 针对性微调

Use focus fine-tuning only after a specific real-world failure pattern has been confirmed and added to data/parser_regression_cases.json.

只有在确认某类真实失败样式，并加入 data/parser_regression_cases.json 后，才使用针对性微调。

uv run python build_repair_focus_dataset.py `
  --input datasets/AnimeName/dmhy_weak_char.jsonl `
  --output data/repair_focus_char.jsonl `
  --context-samples 50000 `
  --repeat-repaired 4 `
  --repeat-manual 24 `
  --seed 75

uv run python train.py --tokenizer char `
  --data-file data/repair_focus_char.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --save-dir checkpoints/dmhy-char-special-focus `
  --init-model-dir . `
  --epochs 1 `
  --batch-size 64 `
  --learning-rate 0.00003 `
  --warmup-steps 50 `
  --max-seq-length 128 `
  --train-split 0.95 `
  --num-workers 0 `
  --checkpoint-steps 500 `
  --save-total-limit 2 `
  --parse-eval-limit 512 `
  --case-eval-file data/parser_regression_cases.json `
  --seed 75 `
  --experiment-name dmhy-char-special-focus

7. Publish to Repository Root / 发布到仓库根目录

The repository root is the Hugging Face checkpoint surface.

仓库根目录就是 Hugging Face checkpoint 发布面。

$final = "checkpoints/dmhy-char-full/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
Copy-Item "$final/run_metadata.json" . -Force
Copy-Item "$final/trainer_eval_metrics.json" . -Force
Copy-Item "$final/parse_eval_metrics.json" . -Force
Copy-Item "$final/case_metrics.json" . -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force

Then export ONNX:

然后导出 ONNX：

uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128

8. Validation Checklist / 验证清单

Run these before committing:

提交前执行：

uv run python -m py_compile tokenizer.py dataset.py dmhy_dataset.py label_repairs.py train.py inference.py export_onnx.py onnx_inference.py
uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"

9. Git and LFS Order / Git 与 LFS 顺序

If the dataset submodule changed:

如果数据集子模块有变动：

git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main

Then commit the model repo:

再提交模型仓库：

git add README.md MAINTENANCE.md ANDROID.md docs/training.md docs/onnx.md `
  config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
  exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
  train.py inference.py export_onnx.py onnx_inference.py data/parser_regression_cases.json datasets/AnimeName
git commit -m "Update AniFileBERT model and documentation"
git lfs push origin main --all
git push origin main