Polish Hugging Face repository docs

376db19 5 days ago

7.57 kB

	# Training Guide / 训练指南

	This document describes the reproducible training workflow for AniFileBERT.

	本文档记录 AniFileBERT 的可复现训练流程。

	## 1. Environment / 环境

	Use `uv` for all dependency and command execution.

	所有依赖和命令优先使用 `uv`。

	```powershell
	uv sync
	uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
	```

	Recommended GPU configuration:

	推荐 GPU 配置：

	- RTX 3080 class GPU or better
	- batch size `192` to `256` for full char training
	- `fp16` enabled automatically when CUDA is available
	- `--num-workers 4` or `8` when the local disk can keep up

	## 2. Dataset / 数据集

	The authoritative dataset lives in the nested submodule:

	权威数据集位于嵌套子模块：

	```text
	datasets/AnimeName/dmhy_weak.jsonl
	datasets/AnimeName/dmhy_weak_char.jsonl
	datasets/AnimeName/vocab.json
	datasets/AnimeName/vocab.char.json
	```

	Current expected properties:

	当前期望属性：

	- rows / 行数: `632002`
	- strict BIO violations / 严格 BIO 违规: `0`
	- character vocab / 字符词表: `6199`
	- character coverage / 字符覆盖率: `100%`

	## 3. Relabel Full Dataset / 全量重标注

	Use this when weak-label rules changed in `dmhy_dataset.py` or `label_repairs.py`.

	当 `dmhy_dataset.py` 或 `label_repairs.py` 的弱标注规则改变时，使用此流程。

	```powershell
	uv run python relabel_dataset_from_filenames.py `
	--input datasets/AnimeName/dmhy_weak.jsonl `
	--output datasets/AnimeName/dmhy_weak.relabel.jsonl `
	--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json `
	--vocab-output datasets/AnimeName/vocab.relabel.json `
	--base-vocab datasets/AnimeName/vocab.json `
	--max-vocab-size 8000 `
	--progress 50000
	```

	After checking the manifest and sample labels, replace the authoritative files:

	检查 manifest 和样本标注后，再替换权威文件：

	```powershell
	Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
	Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
	Move-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
	```

	## 4. Convert to Character Dataset / 转换为字符数据集

	The published checkpoint uses the character tokenizer.

	当前发布模型使用字符级 tokenizer。

	```powershell
	uv run python convert_to_char_dataset.py `
	--input datasets/AnimeName/dmhy_weak.jsonl `
	--output datasets/AnimeName/dmhy_weak_char.jsonl `
	--vocab-output datasets/AnimeName/vocab.char.json `
	--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json `
	--progress 50000
	```

	## 5. Full Training / 全量训练

	Recommended RTX 3080 run:

	推荐 RTX 3080 训练命令：

	```powershell
	uv run python train.py --tokenizer char `
	--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
	--vocab-file datasets/AnimeName/vocab.char.json `
	--save-dir checkpoints/dmhy-char-full `
	--init-model-dir . `
	--epochs 2 `
	--batch-size 256 `
	--learning-rate 0.00008 `
	--warmup-steps 300 `
	--max-seq-length 128 `
	--train-split 0.98 `
	--num-workers 4 `
	--checkpoint-steps 1000 `
	--save-total-limit 3 `
	--parse-eval-limit 2048 `
	--case-eval-file data/parser_regression_cases.json `
	--seed 52 `
	--experiment-name dmhy-char-full
	```

	Training outputs:

	训练输出：

	- `checkpoints/<run>/checkpoint-*`: resumable checkpoints / 可恢复 checkpoint
	- `checkpoints/<run>/final`: final Hugging Face checkpoint / 最终 checkpoint
	- `final/run_metadata.json`: run configuration / 训练配置
	- `final/trainer_eval_metrics.json`: seqeval metrics / token/entity 指标
	- `final/parse_eval_metrics.json`: held-out parser exact-match / held-out 解析准确率
	- `final/case_metrics.json`: fixed real-world case regression / 固定真实 case 回归
	- TensorBoard logs unless `--no-tensorboard` is set / 默认写 TensorBoard

	## 6. Focus Fine-Tuning / 针对性微调

	Use focus fine-tuning only after a specific real-world failure pattern has been
	confirmed and added to `data/parser_regression_cases.json`.

	只有在确认某类真实失败样式，并加入 `data/parser_regression_cases.json` 后，才使用针对性微调。

	```powershell
	uv run python build_repair_focus_dataset.py `
	--input datasets/AnimeName/dmhy_weak_char.jsonl `
	--output data/repair_focus_char.jsonl `
	--context-samples 50000 `
	--repeat-repaired 4 `
	--repeat-manual 24 `
	--seed 75

	uv run python train.py --tokenizer char `
	--data-file data/repair_focus_char.jsonl `
	--vocab-file datasets/AnimeName/vocab.char.json `
	--save-dir checkpoints/dmhy-char-special-focus `
	--init-model-dir . `
	--epochs 1 `
	--batch-size 64 `
	--learning-rate 0.00003 `
	--warmup-steps 50 `
	--max-seq-length 128 `
	--train-split 0.95 `
	--num-workers 0 `
	--checkpoint-steps 500 `
	--save-total-limit 2 `
	--parse-eval-limit 512 `
	--case-eval-file data/parser_regression_cases.json `
	--seed 75 `
	--experiment-name dmhy-char-special-focus
	```

	## 7. Publish to Repository Root / 发布到仓库根目录

	The repository root is the Hugging Face checkpoint surface.

	仓库根目录就是 Hugging Face checkpoint 发布面。

	```powershell
	$final = "checkpoints/dmhy-char-full/final"
	Copy-Item "$final/config.json" . -Force
	Copy-Item "$final/model.safetensors" . -Force
	Copy-Item "$final/tokenizer_config.json" . -Force
	Copy-Item "$final/training_args.bin" . -Force
	Copy-Item "$final/vocab.json" . -Force
	Copy-Item "$final/run_metadata.json" . -Force
	Copy-Item "$final/trainer_eval_metrics.json" . -Force
	Copy-Item "$final/parse_eval_metrics.json" . -Force
	Copy-Item "$final/case_metrics.json" . -Force
	Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
	```

	Then export ONNX:

	然后导出 ONNX：

	```powershell
	uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
	```

	## 8. Validation Checklist / 验证清单

	Run these before committing:

	提交前执行：

	```powershell
	uv run python -m py_compile tokenizer.py dataset.py dmhy_dataset.py label_repairs.py train.py inference.py export_onnx.py onnx_inference.py
	uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
	uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
	uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
	```

	## 9. Git and LFS Order / Git 与 LFS 顺序

	If the dataset submodule changed:

	如果数据集子模块有变动：

	```powershell
	git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
	git -C datasets/AnimeName commit -m "Update anime filename labels"
	git -C datasets/AnimeName lfs push origin main --all
	git -C datasets/AnimeName push origin main
	```

	Then commit the model repo:

	再提交模型仓库：

	```powershell
	git add README.md MAINTENANCE.md ANDROID.md docs/training.md docs/onnx.md `
	config.json model.safetensors tokenizer_config.json training_args.bin vocab.json vocab.char.json `
	exports/anime_filename_parser.onnx exports/anime_filename_parser.metadata.json `
	train.py inference.py export_onnx.py onnx_inference.py data/parser_regression_cases.json datasets/AnimeName
	git commit -m "Update AniFileBERT model and documentation"
	git lfs push origin main --all
	git push origin main
	```