Train virtual-shard anime parser

359ff82 about 16 hours ago

6.24 kB

	# AniFileBERT Maintenance / 维护手册

	This repository is the standalone Hugging Face model repo used by MiruPlay as
	`tools/anime_parser`.

	本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。

	## Related Repositories / 相关仓库

	\| Repository / 仓库 \| URL \| Purpose / 用途 \|
	\| --- \| --- \| --- \|
	\| AniFileBERT \| `https://huggingface.co/ModerRAS/AniFileBERT` \| Model, scripts, ONNX export / 模型、脚本、ONNX 导出 \|
	\| AnimeName \| `https://huggingface.co/datasets/ModerRAS/AnimeName` \| Dataset snapshot / 数据集快照 \|
	\| MiruPlay \| `https://github.com/ModerRAS/MiruPlay` \| Android integration / Android 集成 \|

	Nested structure / 嵌套结构：

	```text
	AniFileBERT
	datasets/AnimeName -> ModerRAS/AnimeName
	```

	## Clone / 克隆

	```powershell
	git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
	```

	After a normal clone / 普通 clone 后：

	```powershell
	git submodule update --init --recursive
	uv sync
	```

	## Publishing Surface / 发布面

	The repository root is the only published Hugging Face checkpoint location:

	仓库根目录是唯一的 Hugging Face checkpoint 发布位置：

	```text
	config.json
	model.safetensors
	tokenizer_config.json
	training_args.bin
	vocab.json
	vocab.char.json
	```

	Release reports are kept under `reports/`:

	发布报告保存在 `reports/`：

	```text
	reports/run_metadata.json
	reports/trainer_eval_metrics.json
	reports/parse_eval_metrics.json
	reports/case_metrics.json
	reports/perf_metrics.json
	reports/benchmark_results.json
	reports/training_lineage.json
	```

	There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
	local training artifacts only.

	仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。

	## Standard Training / 标准训练

	For full details, see [`training.md`](training.md).

	完整流程见 [`training.md`](training.md)。

	Current release training uses the virtual-shard flow in [`training.md`](training.md):

	当前发布训练使用 [`training.md`](training.md) 中的 virtual-shard 流程：

	```powershell
	uv run python -m compileall -q anifilebert tools
	cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
	# Then follow docs/training.md section "Full Training with Virtual BIO Shards".
	```

	## Publish a New Checkpoint / 发布新 checkpoint

	Copy final files to the repository root:

	把 `final` 文件复制到仓库根目录：

	```powershell
	$final = "checkpoints/dmhy-char-virtual-sps32-10epoch-lightfocus/final"
	Copy-Item "$final/config.json" . -Force
	Copy-Item "$final/model.safetensors" . -Force
	Copy-Item "$final/tokenizer_config.json" . -Force
	Copy-Item "$final/training_args.bin" . -Force
	Copy-Item "$final/vocab.json" . -Force
	New-Item -ItemType Directory -Path reports -Force \| Out-Null
	Copy-Item "$final/run_metadata.json" reports/run_metadata.json -Force
	Copy-Item "$final/trainer_eval_metrics.json" reports/trainer_eval_metrics.json -Force
	Copy-Item "$final/parse_eval_metrics.json" reports/parse_eval_metrics.json -Force
	Copy-Item "$final/case_metrics.json" reports/case_metrics.json -Force
	Copy-Item "$final/perf_metrics.json" reports/perf_metrics.json -Force
	Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
	```

	Export ONNX / 导出 ONNX：

	```powershell
	uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
	```

	Validate / 验证：

	```powershell
	uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
	uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
	uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
	```

	The default parser path is thin runtime: model logits, constrained BIO, entity
	aggregation, and light string/number normalization. Do not add structural
	filename regex assists back to the default runtime; parser quality should come
	from labels and model training.

	默认解析路径是薄层运行时：模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
	不要把结构化文件名正则辅助重新加回默认运行时；解析质量应来自标签和模型训练。

	## Dataset Submodule / 数据集子模块

	If `datasets/AnimeName` changed, commit and push it first:

	如果 `datasets/AnimeName` 有变动，先提交并推送它：

	```powershell
	git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
	git -C datasets/AnimeName commit -m "Update anime filename labels"
	git -C datasets/AnimeName lfs push origin main --all
	git -C datasets/AnimeName push origin main
	```

	Then commit the submodule pointer in this repo:

	然后在本仓库提交 submodule pointer：

	```powershell
	git add datasets/AnimeName
	git commit -m "Update AnimeName dataset pointer"
	```

	## LFS Push Order / LFS 推送顺序

	Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
	because an LFS pointer points to a missing object, upload LFS objects first:

	大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push，
	先上传 LFS 对象：

	```powershell
	git lfs push origin main --all
	git push origin main
	```

	For dataset changes:

	数据集变动：

	```powershell
	git -C datasets/AnimeName lfs push origin main --all
	git -C datasets/AnimeName push origin main
	```

	## Update MiruPlay / 更新 MiruPlay

	From MiruPlay root:

	在 MiruPlay 根目录：

	```powershell
	git submodule update --remote --recursive tools/anime_parser
	git add tools/anime_parser
	git commit -m "Update AniFileBERT submodule"
	```

	If Android assets changed, also stage:

	如果 Android assets 变化，也要提交：

	```text
	scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
	scraper/src/main/assets/anime_parser/config.json
	scraper/src/main/assets/anime_parser/vocab.json
	```