Train virtual-shard anime parser

359ff82 about 11 hours ago

5.89 kB

	# ONNX Usage / ONNX 使用说明

	AniFileBERT exports a static-shape ONNX graph for Android and local inference.

	AniFileBERT 导出静态 shape 的 ONNX 图，用于 Android 和本地推理。

	## 1. What ONNX Contains / ONNX 包含什么

	The ONNX graph contains only the BERT token-classification forward pass:

	ONNX 图只包含 BERT token-classification 前向计算：

	```text
	input_ids int64[1,128]
	attention_mask int64[1,128]
	logits float32[1,128,15]
	```

	It does not contain:

	它不包含：

	- filename tokenization / 文件名分词
	- token-to-id conversion / token 到 id 的转换
	- constrained BIO decoding / 约束 BIO 解码
	- field aggregation / 字段聚合
	- thin string and number normalization / 薄字符串和数字规范化

	Those steps must stay aligned with `anifilebert/tokenizer.py`, `anifilebert/inference.py`, `config.json`,
	and `vocab.json`.

	这些步骤必须与 `anifilebert/tokenizer.py`、`anifilebert/inference.py`、`config.json`、`vocab.json`
	保持一致。

	## 2. Export / 导出

	```powershell
	uv run python -m tools.export_onnx --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
	```

	The exporter also writes:

	导出器还会写入：

	```text
	exports/anime_filename_parser.metadata.json
	```

	The metadata records the sample filename, output shape, and PyTorch/ONNX max
	absolute logits difference.

	metadata 会记录样本文件名、输出 shape、PyTorch/ONNX logits 最大绝对误差。

	## 3. Local ONNX Inference / 本地 ONNX 推理

	Use `python -m tools.onnx_inference` as the minimal runnable reference.

	使用 `python -m tools.onnx_inference` 作为最小可运行参考实现。

	```powershell
	uv run python -m tools.onnx_inference "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
	```

	Expected:

	期望输出：

	```json
	{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
	```

	Special-code example:

	特典编号示例：

	```powershell
	uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
	```

	Expected:

	期望输出：

	```json
	{"title":"Shinsekai Yori","season":null,"episode":null,"group":"YYDM&VCB-Studio","resolution":"1080p","source":"x265_flac","special":"NCED02"}
	```

	## 4. Implementation Steps / 实现步骤

	The runtime parser should do this:

	运行时解析器应按以下步骤实现：

	1. Tokenize filename with the custom character tokenizer.
	使用自定义字符 tokenizer 对文件名分词。
	2. Add `[CLS]` and `[SEP]`, truncate to `max_length - 2`.
	添加 `[CLS]` 和 `[SEP]`，截断到 `max_length - 2`。
	3. Convert tokens to ids with `vocab.json`.
	使用 `vocab.json` 转换 token id。
	4. Pad `input_ids` and `attention_mask` to exactly `128`.
	将 `input_ids` 和 `attention_mask` padding 到固定 `128`。
	5. Run ONNX Runtime.
	执行 ONNX Runtime。
	6. Slice logits back to real token count, excluding `[CLS]` and `[SEP]`.
	去掉 `[CLS]` / `[SEP]`，只保留真实 token 的 logits。
	7. Decode labels with constrained BIO transitions.
	使用约束 BIO transition 解码标签。
	8. Aggregate labels into parser fields.
	聚合标签为结构化字段。
	9. Apply thin normalization only: trim brackets, normalize source text, and
	convert numeric fields.
	只做薄层规范化：裁剪括号/扩展名并转换数字字段。

	The ONNX reference runtime intentionally matches the Python thin runtime. It
	does not include structural filename regex assists.

	ONNX 参考运行时有意与 Python 薄层运行时保持一致，不包含结构化文件名正则辅助。

	## 5. Android Notes / Android 注意事项

	Android must bundle these files together:

	Android 端必须同时打包：

	```text
	anime_filename_parser.onnx
	vocab.json
	config.json
	```

	When changing any of them, update all of them in the same commit.

	只要其中任意一个变化，三者必须在同一次提交中一起更新。

	## 6. Common Mistakes / 常见错误

	Using a standard Hugging Face tokenizer

	误用标准 Hugging Face tokenizer

	This model uses `AnimeTokenizer`, not WordPiece/BPE.

	本模型使用 `AnimeTokenizer`，不是 WordPiece/BPE。

	Treating ONNX output as final fields

	把 ONNX 输出当成最终字段

	ONNX returns token logits. You still need BIO decode and field aggregation.

	ONNX 返回 token logits，仍然需要 BIO 解码和字段聚合。

	Changing max length without updating Android

	改 max length 但没有同步 Android

	The exported graph is static. Runtime arrays must match `[1,128]`.

	导出的图是静态 shape，运行时数组必须匹配 `[1,128]`。

	## 7. Benchmark / 性能基准

	Run:

	运行：

	```powershell
	uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
	```

	Local single-thread CPU result, measured on 26 real-world regression cases with
	the default thin runtime:

	本地 CPU 单线程结果，使用 26 条真实回归 case 和默认薄层运行时：

	\| Backend / 后端 \| Load ms / 加载 ms \| Avg ms / 平均 ms \| P50 ms \| P95 ms \| P99 ms \| files/s \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| PyTorch \| 46.35 \| 15.36 \| 14.25 \| 22.27 \| 29.75 \| 65.1 \|
	\| ONNX Runtime \| 50.92 \| 12.04 \| 11.90 \| 13.81 \| 15.38 \| 83.1 \|

	The benchmark includes tokenization, model/session forward, constrained BIO
	decode, entity aggregation, and thin normalization. It does not include
	repeatedly constructing the ONNX Runtime session inside the loop.

	该基准包含 tokenizer、模型/session 前向、约束 BIO 解码、实体聚合和薄层规范化；
	循环内不会重复创建 ONNX Runtime session。