Remove structural parser rule assists

116c87c 13 days ago

11.1 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- anime
	- filename-parsing
	- bert
	- token-classification
	- onnx
	datasets:
	- ModerRAS/AnimeName
	language:
	- en
	- ja
	- zh
	model-index:
	- name: AniFileBERT
	results:
	- task:
	type: token-classification
	name: Anime filename token classification
	dataset:
	name: AniFileBERT fixed parser regression cases
	type: parser-regression
	metrics:
	- type: accuracy
	name: Fixed parser model-only full-match accuracy
	value: 0.9615
	- type: accuracy
	name: Fixed parser thin-runtime full-match accuracy
	value: 1.0
	---

	# AniFileBERT

	中文：AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段：字幕组、标题、季、集数、分辨率、来源和 special tag。

	English: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.

	This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.

	## Model Details / 模型信息

	\| Item \| Value \|
	\| --- \| --- \|
	\| Architecture / 架构 \| `BertForTokenClassification` \|
	\| Tokenizer / 分词器 \| Custom character tokenizer in `tokenizer.py` \|
	\| Parameters / 参数量 \| 4,783,631 \|
	\| Hidden size / 隐层维度 \| 256 \|
	\| Layers / 层数 \| 4 \|
	\| Attention heads / 注意力头 \| 8 \|
	\| Max sequence length / 最大长度 \| 128 \|
	\| Labels / 标签 \| BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` \|
	\| Default checkpoint / 默认权重 \| Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) \|
	\| ONNX export / ONNX 导出 \| `exports/anime_filename_parser.onnx` \|
	\| Training lineage / 训练链路 \| `training_lineage.json` \|

	中文：根目录就是发布 checkpoint，不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”，不再默认启用重结构规则；直接 `from_pretrained()` 只能加载 token-classification 权重。

	English: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. `from_pretrained()` only loads token-classification weights.

	## Intended Use / 使用场景

	中文

	- 解析番剧/动画发布文件名，用于媒体库刮削、归类、搜索和展示。
	- 覆盖常见结构：`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
	- 不适合泛化为自然语言 NER；这是结构化文件名解析任务。

	English

	- Parse anime release filenames for media library scraping, classification, search, and display.
	- Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
	- This is not a general natural-language NER model; it is a structured filename parser.

	## Install / 安装

	```powershell
	uv sync
	```

	If the dataset submodule is missing:

	```powershell
	git submodule update --init --recursive
	```

	## Quick Start / 快速使用

	Run the Python parser:

	```powershell
	uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
	```

	Expected output:

	```json
	{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
	```

	Load the raw Transformers model:

	```python
	from transformers import BertForTokenClassification

	model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
	```

	中文：如果需要完整字段解析，请 clone 本仓库并使用 `inference.py`，因为分词器和后处理是自定义的。

	English: For complete field parsing, clone this repo and use `inference.py`; the tokenizer and postprocessing are custom.

	## ONNX Usage / ONNX 使用

	The ONNX graph outputs token logits only. A complete parser still needs:

	1. custom character tokenization,
	2. constrained BIO decoding,
	3. field aggregation and thin string/number normalization.

	本仓库提供最小可运行示例：

	```powershell
	uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
	```

	Static graph shapes:

	- `input_ids`: `int64[1,128]`
	- `attention_mask`: `int64[1,128]`
	- `logits`: `float32[1,128,15]`

	More details: [`docs/onnx.md`](docs/onnx.md) and [`ANDROID.md`](ANDROID.md).

	## Evaluation / 评估

	Current published checkpoint:

	\| Metric / 指标 \| Value / 数值 \|
	\| --- \| --- \|
	\| Fixed regression, model-only / 固定回归，纯模型聚合 \| 25/26 full match = `96.15%` \|
	\| Fixed regression, default thin runtime / 固定回归，默认薄层运行时 \| 26/26 full match = `100%` \|
	\| Focus held-out, model-only / 困难抽样，纯模型聚合 \| 1014/1024 full match = `99.02%` \|
	\| Focus held-out, default thin runtime / 困难抽样，默认薄层运行时 \| 1017/1024 full match = `99.32%` \|
	\| Token/entity eval / token/entity 评估 \| F1 `0.9972`, token accuracy `0.9995` \|
	\| ONNX parity / ONNX 误差 \| max abs diff `4.0531e-05` \|
	\| CPU thin-runtime latency / CPU 薄层运行时延迟 \| ONNX avg `13.18 ms`, P95 `16.70 ms` \|

	中文：当前发布模型是“两阶段训练”产物：先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训，再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准；旧版结构规则辅助层已移除，不再作为运行时或质量对照。

	English: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports.

	Run regression:

	```powershell
	uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
	```

	## Performance / 性能

	Benchmark command:

	性能测试命令：

	```powershell
	uv run python benchmark_inference.py --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output benchmark_results.json
	```

	Local CPU benchmark on the 26 fixed real-world cases, single-threaded, using the
	default thin runtime: tokenization, model/session forward, constrained BIO
	decoding, entity aggregation, and light string/number normalization:

	本地 CPU 单线程测试，使用 26 条固定真实 case，默认薄层运行时，包含 tokenizer、
	模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化：

	\| Backend / 后端 \| Load ms / 加载 ms \| Avg ms / 平均 ms \| P50 ms \| P95 ms \| P99 ms \| files/s \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| PyTorch \| 76.56 \| 16.85 \| 16.21 \| 22.84 \| 28.31 \| 59.4 \|
	\| ONNX Runtime \| 49.74 \| 13.18 \| 12.86 \| 16.70 \| 18.06 \| 75.9 \|

	中文：这是完整薄层 parser 的端到端延迟，不是只测模型 forward。移动端实现应复用 ONNX session，并保持 tokenizer/BIO/薄规范化逻辑一致。

	English: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned.

	## Training / 训练

	Training uses the dataset submodule at `datasets/AnimeName`.

	Recommended full character-token run:

	```powershell
	uv run python train.py --tokenizer char `
	--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
	--vocab-file datasets/AnimeName/vocab.char.json `
	--save-dir checkpoints/dmhy-char-full `
	--init-model-dir . `
	--epochs 2 `
	--batch-size 256 `
	--learning-rate 0.00008 `
	--warmup-steps 300 `
	--max-seq-length 128 `
	--train-split 0.98 `
	--num-workers 4 `
	--checkpoint-steps 1000 `
	--save-total-limit 3 `
	--parse-eval-limit 2048 `
	--case-eval-file data/parser_regression_cases.json `
	--seed 52 `
	--experiment-name dmhy-char-full
	```

	`train.py` writes:

	- Hugging Face checkpoints under `--save-dir`,
	- `final/run_metadata.json`,
	- `final/trainer_eval_metrics.json`,
	- `final/parse_eval_metrics.json`,
	- `final/case_metrics.json` unless `--no-case-eval` is used,
	- TensorBoard logs unless `--no-tensorboard` is used.

	Full workflow: [`docs/training.md`](docs/training.md).

	## Dataset / 数据集

	Authoritative dataset snapshot:

	```text
	datasets/AnimeName/dmhy_weak.jsonl
	datasets/AnimeName/dmhy_weak_char.jsonl
	datasets/AnimeName/vocab.json
	datasets/AnimeName/vocab.char.json
	```

	Current snapshot:

	- rows / 行数: `632002`
	- failed relabel rows / 重标注失败行: `0`
	- strict BIO violations / 严格 BIO 违规: `0`
	- character vocab / 字符词表: `6199`
	- character coverage / 字符覆盖率: `100%`

	中文：`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库，再提交父仓库的 submodule pointer。

	English: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.

	## Repository Layout / 仓库结构

	```text
	config.json
	model.safetensors
	tokenizer_config.json
	vocab.json
	training_args.bin
	inference.py
	onnx_inference.py
	export_onnx.py
	train.py
	dataset.py
	tokenizer.py
	dmhy_dataset.py
	label_repairs.py
	relabel_dataset_from_filenames.py
	convert_to_char_dataset.py
	data/parser_regression_cases.json
	datasets/AnimeName/
	exports/anime_filename_parser.onnx
	docs/
	```

	## Maintenance / 维护

	See [`MAINTENANCE.md`](MAINTENANCE.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.

	## Limitations / 局限

	中文

	- 发布命名没有统一标准，极端 OCR 噪声、乱码、非动画命名仍可能失败。
	- ONNX 只包含模型 logits，不包含 tokenizer、BIO decode 和薄字段规范化；移动端必须保持 tokenizer/vocab/config 一致。
	- `source` 当前是单值字段，复杂文件名里可能同时存在平台、发布源、编码器和语言标签。

	English

	- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
	- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync.
	- `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.