File size: 12,575 Bytes

a94e250
 
be5f706
 
 
376db19
 
 
 
 
be5f706
376db19
be5f706
376db19
 
 
 
 
 
 
 
 
 
 
 
 
 
f712f4b
359ff82
f712f4b
 
376db19
a94e250
be5f706
 
 
376db19
be5f706
376db19
be5f706
376db19
be5f706
376db19
be5f706
376db19
 
 
8c50d16
376db19
 
 
 
 
 
 
 
8c50d16
be5f706
f712f4b
be5f706
f712f4b
be5f706
376db19
be5f706
376db19
410e000
376db19
 
 
410e000
376db19
410e000
376db19
 
 
0779202
376db19
be5f706
376db19
 
 
be5f706
376db19
be5f706
376db19
 
 
be5f706
376db19
be5f706
376db19
be5f706
376db19
8c50d16
be5f706
 
376db19
be5f706
376db19
 
be5f706
 
376db19
be5f706
 
 
 
 
 
 
8c50d16
be5f706
8c50d16
be5f706
376db19
be5f706
376db19
0779202
376db19
 
f712f4b
be5f706
376db19
be5f706
376db19
8c50d16
be5f706
 
376db19
 
 
 
 
 
8c50d16
376db19
 
be5f706
376db19
 
 
 
359ff82
f712f4b
359ff82
 
 
 
 
376db19
359ff82
376db19
359ff82
376db19
 
 
 
8c50d16
be5f706
 
ce3a60d
 
 
 
 
 
 
8c50d16
ce3a60d
 
f712f4b
 
 
ce3a60d
f712f4b
 
ce3a60d
 
 
359ff82
 
ce3a60d
f712f4b
ce3a60d
f712f4b
ce3a60d
376db19
 
 
 
359ff82
376db19
 
359ff82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be6a29a
376db19
359ff82
 
 
376db19
be6a29a
 
359ff82
 
376db19
 
359ff82
 
 
 
376db19
 
f712f4b
be6a29a
 
359ff82
 
be6a29a
359ff82
376db19
 
8c50d16
376db19
 
 
 
 
 
be6a29a
376db19
410e000
376db19
410e000
376db19
e458112
376db19
 
 
 
 
 
 
 
e458112
376db19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c50d16
 
376db19
 
 
 
8c50d16
e458112
410e000
376db19
be5f706
8c50d16
3197202
376db19
3197202
376db19
3197202
376db19
f712f4b
376db19
be5f706
376db19
be5f706
376db19
f712f4b
376db19
8c50d16

---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
  - anime
  - filename-parsing
  - bert
  - token-classification
  - onnx
datasets:
  - ModerRAS/AnimeName
language:
  - en
  - ja
  - zh
model-index:
  - name: AniFileBERT
    results:
      - task:
          type: token-classification
          name: Anime filename token classification
        dataset:
          name: AniFileBERT fixed parser regression cases
          type: parser-regression
        metrics:
          - type: accuracy
            name: Fixed parser model-only full-match accuracy
            value: 0.9231
          - type: accuracy
            name: Fixed parser thin-runtime full-match accuracy
            value: 1.0
---

# AniFileBERT

**中文**：AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段：字幕组、标题、季、集数、分辨率、来源和 special tag。

**English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.

This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.

## Model Details / 模型信息

| Item | Value |
| --- | --- |
| Architecture / 架构 | `BertForTokenClassification` |
| Tokenizer / 分词器 | Custom character tokenizer in `anifilebert/tokenizer.py` |
| Parameters / 参数量 | 4,783,631 |
| Hidden size / 隐层维度 | 256 |
| Layers / 层数 | 4 |
| Attention heads / 注意力头 | 8 |
| Max sequence length / 最大长度 | 128 |
| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
| Training lineage / 训练链路 | `reports/training_lineage.json` |

**中文**：根目录就是发布 checkpoint，不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”，不再默认启用重结构规则；直接 `from_pretrained()` 只能加载 token-classification 权重。

**English**: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. `from_pretrained()` only loads token-classification weights.

## Intended Use / 使用场景

**中文**

- 解析番剧/动画发布文件名，用于媒体库刮削、归类、搜索和展示。
- 覆盖常见结构：`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
- 不适合泛化为自然语言 NER；这是结构化文件名解析任务。

**English**

- Parse anime release filenames for media library scraping, classification, search, and display.
- Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
- This is not a general natural-language NER model; it is a structured filename parser.

## Install / 安装

```powershell
uv sync
```

If the dataset submodule is missing:

```powershell
git submodule update --init --recursive
```

## Quick Start / 快速使用

Run the Python parser:

```powershell
uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
```

Expected output:

```json
{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
```

Load the raw Transformers model:

```python
from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```

**中文**：如果需要完整字段解析，请 clone 本仓库并使用 `python -m anifilebert.inference`，因为分词器和后处理是自定义的。

**English**: For complete field parsing, clone this repo and use `python -m anifilebert.inference`; the tokenizer and postprocessing are custom.

## ONNX Usage / ONNX 使用

The ONNX graph outputs token logits only. A complete parser still needs:

1. custom character tokenization,
2. constrained BIO decoding,
3. field aggregation and thin string/number normalization.

本仓库提供最小可运行示例：

```powershell
uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
```

Static graph shapes:

- `input_ids`: `int64[1,128]`
- `attention_mask`: `int64[1,128]`
- `logits`: `float32[1,128,15]`

More details: [`docs/onnx.md`](docs/onnx.md) and [`docs/android.md`](docs/android.md).

## Evaluation / 评估

Current published checkpoint:

| Metric / 指标 | Value / 数值 |
| --- | --- |
| Fixed regression, model-only / 固定回归，纯模型聚合 | 24/26 full match = `92.31%` |
| Fixed regression, default thin runtime / 固定回归，默认薄层运行时 | 26/26 full match = `100%` |
| Held-out parse, model-only / held-out 解析，纯模型聚合 | 1962/2048 full match = `95.80%` |
| Held-out parse, default thin runtime / held-out 解析，默认薄层运行时 | 1988/2048 full match = `97.07%` |
| Token/entity eval / token/entity 评估 | F1 `0.9844`, token accuracy `0.9961` |
| ONNX parity / ONNX 误差 | max abs diff `4.0054e-05` |
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `12.04 ms`, P95 `13.81 ms` |

**中文**：当前发布模型是“两阶段训练”产物：先用 Rust 预生成 `20,439,848` 行虚拟 BIO shard，在 RTX 5070 Ti 上完整训练 10 epoch / `114,070` optimizer steps；再接 1 epoch light hard-case focus 微调。细节见 `reports/training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准；旧版结构规则辅助层已移除，不再作为运行时或质量对照。

**English**: The published checkpoint was trained in two stages: a 10-epoch CUDA fine-tune over `20,439,848` Rust-generated virtual BIO shard rows (`114,070` optimizer steps) on the RTX 5070 Ti, followed by a 1-epoch light hard-case focus fine-tune. See `reports/training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports.

Run regression:

```powershell
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
```

## Performance / 性能

Benchmark command:

性能测试命令：

```powershell
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
```

Local CPU benchmark on the 26 fixed real-world cases, single-threaded, using the
default thin runtime: tokenization, model/session forward, constrained BIO
decoding, entity aggregation, and light string/number normalization:

本地 CPU 单线程测试，使用 26 条固定真实 case，默认薄层运行时，包含 tokenizer、
模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化：

| Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| PyTorch | 46.35 | 15.36 | 14.25 | 22.27 | 29.75 | 65.1 |
| ONNX Runtime | 50.92 | 12.04 | 11.90 | 13.81 | 15.38 | 83.1 |

**中文**：这是完整薄层 parser 的端到端延迟，不是只测模型 forward。移动端实现应复用 ONNX session，并保持 tokenizer/BIO/薄规范化逻辑一致。

**English**: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned.

## Training / 训练

Training uses the dataset submodule at `datasets/AnimeName`.

Recommended virtual-shard character-token run on the Windows RTX 5070 Ti worker:

```powershell
@'
import random
from pathlib import Path

source = Path("datasets/AnimeName/dmhy_weak_char.jsonl")
target = Path("data/generated/virtual_source_train_seed105.jsonl")
rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line]
random.Random(105).shuffle(rows)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8")
'@ | .\.venv\Scripts\python.exe -

cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
.\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe `
  --input data/generated/virtual_source_train_seed105.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --output-dir data/generated/virtual_char_sps32_seed105 `
  --max-length 128 `
  --samples-per-source 32 `
  --seed 105 `
  --threads 20 `
  --separator-mode per-gap `
  --bracket-mode per-part

.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --virtual-dataset-dir data/generated/virtual_char_sps32_seed105 `
  --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5 `
  --init-model-dir . `
  --epochs 10 `
  --batch-size 1792 `
  --learning-rate 0.00001 `
  --warmup-steps 2000 `
  --max-seq-length 128 `
  --train-split 0.98 `
  --num-workers 4 `
  --prefetch-factor 4 `
  --persistent-workers `
  --checkpoint-steps 5000 `
  --save-total-limit 3 `
  --parse-eval-limit 2048 `
  --case-eval-file data/parser_regression_cases.json `
  --bf16 `
  --no-periodic-eval `
  --perf-log-steps 1000 `
  --perf-sample-interval 0.5 `
  --seed 105 `
  --experiment-name dmhy-char-virtual-sps32-10epoch-lr1e5
```

`python -m anifilebert.train` writes:

- Hugging Face checkpoints under `--save-dir`,
- `final/run_metadata.json`,
- `final/trainer_eval_metrics.json`,
- `final/parse_eval_metrics.json`,
- `final/case_metrics.json` unless `--no-case-eval` is used,
- `final/perf_metrics.json` when `--perf-log-steps` is set,
- TensorBoard logs unless `--no-tensorboard` is used.

Full workflow: [`docs/training.md`](docs/training.md).

## Dataset / 数据集

Authoritative dataset snapshot:

```text
datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json
```

Current snapshot:

- rows / 行数: `632002`
- failed relabel rows / 重标注失败行: `0`
- strict BIO violations / 严格 BIO 违规: `0`
- character vocab / 字符词表: `6199`
- character coverage / 字符覆盖率: `100%`

**中文**：`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库，再提交父仓库的 submodule pointer。

**English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.

## Repository Layout / 仓库结构

```text
config.json
model.safetensors
tokenizer_config.json
vocab.json
training_args.bin
anifilebert/
tools/
data/parser_regression_cases.json
datasets/AnimeName/
exports/anime_filename_parser.onnx
docs/
reports/
```

## Maintenance / 维护

See [`docs/maintenance.md`](docs/maintenance.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.

## Limitations / 局限

**中文**

- 发布命名没有统一标准，极端 OCR 噪声、乱码、非动画命名仍可能失败。
- ONNX 只包含模型 logits，不包含 tokenizer、BIO decode 和薄字段规范化；移动端必须保持 tokenizer/vocab/config 一致。
- `source` 当前是单值字段，复杂文件名里可能同时存在平台、发布源、编码器和语言标签。

**English**

- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync.
- `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.