Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: token-classification | |
| tags: | |
| - anime | |
| - filename-parsing | |
| - bert | |
| - token-classification | |
| - onnx | |
| datasets: | |
| - ModerRAS/AnimeName | |
| language: | |
| - en | |
| - ja | |
| - zh | |
| model-index: | |
| - name: AniFileBERT | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Anime filename token classification | |
| dataset: | |
| name: AniFileBERT fixed parser regression cases | |
| type: parser-regression | |
| metrics: | |
| - type: accuracy | |
| name: Fixed parser model-only full-match accuracy | |
| value: 0.9231 | |
| - type: accuracy | |
| name: Fixed parser thin-runtime full-match accuracy | |
| value: 1.0 | |
| # AniFileBERT | |
| **中文**:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。 | |
| **English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags. | |
| This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`. | |
| ## Model Details / 模型信息 | |
| | Item | Value | | |
| | --- | --- | | |
| | Architecture / 架构 | `BertForTokenClassification` | | |
| | Tokenizer / 分词器 | Custom character tokenizer in `anifilebert/tokenizer.py` | | |
| | Parameters / 参数量 | 4,783,631 | | |
| | Hidden size / 隐层维度 | 256 | | |
| | Layers / 层数 | 4 | | |
| | Attention heads / 注意力头 | 8 | | |
| | Max sequence length / 最大长度 | 128 | | |
| | Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` | | |
| | Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) | | |
| | ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` | | |
| | Training lineage / 训练链路 | `reports/training_lineage.json` | | |
| **中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。 | |
| **English**: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. `from_pretrained()` only loads token-classification weights. | |
| ## Intended Use / 使用场景 | |
| **中文** | |
| - 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。 | |
| - 覆盖常见结构:`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。 | |
| - 不适合泛化为自然语言 NER;这是结构化文件名解析任务。 | |
| **English** | |
| - Parse anime release filenames for media library scraping, classification, search, and display. | |
| - Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases. | |
| - This is not a general natural-language NER model; it is a structured filename parser. | |
| ## Install / 安装 | |
| ```powershell | |
| uv sync | |
| ``` | |
| If the dataset submodule is missing: | |
| ```powershell | |
| git submodule update --init --recursive | |
| ``` | |
| ## Quick Start / 快速使用 | |
| Run the Python parser: | |
| ```powershell | |
| uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" | |
| ``` | |
| Expected output: | |
| ```json | |
| {"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null} | |
| ``` | |
| Load the raw Transformers model: | |
| ```python | |
| from transformers import BertForTokenClassification | |
| model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") | |
| ``` | |
| **中文**:如果需要完整字段解析,请 clone 本仓库并使用 `python -m anifilebert.inference`,因为分词器和后处理是自定义的。 | |
| **English**: For complete field parsing, clone this repo and use `python -m anifilebert.inference`; the tokenizer and postprocessing are custom. | |
| ## ONNX Usage / ONNX 使用 | |
| The ONNX graph outputs token logits only. A complete parser still needs: | |
| 1. custom character tokenization, | |
| 2. constrained BIO decoding, | |
| 3. field aggregation and thin string/number normalization. | |
| 本仓库提供最小可运行示例: | |
| ```powershell | |
| uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv" | |
| ``` | |
| Static graph shapes: | |
| - `input_ids`: `int64[1,128]` | |
| - `attention_mask`: `int64[1,128]` | |
| - `logits`: `float32[1,128,15]` | |
| More details: [`docs/onnx.md`](docs/onnx.md) and [`docs/android.md`](docs/android.md). | |
| ## Evaluation / 评估 | |
| Current published checkpoint: | |
| | Metric / 指标 | Value / 数值 | | |
| | --- | --- | | |
| | Fixed regression, model-only / 固定回归,纯模型聚合 | 24/26 full match = `92.31%` | | |
| | Fixed regression, default thin runtime / 固定回归,默认薄层运行时 | 26/26 full match = `100%` | | |
| | Held-out parse, model-only / held-out 解析,纯模型聚合 | 1962/2048 full match = `95.80%` | | |
| | Held-out parse, default thin runtime / held-out 解析,默认薄层运行时 | 1988/2048 full match = `97.07%` | | |
| | Token/entity eval / token/entity 评估 | F1 `0.9844`, token accuracy `0.9961` | | |
| | ONNX parity / ONNX 误差 | max abs diff `4.0054e-05` | | |
| | CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `12.04 ms`, P95 `13.81 ms` | | |
| **中文**:当前发布模型是“两阶段训练”产物:先用 Rust 预生成 `20,439,848` 行虚拟 BIO shard,在 RTX 5070 Ti 上完整训练 10 epoch / `114,070` optimizer steps;再接 1 epoch light hard-case focus 微调。细节见 `reports/training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;旧版结构规则辅助层已移除,不再作为运行时或质量对照。 | |
| **English**: The published checkpoint was trained in two stages: a 10-epoch CUDA fine-tune over `20,439,848` Rust-generated virtual BIO shard rows (`114,070` optimizer steps) on the RTX 5070 Ti, followed by a 1-epoch light hard-case focus fine-tune. See `reports/training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports. | |
| Run regression: | |
| ```powershell | |
| uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json | |
| ``` | |
| ## Performance / 性能 | |
| Benchmark command: | |
| 性能测试命令: | |
| ```powershell | |
| uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json | |
| ``` | |
| Local CPU benchmark on the 26 fixed real-world cases, single-threaded, using the | |
| default thin runtime: tokenization, model/session forward, constrained BIO | |
| decoding, entity aggregation, and light string/number normalization: | |
| 本地 CPU 单线程测试,使用 26 条固定真实 case,默认薄层运行时,包含 tokenizer、 | |
| 模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化: | |
| | Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | PyTorch | 46.35 | 15.36 | 14.25 | 22.27 | 29.75 | 65.1 | | |
| | ONNX Runtime | 50.92 | 12.04 | 11.90 | 13.81 | 15.38 | 83.1 | | |
| **中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。 | |
| **English**: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned. | |
| ## Training / 训练 | |
| Training uses the dataset submodule at `datasets/AnimeName`. | |
| Recommended virtual-shard character-token run on the Windows RTX 5070 Ti worker: | |
| ```powershell | |
| @' | |
| import random | |
| from pathlib import Path | |
| source = Path("datasets/AnimeName/dmhy_weak_char.jsonl") | |
| target = Path("data/generated/virtual_source_train_seed105.jsonl") | |
| rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line] | |
| random.Random(105).shuffle(rows) | |
| target.parent.mkdir(parents=True, exist_ok=True) | |
| target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8") | |
| '@ | .\.venv\Scripts\python.exe - | |
| cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml | |
| .\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe ` | |
| --input data/generated/virtual_source_train_seed105.jsonl ` | |
| --vocab-file datasets/AnimeName/vocab.char.json ` | |
| --output-dir data/generated/virtual_char_sps32_seed105 ` | |
| --max-length 128 ` | |
| --samples-per-source 32 ` | |
| --seed 105 ` | |
| --threads 20 ` | |
| --separator-mode per-gap ` | |
| --bracket-mode per-part | |
| .\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char ` | |
| --data-file datasets/AnimeName/dmhy_weak_char.jsonl ` | |
| --vocab-file datasets/AnimeName/vocab.char.json ` | |
| --virtual-dataset-dir data/generated/virtual_char_sps32_seed105 ` | |
| --save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5 ` | |
| --init-model-dir . ` | |
| --epochs 10 ` | |
| --batch-size 1792 ` | |
| --learning-rate 0.00001 ` | |
| --warmup-steps 2000 ` | |
| --max-seq-length 128 ` | |
| --train-split 0.98 ` | |
| --num-workers 4 ` | |
| --prefetch-factor 4 ` | |
| --persistent-workers ` | |
| --checkpoint-steps 5000 ` | |
| --save-total-limit 3 ` | |
| --parse-eval-limit 2048 ` | |
| --case-eval-file data/parser_regression_cases.json ` | |
| --bf16 ` | |
| --no-periodic-eval ` | |
| --perf-log-steps 1000 ` | |
| --perf-sample-interval 0.5 ` | |
| --seed 105 ` | |
| --experiment-name dmhy-char-virtual-sps32-10epoch-lr1e5 | |
| ``` | |
| `python -m anifilebert.train` writes: | |
| - Hugging Face checkpoints under `--save-dir`, | |
| - `final/run_metadata.json`, | |
| - `final/trainer_eval_metrics.json`, | |
| - `final/parse_eval_metrics.json`, | |
| - `final/case_metrics.json` unless `--no-case-eval` is used, | |
| - `final/perf_metrics.json` when `--perf-log-steps` is set, | |
| - TensorBoard logs unless `--no-tensorboard` is used. | |
| Full workflow: [`docs/training.md`](docs/training.md). | |
| ## Dataset / 数据集 | |
| Authoritative dataset snapshot: | |
| ```text | |
| datasets/AnimeName/dmhy_weak.jsonl | |
| datasets/AnimeName/dmhy_weak_char.jsonl | |
| datasets/AnimeName/vocab.json | |
| datasets/AnimeName/vocab.char.json | |
| ``` | |
| Current snapshot: | |
| - rows / 行数: `632002` | |
| - failed relabel rows / 重标注失败行: `0` | |
| - strict BIO violations / 严格 BIO 违规: `0` | |
| - character vocab / 字符词表: `6199` | |
| - character coverage / 字符覆盖率: `100%` | |
| **中文**:`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。 | |
| **English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo. | |
| ## Repository Layout / 仓库结构 | |
| ```text | |
| config.json | |
| model.safetensors | |
| tokenizer_config.json | |
| vocab.json | |
| training_args.bin | |
| anifilebert/ | |
| tools/ | |
| data/parser_regression_cases.json | |
| datasets/AnimeName/ | |
| exports/anime_filename_parser.onnx | |
| docs/ | |
| reports/ | |
| ``` | |
| ## Maintenance / 维护 | |
| See [`docs/maintenance.md`](docs/maintenance.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes. | |
| ## Limitations / 局限 | |
| **中文** | |
| - 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。 | |
| - ONNX 只包含模型 logits,不包含 tokenizer、BIO decode 和薄字段规范化;移动端必须保持 tokenizer/vocab/config 一致。 | |
| - `source` 当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。 | |
| **English** | |
| - Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail. | |
| - ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync. | |
| - `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together. | |