--- license: apache-2.0 library_name: transformers pipeline_tag: token-classification tags: - anime - filename-parsing - bert - token-classification - onnx datasets: - ModerRAS/AnimeName language: - en - ja - zh model-index: - name: AniFileBERT results: - task: type: token-classification name: Anime filename token classification dataset: name: AniFileBERT fixed parser regression cases type: parser-regression metrics: - type: accuracy name: Fixed parser full-match accuracy value: 1.0 --- # AniFileBERT **中文**:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。 **English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags. This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`. ## Model Details / 模型信息 | Item | Value | | --- | --- | | Architecture / 架构 | `BertForTokenClassification` | | Tokenizer / 分词器 | Custom character tokenizer in `tokenizer.py` | | Parameters / 参数量 | 4,783,631 | | Hidden size / 隐层维度 | 256 | | Layers / 层数 | 4 | | Attention heads / 注意力头 | 8 | | Max sequence length / 最大长度 | 128 | | Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` | | Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) | | ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` | **中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。完整解析请使用本仓库的 `inference.py` 或复用 `tokenizer.py`、BIO decode 和字段聚合逻辑;直接 `from_pretrained()` 只能加载 token-classification 权重。 **English**: The repository root is the published checkpoint. The old duplicate `model/` directory is intentionally not used. For end-to-end parsing, use `inference.py` or reuse this repo's tokenizer, BIO decoder, and field aggregation logic; `from_pretrained()` only loads token-classification weights. ## Intended Use / 使用场景 **中文** - 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。 - 覆盖常见结构:`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。 - 不适合泛化为自然语言 NER;这是结构化文件名解析任务。 **English** - Parse anime release filenames for media library scraping, classification, search, and display. - Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases. - This is not a general natural-language NER model; it is a structured filename parser. ## Install / 安装 ```powershell uv sync ``` If the dataset submodule is missing: ```powershell git submodule update --init --recursive ``` ## Quick Start / 快速使用 Run the Python parser: ```powershell uv run python inference.py --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4" ``` Expected output: ```json {"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null} ``` Load the raw Transformers model: ```python from transformers import BertForTokenClassification model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") ``` **中文**:如果需要完整字段解析,请 clone 本仓库并使用 `inference.py`,因为分词器和后处理是自定义的。 **English**: For complete field parsing, clone this repo and use `inference.py`; the tokenizer and postprocessing are custom. ## ONNX Usage / ONNX 使用 The ONNX graph outputs token logits only. A complete parser still needs: 1. custom character tokenization, 2. constrained BIO decoding, 3. field aggregation and high-confidence structural cleanup. 本仓库提供最小可运行示例: ```powershell uv run python onnx_inference.py "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv" ``` Static graph shapes: - `input_ids`: `int64[1,128]` - `attention_mask`: `int64[1,128]` - `logits`: `float32[1,128,15]` More details: [`docs/onnx.md`](docs/onnx.md) and [`ANDROID.md`](ANDROID.md). ## Evaluation / 评估 Current published checkpoint: | Metric / 指标 | Value / 数值 | | --- | --- | | Fixed real-case regression / 固定真实回归 | 26/26 full match | | ONNX parity / ONNX 误差 | max abs diff `2.6703e-05` | | Token/entity eval after focus tuning / focus 微调后实体评估 | F1 `0.9666`, token accuracy `0.9904` | | Focus parse eval / focus 解析评估 | 385/512 full match | **中文**:当前发布模型是“全量重标注 char 模型 + special-code focus 微调”。固定回归集覆盖真实用户反馈样式;focus eval 是偏向困难样本的评估,不等同于全量随机 DMHY 评估。 **English**: The published checkpoint is the full-relabel character model plus a targeted special-code focus fine-tune. The fixed regression set covers real user-reported patterns; focus eval is intentionally biased toward hard examples and is not equivalent to a broad random DMHY evaluation. Run regression: ```powershell uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json ``` ## Training / 训练 Training uses the dataset submodule at `datasets/AnimeName`. Recommended full character-token run: ```powershell uv run python train.py --tokenizer char ` --data-file datasets/AnimeName/dmhy_weak_char.jsonl ` --vocab-file datasets/AnimeName/vocab.char.json ` --save-dir checkpoints/dmhy-char-full ` --init-model-dir . ` --epochs 2 ` --batch-size 256 ` --learning-rate 0.00008 ` --warmup-steps 300 ` --max-seq-length 128 ` --train-split 0.98 ` --num-workers 4 ` --checkpoint-steps 1000 ` --save-total-limit 3 ` --parse-eval-limit 2048 ` --seed 52 ` --experiment-name dmhy-char-full ``` `train.py` writes: - Hugging Face checkpoints under `--save-dir`, - `final/run_metadata.json`, - `final/trainer_eval_metrics.json`, - `final/parse_eval_metrics.json`, - `final/case_metrics.json` unless `--no-case-eval` is used, - TensorBoard logs unless `--no-tensorboard` is used. Full workflow: [`docs/training.md`](docs/training.md). ## Dataset / 数据集 Authoritative dataset snapshot: ```text datasets/AnimeName/dmhy_weak.jsonl datasets/AnimeName/dmhy_weak_char.jsonl datasets/AnimeName/vocab.json datasets/AnimeName/vocab.char.json ``` Current snapshot: - rows / 行数: `632002` - failed relabel rows / 重标注失败行: `0` - strict BIO violations / 严格 BIO 违规: `0` - character vocab / 字符词表: `6199` - character coverage / 字符覆盖率: `100%` **中文**:`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。 **English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo. ## Repository Layout / 仓库结构 ```text config.json model.safetensors tokenizer_config.json vocab.json training_args.bin inference.py onnx_inference.py export_onnx.py train.py dataset.py tokenizer.py dmhy_dataset.py label_repairs.py relabel_dataset_from_filenames.py convert_to_char_dataset.py data/parser_regression_cases.json datasets/AnimeName/ exports/anime_filename_parser.onnx docs/ ``` ## Maintenance / 维护 See [`MAINTENANCE.md`](MAINTENANCE.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes. ## Limitations / 局限 **中文** - 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。 - ONNX 只包含模型 logits,不包含 tokenizer 和后处理;移动端必须保持 tokenizer/vocab/config 一致。 - `source` 当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。 **English** - Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail. - ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and postprocessing in sync. - `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.