Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
File size: 12,575 Bytes
a94e250 be5f706 376db19 be5f706 376db19 be5f706 376db19 f712f4b 359ff82 f712f4b 376db19 a94e250 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 8c50d16 376db19 8c50d16 be5f706 f712f4b be5f706 f712f4b be5f706 376db19 be5f706 376db19 410e000 376db19 410e000 376db19 410e000 376db19 0779202 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 376db19 8c50d16 be5f706 376db19 be5f706 376db19 be5f706 376db19 be5f706 8c50d16 be5f706 8c50d16 be5f706 376db19 be5f706 376db19 0779202 376db19 f712f4b be5f706 376db19 be5f706 376db19 8c50d16 be5f706 376db19 8c50d16 376db19 be5f706 376db19 359ff82 f712f4b 359ff82 376db19 359ff82 376db19 359ff82 376db19 8c50d16 be5f706 ce3a60d 8c50d16 ce3a60d f712f4b ce3a60d f712f4b ce3a60d 359ff82 ce3a60d f712f4b ce3a60d f712f4b ce3a60d 376db19 359ff82 376db19 359ff82 be6a29a 376db19 359ff82 376db19 be6a29a 359ff82 376db19 359ff82 376db19 f712f4b be6a29a 359ff82 be6a29a 359ff82 376db19 8c50d16 376db19 be6a29a 376db19 410e000 376db19 410e000 376db19 e458112 376db19 e458112 376db19 8c50d16 376db19 8c50d16 e458112 410e000 376db19 be5f706 8c50d16 3197202 376db19 3197202 376db19 3197202 376db19 f712f4b 376db19 be5f706 376db19 be5f706 376db19 f712f4b 376db19 8c50d16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
- onnx
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
model-index:
- name: AniFileBERT
results:
- task:
type: token-classification
name: Anime filename token classification
dataset:
name: AniFileBERT fixed parser regression cases
type: parser-regression
metrics:
- type: accuracy
name: Fixed parser model-only full-match accuracy
value: 0.9231
- type: accuracy
name: Fixed parser thin-runtime full-match accuracy
value: 1.0
---
# AniFileBERT
**中文**:AniFileBERT 是一个面向番剧发布文件名的轻量级 BERT token-classification 解析器。它把常见发布名解析为结构化字段:字幕组、标题、季、集数、分辨率、来源和 special tag。
**English**: AniFileBERT is a lightweight BERT token-classification parser for anime release filenames. It extracts structured fields: release group, title, season, episode, resolution, source, and special tags.
This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_parser`.
## Model Details / 模型信息
| Item | Value |
| --- | --- |
| Architecture / 架构 | `BertForTokenClassification` |
| Tokenizer / 分词器 | Custom character tokenizer in `anifilebert/tokenizer.py` |
| Parameters / 参数量 | 4,783,631 |
| Hidden size / 隐层维度 | 256 |
| Layers / 层数 | 4 |
| Attention heads / 注意力头 | 8 |
| Max sequence length / 最大长度 | 128 |
| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
| Training lineage / 训练链路 | `reports/training_lineage.json` |
**中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。
**English**: The repository root is the published checkpoint. The default parser is model logits + constrained BIO + thin field normalization; heavy structural assist is not enabled by default. `from_pretrained()` only loads token-classification weights.
## Intended Use / 使用场景
**中文**
- 解析番剧/动画发布文件名,用于媒体库刮削、归类、搜索和展示。
- 覆盖常见结构:`[GROUP] TITLE - EP [META]`、点分隔 `S01E07`、国漫多括号标题、BD 特典 `NCOP/NCED/IV05`、长集数、第二季别名等。
- 不适合泛化为自然语言 NER;这是结构化文件名解析任务。
**English**
- Parse anime release filenames for media library scraping, classification, search, and display.
- Covers common layouts: `[GROUP] TITLE - EP [META]`, dotted `S01E07`, Chinese animation bracket layouts, BD extras such as `NCOP/NCED/IV05`, long-running episode numbers, and season aliases.
- This is not a general natural-language NER model; it is a structured filename parser.
## Install / 安装
```powershell
uv sync
```
If the dataset submodule is missing:
```powershell
git submodule update --init --recursive
```
## Quick Start / 快速使用
Run the Python parser:
```powershell
uv run python -m anifilebert.inference --model-dir . "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
```
Expected output:
```json
{"title":"神印王座","season":null,"episode":200,"group":"GM-Team","resolution":"1080P","source":"GB","special":null}
```
Load the raw Transformers model:
```python
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```
**中文**:如果需要完整字段解析,请 clone 本仓库并使用 `python -m anifilebert.inference`,因为分词器和后处理是自定义的。
**English**: For complete field parsing, clone this repo and use `python -m anifilebert.inference`; the tokenizer and postprocessing are custom.
## ONNX Usage / ONNX 使用
The ONNX graph outputs token logits only. A complete parser still needs:
1. custom character tokenization,
2. constrained BIO decoding,
3. field aggregation and thin string/number normalization.
本仓库提供最小可运行示例:
```powershell
uv run python -m tools.onnx_inference "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv"
```
Static graph shapes:
- `input_ids`: `int64[1,128]`
- `attention_mask`: `int64[1,128]`
- `logits`: `float32[1,128,15]`
More details: [`docs/onnx.md`](docs/onnx.md) and [`docs/android.md`](docs/android.md).
## Evaluation / 评估
Current published checkpoint:
| Metric / 指标 | Value / 数值 |
| --- | --- |
| Fixed regression, model-only / 固定回归,纯模型聚合 | 24/26 full match = `92.31%` |
| Fixed regression, default thin runtime / 固定回归,默认薄层运行时 | 26/26 full match = `100%` |
| Held-out parse, model-only / held-out 解析,纯模型聚合 | 1962/2048 full match = `95.80%` |
| Held-out parse, default thin runtime / held-out 解析,默认薄层运行时 | 1988/2048 full match = `97.07%` |
| Token/entity eval / token/entity 评估 | F1 `0.9844`, token accuracy `0.9961` |
| ONNX parity / ONNX 误差 | max abs diff `4.0054e-05` |
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `12.04 ms`, P95 `13.81 ms` |
**中文**:当前发布模型是“两阶段训练”产物:先用 Rust 预生成 `20,439,848` 行虚拟 BIO shard,在 RTX 5070 Ti 上完整训练 10 epoch / `114,070` optimizer steps;再接 1 epoch light hard-case focus 微调。细节见 `reports/training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;旧版结构规则辅助层已移除,不再作为运行时或质量对照。
**English**: The published checkpoint was trained in two stages: a 10-epoch CUDA fine-tune over `20,439,848` Rust-generated virtual BIO shard rows (`114,070` optimizer steps) on the RTX 5070 Ti, followed by a 1-epoch light hard-case focus fine-tune. See `reports/training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports.
Run regression:
```powershell
uv run python -m tools.evaluate_parser_cases --model-dir . --case-file data/parser_regression_cases.json --output reports/case_metrics.json
```
## Performance / 性能
Benchmark command:
性能测试命令:
```powershell
uv run python -m tools.benchmark_inference --model-dir . --onnx exports/anime_filename_parser.onnx --case-file data/parser_regression_cases.json --repeat 20 --warmup 20 --torch-threads 1 --ort-threads 1 --output reports/benchmark_results.json
```
Local CPU benchmark on the 26 fixed real-world cases, single-threaded, using the
default thin runtime: tokenization, model/session forward, constrained BIO
decoding, entity aggregation, and light string/number normalization:
本地 CPU 单线程测试,使用 26 条固定真实 case,默认薄层运行时,包含 tokenizer、
模型/session 前向、约束 BIO 解码、实体聚合和轻量字符串/数字规范化:
| Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| PyTorch | 46.35 | 15.36 | 14.25 | 22.27 | 29.75 | 65.1 |
| ONNX Runtime | 50.92 | 12.04 | 11.90 | 13.81 | 15.38 | 83.1 |
**中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
**English**: This is end-to-end thin-parser latency, not model-forward-only timing. Mobile code should keep the ONNX session reusable and keep tokenizer/BIO/thin-normalization behavior aligned.
## Training / 训练
Training uses the dataset submodule at `datasets/AnimeName`.
Recommended virtual-shard character-token run on the Windows RTX 5070 Ti worker:
```powershell
@'
import random
from pathlib import Path
source = Path("datasets/AnimeName/dmhy_weak_char.jsonl")
target = Path("data/generated/virtual_source_train_seed105.jsonl")
rows = [line for line in source.read_text(encoding="utf-8").splitlines() if line]
random.Random(105).shuffle(rows)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text("\n".join(rows[: int(len(rows) * 0.98)]) + "\n", encoding="utf-8")
'@ | .\.venv\Scripts\python.exe -
cargo build --release --manifest-path tools/virtual_dataset_generator/Cargo.toml
.\tools\virtual_dataset_generator\target\release\anifilebert-virtual-dataset-generator.exe `
--input data/generated/virtual_source_train_seed105.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--output-dir data/generated/virtual_char_sps32_seed105 `
--max-length 128 `
--samples-per-source 32 `
--seed 105 `
--threads 20 `
--separator-mode per-gap `
--bracket-mode per-part
.\.venv\Scripts\python.exe -m anifilebert.train --tokenizer char `
--data-file datasets/AnimeName/dmhy_weak_char.jsonl `
--vocab-file datasets/AnimeName/vocab.char.json `
--virtual-dataset-dir data/generated/virtual_char_sps32_seed105 `
--save-dir checkpoints/dmhy-char-virtual-sps32-10epoch-lr1e5 `
--init-model-dir . `
--epochs 10 `
--batch-size 1792 `
--learning-rate 0.00001 `
--warmup-steps 2000 `
--max-seq-length 128 `
--train-split 0.98 `
--num-workers 4 `
--prefetch-factor 4 `
--persistent-workers `
--checkpoint-steps 5000 `
--save-total-limit 3 `
--parse-eval-limit 2048 `
--case-eval-file data/parser_regression_cases.json `
--bf16 `
--no-periodic-eval `
--perf-log-steps 1000 `
--perf-sample-interval 0.5 `
--seed 105 `
--experiment-name dmhy-char-virtual-sps32-10epoch-lr1e5
```
`python -m anifilebert.train` writes:
- Hugging Face checkpoints under `--save-dir`,
- `final/run_metadata.json`,
- `final/trainer_eval_metrics.json`,
- `final/parse_eval_metrics.json`,
- `final/case_metrics.json` unless `--no-case-eval` is used,
- `final/perf_metrics.json` when `--perf-log-steps` is set,
- TensorBoard logs unless `--no-tensorboard` is used.
Full workflow: [`docs/training.md`](docs/training.md).
## Dataset / 数据集
Authoritative dataset snapshot:
```text
datasets/AnimeName/dmhy_weak.jsonl
datasets/AnimeName/dmhy_weak_char.jsonl
datasets/AnimeName/vocab.json
datasets/AnimeName/vocab.char.json
```
Current snapshot:
- rows / 行数: `632002`
- failed relabel rows / 重标注失败行: `0`
- strict BIO violations / 严格 BIO 违规: `0`
- character vocab / 字符词表: `6199`
- character coverage / 字符覆盖率: `100%`
**中文**:`datasets/AnimeName` 是嵌套数据集仓库。更新数据后需要先提交/推送子仓库,再提交父仓库的 submodule pointer。
**English**: `datasets/AnimeName` is a nested dataset repository. Commit and push the dataset repo first, then commit the updated submodule pointer in this model repo.
## Repository Layout / 仓库结构
```text
config.json
model.safetensors
tokenizer_config.json
vocab.json
training_args.bin
anifilebert/
tools/
data/parser_regression_cases.json
datasets/AnimeName/
exports/anime_filename_parser.onnx
docs/
reports/
```
## Maintenance / 维护
See [`docs/maintenance.md`](docs/maintenance.md) for release steps, LFS order, dataset submodule updates, and MiruPlay integration notes.
## Limitations / 局限
**中文**
- 发布命名没有统一标准,极端 OCR 噪声、乱码、非动画命名仍可能失败。
- ONNX 只包含模型 logits,不包含 tokenizer、BIO decode 和薄字段规范化;移动端必须保持 tokenizer/vocab/config 一致。
- `source` 当前是单值字段,复杂文件名里可能同时存在平台、发布源、编码器和语言标签。
**English**
- Anime release names are not standardized; extreme OCR noise, mojibake, or non-anime names can still fail.
- ONNX contains logits only. Mobile runtimes must keep tokenizer, vocabulary, config, BIO decode, and thin normalization in sync.
- `source` is currently a single field, while real filenames may contain platform, release source, codec, and language tags together.
|