File size: 7,073 Bytes
a94e250
 
be5f706
 
 
 
 
 
 
 
 
 
 
 
 
a94e250
be5f706
 
 
 
 
e63569d
be5f706
 
 
 
 
 
 
 
e63569d
 
 
be5f706
 
 
 
 
 
 
410e000
be5f706
410e000
 
 
 
 
 
 
e63569d
 
 
 
 
410e000
e63569d
 
0779202
be5f706
 
e63569d
 
be5f706
e63569d
 
 
 
 
 
 
 
 
be5f706
e63569d
 
 
be5f706
 
 
 
 
 
e63569d
be5f706
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0779202
 
 
e63569d
0779202
 
e63569d
0779202
 
e63569d
0779202
e63569d
 
 
 
 
 
 
 
0779202
 
 
 
 
 
 
e63569d
be5f706
 
e63569d
 
 
 
 
 
 
 
 
 
 
 
be5f706
 
410e000
be5f706
 
410e000
 
 
 
 
 
 
be5f706
 
410e000
be5f706
 
e63569d
be5f706
 
410e000
 
 
 
e458112
 
 
 
 
 
 
 
 
 
 
410e000
be5f706
 
e63569d
be5f706
 
0779202
be5f706
 
 
3197202
 
 
 
 
 
 
 
 
 
be5f706
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- anime
- filename-parsing
- bert
- token-classification
datasets:
- ModerRAS/AnimeName
language:
- en
- ja
- zh
---

# AniFileBERT

AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.

The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.

## Model

- Architecture: `BertForTokenClassification`
- Hidden size: 256
- Layers: 4
- Attention heads: 8
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
- Max sequence length: 128
- Parameters: 4,783,631

The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.

## Dataset

Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.

Current DMHY export waterline (from `datasets/AnimeName`):

- Last exported `files.id`: `1675184`
- Next incremental export: `--min-id 1675185`
- Weak-labeled samples: `632002`
- Mixed training samples: `732002`

## Vocabulary

The published checkpoint uses a character vocabulary. `vocab.json` at the
repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
dataset has **6195 unique characters**, so the complete character vocab is only
**6199** entries including special tokens and reaches 100% token coverage.

The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
dataset relabeling and diagnostics, but the root checkpoint loads as `char`.

## Evaluation

Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
seed 48):

| Metric | Value |
|--------|-------|
| Eval loss | 0.0163 |
| Entity precision | 0.9800 |
| Entity recall | 0.9867 |
| Entity F1 | 0.9833 |
| Token accuracy | 0.9943 |
| Held-out parse full match | 2008/2048 (0.9805) |
| Fixed regression full match | 21/21 (1.0000) |

The fixed regression set includes second-season aliases such as `Ni`,
`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
blocks.

## Usage

Install dependencies:

```bash
uv sync
```

Parse a filename with this repository cloned locally:

```bash
python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
```

Load only the model weights from the Hub:

```python
from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")
```

For full parsing, clone this repo and use `load_tokenizer` from `tokenizer.py` or the CLI in `inference.py`.

## Clone with Dataset Submodule

```bash
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
# or, after a normal clone:
git submodule update --init --recursive
```

## Training

### Character-token DMHY training

```bash
uv run python convert_to_char_dataset.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-output datasets/AnimeName/vocab.char.json \
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json

uv run python train.py --tokenizer char \
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-file datasets/AnimeName/vocab.char.json \
  --save-dir checkpoints/dmhy-char-full-relabel \
  --init-model-dir . \
  --epochs 2 --batch-size 256 \
  --learning-rate 0.00008 --warmup-steps 300 \
  --checkpoint-steps 1000 --save-total-limit 3 \
  --parse-eval-limit 2048 \
  --max-seq-length 128 --seed 48
```

The converter keeps source metadata and adds `tokenizer_variant`, source token
count, and character token count fields to each record. The char dataset's
p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
while leaving room for `[CLS]` and `[SEP]`.

### Relabel the full dataset

```bash
uv run python relabel_dataset_from_filenames.py \
  --input datasets/AnimeName/dmhy_weak.jsonl \
  --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
  --vocab-output datasets/AnimeName/vocab.relabel.json \
  --base-vocab datasets/AnimeName/vocab.json \
  --max-vocab-size 8000

Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
```

### Rebuild vocabulary (if needed)

```bash
python -c "
import json, collections
tokens = collections.Counter()
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
"
```

### Export ONNX for MiruPlay Android

```bash
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```

---

## Google Colab Training

For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
Free Colab still has to be started manually, but once `colab_worker.py` is
running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
status. Checkpoints live on Google Drive and default profiles resume from the
latest checkpoint automatically.

Manual one-shot runs are also supported:

```bash
python colab_train.py --profile dmhy_regex_finetune
```

## Repository Layout

- `model.safetensors`, `config.json`, `vocab.json`: default published model
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
- `convert_to_char_dataset.py`: full character-token projection for weak labels
- `inference.py`: end-to-end filename parser CLI
- `export_onnx.py`: ONNX export for Android integration
- `exports/`: exported ONNX model and metadata
- `datasets/AnimeName/`: nested dataset submodule

## Maintenance Notes

MiruPlay tracks this repository as `tools/anime_parser`, and this repository
tracks `ModerRAS/AnimeName` as `datasets/AnimeName`. After updating either
repo, remember to commit the submodule pointer in the parent repo.

For the full maintenance workflow, see MiruPlay's
`docs/anifilebert-maintenance.md`.