Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Instructions to use chivehao/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chivehao/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="chivehao/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("chivehao/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("chivehao/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
| # Anime Filename Parser Diagnostics Report | |
| ## Executive Summary | |
| - Dataset: `datasets\AnimeName\dmhy_weak.jsonl` | |
| - Inspected rows: 5,000 | |
| - Dataset tokenizer variant: `regex` | |
| - Diagnosed tokenizer variant: `regex` | |
| - Vocab: `datasets\AnimeName\vocab.json` (8,000 tokens) | |
| - Max sequence length checked: 64 | |
| - O-label ratio: 38.12% | |
| - Truncation risk: 0/5,000 rows (0.00%) | |
| - UNK rate after selected tokenizer: 6.9158% | |
| - BIO warnings collected: 9,711 | |
| Primary finding: this task is structural filename parsing. Tokenizer/preprocessing identity is more important than lowering token loss. | |
| ## Label And Entity Statistics | |
| ### Label distribution | |
| - `O`: 32,517 (38.12%) | |
| - `I-TITLE`: 30,321 (35.54%) | |
| - `B-TITLE`: 5,593 (6.56%) | |
| - `B-EPISODE`: 5,000 (5.86%) | |
| - `B-SOURCE`: 4,032 (4.73%) | |
| - `I-GROUP`: 2,459 (2.88%) | |
| - `B-GROUP`: 2,299 (2.69%) | |
| - `B-RESOLUTION`: 1,765 (2.07%) | |
| - `B-SEASON`: 1,269 (1.49%) | |
| - `B-SPECIAL`: 57 (0.07%) | |
| ### Entity count | |
| - `TITLE`: 6,061 (29.59%) | |
| - `EPISODE`: 5,000 (24.41%) | |
| - `SOURCE`: 4,032 (19.68%) | |
| - `GROUP`: 2,299 (11.22%) | |
| - `RESOLUTION`: 1,765 (8.62%) | |
| - `SEASON`: 1,269 (6.20%) | |
| - `SPECIAL`: 57 (0.28%) | |
| ### Length distribution | |
| ```json | |
| { | |
| "raw_tokens": { | |
| "min": 3, | |
| "p50": 17, | |
| "p90": 28, | |
| "p95": 31, | |
| "p99": 39, | |
| "max": 54 | |
| }, | |
| "aligned_tokens": { | |
| "min": 3, | |
| "p50": 17, | |
| "p90": 28, | |
| "p95": 31, | |
| "p99": 39, | |
| "max": 54 | |
| } | |
| } | |
| ``` | |
| ### Whitespace labels | |
| - `I-TITLE`: 10,539 (48.98%) | |
| - `O`: 10,484 (48.72%) | |
| - `I-GROUP`: 411 (1.91%) | |
| - `B-TITLE`: 84 (0.39%) | |
| ## BIO Violations And Boundary Drift | |
| ### Violation counts | |
| - `B_DIRECT_TO_O`: 9,243 (95.18%) | |
| - `ORPHAN_I`: 468 (4.82%) | |
| ### Boundary drift heuristics | |
| - none | |
| ### Sample violations | |
| ```json | |
| [ | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 8, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": ".", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| ".", | |
| "Atelier", | |
| ".", | |
| "S01", | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "I-TITLE", | |
| "O", | |
| "B-SEASON", | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 10, | |
| "prev_label": "B-RESOLUTION", | |
| "label": "O", | |
| "token": ".", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| ".", | |
| "S01", | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP" | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-SEASON", | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 12, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": ".", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| "." | |
| ], | |
| "context_labels": [ | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 14, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": ".", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2" | |
| ], | |
| "context_labels": [ | |
| "B-RESOLUTION", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 16, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "N", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2", | |
| ".", | |
| "0" | |
| ], | |
| "context_labels": [ | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 19, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "2", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H.264", | |
| "." | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O", | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 24, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": ".", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H.264", | |
| ".", | |
| "MSubs", | |
| "-", | |
| "ToonsHub" | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "O", | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 26, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "-", | |
| "row": 1, | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "context_tokens": [ | |
| "0", | |
| ".", | |
| "H.264", | |
| ".", | |
| "MSubs", | |
| "-", | |
| "ToonsHub" | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE", | |
| "O", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 2, | |
| "file_id": 2, | |
| "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", | |
| "context_tokens": [ | |
| "[", | |
| "LoliHouse", | |
| "]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 17, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": " ", | |
| "row": 2, | |
| "file_id": 2, | |
| "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", | |
| "context_tokens": [ | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "O", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 3, | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠" | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 13, | |
| "prev_label": "B-SEASON", | |
| "label": "O", | |
| "token": " ", | |
| "row": 3, | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "O", | |
| "B-SEASON", | |
| "O", | |
| "O", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 17, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": " ", | |
| "row": 3, | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "B-SEASON", | |
| "O", | |
| "O", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 21, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[AAC AVC]", | |
| "row": 3, | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 4, | |
| "file_id": 4, | |
| "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀" | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 24, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": " ", | |
| "row": 4, | |
| "file_id": 4, | |
| "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "O", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 28, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[AAC AVC]", | |
| "row": 4, | |
| "file_id": 4, | |
| "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "context_tokens": [ | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "B-EPISODE", | |
| "O", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "B-SOURCE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 5, | |
| "file_id": 5, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", | |
| "context_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 19, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[MP4]", | |
| "row": 5, | |
| "file_id": 5, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", | |
| "context_tokens": [ | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 6, | |
| "file_id": 6, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", | |
| "context_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 19, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[MP4]", | |
| "row": 6, | |
| "file_id": 6, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", | |
| "context_tokens": [ | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 7, | |
| "file_id": 7, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", | |
| "context_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 19, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[MP4]", | |
| "row": 7, | |
| "file_id": 7, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", | |
| "context_tokens": [ | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 8, | |
| "file_id": 8, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", | |
| "context_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 19, | |
| "prev_label": "B-SOURCE", | |
| "label": "O", | |
| "token": "[MP4]", | |
| "row": 8, | |
| "file_id": 8, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", | |
| "context_tokens": [ | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "B-RESOLUTION", | |
| "B-SOURCE", | |
| "O" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 9, | |
| "file_id": 9, | |
| "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", | |
| "context_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 11, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": "[1080p AVC AAC]", | |
| "row": 9, | |
| "file_id": 9, | |
| "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", | |
| "context_tokens": [ | |
| "no", | |
| " ", | |
| "Frieren", | |
| "]", | |
| "[29]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 10, | |
| "file_id": 10, | |
| "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", | |
| "context_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 11, | |
| "prev_label": "B-EPISODE", | |
| "label": "O", | |
| "token": "[1080p AVC AAC]", | |
| "row": 10, | |
| "file_id": 10, | |
| "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", | |
| "context_tokens": [ | |
| "no", | |
| " ", | |
| "Frieren", | |
| "]", | |
| "[30]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "context_labels": [ | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "O", | |
| "B-EPISODE", | |
| "O", | |
| "B-SOURCE" | |
| ] | |
| }, | |
| { | |
| "type": "B_DIRECT_TO_O", | |
| "index": 2, | |
| "prev_label": "B-GROUP", | |
| "label": "O", | |
| "token": "]", | |
| "row": 11, | |
| "file_id": 11, | |
| "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]", | |
| "context_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " " | |
| ], | |
| "context_labels": [ | |
| "O", | |
| "B-GROUP", | |
| "O", | |
| "O", | |
| "B-TITLE", | |
| "I-TITLE", | |
| "I-TITLE", | |
| "I-TITLE" | |
| ] | |
| } | |
| ] | |
| ``` | |
| ## Tokenizer Split And Alignment | |
| ### Dataset tokens vs selected tokenizer mismatches | |
| ```json | |
| [ | |
| { | |
| "file_id": 2, | |
| "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "LoliHouse", | |
| "]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " ", | |
| "wa", | |
| " ", | |
| "Taberu", | |
| " ", | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[LoliHouse]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " ", | |
| "wa", | |
| " ", | |
| "Taberu", | |
| " ", | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "dataset_len": 19, | |
| "tokenizer_len": 17 | |
| }, | |
| { | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "dataset_len": 23, | |
| "tokenizer_len": 21 | |
| }, | |
| { | |
| "file_id": 4, | |
| "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "JK", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "JK", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "dataset_len": 30, | |
| "tokenizer_len": 28 | |
| }, | |
| { | |
| "file_id": 5, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "dataset_len": 20, | |
| "tokenizer_len": 6 | |
| }, | |
| { | |
| "file_id": 6, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "dataset_len": 20, | |
| "tokenizer_len": 6 | |
| }, | |
| { | |
| "file_id": 7, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "dataset_len": 20, | |
| "tokenizer_len": 6 | |
| }, | |
| { | |
| "file_id": 8, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "dataset_len": 20, | |
| "tokenizer_len": 6 | |
| }, | |
| { | |
| "file_id": 9, | |
| "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " ", | |
| "Frieren", | |
| "]", | |
| "[29]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[Airota]", | |
| "[Sousou no Frieren]", | |
| "[29]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "dataset_len": 13, | |
| "tokenizer_len": 5 | |
| }, | |
| { | |
| "file_id": 10, | |
| "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " ", | |
| "Frieren", | |
| "]", | |
| "[30]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[Airota]", | |
| "[Sousou no Frieren]", | |
| "[30]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "dataset_len": 13, | |
| "tokenizer_len": 5 | |
| }, | |
| { | |
| "file_id": 11, | |
| "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]", | |
| "common_prefix": 0, | |
| "dataset_tokens": [ | |
| "[", | |
| "Airota", | |
| "]", | |
| "[", | |
| "Sousou", | |
| " ", | |
| "no", | |
| " ", | |
| "Frieren", | |
| "]", | |
| "[31]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "tokenizer_tokens": [ | |
| "[Airota]", | |
| "[Sousou no Frieren]", | |
| "[31]", | |
| "[1080p AVC AAC]", | |
| "[CHT]" | |
| ], | |
| "dataset_len": 13, | |
| "tokenizer_len": 5 | |
| } | |
| ] | |
| ``` | |
| ### Split examples | |
| ```json | |
| [ | |
| { | |
| "file_id": 1, | |
| "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", | |
| "dataset_tokens": [ | |
| "Witch", | |
| ".", | |
| "Hat", | |
| ".", | |
| "Atelier", | |
| ".", | |
| "S01", | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H.264", | |
| ".", | |
| "MSubs", | |
| "-", | |
| "ToonsHub" | |
| ], | |
| "diagnosed_tokens": [ | |
| "Witch", | |
| ".", | |
| "Hat", | |
| ".", | |
| "Atelier", | |
| ".", | |
| "S01", | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H.264", | |
| ".", | |
| "MSubs", | |
| "-", | |
| "ToonsHub" | |
| ], | |
| "regex_tokens": [ | |
| "Witch", | |
| ".", | |
| "Hat", | |
| ".", | |
| "Atelier", | |
| ".", | |
| "S01", | |
| "E07", | |
| ".", | |
| "1080p", | |
| ".", | |
| "NF", | |
| ".", | |
| "WEB-DL", | |
| ".", | |
| "JP", | |
| "N", | |
| ".", | |
| "AAC", | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H.264", | |
| ".", | |
| "MSubs", | |
| "-", | |
| "ToonsHub" | |
| ], | |
| "char_tokens": [ | |
| "W", | |
| "i", | |
| "t", | |
| "c", | |
| "h", | |
| ".", | |
| "H", | |
| "a", | |
| "t", | |
| ".", | |
| "A", | |
| "t", | |
| "e", | |
| "l", | |
| "i", | |
| "e", | |
| "r", | |
| ".", | |
| "S", | |
| "0", | |
| "1", | |
| "E", | |
| "0", | |
| "7", | |
| ".", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "p", | |
| ".", | |
| "N", | |
| "F", | |
| ".", | |
| "W", | |
| "E", | |
| "B", | |
| "-", | |
| "D", | |
| "L", | |
| ".", | |
| "J", | |
| "P", | |
| "N", | |
| ".", | |
| "A", | |
| "A", | |
| "C", | |
| "2", | |
| ".", | |
| "0", | |
| ".", | |
| "H", | |
| ".", | |
| "2", | |
| "6", | |
| "4", | |
| ".", | |
| "M", | |
| "S", | |
| "u", | |
| "b", | |
| "s", | |
| "-", | |
| "T", | |
| "o", | |
| "o", | |
| "n", | |
| "s", | |
| "H", | |
| "u", | |
| "b" | |
| ] | |
| }, | |
| { | |
| "file_id": 2, | |
| "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", | |
| "dataset_tokens": [ | |
| "[", | |
| "LoliHouse", | |
| "]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " ", | |
| "wa", | |
| " ", | |
| "Taberu", | |
| " ", | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[LoliHouse]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " ", | |
| "wa", | |
| " ", | |
| "Taberu", | |
| " ", | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "regex_tokens": [ | |
| "[LoliHouse]", | |
| " ", | |
| "Maid", | |
| "-", | |
| "san", | |
| " ", | |
| "wa", | |
| " ", | |
| "Taberu", | |
| " ", | |
| "Dake", | |
| " ", | |
| "-", | |
| " ", | |
| "07", | |
| " ", | |
| "[WebRip 1080p HEVC-10bit AAC ASSx2]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "L", | |
| "o", | |
| "l", | |
| "i", | |
| "H", | |
| "o", | |
| "u", | |
| "s", | |
| "e", | |
| "]", | |
| " ", | |
| "M", | |
| "a", | |
| "i", | |
| "d", | |
| "-", | |
| "s", | |
| "a", | |
| "n", | |
| " ", | |
| "w", | |
| "a", | |
| " ", | |
| "T", | |
| "a", | |
| "b", | |
| "e", | |
| "r", | |
| "u", | |
| " ", | |
| "D", | |
| "a", | |
| "k", | |
| "e", | |
| " ", | |
| "-", | |
| " ", | |
| "0", | |
| "7", | |
| " ", | |
| "[", | |
| "W", | |
| "e", | |
| "b", | |
| "R", | |
| "i", | |
| "p", | |
| " ", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "p", | |
| " ", | |
| "H", | |
| "E", | |
| "V", | |
| "C", | |
| "-", | |
| "1", | |
| "0", | |
| "b", | |
| "i", | |
| "t", | |
| " ", | |
| "A", | |
| "A", | |
| "C", | |
| " ", | |
| "A", | |
| "S", | |
| "S", | |
| "x", | |
| "2", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 3, | |
| "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "dataset_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "regex_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "A", | |
| "N", | |
| "i", | |
| "]", | |
| " ", | |
| "異", | |
| "世", | |
| "界", | |
| "悠", | |
| "閒", | |
| "農", | |
| "家", | |
| " ", | |
| "2", | |
| " ", | |
| "-", | |
| " ", | |
| "0", | |
| "6", | |
| " ", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "B", | |
| "a", | |
| "h", | |
| "a", | |
| "]", | |
| "[", | |
| "W", | |
| "E", | |
| "B", | |
| "-", | |
| "D", | |
| "L", | |
| "]", | |
| "[", | |
| "A", | |
| "A", | |
| "C", | |
| " ", | |
| "A", | |
| "V", | |
| "C", | |
| "]", | |
| "[", | |
| "C", | |
| "H", | |
| "T", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 4, | |
| "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", | |
| "dataset_tokens": [ | |
| "[", | |
| "ANi", | |
| "]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "JK", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "JK", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "regex_tokens": [ | |
| "[ANi]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "JK", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "06", | |
| " ", | |
| "[1080P]", | |
| "[Baha]", | |
| "[WEB-DL]", | |
| "[AAC AVC]", | |
| "[CHT]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "A", | |
| "N", | |
| "i", | |
| "]", | |
| " ", | |
| "木", | |
| "頭", | |
| "風", | |
| "紀", | |
| "委", | |
| "員", | |
| "和", | |
| "迷", | |
| "你", | |
| "裙", | |
| " ", | |
| "J", | |
| "K", | |
| " ", | |
| "的", | |
| "故", | |
| "事", | |
| " ", | |
| "-", | |
| " ", | |
| "0", | |
| "6", | |
| " ", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "B", | |
| "a", | |
| "h", | |
| "a", | |
| "]", | |
| "[", | |
| "W", | |
| "E", | |
| "B", | |
| "-", | |
| "D", | |
| "L", | |
| "]", | |
| "[", | |
| "A", | |
| "A", | |
| "C", | |
| " ", | |
| "A", | |
| "V", | |
| "C", | |
| "]", | |
| "[", | |
| "C", | |
| "H", | |
| "T", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 5, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "regex_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "K", | |
| "i", | |
| "s", | |
| "s", | |
| "S", | |
| "u", | |
| "b", | |
| "]", | |
| "[", | |
| "S", | |
| "h", | |
| "u", | |
| "n", | |
| "k", | |
| "a", | |
| "s", | |
| "h", | |
| "u", | |
| "u", | |
| "t", | |
| "o", | |
| "u", | |
| " ", | |
| "D", | |
| "a", | |
| "i", | |
| "k", | |
| "o", | |
| "u", | |
| "s", | |
| "h", | |
| "a", | |
| " ", | |
| "-", | |
| " ", | |
| "H", | |
| "a", | |
| "r", | |
| "u", | |
| " ", | |
| "n", | |
| "o", | |
| " ", | |
| "M", | |
| "a", | |
| "i", | |
| "]", | |
| "[", | |
| "0", | |
| "5", | |
| "]", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "G", | |
| "B", | |
| "]", | |
| "[", | |
| "M", | |
| "P", | |
| "4", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 6, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "regex_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[GB]", | |
| "[MP4]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "K", | |
| "i", | |
| "s", | |
| "s", | |
| "S", | |
| "u", | |
| "b", | |
| "]", | |
| "[", | |
| "S", | |
| "h", | |
| "u", | |
| "n", | |
| "k", | |
| "a", | |
| "s", | |
| "h", | |
| "u", | |
| "u", | |
| "t", | |
| "o", | |
| "u", | |
| " ", | |
| "D", | |
| "a", | |
| "i", | |
| "k", | |
| "o", | |
| "u", | |
| "s", | |
| "h", | |
| "a", | |
| " ", | |
| "-", | |
| " ", | |
| "H", | |
| "a", | |
| "r", | |
| "u", | |
| " ", | |
| "n", | |
| "o", | |
| " ", | |
| "M", | |
| "a", | |
| "i", | |
| "]", | |
| "[", | |
| "0", | |
| "6", | |
| "]", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "G", | |
| "B", | |
| "]", | |
| "[", | |
| "M", | |
| "P", | |
| "4", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 7, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "regex_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[06]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "K", | |
| "i", | |
| "s", | |
| "s", | |
| "S", | |
| "u", | |
| "b", | |
| "]", | |
| "[", | |
| "S", | |
| "h", | |
| "u", | |
| "n", | |
| "k", | |
| "a", | |
| "s", | |
| "h", | |
| "u", | |
| "u", | |
| "t", | |
| "o", | |
| "u", | |
| " ", | |
| "D", | |
| "a", | |
| "i", | |
| "k", | |
| "o", | |
| "u", | |
| "s", | |
| "h", | |
| "a", | |
| " ", | |
| "-", | |
| " ", | |
| "H", | |
| "a", | |
| "r", | |
| "u", | |
| " ", | |
| "n", | |
| "o", | |
| " ", | |
| "M", | |
| "a", | |
| "i", | |
| "]", | |
| "[", | |
| "0", | |
| "6", | |
| "]", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "B", | |
| "I", | |
| "G", | |
| "5", | |
| "]", | |
| "[", | |
| "M", | |
| "P", | |
| "4", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "file_id": 8, | |
| "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", | |
| "dataset_tokens": [ | |
| "[", | |
| "KissSub", | |
| "]", | |
| "[", | |
| "Shunkashuutou", | |
| " ", | |
| "Daikousha", | |
| " ", | |
| "-", | |
| " ", | |
| "Haru", | |
| " ", | |
| "no", | |
| " ", | |
| "Mai", | |
| "]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "diagnosed_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "regex_tokens": [ | |
| "[KissSub]", | |
| "[Shunkashuutou Daikousha - Haru no Mai]", | |
| "[05]", | |
| "[1080P]", | |
| "[BIG5]", | |
| "[MP4]" | |
| ], | |
| "char_tokens": [ | |
| "[", | |
| "K", | |
| "i", | |
| "s", | |
| "s", | |
| "S", | |
| "u", | |
| "b", | |
| "]", | |
| "[", | |
| "S", | |
| "h", | |
| "u", | |
| "n", | |
| "k", | |
| "a", | |
| "s", | |
| "h", | |
| "u", | |
| "u", | |
| "t", | |
| "o", | |
| "u", | |
| " ", | |
| "D", | |
| "a", | |
| "i", | |
| "k", | |
| "o", | |
| "u", | |
| "s", | |
| "h", | |
| "a", | |
| " ", | |
| "-", | |
| " ", | |
| "H", | |
| "a", | |
| "r", | |
| "u", | |
| " ", | |
| "n", | |
| "o", | |
| " ", | |
| "M", | |
| "a", | |
| "i", | |
| "]", | |
| "[", | |
| "0", | |
| "5", | |
| "]", | |
| "[", | |
| "1", | |
| "0", | |
| "8", | |
| "0", | |
| "P", | |
| "]", | |
| "[", | |
| "B", | |
| "I", | |
| "G", | |
| "5", | |
| "]", | |
| "[", | |
| "M", | |
| "P", | |
| "4", | |
| "]" | |
| ] | |
| } | |
| ] | |
| ``` | |
| ### Vocabulary coverage | |
| ```json | |
| { | |
| "total": 85312, | |
| "unk": 5900, | |
| "unk_rate": 0.06915791447861966, | |
| "top_unk": [ | |
| [ | |
| "(BDRip 720p x264)", | |
| 66 | |
| ], | |
| [ | |
| "Partie", | |
| 59 | |
| ], | |
| [ | |
| "incantevole", | |
| 54 | |
| ], | |
| [ | |
| "Muxed", | |
| 54 | |
| ], | |
| [ | |
| "nonscordarmi", | |
| 54 | |
| ], | |
| [ | |
| "NEET", | |
| 52 | |
| ], | |
| [ | |
| "Dousei", | |
| 52 | |
| ], | |
| [ | |
| "[krikoun68]", | |
| 52 | |
| ], | |
| [ | |
| "[Blu-Ray - MUX - 960p - x264 - AC3 ITA-JAP - SUB ITA]", | |
| 51 | |
| ], | |
| [ | |
| "CTR", | |
| 45 | |
| ], | |
| [ | |
| "joseol", | |
| 45 | |
| ], | |
| [ | |
| "e99", | |
| 45 | |
| ], | |
| [ | |
| "(1440x1080 h264 AC3 AAC)", | |
| 45 | |
| ], | |
| [ | |
| "VERS", | |
| 37 | |
| ], | |
| [ | |
| "脙", | |
| 37 | |
| ], | |
| [ | |
| "Shunkashuutou", | |
| 36 | |
| ], | |
| [ | |
| "Daikousha", | |
| 36 | |
| ], | |
| [ | |
| "houbatsu", | |
| 36 | |
| ], | |
| [ | |
| "DEFINITIVA", | |
| 36 | |
| ], | |
| [ | |
| "Crash", | |
| 35 | |
| ], | |
| [ | |
| "Realm", | |
| 31 | |
| ], | |
| [ | |
| "UHD", | |
| 31 | |
| ], | |
| [ | |
| "[BDrip 1080P HEVC-10bit AAC]", | |
| 29 | |
| ], | |
| [ | |
| "Choroi", | |
| 28 | |
| ], | |
| [ | |
| "완", | |
| 28 | |
| ] | |
| ] | |
| } | |
| ``` | |
| ## Train Inference Tokenizer Comparison | |
| - Model dir: `checkpoints\dmhy-finetune\final` | |
| - Model tokenizer variant: `regex` | |
| - Dataset tokenizer variant: `regex` | |
| - Diagnostic tokenizer variant: `regex` | |
| - Model tokenizer vocab size: 3,000 | |
| - Diagnostic tokenizer vocab size: 8,000 | |
| If dataset and model tokenizer variants differ, validation loss can be low while real inference sees different token IDs and boundaries. | |
| ## Model Confusion Analysis | |
| - Evaluated samples: 128 | |
| - Entity precision: 0.9568 | |
| - Entity recall: 0.9530 | |
| - Entity F1: 0.9549 | |
| ### Boundary error classes | |
| - `B-boundary`: 26 (56.52%) | |
| - `entity-type`: 20 (43.48%) | |
| ### Top token-label confusions | |
| | true | pred | count | | |
| | --- | --- | --- | | |
| | O | I-TITLE | 17 | | |
| | O | B-EPISODE | 6 | | |
| | B-SOURCE | O | 4 | | |
| | I-TITLE | O | 3 | | |
| | B-EPISODE | O | 3 | | |
| | B-SEASON | O | 2 | | |
| | B-RESOLUTION | B-SOURCE | 2 | | |
| | B-EPISODE | I-TITLE | 2 | | |
| | O | B-TITLE | 2 | | |
| | B-TITLE | I-TITLE | 2 | | |
| | O | B-SOURCE | 1 | | |
| | B-SEASON | I-TITLE | 1 | | |
| | O | B-SEASON | 1 | | |
| ### Top entity-type confusions | |
| | true | pred | count | | |
| | --- | --- | --- | | |
| | O | TITLE | 19 | | |
| | O | EPISODE | 6 | | |
| | SOURCE | O | 4 | | |
| | TITLE | O | 3 | | |
| | EPISODE | O | 3 | | |
| | SEASON | O | 2 | | |
| | RESOLUTION | SOURCE | 2 | | |
| | EPISODE | TITLE | 2 | | |
| | O | SOURCE | 1 | | |
| | SEASON | TITLE | 1 | | |
| | O | SEASON | 1 | | |
| ### Seqeval report | |
| ```text | |
| precision recall f1-score support | |
| EPISODE 0.9535 0.9609 0.9572 128 | |
| GROUP 1.0000 1.0000 1.0000 53 | |
| RESOLUTION 1.0000 0.9545 0.9767 44 | |
| SEASON 0.9630 0.8966 0.9286 29 | |
| SOURCE 0.9703 0.9608 0.9655 102 | |
| SPECIAL 1.0000 1.0000 1.0000 5 | |
| TITLE 0.9211 0.9333 0.9272 150 | |
| micro avg 0.9568 0.9530 0.9549 511 | |
| macro avg 0.9725 0.9580 0.9650 511 | |
| weighted avg 0.9571 0.9530 0.9550 511 | |
| ``` | |
| ## Recommended Pipeline | |
| 1. Use one tokenizer variant end to end and save it in the checkpoint metadata. | |
| 2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels. | |
| 3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`. | |
| 4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked. | |
| 5. Keep rule-assisted post-processing for high-confidence structural anchors: leading group bracket, ` - 07`, `S01E07`, source, and resolution. | |
| 6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone. | |