# Anime Filename Parser Diagnostics Report ## Executive Summary - Dataset: `datasets\AnimeName\dmhy_weak.jsonl` - Inspected rows: 5,000 - Dataset tokenizer variant: `regex` - Diagnosed tokenizer variant: `regex` - Vocab: `datasets\AnimeName\vocab.json` (8,000 tokens) - Max sequence length checked: 64 - O-label ratio: 38.12% - Truncation risk: 0/5,000 rows (0.00%) - UNK rate after selected tokenizer: 6.9158% - BIO warnings collected: 9,711 Primary finding: this task is structural filename parsing. Tokenizer/preprocessing identity is more important than lowering token loss. ## Label And Entity Statistics ### Label distribution - `O`: 32,517 (38.12%) - `I-TITLE`: 30,321 (35.54%) - `B-TITLE`: 5,593 (6.56%) - `B-EPISODE`: 5,000 (5.86%) - `B-SOURCE`: 4,032 (4.73%) - `I-GROUP`: 2,459 (2.88%) - `B-GROUP`: 2,299 (2.69%) - `B-RESOLUTION`: 1,765 (2.07%) - `B-SEASON`: 1,269 (1.49%) - `B-SPECIAL`: 57 (0.07%) ### Entity count - `TITLE`: 6,061 (29.59%) - `EPISODE`: 5,000 (24.41%) - `SOURCE`: 4,032 (19.68%) - `GROUP`: 2,299 (11.22%) - `RESOLUTION`: 1,765 (8.62%) - `SEASON`: 1,269 (6.20%) - `SPECIAL`: 57 (0.28%) ### Length distribution ```json { "raw_tokens": { "min": 3, "p50": 17, "p90": 28, "p95": 31, "p99": 39, "max": 54 }, "aligned_tokens": { "min": 3, "p50": 17, "p90": 28, "p95": 31, "p99": 39, "max": 54 } } ``` ### Whitespace labels - `I-TITLE`: 10,539 (48.98%) - `O`: 10,484 (48.72%) - `I-GROUP`: 411 (1.91%) - `B-TITLE`: 84 (0.39%) ## BIO Violations And Boundary Drift ### Violation counts - `B_DIRECT_TO_O`: 9,243 (95.18%) - `ORPHAN_I`: 468 (4.82%) ### Boundary drift heuristics - none ### Sample violations ```json [ { "type": "B_DIRECT_TO_O", "index": 8, "prev_label": "B-EPISODE", "label": "O", "token": ".", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ ".", "Atelier", ".", "S01", "E07", ".", "1080p", ".", "NF", ".", "WEB-DL" ], "context_labels": [ "I-TITLE", "I-TITLE", "O", "B-SEASON", "B-EPISODE", "O", "B-RESOLUTION", "O", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 10, "prev_label": "B-RESOLUTION", "label": "O", "token": ".", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ ".", "S01", "E07", ".", "1080p", ".", "NF", ".", "WEB-DL", ".", "JP" ], "context_labels": [ "O", "B-SEASON", "B-EPISODE", "O", "B-RESOLUTION", "O", "B-SOURCE", "O", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 12, "prev_label": "B-SOURCE", "label": "O", "token": ".", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ "E07", ".", "1080p", ".", "NF", ".", "WEB-DL", ".", "JP", "N", "." ], "context_labels": [ "B-EPISODE", "O", "B-RESOLUTION", "O", "B-SOURCE", "O", "B-SOURCE", "O", "B-SOURCE", "O", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 14, "prev_label": "B-SOURCE", "label": "O", "token": ".", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ "1080p", ".", "NF", ".", "WEB-DL", ".", "JP", "N", ".", "AAC", "2" ], "context_labels": [ "B-RESOLUTION", "O", "B-SOURCE", "O", "B-SOURCE", "O", "B-SOURCE", "O", "O", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 16, "prev_label": "B-SOURCE", "label": "O", "token": "N", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ "NF", ".", "WEB-DL", ".", "JP", "N", ".", "AAC", "2", ".", "0" ], "context_labels": [ "B-SOURCE", "O", "B-SOURCE", "O", "B-SOURCE", "O", "O", "B-SOURCE", "O", "O", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 19, "prev_label": "B-SOURCE", "label": "O", "token": "2", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ ".", "JP", "N", ".", "AAC", "2", ".", "0", ".", "H.264", "." ], "context_labels": [ "O", "B-SOURCE", "O", "O", "B-SOURCE", "O", "O", "O", "O", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 24, "prev_label": "B-SOURCE", "label": "O", "token": ".", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ "2", ".", "0", ".", "H.264", ".", "MSubs", "-", "ToonsHub" ], "context_labels": [ "O", "O", "O", "O", "B-SOURCE", "O", "B-SOURCE", "O", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 26, "prev_label": "B-SOURCE", "label": "O", "token": "-", "row": 1, "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "context_tokens": [ "0", ".", "H.264", ".", "MSubs", "-", "ToonsHub" ], "context_labels": [ "O", "O", "B-SOURCE", "O", "B-SOURCE", "O", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 2, "file_id": 2, "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", "context_tokens": [ "[", "LoliHouse", "]", " ", "Maid", "-", "san", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 17, "prev_label": "B-EPISODE", "label": "O", "token": " ", "row": 2, "file_id": 2, "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", "context_tokens": [ "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "context_labels": [ "I-TITLE", "O", "O", "O", "B-EPISODE", "O", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 3, "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "[", "ANi", "]", " ", "異", "世", "界", "悠" ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 13, "prev_label": "B-SEASON", "label": "O", "token": " ", "row": 3, "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]" ], "context_labels": [ "I-TITLE", "I-TITLE", "I-TITLE", "O", "B-SEASON", "O", "O", "O", "B-EPISODE", "O", "B-RESOLUTION" ] }, { "type": "B_DIRECT_TO_O", "index": 17, "prev_label": "B-EPISODE", "label": "O", "token": " ", "row": 3, "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "context_labels": [ "B-SEASON", "O", "O", "O", "B-EPISODE", "O", "B-RESOLUTION", "B-SOURCE", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 21, "prev_label": "B-SOURCE", "label": "O", "token": "[AAC AVC]", "row": 3, "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "context_labels": [ "B-EPISODE", "O", "B-RESOLUTION", "B-SOURCE", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 4, "file_id": 4, "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "[", "ANi", "]", " ", "木", "頭", "風", "紀" ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 24, "prev_label": "B-EPISODE", "label": "O", "token": " ", "row": 4, "file_id": 4, "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "context_labels": [ "I-TITLE", "O", "O", "O", "B-EPISODE", "O", "B-RESOLUTION", "B-SOURCE", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 28, "prev_label": "B-SOURCE", "label": "O", "token": "[AAC AVC]", "row": 4, "file_id": 4, "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "context_tokens": [ "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "context_labels": [ "B-EPISODE", "O", "B-RESOLUTION", "B-SOURCE", "B-SOURCE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 5, "file_id": 5, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", "context_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 19, "prev_label": "B-SOURCE", "label": "O", "token": "[MP4]", "row": 5, "file_id": 5, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", "context_tokens": [ "Mai", "]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "context_labels": [ "I-TITLE", "O", "B-EPISODE", "B-RESOLUTION", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 6, "file_id": 6, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", "context_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 19, "prev_label": "B-SOURCE", "label": "O", "token": "[MP4]", "row": 6, "file_id": 6, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", "context_tokens": [ "Mai", "]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "context_labels": [ "I-TITLE", "O", "B-EPISODE", "B-RESOLUTION", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 7, "file_id": 7, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", "context_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 19, "prev_label": "B-SOURCE", "label": "O", "token": "[MP4]", "row": 7, "file_id": 7, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", "context_tokens": [ "Mai", "]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "context_labels": [ "I-TITLE", "O", "B-EPISODE", "B-RESOLUTION", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 8, "file_id": 8, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", "context_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 19, "prev_label": "B-SOURCE", "label": "O", "token": "[MP4]", "row": 8, "file_id": 8, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", "context_tokens": [ "Mai", "]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "context_labels": [ "I-TITLE", "O", "B-EPISODE", "B-RESOLUTION", "B-SOURCE", "O" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 9, "file_id": 9, "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", "context_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 11, "prev_label": "B-EPISODE", "label": "O", "token": "[1080p AVC AAC]", "row": 9, "file_id": 9, "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", "context_tokens": [ "no", " ", "Frieren", "]", "[29]", "[1080p AVC AAC]", "[CHT]" ], "context_labels": [ "I-TITLE", "I-TITLE", "I-TITLE", "O", "B-EPISODE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 10, "file_id": 10, "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", "context_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] }, { "type": "B_DIRECT_TO_O", "index": 11, "prev_label": "B-EPISODE", "label": "O", "token": "[1080p AVC AAC]", "row": 10, "file_id": 10, "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", "context_tokens": [ "no", " ", "Frieren", "]", "[30]", "[1080p AVC AAC]", "[CHT]" ], "context_labels": [ "I-TITLE", "I-TITLE", "I-TITLE", "O", "B-EPISODE", "O", "B-SOURCE" ] }, { "type": "B_DIRECT_TO_O", "index": 2, "prev_label": "B-GROUP", "label": "O", "token": "]", "row": 11, "file_id": 11, "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]", "context_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " " ], "context_labels": [ "O", "B-GROUP", "O", "O", "B-TITLE", "I-TITLE", "I-TITLE", "I-TITLE" ] } ] ``` ## Tokenizer Split And Alignment ### Dataset tokens vs selected tokenizer mismatches ```json [ { "file_id": 2, "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", "common_prefix": 0, "dataset_tokens": [ "[", "LoliHouse", "]", " ", "Maid", "-", "san", " ", "wa", " ", "Taberu", " ", "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "tokenizer_tokens": [ "[LoliHouse]", " ", "Maid", "-", "san", " ", "wa", " ", "Taberu", " ", "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "dataset_len": 19, "tokenizer_len": 17 }, { "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "common_prefix": 0, "dataset_tokens": [ "[", "ANi", "]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "tokenizer_tokens": [ "[ANi]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "dataset_len": 23, "tokenizer_len": 21 }, { "file_id": 4, "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "common_prefix": 0, "dataset_tokens": [ "[", "ANi", "]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "JK", " ", "的", "故", "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "tokenizer_tokens": [ "[ANi]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "JK", " ", "的", "故", "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "dataset_len": 30, "tokenizer_len": 28 }, { "file_id": 5, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", "common_prefix": 0, "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "tokenizer_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "dataset_len": 20, "tokenizer_len": 6 }, { "file_id": 6, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", "common_prefix": 0, "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "tokenizer_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "dataset_len": 20, "tokenizer_len": 6 }, { "file_id": 7, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", "common_prefix": 0, "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "tokenizer_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "dataset_len": 20, "tokenizer_len": 6 }, { "file_id": 8, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", "common_prefix": 0, "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "tokenizer_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "dataset_len": 20, "tokenizer_len": 6 }, { "file_id": 9, "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]", "common_prefix": 0, "dataset_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " ", "Frieren", "]", "[29]", "[1080p AVC AAC]", "[CHT]" ], "tokenizer_tokens": [ "[Airota]", "[Sousou no Frieren]", "[29]", "[1080p AVC AAC]", "[CHT]" ], "dataset_len": 13, "tokenizer_len": 5 }, { "file_id": 10, "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]", "common_prefix": 0, "dataset_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " ", "Frieren", "]", "[30]", "[1080p AVC AAC]", "[CHT]" ], "tokenizer_tokens": [ "[Airota]", "[Sousou no Frieren]", "[30]", "[1080p AVC AAC]", "[CHT]" ], "dataset_len": 13, "tokenizer_len": 5 }, { "file_id": 11, "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]", "common_prefix": 0, "dataset_tokens": [ "[", "Airota", "]", "[", "Sousou", " ", "no", " ", "Frieren", "]", "[31]", "[1080p AVC AAC]", "[CHT]" ], "tokenizer_tokens": [ "[Airota]", "[Sousou no Frieren]", "[31]", "[1080p AVC AAC]", "[CHT]" ], "dataset_len": 13, "tokenizer_len": 5 } ] ``` ### Split examples ```json [ { "file_id": 1, "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub", "dataset_tokens": [ "Witch", ".", "Hat", ".", "Atelier", ".", "S01", "E07", ".", "1080p", ".", "NF", ".", "WEB-DL", ".", "JP", "N", ".", "AAC", "2", ".", "0", ".", "H.264", ".", "MSubs", "-", "ToonsHub" ], "diagnosed_tokens": [ "Witch", ".", "Hat", ".", "Atelier", ".", "S01", "E07", ".", "1080p", ".", "NF", ".", "WEB-DL", ".", "JP", "N", ".", "AAC", "2", ".", "0", ".", "H.264", ".", "MSubs", "-", "ToonsHub" ], "regex_tokens": [ "Witch", ".", "Hat", ".", "Atelier", ".", "S01", "E07", ".", "1080p", ".", "NF", ".", "WEB-DL", ".", "JP", "N", ".", "AAC", "2", ".", "0", ".", "H.264", ".", "MSubs", "-", "ToonsHub" ], "char_tokens": [ "W", "i", "t", "c", "h", ".", "H", "a", "t", ".", "A", "t", "e", "l", "i", "e", "r", ".", "S", "0", "1", "E", "0", "7", ".", "1", "0", "8", "0", "p", ".", "N", "F", ".", "W", "E", "B", "-", "D", "L", ".", "J", "P", "N", ".", "A", "A", "C", "2", ".", "0", ".", "H", ".", "2", "6", "4", ".", "M", "S", "u", "b", "s", "-", "T", "o", "o", "n", "s", "H", "u", "b" ] }, { "file_id": 2, "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]", "dataset_tokens": [ "[", "LoliHouse", "]", " ", "Maid", "-", "san", " ", "wa", " ", "Taberu", " ", "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "diagnosed_tokens": [ "[LoliHouse]", " ", "Maid", "-", "san", " ", "wa", " ", "Taberu", " ", "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "regex_tokens": [ "[LoliHouse]", " ", "Maid", "-", "san", " ", "wa", " ", "Taberu", " ", "Dake", " ", "-", " ", "07", " ", "[WebRip 1080p HEVC-10bit AAC ASSx2]" ], "char_tokens": [ "[", "L", "o", "l", "i", "H", "o", "u", "s", "e", "]", " ", "M", "a", "i", "d", "-", "s", "a", "n", " ", "w", "a", " ", "T", "a", "b", "e", "r", "u", " ", "D", "a", "k", "e", " ", "-", " ", "0", "7", " ", "[", "W", "e", "b", "R", "i", "p", " ", "1", "0", "8", "0", "p", " ", "H", "E", "V", "C", "-", "1", "0", "b", "i", "t", " ", "A", "A", "C", " ", "A", "S", "S", "x", "2", "]" ] }, { "file_id": 3, "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "dataset_tokens": [ "[", "ANi", "]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "diagnosed_tokens": [ "[ANi]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "regex_tokens": [ "[ANi]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "char_tokens": [ "[", "A", "N", "i", "]", " ", "異", "世", "界", "悠", "閒", "農", "家", " ", "2", " ", "-", " ", "0", "6", " ", "[", "1", "0", "8", "0", "P", "]", "[", "B", "a", "h", "a", "]", "[", "W", "E", "B", "-", "D", "L", "]", "[", "A", "A", "C", " ", "A", "V", "C", "]", "[", "C", "H", "T", "]" ] }, { "file_id": 4, "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]", "dataset_tokens": [ "[", "ANi", "]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "JK", " ", "的", "故", "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "diagnosed_tokens": [ "[ANi]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "JK", " ", "的", "故", "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "regex_tokens": [ "[ANi]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "JK", " ", "的", "故", "事", " ", "-", " ", "06", " ", "[1080P]", "[Baha]", "[WEB-DL]", "[AAC AVC]", "[CHT]" ], "char_tokens": [ "[", "A", "N", "i", "]", " ", "木", "頭", "風", "紀", "委", "員", "和", "迷", "你", "裙", " ", "J", "K", " ", "的", "故", "事", " ", "-", " ", "0", "6", " ", "[", "1", "0", "8", "0", "P", "]", "[", "B", "a", "h", "a", "]", "[", "W", "E", "B", "-", "D", "L", "]", "[", "A", "A", "C", " ", "A", "V", "C", "]", "[", "C", "H", "T", "]" ] }, { "file_id": 5, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]", "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "diagnosed_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "regex_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[GB]", "[MP4]" ], "char_tokens": [ "[", "K", "i", "s", "s", "S", "u", "b", "]", "[", "S", "h", "u", "n", "k", "a", "s", "h", "u", "u", "t", "o", "u", " ", "D", "a", "i", "k", "o", "u", "s", "h", "a", " ", "-", " ", "H", "a", "r", "u", " ", "n", "o", " ", "M", "a", "i", "]", "[", "0", "5", "]", "[", "1", "0", "8", "0", "P", "]", "[", "G", "B", "]", "[", "M", "P", "4", "]" ] }, { "file_id": 6, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]", "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "diagnosed_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "regex_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[GB]", "[MP4]" ], "char_tokens": [ "[", "K", "i", "s", "s", "S", "u", "b", "]", "[", "S", "h", "u", "n", "k", "a", "s", "h", "u", "u", "t", "o", "u", " ", "D", "a", "i", "k", "o", "u", "s", "h", "a", " ", "-", " ", "H", "a", "r", "u", " ", "n", "o", " ", "M", "a", "i", "]", "[", "0", "6", "]", "[", "1", "0", "8", "0", "P", "]", "[", "G", "B", "]", "[", "M", "P", "4", "]" ] }, { "file_id": 7, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]", "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "diagnosed_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "regex_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[06]", "[1080P]", "[BIG5]", "[MP4]" ], "char_tokens": [ "[", "K", "i", "s", "s", "S", "u", "b", "]", "[", "S", "h", "u", "n", "k", "a", "s", "h", "u", "u", "t", "o", "u", " ", "D", "a", "i", "k", "o", "u", "s", "h", "a", " ", "-", " ", "H", "a", "r", "u", " ", "n", "o", " ", "M", "a", "i", "]", "[", "0", "6", "]", "[", "1", "0", "8", "0", "P", "]", "[", "B", "I", "G", "5", "]", "[", "M", "P", "4", "]" ] }, { "file_id": 8, "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]", "dataset_tokens": [ "[", "KissSub", "]", "[", "Shunkashuutou", " ", "Daikousha", " ", "-", " ", "Haru", " ", "no", " ", "Mai", "]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "diagnosed_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "regex_tokens": [ "[KissSub]", "[Shunkashuutou Daikousha - Haru no Mai]", "[05]", "[1080P]", "[BIG5]", "[MP4]" ], "char_tokens": [ "[", "K", "i", "s", "s", "S", "u", "b", "]", "[", "S", "h", "u", "n", "k", "a", "s", "h", "u", "u", "t", "o", "u", " ", "D", "a", "i", "k", "o", "u", "s", "h", "a", " ", "-", " ", "H", "a", "r", "u", " ", "n", "o", " ", "M", "a", "i", "]", "[", "0", "5", "]", "[", "1", "0", "8", "0", "P", "]", "[", "B", "I", "G", "5", "]", "[", "M", "P", "4", "]" ] } ] ``` ### Vocabulary coverage ```json { "total": 85312, "unk": 5900, "unk_rate": 0.06915791447861966, "top_unk": [ [ "(BDRip 720p x264)", 66 ], [ "Partie", 59 ], [ "incantevole", 54 ], [ "Muxed", 54 ], [ "nonscordarmi", 54 ], [ "NEET", 52 ], [ "Dousei", 52 ], [ "[krikoun68]", 52 ], [ "[Blu-Ray - MUX - 960p - x264 - AC3 ITA-JAP - SUB ITA]", 51 ], [ "CTR", 45 ], [ "joseol", 45 ], [ "e99", 45 ], [ "(1440x1080 h264 AC3 AAC)", 45 ], [ "VERS", 37 ], [ "脙", 37 ], [ "Shunkashuutou", 36 ], [ "Daikousha", 36 ], [ "houbatsu", 36 ], [ "DEFINITIVA", 36 ], [ "Crash", 35 ], [ "Realm", 31 ], [ "UHD", 31 ], [ "[BDrip 1080P HEVC-10bit AAC]", 29 ], [ "Choroi", 28 ], [ "완", 28 ] ] } ``` ## Train Inference Tokenizer Comparison - Model dir: `checkpoints\dmhy-finetune\final` - Model tokenizer variant: `regex` - Dataset tokenizer variant: `regex` - Diagnostic tokenizer variant: `regex` - Model tokenizer vocab size: 3,000 - Diagnostic tokenizer vocab size: 8,000 If dataset and model tokenizer variants differ, validation loss can be low while real inference sees different token IDs and boundaries. ## Model Confusion Analysis - Evaluated samples: 128 - Entity precision: 0.9568 - Entity recall: 0.9530 - Entity F1: 0.9549 ### Boundary error classes - `B-boundary`: 26 (56.52%) - `entity-type`: 20 (43.48%) ### Top token-label confusions | true | pred | count | | --- | --- | --- | | O | I-TITLE | 17 | | O | B-EPISODE | 6 | | B-SOURCE | O | 4 | | I-TITLE | O | 3 | | B-EPISODE | O | 3 | | B-SEASON | O | 2 | | B-RESOLUTION | B-SOURCE | 2 | | B-EPISODE | I-TITLE | 2 | | O | B-TITLE | 2 | | B-TITLE | I-TITLE | 2 | | O | B-SOURCE | 1 | | B-SEASON | I-TITLE | 1 | | O | B-SEASON | 1 | ### Top entity-type confusions | true | pred | count | | --- | --- | --- | | O | TITLE | 19 | | O | EPISODE | 6 | | SOURCE | O | 4 | | TITLE | O | 3 | | EPISODE | O | 3 | | SEASON | O | 2 | | RESOLUTION | SOURCE | 2 | | EPISODE | TITLE | 2 | | O | SOURCE | 1 | | SEASON | TITLE | 1 | | O | SEASON | 1 | ### Seqeval report ```text precision recall f1-score support EPISODE 0.9535 0.9609 0.9572 128 GROUP 1.0000 1.0000 1.0000 53 RESOLUTION 1.0000 0.9545 0.9767 44 SEASON 0.9630 0.8966 0.9286 29 SOURCE 0.9703 0.9608 0.9655 102 SPECIAL 1.0000 1.0000 1.0000 5 TITLE 0.9211 0.9333 0.9272 150 micro avg 0.9568 0.9530 0.9549 511 macro avg 0.9725 0.9580 0.9650 511 weighted avg 0.9571 0.9530 0.9550 511 ``` ## Recommended Pipeline 1. Use one tokenizer variant end to end and save it in the checkpoint metadata. 2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels. 3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`. 4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked. 5. Keep rule-assisted post-processing for high-confidence structural anchors: leading group bracket, ` - 07`, `S01E07`, source, and resolution. 6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.