chivehao
/

AniFileBERT / diagnostics_report_word.md
chivehao's picture
Duplicate from ModerRAS/AniFileBERT
f7b1036

Anime Filename Parser Diagnostics Report

Executive Summary

  • Dataset: datasets\AnimeName\dmhy_weak.jsonl
  • Inspected rows: 5,000
  • Dataset tokenizer variant: regex
  • Diagnosed tokenizer variant: regex
  • Vocab: datasets\AnimeName\vocab.json (8,000 tokens)
  • Max sequence length checked: 64
  • O-label ratio: 38.12%
  • Truncation risk: 0/5,000 rows (0.00%)
  • UNK rate after selected tokenizer: 6.9158%
  • BIO warnings collected: 9,711

Primary finding: this task is structural filename parsing. Tokenizer/preprocessing identity is more important than lowering token loss.

Label And Entity Statistics

Label distribution

  • O: 32,517 (38.12%)
  • I-TITLE: 30,321 (35.54%)
  • B-TITLE: 5,593 (6.56%)
  • B-EPISODE: 5,000 (5.86%)
  • B-SOURCE: 4,032 (4.73%)
  • I-GROUP: 2,459 (2.88%)
  • B-GROUP: 2,299 (2.69%)
  • B-RESOLUTION: 1,765 (2.07%)
  • B-SEASON: 1,269 (1.49%)
  • B-SPECIAL: 57 (0.07%)

Entity count

  • TITLE: 6,061 (29.59%)
  • EPISODE: 5,000 (24.41%)
  • SOURCE: 4,032 (19.68%)
  • GROUP: 2,299 (11.22%)
  • RESOLUTION: 1,765 (8.62%)
  • SEASON: 1,269 (6.20%)
  • SPECIAL: 57 (0.28%)

Length distribution

{
  "raw_tokens": {
    "min": 3,
    "p50": 17,
    "p90": 28,
    "p95": 31,
    "p99": 39,
    "max": 54
  },
  "aligned_tokens": {
    "min": 3,
    "p50": 17,
    "p90": 28,
    "p95": 31,
    "p99": 39,
    "max": 54
  }
}

Whitespace labels

  • I-TITLE: 10,539 (48.98%)
  • O: 10,484 (48.72%)
  • I-GROUP: 411 (1.91%)
  • B-TITLE: 84 (0.39%)

BIO Violations And Boundary Drift

Violation counts

  • B_DIRECT_TO_O: 9,243 (95.18%)
  • ORPHAN_I: 468 (4.82%)

Boundary drift heuristics

  • none

Sample violations

[
  {
    "type": "B_DIRECT_TO_O",
    "index": 8,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": ".",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      ".",
      "Atelier",
      ".",
      "S01",
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL"
    ],
    "context_labels": [
      "I-TITLE",
      "I-TITLE",
      "O",
      "B-SEASON",
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 10,
    "prev_label": "B-RESOLUTION",
    "label": "O",
    "token": ".",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      ".",
      "S01",
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP"
    ],
    "context_labels": [
      "O",
      "B-SEASON",
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 12,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": ".",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      "."
    ],
    "context_labels": [
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 14,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": ".",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2"
    ],
    "context_labels": [
      "B-RESOLUTION",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "O",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 16,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "N",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2",
      ".",
      "0"
    ],
    "context_labels": [
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "O",
      "B-SOURCE",
      "O",
      "O",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 19,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "2",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2",
      ".",
      "0",
      ".",
      "H.264",
      "."
    ],
    "context_labels": [
      "O",
      "B-SOURCE",
      "O",
      "O",
      "B-SOURCE",
      "O",
      "O",
      "O",
      "O",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 24,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": ".",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      "2",
      ".",
      "0",
      ".",
      "H.264",
      ".",
      "MSubs",
      "-",
      "ToonsHub"
    ],
    "context_labels": [
      "O",
      "O",
      "O",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 26,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "-",
    "row": 1,
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "context_tokens": [
      "0",
      ".",
      "H.264",
      ".",
      "MSubs",
      "-",
      "ToonsHub"
    ],
    "context_labels": [
      "O",
      "O",
      "B-SOURCE",
      "O",
      "B-SOURCE",
      "O",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 2,
    "file_id": 2,
    "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
    "context_tokens": [
      "[",
      "LoliHouse",
      "]",
      " ",
      "Maid",
      "-",
      "san",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 17,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": " ",
    "row": 2,
    "file_id": 2,
    "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
    "context_tokens": [
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "O",
      "O",
      "B-EPISODE",
      "O",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 3,
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "異",
      "世",
      "界",
      "悠"
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 13,
    "prev_label": "B-SEASON",
    "label": "O",
    "token": " ",
    "row": 3,
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]"
    ],
    "context_labels": [
      "I-TITLE",
      "I-TITLE",
      "I-TITLE",
      "O",
      "B-SEASON",
      "O",
      "O",
      "O",
      "B-EPISODE",
      "O",
      "B-RESOLUTION"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 17,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": " ",
    "row": 3,
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "context_labels": [
      "B-SEASON",
      "O",
      "O",
      "O",
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "B-SOURCE",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 21,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[AAC AVC]",
    "row": 3,
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "context_labels": [
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "B-SOURCE",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 4,
    "file_id": 4,
    "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "木",
      "頭",
      "風",
      "紀"
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 24,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": " ",
    "row": 4,
    "file_id": 4,
    "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "O",
      "O",
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "B-SOURCE",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 28,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[AAC AVC]",
    "row": 4,
    "file_id": 4,
    "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "context_tokens": [
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "context_labels": [
      "B-EPISODE",
      "O",
      "B-RESOLUTION",
      "B-SOURCE",
      "B-SOURCE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 5,
    "file_id": 5,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
    "context_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 19,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[MP4]",
    "row": 5,
    "file_id": 5,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
    "context_tokens": [
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "B-EPISODE",
      "B-RESOLUTION",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 6,
    "file_id": 6,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]",
    "context_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 19,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[MP4]",
    "row": 6,
    "file_id": 6,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]",
    "context_tokens": [
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "B-EPISODE",
      "B-RESOLUTION",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 7,
    "file_id": 7,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]",
    "context_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 19,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[MP4]",
    "row": 7,
    "file_id": 7,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]",
    "context_tokens": [
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "B-EPISODE",
      "B-RESOLUTION",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 8,
    "file_id": 8,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]",
    "context_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 19,
    "prev_label": "B-SOURCE",
    "label": "O",
    "token": "[MP4]",
    "row": 8,
    "file_id": 8,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]",
    "context_tokens": [
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "context_labels": [
      "I-TITLE",
      "O",
      "B-EPISODE",
      "B-RESOLUTION",
      "B-SOURCE",
      "O"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 9,
    "file_id": 9,
    "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
    "context_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 11,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": "[1080p AVC AAC]",
    "row": 9,
    "file_id": 9,
    "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
    "context_tokens": [
      "no",
      " ",
      "Frieren",
      "]",
      "[29]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "context_labels": [
      "I-TITLE",
      "I-TITLE",
      "I-TITLE",
      "O",
      "B-EPISODE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 10,
    "file_id": 10,
    "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]",
    "context_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 11,
    "prev_label": "B-EPISODE",
    "label": "O",
    "token": "[1080p AVC AAC]",
    "row": 10,
    "file_id": 10,
    "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]",
    "context_tokens": [
      "no",
      " ",
      "Frieren",
      "]",
      "[30]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "context_labels": [
      "I-TITLE",
      "I-TITLE",
      "I-TITLE",
      "O",
      "B-EPISODE",
      "O",
      "B-SOURCE"
    ]
  },
  {
    "type": "B_DIRECT_TO_O",
    "index": 2,
    "prev_label": "B-GROUP",
    "label": "O",
    "token": "]",
    "row": 11,
    "file_id": 11,
    "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]",
    "context_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " "
    ],
    "context_labels": [
      "O",
      "B-GROUP",
      "O",
      "O",
      "B-TITLE",
      "I-TITLE",
      "I-TITLE",
      "I-TITLE"
    ]
  }
]

Tokenizer Split And Alignment

Dataset tokens vs selected tokenizer mismatches

[
  {
    "file_id": 2,
    "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "LoliHouse",
      "]",
      " ",
      "Maid",
      "-",
      "san",
      " ",
      "wa",
      " ",
      "Taberu",
      " ",
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "tokenizer_tokens": [
      "[LoliHouse]",
      " ",
      "Maid",
      "-",
      "san",
      " ",
      "wa",
      " ",
      "Taberu",
      " ",
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "dataset_len": 19,
    "tokenizer_len": 17
  },
  {
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "tokenizer_tokens": [
      "[ANi]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "dataset_len": 23,
    "tokenizer_len": 21
  },
  {
    "file_id": 4,
    "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "JK",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "tokenizer_tokens": [
      "[ANi]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "JK",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "dataset_len": 30,
    "tokenizer_len": 28
  },
  {
    "file_id": 5,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "tokenizer_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "dataset_len": 20,
    "tokenizer_len": 6
  },
  {
    "file_id": 6,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "tokenizer_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "dataset_len": 20,
    "tokenizer_len": 6
  },
  {
    "file_id": 7,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "tokenizer_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "dataset_len": 20,
    "tokenizer_len": 6
  },
  {
    "file_id": 8,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "tokenizer_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "dataset_len": 20,
    "tokenizer_len": 6
  },
  {
    "file_id": 9,
    "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " ",
      "Frieren",
      "]",
      "[29]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "tokenizer_tokens": [
      "[Airota]",
      "[Sousou no Frieren]",
      "[29]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "dataset_len": 13,
    "tokenizer_len": 5
  },
  {
    "file_id": 10,
    "filename": "[Airota][Sousou no Frieren][30][1080p AVC AAC][CHT]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " ",
      "Frieren",
      "]",
      "[30]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "tokenizer_tokens": [
      "[Airota]",
      "[Sousou no Frieren]",
      "[30]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "dataset_len": 13,
    "tokenizer_len": 5
  },
  {
    "file_id": 11,
    "filename": "[Airota][Sousou no Frieren][31][1080p AVC AAC][CHT]",
    "common_prefix": 0,
    "dataset_tokens": [
      "[",
      "Airota",
      "]",
      "[",
      "Sousou",
      " ",
      "no",
      " ",
      "Frieren",
      "]",
      "[31]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "tokenizer_tokens": [
      "[Airota]",
      "[Sousou no Frieren]",
      "[31]",
      "[1080p AVC AAC]",
      "[CHT]"
    ],
    "dataset_len": 13,
    "tokenizer_len": 5
  }
]

Split examples

[
  {
    "file_id": 1,
    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
    "dataset_tokens": [
      "Witch",
      ".",
      "Hat",
      ".",
      "Atelier",
      ".",
      "S01",
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2",
      ".",
      "0",
      ".",
      "H.264",
      ".",
      "MSubs",
      "-",
      "ToonsHub"
    ],
    "diagnosed_tokens": [
      "Witch",
      ".",
      "Hat",
      ".",
      "Atelier",
      ".",
      "S01",
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2",
      ".",
      "0",
      ".",
      "H.264",
      ".",
      "MSubs",
      "-",
      "ToonsHub"
    ],
    "regex_tokens": [
      "Witch",
      ".",
      "Hat",
      ".",
      "Atelier",
      ".",
      "S01",
      "E07",
      ".",
      "1080p",
      ".",
      "NF",
      ".",
      "WEB-DL",
      ".",
      "JP",
      "N",
      ".",
      "AAC",
      "2",
      ".",
      "0",
      ".",
      "H.264",
      ".",
      "MSubs",
      "-",
      "ToonsHub"
    ],
    "char_tokens": [
      "W",
      "i",
      "t",
      "c",
      "h",
      ".",
      "H",
      "a",
      "t",
      ".",
      "A",
      "t",
      "e",
      "l",
      "i",
      "e",
      "r",
      ".",
      "S",
      "0",
      "1",
      "E",
      "0",
      "7",
      ".",
      "1",
      "0",
      "8",
      "0",
      "p",
      ".",
      "N",
      "F",
      ".",
      "W",
      "E",
      "B",
      "-",
      "D",
      "L",
      ".",
      "J",
      "P",
      "N",
      ".",
      "A",
      "A",
      "C",
      "2",
      ".",
      "0",
      ".",
      "H",
      ".",
      "2",
      "6",
      "4",
      ".",
      "M",
      "S",
      "u",
      "b",
      "s",
      "-",
      "T",
      "o",
      "o",
      "n",
      "s",
      "H",
      "u",
      "b"
    ]
  },
  {
    "file_id": 2,
    "filename": "[LoliHouse] Maid-san wa Taberu Dake - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
    "dataset_tokens": [
      "[",
      "LoliHouse",
      "]",
      " ",
      "Maid",
      "-",
      "san",
      " ",
      "wa",
      " ",
      "Taberu",
      " ",
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "diagnosed_tokens": [
      "[LoliHouse]",
      " ",
      "Maid",
      "-",
      "san",
      " ",
      "wa",
      " ",
      "Taberu",
      " ",
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "regex_tokens": [
      "[LoliHouse]",
      " ",
      "Maid",
      "-",
      "san",
      " ",
      "wa",
      " ",
      "Taberu",
      " ",
      "Dake",
      " ",
      "-",
      " ",
      "07",
      " ",
      "[WebRip 1080p HEVC-10bit AAC ASSx2]"
    ],
    "char_tokens": [
      "[",
      "L",
      "o",
      "l",
      "i",
      "H",
      "o",
      "u",
      "s",
      "e",
      "]",
      " ",
      "M",
      "a",
      "i",
      "d",
      "-",
      "s",
      "a",
      "n",
      " ",
      "w",
      "a",
      " ",
      "T",
      "a",
      "b",
      "e",
      "r",
      "u",
      " ",
      "D",
      "a",
      "k",
      "e",
      " ",
      "-",
      " ",
      "0",
      "7",
      " ",
      "[",
      "W",
      "e",
      "b",
      "R",
      "i",
      "p",
      " ",
      "1",
      "0",
      "8",
      "0",
      "p",
      " ",
      "H",
      "E",
      "V",
      "C",
      "-",
      "1",
      "0",
      "b",
      "i",
      "t",
      " ",
      "A",
      "A",
      "C",
      " ",
      "A",
      "S",
      "S",
      "x",
      "2",
      "]"
    ]
  },
  {
    "file_id": 3,
    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "dataset_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "diagnosed_tokens": [
      "[ANi]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "regex_tokens": [
      "[ANi]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "char_tokens": [
      "[",
      "A",
      "N",
      "i",
      "]",
      " ",
      "異",
      "世",
      "界",
      "悠",
      "閒",
      "農",
      "家",
      " ",
      "2",
      " ",
      "-",
      " ",
      "0",
      "6",
      " ",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "B",
      "a",
      "h",
      "a",
      "]",
      "[",
      "W",
      "E",
      "B",
      "-",
      "D",
      "L",
      "]",
      "[",
      "A",
      "A",
      "C",
      " ",
      "A",
      "V",
      "C",
      "]",
      "[",
      "C",
      "H",
      "T",
      "]"
    ]
  },
  {
    "file_id": 4,
    "filename": "[ANi] 木頭風紀委員和迷你裙 JK 的故事 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
    "dataset_tokens": [
      "[",
      "ANi",
      "]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "JK",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "diagnosed_tokens": [
      "[ANi]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "JK",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "regex_tokens": [
      "[ANi]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "JK",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "06",
      " ",
      "[1080P]",
      "[Baha]",
      "[WEB-DL]",
      "[AAC AVC]",
      "[CHT]"
    ],
    "char_tokens": [
      "[",
      "A",
      "N",
      "i",
      "]",
      " ",
      "木",
      "頭",
      "風",
      "紀",
      "委",
      "員",
      "和",
      "迷",
      "你",
      "裙",
      " ",
      "J",
      "K",
      " ",
      "的",
      "故",
      "事",
      " ",
      "-",
      " ",
      "0",
      "6",
      " ",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "B",
      "a",
      "h",
      "a",
      "]",
      "[",
      "W",
      "E",
      "B",
      "-",
      "D",
      "L",
      "]",
      "[",
      "A",
      "A",
      "C",
      " ",
      "A",
      "V",
      "C",
      "]",
      "[",
      "C",
      "H",
      "T",
      "]"
    ]
  },
  {
    "file_id": 5,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "diagnosed_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "regex_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "char_tokens": [
      "[",
      "K",
      "i",
      "s",
      "s",
      "S",
      "u",
      "b",
      "]",
      "[",
      "S",
      "h",
      "u",
      "n",
      "k",
      "a",
      "s",
      "h",
      "u",
      "u",
      "t",
      "o",
      "u",
      " ",
      "D",
      "a",
      "i",
      "k",
      "o",
      "u",
      "s",
      "h",
      "a",
      " ",
      "-",
      " ",
      "H",
      "a",
      "r",
      "u",
      " ",
      "n",
      "o",
      " ",
      "M",
      "a",
      "i",
      "]",
      "[",
      "0",
      "5",
      "]",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "G",
      "B",
      "]",
      "[",
      "M",
      "P",
      "4",
      "]"
    ]
  },
  {
    "file_id": 6,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][GB][MP4]",
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "diagnosed_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "regex_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[GB]",
      "[MP4]"
    ],
    "char_tokens": [
      "[",
      "K",
      "i",
      "s",
      "s",
      "S",
      "u",
      "b",
      "]",
      "[",
      "S",
      "h",
      "u",
      "n",
      "k",
      "a",
      "s",
      "h",
      "u",
      "u",
      "t",
      "o",
      "u",
      " ",
      "D",
      "a",
      "i",
      "k",
      "o",
      "u",
      "s",
      "h",
      "a",
      " ",
      "-",
      " ",
      "H",
      "a",
      "r",
      "u",
      " ",
      "n",
      "o",
      " ",
      "M",
      "a",
      "i",
      "]",
      "[",
      "0",
      "6",
      "]",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "G",
      "B",
      "]",
      "[",
      "M",
      "P",
      "4",
      "]"
    ]
  },
  {
    "file_id": 7,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][06][1080P][BIG5][MP4]",
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "diagnosed_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "regex_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[06]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "char_tokens": [
      "[",
      "K",
      "i",
      "s",
      "s",
      "S",
      "u",
      "b",
      "]",
      "[",
      "S",
      "h",
      "u",
      "n",
      "k",
      "a",
      "s",
      "h",
      "u",
      "u",
      "t",
      "o",
      "u",
      " ",
      "D",
      "a",
      "i",
      "k",
      "o",
      "u",
      "s",
      "h",
      "a",
      " ",
      "-",
      " ",
      "H",
      "a",
      "r",
      "u",
      " ",
      "n",
      "o",
      " ",
      "M",
      "a",
      "i",
      "]",
      "[",
      "0",
      "6",
      "]",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "B",
      "I",
      "G",
      "5",
      "]",
      "[",
      "M",
      "P",
      "4",
      "]"
    ]
  },
  {
    "file_id": 8,
    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][BIG5][MP4]",
    "dataset_tokens": [
      "[",
      "KissSub",
      "]",
      "[",
      "Shunkashuutou",
      " ",
      "Daikousha",
      " ",
      "-",
      " ",
      "Haru",
      " ",
      "no",
      " ",
      "Mai",
      "]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "diagnosed_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "regex_tokens": [
      "[KissSub]",
      "[Shunkashuutou Daikousha - Haru no Mai]",
      "[05]",
      "[1080P]",
      "[BIG5]",
      "[MP4]"
    ],
    "char_tokens": [
      "[",
      "K",
      "i",
      "s",
      "s",
      "S",
      "u",
      "b",
      "]",
      "[",
      "S",
      "h",
      "u",
      "n",
      "k",
      "a",
      "s",
      "h",
      "u",
      "u",
      "t",
      "o",
      "u",
      " ",
      "D",
      "a",
      "i",
      "k",
      "o",
      "u",
      "s",
      "h",
      "a",
      " ",
      "-",
      " ",
      "H",
      "a",
      "r",
      "u",
      " ",
      "n",
      "o",
      " ",
      "M",
      "a",
      "i",
      "]",
      "[",
      "0",
      "5",
      "]",
      "[",
      "1",
      "0",
      "8",
      "0",
      "P",
      "]",
      "[",
      "B",
      "I",
      "G",
      "5",
      "]",
      "[",
      "M",
      "P",
      "4",
      "]"
    ]
  }
]

Vocabulary coverage

{
  "total": 85312,
  "unk": 5900,
  "unk_rate": 0.06915791447861966,
  "top_unk": [
    [
      "(BDRip 720p x264)",
      66
    ],
    [
      "Partie",
      59
    ],
    [
      "incantevole",
      54
    ],
    [
      "Muxed",
      54
    ],
    [
      "nonscordarmi",
      54
    ],
    [
      "NEET",
      52
    ],
    [
      "Dousei",
      52
    ],
    [
      "[krikoun68]",
      52
    ],
    [
      "[Blu-Ray - MUX - 960p - x264 - AC3 ITA-JAP - SUB ITA]",
      51
    ],
    [
      "CTR",
      45
    ],
    [
      "joseol",
      45
    ],
    [
      "e99",
      45
    ],
    [
      "(1440x1080 h264 AC3 AAC)",
      45
    ],
    [
      "VERS",
      37
    ],
    [
      "脙",
      37
    ],
    [
      "Shunkashuutou",
      36
    ],
    [
      "Daikousha",
      36
    ],
    [
      "houbatsu",
      36
    ],
    [
      "DEFINITIVA",
      36
    ],
    [
      "Crash",
      35
    ],
    [
      "Realm",
      31
    ],
    [
      "UHD",
      31
    ],
    [
      "[BDrip 1080P HEVC-10bit AAC]",
      29
    ],
    [
      "Choroi",
      28
    ],
    [
      "완",
      28
    ]
  ]
}

Train Inference Tokenizer Comparison

  • Model dir: checkpoints\dmhy-finetune\final
  • Model tokenizer variant: regex
  • Dataset tokenizer variant: regex
  • Diagnostic tokenizer variant: regex
  • Model tokenizer vocab size: 3,000
  • Diagnostic tokenizer vocab size: 8,000

If dataset and model tokenizer variants differ, validation loss can be low while real inference sees different token IDs and boundaries.

Model Confusion Analysis

  • Evaluated samples: 128
  • Entity precision: 0.9568
  • Entity recall: 0.9530
  • Entity F1: 0.9549

Boundary error classes

  • B-boundary: 26 (56.52%)
  • entity-type: 20 (43.48%)

Top token-label confusions

true pred count
O I-TITLE 17
O B-EPISODE 6
B-SOURCE O 4
I-TITLE O 3
B-EPISODE O 3
B-SEASON O 2
B-RESOLUTION B-SOURCE 2
B-EPISODE I-TITLE 2
O B-TITLE 2
B-TITLE I-TITLE 2
O B-SOURCE 1
B-SEASON I-TITLE 1
O B-SEASON 1

Top entity-type confusions

true pred count
O TITLE 19
O EPISODE 6
SOURCE O 4
TITLE O 3
EPISODE O 3
SEASON O 2
RESOLUTION SOURCE 2
EPISODE TITLE 2
O SOURCE 1
SEASON TITLE 1
O SEASON 1

Seqeval report

              precision    recall  f1-score   support

     EPISODE     0.9535    0.9609    0.9572       128
       GROUP     1.0000    1.0000    1.0000        53
  RESOLUTION     1.0000    0.9545    0.9767        44
      SEASON     0.9630    0.8966    0.9286        29
      SOURCE     0.9703    0.9608    0.9655       102
     SPECIAL     1.0000    1.0000    1.0000         5
       TITLE     0.9211    0.9333    0.9272       150

   micro avg     0.9568    0.9530    0.9549       511
   macro avg     0.9725    0.9580    0.9650       511
weighted avg     0.9571    0.9530    0.9550       511

Recommended Pipeline

  1. Use one tokenizer variant end to end and save it in the checkpoint metadata.
  2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels.
  3. For char-level runs, use --tokenizer char --max-seq-length 128 with vocab.char.json.
  4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked.
  5. Keep rule-assisted post-processing for high-confidence structural anchors: leading group bracket, - 07, S01E07, source, and resolution.
  6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.