ModerRAS commited on
Commit
d92b315
·
1 Parent(s): f484458

Document and tighten DMHY template labels

Browse files
docs/dmhy_template_labeling.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DMHY Template Labeling Requirements
2
+
3
+ This document records the current labeling contract for the DMHY template
4
+ metadata workflow. It is intentionally stricter than the old weak-label export:
5
+ precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
6
+ `SEASON`.
7
+
8
+ ## Source And Pipeline
9
+
10
+ - Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
11
+ - Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
12
+ - Template recipe generation and application live in
13
+ `tools/rust_dmhy_template_apply`.
14
+ - Generated training JSONL rows must contain at least `filename`, `tokens`,
15
+ `labels`, `template_id`, and `template`.
16
+ - Reports and intermediate audits belong under `reports/`; they are diagnostic
17
+ artifacts, not authoritative dataset files.
18
+
19
+ ## Critical Label Semantics
20
+
21
+ - `TITLE` is the anime/work title. It must be one contiguous span whenever a
22
+ single title is being emitted.
23
+ - `EPISODE` is the episode number or explicit episode marker span.
24
+ - `SEASON` is the season/cour/part marker when the filename explicitly encodes
25
+ season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
26
+ `Part 5` in series-part naming.
27
+ - `GROUP` is a release group or subtitle group, not the title.
28
+ - `SOURCE` covers media/source/codec/language/platform-ish release metadata
29
+ such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
30
+ and `BIG5`.
31
+ - `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
32
+ `1920x1080`.
33
+ - `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
34
+ `Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
35
+ an episodic number.
36
+ - Hash-like suffixes are retained as text in source filenames when useful, but
37
+ they must not become entity labels in generated training data.
38
+
39
+ ## Title Rules
40
+
41
+ - Avoid duplicate titles. If the leaf filename already carries a complete title,
42
+ season, and episode structure, drop redundant parent directory titles.
43
+ - If precision is uncertain, prefer skipping the row/template over producing a
44
+ duplicated or discontinuous `TITLE`.
45
+ - A title may contain punctuation or symbols. Internal title joiners must stay
46
+ inside the title span, including common ASCII separators and known Unicode
47
+ title punctuation such as `‐`, `–`, `—`, `$`, `∽`, `꞉`, and `♥`.
48
+ - Multiple title candidates in one filename should be handled explicitly:
49
+ bilingual title aliases and special-program titles are allowed in rich review
50
+ metadata, but the final weak training row should not emit arbitrary
51
+ non-contiguous titles unless that structure has been reviewed.
52
+ - Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
53
+ title follows.
54
+
55
+ ## Episode And Season Rules
56
+
57
+ - `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
58
+ higher training cost than dropping a row.
59
+ - `SxxExx` means season plus episode. `S` identifies season and `E` identifies
60
+ episode. If the tokenizer keeps `S01E02` as one compact token, project it to
61
+ season and episode components during normalization; if split into marker and
62
+ number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
63
+ may remain structural `O`.
64
+ - `01v2` means episode `01` version `2`; the episode value must not be treated
65
+ as title.
66
+ - Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
67
+ remain episode spans.
68
+ - Decimal episode-like values such as `14.5` may be valid recap or midpoint
69
+ episodes and should not be discarded only because they contain a decimal point.
70
+ - Title-internal numbers stay in `TITLE` when they are part of the work name,
71
+ such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
72
+ Million-Dollar Pentagram`.
73
+
74
+ ## Path And Noise Rules
75
+
76
+ - BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
77
+ filenames and should be skipped.
78
+ - Non-anime or abstract path data, including obvious `MTV` paths and tourism /
79
+ railway program dumps, should be skipped.
80
+ - Mojibake and encoding-noise rows should be skipped unless explicitly kept for
81
+ diagnosis.
82
+ - Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
83
+ the output should avoid duplicate title spans.
84
+ - Parent directory context is allowed only when the leaf filename is too weak to
85
+ identify the title; otherwise the leaf filename should dominate.
86
+
87
+ ## Review Strategy
88
+
89
+ - High-frequency templates affect training most and must be sampled more heavily.
90
+ - Low-frequency templates are gated conservatively; ambiguous cases are sent to
91
+ review instead of generated training data.
92
+ - Middle-frequency templates should be audited by sampling a few examples from
93
+ every template class, then grouping failures by rule rather than patching
94
+ single examples blindly.
95
+ - A template can enter the generated training set only when its `TITLE`,
96
+ `EPISODE`, and `SEASON` behavior is defensible across sampled rows.
97
+
98
+ ## Character Dataset Projection
99
+
100
+ - Regex-token JSONL is converted to character JSONL by projecting BIO labels:
101
+ first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
102
+ - Punctuation tokens must remain independently represented before character
103
+ projection so the model can learn filename structure boundaries.
tools/rust_dmhy_template_apply/src/main.rs CHANGED
@@ -393,10 +393,18 @@ fn main() -> Result<()> {
393
  "low_frequency_audit_max_count": args.audit_max_count,
394
  "low_frequency_blocking_warnings": [
395
  "ambiguous_no_episode_title",
 
 
 
 
396
  "hash_labeled",
397
  "multiple_title_spans",
398
  "no_title",
399
- "path_retained"
 
 
 
 
400
  ],
401
  "expand": args.expand,
402
  "sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
@@ -856,10 +864,18 @@ fn run_verify_generated_output(args: &Args) -> Result<()> {
856
  if !matches!(
857
  warning.as_str(),
858
  "ambiguous_no_episode_title"
 
 
 
 
859
  | "hash_labeled"
860
  | "multiple_title_spans"
861
  | "no_title"
862
  | "path_retained"
 
 
 
 
863
  ) {
864
  continue;
865
  }
@@ -1123,18 +1139,30 @@ fn entity_spans(tokens: &[String], labels: &[String]) -> Vec<Value> {
1123
 
1124
  fn audit_warnings(record: &Record) -> Vec<String> {
1125
  let mut warnings = Vec::new();
1126
- let title_spans = entity_spans(&record.tokens, &record.labels)
1127
- .into_iter()
1128
- .filter(|span| span.get("label").and_then(Value::as_str) == Some("TITLE"))
1129
- .count();
1130
  if title_spans == 0 {
1131
  warnings.push("no_title".to_string());
1132
  } else if title_spans > 1 {
1133
  warnings.push("multiple_title_spans".to_string());
1134
  }
 
 
 
 
 
 
 
 
 
1135
  let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
 
 
1136
  if !has_episode {
1137
  warnings.push("no_episode".to_string());
 
 
 
1138
  if record
1139
  .dropped_title_candidate_positions
1140
  .as_ref()
@@ -1143,20 +1171,112 @@ fn audit_warnings(record: &Record) -> Vec<String> {
1143
  warnings.push("ambiguous_no_episode_title".to_string());
1144
  }
1145
  }
 
 
 
1146
  if record.filename.contains('/') || record.filename.contains('\\') {
1147
  warnings.push("path_retained".to_string());
1148
  }
 
 
 
 
 
 
 
 
1149
  for (index, token) in record.tokens.iter().enumerate() {
 
 
1150
  if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
1151
  warnings.push("hash_labeled".to_string());
1152
  break;
1153
  }
 
 
 
 
 
 
 
 
 
1154
  }
1155
  warnings.sort();
1156
  warnings.dedup();
1157
  warnings
1158
  }
1159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1160
  fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
1161
  let mut counts = HashMap::new();
1162
  for row in rows {
@@ -1223,10 +1343,7 @@ fn process_filename(
1223
  }
1224
  };
1225
  let warnings = audit_warnings(&record);
1226
- if warnings.iter().any(|warning| warning == "no_title")
1227
- || (recipe.count.unwrap_or(0) <= args.audit_max_count
1228
- && has_blocking_warnings(&warnings))
1229
- {
1230
  return Processed::Skipped {
1231
  reason: "low_frequency_audit_warning",
1232
  trimmed_parent,
@@ -1251,10 +1368,18 @@ fn has_blocking_warnings(warnings: &[String]) -> bool {
1251
  matches!(
1252
  warning.as_str(),
1253
  "ambiguous_no_episode_title"
 
 
 
 
1254
  | "hash_labeled"
1255
  | "multiple_title_spans"
1256
  | "no_title"
1257
  | "path_retained"
 
 
 
 
1258
  )
1259
  })
1260
  }
@@ -1773,6 +1898,10 @@ fn has_encoding_noise(value: &str) -> bool {
1773
  "譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
1774
  "楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
1775
  "伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
 
 
 
 
1776
  ];
1777
  let marker_hits = markers
1778
  .iter()
@@ -1783,7 +1912,8 @@ fn has_encoding_noise(value: &str) -> bool {
1783
  .filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
1784
  .count();
1785
  let latin_mojibake = value.split_whitespace().any(|part| {
1786
- part.contains('帽') && part.chars().any(|ch| ch.is_ascii_alphabetic())
 
1787
  });
1788
  marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
1789
  }
@@ -2075,6 +2205,9 @@ fn label_for_refined_piece(piece: &str, role: &str, token_class: &str) -> String
2075
  return "O".to_string();
2076
  }
2077
  if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
 
 
 
2078
  if atom_class == "RESOLUTION" {
2079
  return "B-RESOLUTION".to_string();
2080
  }
@@ -2129,6 +2262,24 @@ fn split_sxe_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
2129
  Some((pieces, labels))
2130
  }
2131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2132
  fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
2133
  if DECIMAL_EPISODE_RE.is_match(token) {
2134
  let pieces = split_generated_token(token);
@@ -3067,6 +3218,16 @@ fn normalize_title_token(token: &str) -> (Vec<String>, Vec<String>) {
3067
  labels.push("O".to_string());
3068
  continue;
3069
  }
 
 
 
 
 
 
 
 
 
 
3070
  if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
3071
  output_pieces.push(piece);
3072
  labels.push("B-SEASON".to_string());
@@ -3732,6 +3893,7 @@ fn dmhy_record(filename: &str, template_id: &str, roles: &[String]) -> Option<Re
3732
  let roles = adjust_contextual_roles(&tokens, &groups, roles);
3733
  let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
3734
  let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
 
3735
  let labels = smooth_title_spans(&tokens, &labels);
3736
  if tokens.len() != labels.len() {
3737
  return None;
@@ -3811,6 +3973,26 @@ mod tests {
3811
  let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
3812
  assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
3813
  assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3814
 
3815
  let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
3816
  assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
@@ -4135,6 +4317,13 @@ mod tests {
4135
  "[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
4136
  ));
4137
  assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
 
 
 
 
 
 
 
4138
  assert!(has_non_anime_noise(
4139
  "13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
4140
  ));
 
393
  "low_frequency_audit_max_count": args.audit_max_count,
394
  "low_frequency_blocking_warnings": [
395
  "ambiguous_no_episode_title",
396
+ "encoding_noise_survived",
397
+ "episode_version_missing_label",
398
+ "episode_in_title",
399
+ "generic_title_only",
400
  "hash_labeled",
401
  "multiple_title_spans",
402
  "no_title",
403
+ "path_retained",
404
+ "sxe_compact_unexpanded",
405
+ "tech_in_title",
406
+ "template_episode_missing_label",
407
+ "template_sxe_missing_label"
408
  ],
409
  "expand": args.expand,
410
  "sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
 
864
  if !matches!(
865
  warning.as_str(),
866
  "ambiguous_no_episode_title"
867
+ | "encoding_noise_survived"
868
+ | "episode_version_missing_label"
869
+ | "episode_in_title"
870
+ | "generic_title_only"
871
  | "hash_labeled"
872
  | "multiple_title_spans"
873
  | "no_title"
874
  | "path_retained"
875
+ | "sxe_compact_unexpanded"
876
+ | "tech_in_title"
877
+ | "template_episode_missing_label"
878
+ | "template_sxe_missing_label"
879
  ) {
880
  continue;
881
  }
 
1139
 
1140
  fn audit_warnings(record: &Record) -> Vec<String> {
1141
  let mut warnings = Vec::new();
1142
+ let title_texts = entity_texts(&record.tokens, &record.labels, "TITLE");
1143
+ let title_spans = title_texts.len();
 
 
1144
  if title_spans == 0 {
1145
  warnings.push("no_title".to_string());
1146
  } else if title_spans > 1 {
1147
  warnings.push("multiple_title_spans".to_string());
1148
  }
1149
+ if !title_texts.is_empty() && title_texts.iter().all(|title| generic_title_text(title)) {
1150
+ warnings.push("generic_title_only".to_string());
1151
+ }
1152
+ if title_texts.iter().any(|title| technical_title_text(title)) {
1153
+ warnings.push("tech_in_title".to_string());
1154
+ }
1155
+ if title_texts.iter().any(|title| episodeish_title_text(title)) {
1156
+ warnings.push("episode_in_title".to_string());
1157
+ }
1158
  let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
1159
+ let has_season = record.labels.iter().any(|label| label.ends_with("SEASON"));
1160
+ let has_special = record.labels.iter().any(|label| label.ends_with("SPECIAL"));
1161
  if !has_episode {
1162
  warnings.push("no_episode".to_string());
1163
+ if record.template.contains("EPISODE") && !has_special {
1164
+ warnings.push("template_episode_missing_label".to_string());
1165
+ }
1166
  if record
1167
  .dropped_title_candidate_positions
1168
  .as_ref()
 
1171
  warnings.push("ambiguous_no_episode_title".to_string());
1172
  }
1173
  }
1174
+ if record.template.contains("SXE") && (!has_season || !has_episode) {
1175
+ warnings.push("template_sxe_missing_label".to_string());
1176
+ }
1177
  if record.filename.contains('/') || record.filename.contains('\\') {
1178
  warnings.push("path_retained".to_string());
1179
  }
1180
+ if has_encoding_noise(&record.filename)
1181
+ || record
1182
+ .source_filename
1183
+ .as_ref()
1184
+ .is_some_and(|source| has_encoding_noise(source))
1185
+ {
1186
+ warnings.push("encoding_noise_survived".to_string());
1187
+ }
1188
  for (index, token) in record.tokens.iter().enumerate() {
1189
+ let entity = record.labels.get(index).and_then(|label| label_entity(label));
1190
+ let cleaned = strip_wrapper(token);
1191
  if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
1192
  warnings.push("hash_labeled".to_string());
1193
  break;
1194
  }
1195
+ if EPISODE_VERSION_RE.is_match(&compact_for_classify(&cleaned))
1196
+ && entity != Some("EPISODE")
1197
+ {
1198
+ warnings.push("episode_version_missing_label".to_string());
1199
+ }
1200
+ if SXE_VALUE_RE.is_match(&cleaned) && entity != Some("EPISODE") && entity != Some("SEASON")
1201
+ {
1202
+ warnings.push("sxe_compact_unexpanded".to_string());
1203
+ }
1204
  }
1205
  warnings.sort();
1206
  warnings.dedup();
1207
  warnings
1208
  }
1209
 
1210
+ fn label_entity(label: &str) -> Option<&str> {
1211
+ label
1212
+ .strip_prefix("B-")
1213
+ .or_else(|| label.strip_prefix("I-"))
1214
+ }
1215
+
1216
+ fn entity_texts(tokens: &[String], labels: &[String], target: &str) -> Vec<String> {
1217
+ let mut spans = Vec::new();
1218
+ let mut current = String::new();
1219
+ for (token, label) in tokens.iter().zip(labels.iter()) {
1220
+ let entity = label_entity(label);
1221
+ if entity == Some(target) {
1222
+ current.push_str(token);
1223
+ } else if !current.trim().is_empty() {
1224
+ spans.push(current.trim().to_string());
1225
+ current.clear();
1226
+ } else {
1227
+ current.clear();
1228
+ }
1229
+ }
1230
+ if !current.trim().is_empty() {
1231
+ spans.push(current.trim().to_string());
1232
+ }
1233
+ spans
1234
+ }
1235
+
1236
+ fn generic_title_text(text: &str) -> bool {
1237
+ matches!(
1238
+ text.trim().to_ascii_lowercase().as_str(),
1239
+ "tv"
1240
+ | "movie"
1241
+ | "mov"
1242
+ | "sample"
1243
+ | "commercial"
1244
+ | "commercials"
1245
+ | "cm"
1246
+ | "pv"
1247
+ | "op"
1248
+ | "ed"
1249
+ | "ncop"
1250
+ | "nced"
1251
+ | "menu"
1252
+ | "trailer"
1253
+ | "spot"
1254
+ | "bdmv"
1255
+ | "stream"
1256
+ )
1257
+ }
1258
+
1259
+ fn technical_title_text(text: &str) -> bool {
1260
+ let normalized = text.to_ascii_lowercase();
1261
+ normalized.contains("bdrip")
1262
+ || normalized.contains("webrip")
1263
+ || normalized.contains("web-dl")
1264
+ || normalized.contains("hevc")
1265
+ || normalized.contains("x264")
1266
+ || normalized.contains("x265")
1267
+ || normalized.contains("aac")
1268
+ || normalized.contains("flac")
1269
+ || normalized.contains("sourceunknown")
1270
+ }
1271
+
1272
+ fn episodeish_title_text(text: &str) -> bool {
1273
+ let trimmed = text.trim();
1274
+ EPISODE_VALUE_RE.is_match(trimmed)
1275
+ || EPISODE_CJK_RE.is_match(trimmed)
1276
+ || EPISODE_RANGE_RE.is_match(trimmed)
1277
+ || trimmed.chars().all(|ch| ch.is_ascii_digit())
1278
+ }
1279
+
1280
  fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
1281
  let mut counts = HashMap::new();
1282
  for row in rows {
 
1343
  }
1344
  };
1345
  let warnings = audit_warnings(&record);
1346
+ if warnings.iter().any(|warning| warning == "no_title") || has_blocking_warnings(&warnings) {
 
 
 
1347
  return Processed::Skipped {
1348
  reason: "low_frequency_audit_warning",
1349
  trimmed_parent,
 
1368
  matches!(
1369
  warning.as_str(),
1370
  "ambiguous_no_episode_title"
1371
+ | "encoding_noise_survived"
1372
+ | "episode_version_missing_label"
1373
+ | "episode_in_title"
1374
+ | "generic_title_only"
1375
  | "hash_labeled"
1376
  | "multiple_title_spans"
1377
  | "no_title"
1378
  | "path_retained"
1379
+ | "sxe_compact_unexpanded"
1380
+ | "tech_in_title"
1381
+ | "template_episode_missing_label"
1382
+ | "template_sxe_missing_label"
1383
  )
1384
  })
1385
  }
 
1898
  "譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
1899
  "楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
1900
  "伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
1901
+ "銇", "銈", "銉", "偅", "偗", "儱", "儫", "兗", "仧", "鏉变", "鍠靛",
1902
+ "銉熴", "銈︺", "瀵掕", "潐楦", "常涔", "涓歖", "缁堟", "湯鍒",
1903
+ "瀵诲", "線浣", "曟柟", "瓒呴", "绁炪", "偘銉", "兇銈", "銉砡",
1904
+ "銉砕", "杩风", "硦澶", "銇淬", "仧銉", "銉嗐", "偅銈", "銈躲",
1905
  ];
1906
  let marker_hits = markers
1907
  .iter()
 
1912
  .filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
1913
  .count();
1914
  let latin_mojibake = value.split_whitespace().any(|part| {
1915
+ part.chars().any(|ch| matches!(ch, '帽' | '茅' | '脳' | '锛'))
1916
+ && part.chars().any(|ch| ch.is_ascii_alphabetic())
1917
  });
1918
  marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
1919
  }
 
2205
  return "O".to_string();
2206
  }
2207
  if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
2208
+ if atom_class == "EPISODE_VERSION" {
2209
+ return "B-EPISODE".to_string();
2210
+ }
2211
  if atom_class == "RESOLUTION" {
2212
  return "B-RESOLUTION".to_string();
2213
  }
 
2262
  Some((pieces, labels))
2263
  }
2264
 
2265
+ fn repair_compact_sxe_tokens(
2266
+ tokens: Vec<String>,
2267
+ labels: Vec<String>,
2268
+ ) -> (Vec<String>, Vec<String>) {
2269
+ let mut output_tokens = Vec::new();
2270
+ let mut output_labels = Vec::new();
2271
+ for (token, label) in tokens.into_iter().zip(labels.into_iter()) {
2272
+ if let Some((pieces, piece_labels)) = split_sxe_token(&token) {
2273
+ output_tokens.extend(pieces);
2274
+ output_labels.extend(piece_labels);
2275
+ } else {
2276
+ output_tokens.push(token);
2277
+ output_labels.push(label);
2278
+ }
2279
+ }
2280
+ (output_tokens, output_labels)
2281
+ }
2282
+
2283
  fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
2284
  if DECIMAL_EPISODE_RE.is_match(token) {
2285
  let pieces = split_generated_token(token);
 
3218
  labels.push("O".to_string());
3219
  continue;
3220
  }
3221
+ if let Some((pieces, piece_labels)) = split_sxe_token(&piece) {
3222
+ output_pieces.extend(pieces);
3223
+ labels.extend(piece_labels);
3224
+ continue;
3225
+ }
3226
+ if EPISODE_VERSION_RE.is_match(&compact_for_classify(&piece)) {
3227
+ output_pieces.push(piece);
3228
+ labels.push("B-EPISODE".to_string());
3229
+ continue;
3230
+ }
3231
  if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
3232
  output_pieces.push(piece);
3233
  labels.push("B-SEASON".to_string());
 
3893
  let roles = adjust_contextual_roles(&tokens, &groups, roles);
3894
  let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
3895
  let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
3896
+ let (tokens, labels) = repair_compact_sxe_tokens(tokens, labels);
3897
  let labels = smooth_title_spans(&tokens, &labels);
3898
  if tokens.len() != labels.len() {
3899
  return None;
 
3973
  let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
3974
  assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
3975
  assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
3976
+ let bocchi_sxe =
3977
+ labels_for("Bocchi the Rock! 孤獨搖滾!S01E12「早起的日頭光照佇你的身上」");
3978
+ assert!(bocchi_sxe.contains(&("01".to_string(), "B-SEASON".to_string())));
3979
+ assert!(bocchi_sxe.contains(&("12".to_string(), "B-EPISODE".to_string())));
3980
+ assert!(!bocchi_sxe.contains(&("S01E12".to_string(), "O".to_string())));
3981
+ let sxe_range = labels_for(
3982
+ "【CXRAW】【TMNT 2012 TV series】【S5E12-S5E14】【Wanted:Bebop & Rocksteady】【DVDrip】【480p】【AVC Hi10P AAC MP4】",
3983
+ );
3984
+ assert!(sxe_range.contains(&("5".to_string(), "B-SEASON".to_string())));
3985
+ assert!(sxe_range.contains(&("12".to_string(), "B-EPISODE".to_string())));
3986
+ assert!(sxe_range.contains(&("14".to_string(), "B-EPISODE".to_string())));
3987
+ let episode_version_title = labels_for("[DHR][Dumbbell[10v2][BIG5][720P][AVC_AAC]");
3988
+ assert!(episode_version_title.contains(&("10v2".to_string(), "B-EPISODE".to_string())));
3989
+ assert!(!episode_version_title.contains(&("10v2".to_string(), "B-TITLE".to_string())));
3990
+ let episode_version_lang =
3991
+ labels_for("[GalaxyRailroad-888] Yu-Gi-Oh! GO RUSH !! [043v2_GB]");
3992
+ assert!(
3993
+ episode_version_lang.contains(&("043v2".to_string(), "B-EPISODE".to_string()))
3994
+ );
3995
+ assert!(episode_version_lang.contains(&("GB".to_string(), "B-SOURCE".to_string())));
3996
 
3997
  let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
3998
  assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
 
4317
  "[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
4318
  ));
4319
  assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
4320
+ assert!(has_encoding_noise(
4321
+ "[2002-2003] Mew Mew_鏉变含鍠靛柕(鏉变含銉熴儱銈︺儫銉ャ偊)_TV"
4322
+ ));
4323
+ assert!(has_encoding_noise("[DAY][Megami no Caf茅 Terrace][01]"));
4324
+ assert!(has_encoding_noise(
4325
+ "[4K_SDR][DBD-Raws][瀵掕潐楦f常涔嬫椂 涓歖[NCED1]"
4326
+ ));
4327
  assert!(has_non_anime_noise(
4328
  "13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
4329
  ));