Document and tighten DMHY template labels

Browse files

Files changed (2) hide show

docs/dmhy_template_labeling.md +103 -0
tools/rust_dmhy_template_apply/src/main.rs +199 -10

docs/dmhy_template_labeling.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# DMHY Template Labeling Requirements
+This document records the current labeling contract for the DMHY template
+metadata workflow. It is intentionally stricter than the old weak-label export:
+precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
+`SEASON`.
+## Source And Pipeline
+- Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
+- Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
+- Template recipe generation and application live in
+  `tools/rust_dmhy_template_apply`.
+- Generated training JSONL rows must contain at least `filename`, `tokens`,
+  `labels`, `template_id`, and `template`.
+- Reports and intermediate audits belong under `reports/`; they are diagnostic
+  artifacts, not authoritative dataset files.
+## Critical Label Semantics
+- `TITLE` is the anime/work title. It must be one contiguous span whenever a
+  single title is being emitted.
+- `EPISODE` is the episode number or explicit episode marker span.
+- `SEASON` is the season/cour/part marker when the filename explicitly encodes
+  season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
+  `Part 5` in series-part naming.
+- `GROUP` is a release group or subtitle group, not the title.
+- `SOURCE` covers media/source/codec/language/platform-ish release metadata
+  such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
+  and `BIG5`.
+- `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
+  `1920x1080`.
+- `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
+  `Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
+  an episodic number.
+- Hash-like suffixes are retained as text in source filenames when useful, but
+  they must not become entity labels in generated training data.
+## Title Rules
+- Avoid duplicate titles. If the leaf filename already carries a complete title,
+  season, and episode structure, drop redundant parent directory titles.
+- If precision is uncertain, prefer skipping the row/template over producing a
+  duplicated or discontinuous `TITLE`.
+- A title may contain punctuation or symbols. Internal title joiners must stay
+  inside the title span, including common ASCII separators and known Unicode
+  title punctuation such as `‐`, `–`, `—`, `＄`, `∽`, `꞉`, and `♥`.
+- Multiple title candidates in one filename should be handled explicitly:
+  bilingual title aliases and special-program titles are allowed in rich review
+  metadata, but the final weak training row should not emit arbitrary
+  non-contiguous titles unless that structure has been reviewed.
+- Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
+  title follows.
+## Episode And Season Rules
+- `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
+  higher training cost than dropping a row.
+- `SxxExx` means season plus episode. `S` identifies season and `E` identifies
+  episode. If the tokenizer keeps `S01E02` as one compact token, project it to
+  season and episode components during normalization; if split into marker and
+  number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
+  may remain structural `O`.
+- `01v2` means episode `01` version `2`; the episode value must not be treated
+  as title.
+- Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
+  remain episode spans.
+- Decimal episode-like values such as `14.5` may be valid recap or midpoint
+  episodes and should not be discarded only because they contain a decimal point.
+- Title-internal numbers stay in `TITLE` when they are part of the work name,
+  such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
+  Million-Dollar Pentagram`.
+## Path And Noise Rules
+- BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
+  filenames and should be skipped.
+- Non-anime or abstract path data, including obvious `MTV` paths and tourism /
+  railway program dumps, should be skipped.
+- Mojibake and encoding-noise rows should be skipped unless explicitly kept for
+  diagnosis.
+- Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
+  the output should avoid duplicate title spans.
+- Parent directory context is allowed only when the leaf filename is too weak to
+  identify the title; otherwise the leaf filename should dominate.
+## Review Strategy
+- High-frequency templates affect training most and must be sampled more heavily.
+- Low-frequency templates are gated conservatively; ambiguous cases are sent to
+  review instead of generated training data.
+- Middle-frequency templates should be audited by sampling a few examples from
+  every template class, then grouping failures by rule rather than patching
+  single examples blindly.
+- A template can enter the generated training set only when its `TITLE`,
+  `EPISODE`, and `SEASON` behavior is defensible across sampled rows.
+## Character Dataset Projection
+- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
+  first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
+- Punctuation tokens must remain independently represented before character
+  projection so the model can learn filename structure boundaries.

tools/rust_dmhy_template_apply/src/main.rs CHANGED Viewed

@@ -393,10 +393,18 @@ fn main() -> Result<()> {
         "low_frequency_audit_max_count": args.audit_max_count,
         "low_frequency_blocking_warnings": [
             "ambiguous_no_episode_title",
             "hash_labeled",
             "multiple_title_spans",
             "no_title",
-            "path_retained"
         ],
         "expand": args.expand,
         "sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
@@ -856,10 +864,18 @@ fn run_verify_generated_output(args: &Args) -> Result<()> {
             if !matches!(
                 warning.as_str(),
                 "ambiguous_no_episode_title"
                     | "hash_labeled"
                     | "multiple_title_spans"
                     | "no_title"
                     | "path_retained"
             ) {
                 continue;
             }
@@ -1123,18 +1139,30 @@ fn entity_spans(tokens: &[String], labels: &[String]) -> Vec<Value> {
 fn audit_warnings(record: &Record) -> Vec<String> {
     let mut warnings = Vec::new();
-    let title_spans = entity_spans(&record.tokens, &record.labels)
-        .into_iter()
-        .filter(|span| span.get("label").and_then(Value::as_str) == Some("TITLE"))
-        .count();
     if title_spans == 0 {
         warnings.push("no_title".to_string());
     } else if title_spans > 1 {
         warnings.push("multiple_title_spans".to_string());
     }
     let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
     if !has_episode {
         warnings.push("no_episode".to_string());
         if record
             .dropped_title_candidate_positions
             .as_ref()
@@ -1143,20 +1171,112 @@ fn audit_warnings(record: &Record) -> Vec<String> {
             warnings.push("ambiguous_no_episode_title".to_string());
         }
     }
     if record.filename.contains('/') || record.filename.contains('\\') {
         warnings.push("path_retained".to_string());
     }
     for (index, token) in record.tokens.iter().enumerate() {
         if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
             warnings.push("hash_labeled".to_string());
             break;
         }
     }
     warnings.sort();
     warnings.dedup();
     warnings
 }
 fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
     let mut counts = HashMap::new();
     for row in rows {
@@ -1223,10 +1343,7 @@ fn process_filename(
         }
     };
     let warnings = audit_warnings(&record);
-    if warnings.iter().any(|warning| warning == "no_title")
-        || (recipe.count.unwrap_or(0) <= args.audit_max_count
-            && has_blocking_warnings(&warnings))
-    {
         return Processed::Skipped {
             reason: "low_frequency_audit_warning",
             trimmed_parent,
@@ -1251,10 +1368,18 @@ fn has_blocking_warnings(warnings: &[String]) -> bool {
         matches!(
             warning.as_str(),
             "ambiguous_no_episode_title"
                 | "hash_labeled"
                 | "multiple_title_spans"
                 | "no_title"
                 | "path_retained"
         )
     })
 }
@@ -1773,6 +1898,10 @@ fn has_encoding_noise(value: &str) -> bool {
         "譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
         "楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
         "伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
     ];
     let marker_hits = markers
         .iter()
@@ -1783,7 +1912,8 @@ fn has_encoding_noise(value: &str) -> bool {
         .filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
         .count();
     let latin_mojibake = value.split_whitespace().any(|part| {
-        part.contains('帽') && part.chars().any(|ch| ch.is_ascii_alphabetic())
     });
     marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
 }
@@ -2075,6 +2205,9 @@ fn label_for_refined_piece(piece: &str, role: &str, token_class: &str) -> String
         return "O".to_string();
     }
     if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
         if atom_class == "RESOLUTION" {
             return "B-RESOLUTION".to_string();
         }
@@ -2129,6 +2262,24 @@ fn split_sxe_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
     Some((pieces, labels))
 }
 fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
     if DECIMAL_EPISODE_RE.is_match(token) {
         let pieces = split_generated_token(token);
@@ -3067,6 +3218,16 @@ fn normalize_title_token(token: &str) -> (Vec<String>, Vec<String>) {
             labels.push("O".to_string());
             continue;
         }
         if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
             output_pieces.push(piece);
             labels.push("B-SEASON".to_string());
@@ -3732,6 +3893,7 @@ fn dmhy_record(filename: &str, template_id: &str, roles: &[String]) -> Option<Re
     let roles = adjust_contextual_roles(&tokens, &groups, roles);
     let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
     let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
     let labels = smooth_title_spans(&tokens, &labels);
     if tokens.len() != labels.len() {
         return None;
@@ -3811,6 +3973,26 @@ mod tests {
         let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
         assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
         assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
         let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
         assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
@@ -4135,6 +4317,13 @@ mod tests {
             "[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
         ));
         assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
         assert!(has_non_anime_noise(
             "13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
         ));

         "low_frequency_audit_max_count": args.audit_max_count,
         "low_frequency_blocking_warnings": [
             "ambiguous_no_episode_title",
+            "encoding_noise_survived",
+            "episode_version_missing_label",
+            "episode_in_title",
+            "generic_title_only",
             "hash_labeled",
             "multiple_title_spans",
             "no_title",
+            "path_retained",
+            "sxe_compact_unexpanded",
+            "tech_in_title",
+            "template_episode_missing_label",
+            "template_sxe_missing_label"
         ],
         "expand": args.expand,
         "sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
             if !matches!(
                 warning.as_str(),
                 "ambiguous_no_episode_title"
+                    | "encoding_noise_survived"
+                    | "episode_version_missing_label"
+                    | "episode_in_title"
+                    | "generic_title_only"
                     | "hash_labeled"
                     | "multiple_title_spans"
                     | "no_title"
                     | "path_retained"
+                    | "sxe_compact_unexpanded"
+                    | "tech_in_title"
+                    | "template_episode_missing_label"
+                    | "template_sxe_missing_label"
             ) {
                 continue;
             }
 fn audit_warnings(record: &Record) -> Vec<String> {
     let mut warnings = Vec::new();
+    let title_texts = entity_texts(&record.tokens, &record.labels, "TITLE");
+    let title_spans = title_texts.len();
     if title_spans == 0 {
         warnings.push("no_title".to_string());
     } else if title_spans > 1 {
         warnings.push("multiple_title_spans".to_string());
     }
+    if !title_texts.is_empty() && title_texts.iter().all(|title| generic_title_text(title)) {
+        warnings.push("generic_title_only".to_string());
+    }
+    if title_texts.iter().any(|title| technical_title_text(title)) {
+        warnings.push("tech_in_title".to_string());
+    }
+    if title_texts.iter().any(|title| episodeish_title_text(title)) {
+        warnings.push("episode_in_title".to_string());
+    }
     let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
+    let has_season = record.labels.iter().any(|label| label.ends_with("SEASON"));
+    let has_special = record.labels.iter().any(|label| label.ends_with("SPECIAL"));
     if !has_episode {
         warnings.push("no_episode".to_string());
+        if record.template.contains("EPISODE") && !has_special {
+            warnings.push("template_episode_missing_label".to_string());
+        }
         if record
             .dropped_title_candidate_positions
             .as_ref()
             warnings.push("ambiguous_no_episode_title".to_string());
         }
     }
+    if record.template.contains("SXE") && (!has_season || !has_episode) {
+        warnings.push("template_sxe_missing_label".to_string());
+    }
     if record.filename.contains('/') || record.filename.contains('\\') {
         warnings.push("path_retained".to_string());
     }
+    if has_encoding_noise(&record.filename)
+        || record
+            .source_filename
+            .as_ref()
+            .is_some_and(|source| has_encoding_noise(source))
+    {
+        warnings.push("encoding_noise_survived".to_string());
+    }
     for (index, token) in record.tokens.iter().enumerate() {
+        let entity = record.labels.get(index).and_then(|label| label_entity(label));
+        let cleaned = strip_wrapper(token);
         if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
             warnings.push("hash_labeled".to_string());
             break;
         }
+        if EPISODE_VERSION_RE.is_match(&compact_for_classify(&cleaned))
+            && entity != Some("EPISODE")
+        {
+            warnings.push("episode_version_missing_label".to_string());
+        }
+        if SXE_VALUE_RE.is_match(&cleaned) && entity != Some("EPISODE") && entity != Some("SEASON")
+        {
+            warnings.push("sxe_compact_unexpanded".to_string());
+        }
     }
     warnings.sort();
     warnings.dedup();
     warnings
 }
+fn label_entity(label: &str) -> Option<&str> {
+    label
+        .strip_prefix("B-")
+        .or_else(|| label.strip_prefix("I-"))
+}
+fn entity_texts(tokens: &[String], labels: &[String], target: &str) -> Vec<String> {
+    let mut spans = Vec::new();
+    let mut current = String::new();
+    for (token, label) in tokens.iter().zip(labels.iter()) {
+        let entity = label_entity(label);
+        if entity == Some(target) {
+            current.push_str(token);
+        } else if !current.trim().is_empty() {
+            spans.push(current.trim().to_string());
+            current.clear();
+        } else {
+            current.clear();
+        }
+    }
+    if !current.trim().is_empty() {
+        spans.push(current.trim().to_string());
+    }
+    spans
+}
+fn generic_title_text(text: &str) -> bool {
+    matches!(
+        text.trim().to_ascii_lowercase().as_str(),
+        "tv"
+            | "movie"
+            | "mov"
+            | "sample"
+            | "commercial"
+            | "commercials"
+            | "cm"
+            | "pv"
+            | "op"
+            | "ed"
+            | "ncop"
+            | "nced"
+            | "menu"
+            | "trailer"
+            | "spot"
+            | "bdmv"
+            | "stream"
+    )
+}
+fn technical_title_text(text: &str) -> bool {
+    let normalized = text.to_ascii_lowercase();
+    normalized.contains("bdrip")
+        || normalized.contains("webrip")
+        || normalized.contains("web-dl")
+        || normalized.contains("hevc")
+        || normalized.contains("x264")
+        || normalized.contains("x265")
+        || normalized.contains("aac")
+        || normalized.contains("flac")
+        || normalized.contains("sourceunknown")
+}
+fn episodeish_title_text(text: &str) -> bool {
+    let trimmed = text.trim();
+    EPISODE_VALUE_RE.is_match(trimmed)
+        || EPISODE_CJK_RE.is_match(trimmed)
+        || EPISODE_RANGE_RE.is_match(trimmed)
+        || trimmed.chars().all(|ch| ch.is_ascii_digit())
+}
 fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
     let mut counts = HashMap::new();
     for row in rows {
         }
     };
     let warnings = audit_warnings(&record);
+    if warnings.iter().any(|warning| warning == "no_title") || has_blocking_warnings(&warnings) {
         return Processed::Skipped {
             reason: "low_frequency_audit_warning",
             trimmed_parent,
         matches!(
             warning.as_str(),
             "ambiguous_no_episode_title"
+                | "encoding_noise_survived"
+                | "episode_version_missing_label"
+                | "episode_in_title"
+                | "generic_title_only"
                 | "hash_labeled"
                 | "multiple_title_spans"
                 | "no_title"
                 | "path_retained"
+                | "sxe_compact_unexpanded"
+                | "tech_in_title"
+                | "template_episode_missing_label"
+                | "template_sxe_missing_label"
         )
     })
 }
         "譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
         "楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
         "伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
+        "銇", "銈", "銉", "偅", "偗", "儱", "儫", "兗", "仧", "鏉变", "鍠靛",
+        "銉熴", "銈︺", "瀵掕", "潐楦", "常涔", "涓歖", "缁堟", "湯鍒",
+        "瀵诲", "線浣", "曟柟", "瓒呴", "绁炪", "偘銉", "兇銈", "銉砡",
+        "銉砕", "杩风", "硦澶", "銇淬", "仧銉", "銉嗐", "偅銈", "銈躲",
     ];
     let marker_hits = markers
         .iter()
         .filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
         .count();
     let latin_mojibake = value.split_whitespace().any(|part| {
+        part.chars().any(|ch| matches!(ch, '帽' | '茅' | '脳' | '锛'))
+            && part.chars().any(|ch| ch.is_ascii_alphabetic())
     });
     marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
 }
         return "O".to_string();
     }
     if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
+        if atom_class == "EPISODE_VERSION" {
+            return "B-EPISODE".to_string();
+        }
         if atom_class == "RESOLUTION" {
             return "B-RESOLUTION".to_string();
         }
     Some((pieces, labels))
 }
+fn repair_compact_sxe_tokens(
+    tokens: Vec<String>,
+    labels: Vec<String>,
+) -> (Vec<String>, Vec<String>) {
+    let mut output_tokens = Vec::new();
+    let mut output_labels = Vec::new();
+    for (token, label) in tokens.into_iter().zip(labels.into_iter()) {
+        if let Some((pieces, piece_labels)) = split_sxe_token(&token) {
+            output_tokens.extend(pieces);
+            output_labels.extend(piece_labels);
+        } else {
+            output_tokens.push(token);
+            output_labels.push(label);
+        }
+    }
+    (output_tokens, output_labels)
+}
 fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
     if DECIMAL_EPISODE_RE.is_match(token) {
         let pieces = split_generated_token(token);
             labels.push("O".to_string());
             continue;
         }
+        if let Some((pieces, piece_labels)) = split_sxe_token(&piece) {
+            output_pieces.extend(pieces);
+            labels.extend(piece_labels);
+            continue;
+        }
+        if EPISODE_VERSION_RE.is_match(&compact_for_classify(&piece)) {
+            output_pieces.push(piece);
+            labels.push("B-EPISODE".to_string());
+            continue;
+        }
         if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
             output_pieces.push(piece);
             labels.push("B-SEASON".to_string());
     let roles = adjust_contextual_roles(&tokens, &groups, roles);
     let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
     let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
+    let (tokens, labels) = repair_compact_sxe_tokens(tokens, labels);
     let labels = smooth_title_spans(&tokens, &labels);
     if tokens.len() != labels.len() {
         return None;
         let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
         assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
         assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
+        let bocchi_sxe =
+            labels_for("Bocchi the Rock! 孤獨搖滾！S01E12「早起的日頭光照佇你的身上」");
+        assert!(bocchi_sxe.contains(&("01".to_string(), "B-SEASON".to_string())));
+        assert!(bocchi_sxe.contains(&("12".to_string(), "B-EPISODE".to_string())));
+        assert!(!bocchi_sxe.contains(&("S01E12".to_string(), "O".to_string())));
+        let sxe_range = labels_for(
+            "【CXRAW】【TMNT 2012 TV series】【S5E12-S5E14】【Wanted：Bebop & Rocksteady】【DVDrip】【480p】【AVC Hi10P AAC MP4】",
+        );
+        assert!(sxe_range.contains(&("5".to_string(), "B-SEASON".to_string())));
+        assert!(sxe_range.contains(&("12".to_string(), "B-EPISODE".to_string())));
+        assert!(sxe_range.contains(&("14".to_string(), "B-EPISODE".to_string())));
+        let episode_version_title = labels_for("[DHR][Dumbbell[10v2][BIG5][720P][AVC_AAC]");
+        assert!(episode_version_title.contains(&("10v2".to_string(), "B-EPISODE".to_string())));
+        assert!(!episode_version_title.contains(&("10v2".to_string(), "B-TITLE".to_string())));
+        let episode_version_lang =
+            labels_for("[GalaxyRailroad-888] Yu-Gi-Oh! GO RUSH !! [043v2_GB]");
+        assert!(
+            episode_version_lang.contains(&("043v2".to_string(), "B-EPISODE".to_string()))
+        );
+        assert!(episode_version_lang.contains(&("GB".to_string(), "B-SOURCE".to_string())));
         let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
         assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
             "[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
         ));
         assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
+        assert!(has_encoding_noise(
+            "[2002-2003] Mew Mew_鏉变含鍠靛柕(鏉变含銉熴儱銈︺儫銉ャ偊)_TV"
+        ));
+        assert!(has_encoding_noise("[DAY][Megami no Caf茅 Terrace][01]"));
+        assert!(has_encoding_noise(
+            "[4K_SDR][DBD-Raws][瀵掕潐楦ｆ常涔嬫椂 涓歖[NCED1]"
+        ));
         assert!(has_non_anime_noise(
             "13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
         ));