Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Document and tighten DMHY template labels
Browse files- docs/dmhy_template_labeling.md +103 -0
- tools/rust_dmhy_template_apply/src/main.rs +199 -10
docs/dmhy_template_labeling.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DMHY Template Labeling Requirements
|
| 2 |
+
|
| 3 |
+
This document records the current labeling contract for the DMHY template
|
| 4 |
+
metadata workflow. It is intentionally stricter than the old weak-label export:
|
| 5 |
+
precision is preferred over coverage, especially for `TITLE`, `EPISODE`, and
|
| 6 |
+
`SEASON`.
|
| 7 |
+
|
| 8 |
+
## Source And Pipeline
|
| 9 |
+
|
| 10 |
+
- Source snapshot: `datasets/AnimeName/dmhy_list.jsonl`.
|
| 11 |
+
- Optional original source: `D:\WorkSpace\Python\dmhy-parser\dmhy_anime.db`.
|
| 12 |
+
- Template recipe generation and application live in
|
| 13 |
+
`tools/rust_dmhy_template_apply`.
|
| 14 |
+
- Generated training JSONL rows must contain at least `filename`, `tokens`,
|
| 15 |
+
`labels`, `template_id`, and `template`.
|
| 16 |
+
- Reports and intermediate audits belong under `reports/`; they are diagnostic
|
| 17 |
+
artifacts, not authoritative dataset files.
|
| 18 |
+
|
| 19 |
+
## Critical Label Semantics
|
| 20 |
+
|
| 21 |
+
- `TITLE` is the anime/work title. It must be one contiguous span whenever a
|
| 22 |
+
single title is being emitted.
|
| 23 |
+
- `EPISODE` is the episode number or explicit episode marker span.
|
| 24 |
+
- `SEASON` is the season/cour/part marker when the filename explicitly encodes
|
| 25 |
+
season structure, such as `S2`, `2nd Season`, `Second Season`, `第2季`, or
|
| 26 |
+
`Part 5` in series-part naming.
|
| 27 |
+
- `GROUP` is a release group or subtitle group, not the title.
|
| 28 |
+
- `SOURCE` covers media/source/codec/language/platform-ish release metadata
|
| 29 |
+
such as `BDRip`, `WEB-DL`, `HEVC`, `AAC`, `Baha`, `CR`, `CHS`, `CHT`, `GB`,
|
| 30 |
+
and `BIG5`.
|
| 31 |
+
- `RESOLUTION` covers explicit resolution values such as `720P`, `1080p`, and
|
| 32 |
+
`1920x1080`.
|
| 33 |
+
- `SPECIAL` covers non-episode extras such as `NCOP`, `NCED`, `PV`, `CM`,
|
| 34 |
+
`Menu`, `Trailer`, `Creditless ED`, and movie/special numbering when it is not
|
| 35 |
+
an episodic number.
|
| 36 |
+
- Hash-like suffixes are retained as text in source filenames when useful, but
|
| 37 |
+
they must not become entity labels in generated training data.
|
| 38 |
+
|
| 39 |
+
## Title Rules
|
| 40 |
+
|
| 41 |
+
- Avoid duplicate titles. If the leaf filename already carries a complete title,
|
| 42 |
+
season, and episode structure, drop redundant parent directory titles.
|
| 43 |
+
- If precision is uncertain, prefer skipping the row/template over producing a
|
| 44 |
+
duplicated or discontinuous `TITLE`.
|
| 45 |
+
- A title may contain punctuation or symbols. Internal title joiners must stay
|
| 46 |
+
inside the title span, including common ASCII separators and known Unicode
|
| 47 |
+
title punctuation such as `‐`, `–`, `—`, `$`, `∽`, `꞉`, and `♥`.
|
| 48 |
+
- Multiple title candidates in one filename should be handled explicitly:
|
| 49 |
+
bilingual title aliases and special-program titles are allowed in rich review
|
| 50 |
+
metadata, but the final weak training row should not emit arbitrary
|
| 51 |
+
non-contiguous titles unless that structure has been reviewed.
|
| 52 |
+
- Generic prefixes such as `TV`, `TVアニメ`, or `アニメ` are not title when a real
|
| 53 |
+
title follows.
|
| 54 |
+
|
| 55 |
+
## Episode And Season Rules
|
| 56 |
+
|
| 57 |
+
- `TITLE`, `EPISODE`, and `SEASON` are the highest-risk labels; errors here have
|
| 58 |
+
higher training cost than dropping a row.
|
| 59 |
+
- `SxxExx` means season plus episode. `S` identifies season and `E` identifies
|
| 60 |
+
episode. If the tokenizer keeps `S01E02` as one compact token, project it to
|
| 61 |
+
season and episode components during normalization; if split into marker and
|
| 62 |
+
number tokens, the numeric value must carry `SEASON`/`EPISODE` and the marker
|
| 63 |
+
may remain structural `O`.
|
| 64 |
+
- `01v2` means episode `01` version `2`; the episode value must not be treated
|
| 65 |
+
as title.
|
| 66 |
+
- Episode ranges such as `01-13`, `#1-3`, and CJK forms like `第10話` should
|
| 67 |
+
remain episode spans.
|
| 68 |
+
- Decimal episode-like values such as `14.5` may be valid recap or midpoint
|
| 69 |
+
episodes and should not be discarded only because they contain a decimal point.
|
| 70 |
+
- Title-internal numbers stay in `TITLE` when they are part of the work name,
|
| 71 |
+
such as `Eien no 831`, `Zom 100`, or movie titles like `Movie 27 The
|
| 72 |
+
Million-Dollar Pentagram`.
|
| 73 |
+
|
| 74 |
+
## Path And Noise Rules
|
| 75 |
+
|
| 76 |
+
- BDMV expanded paths such as `BDMV/STREAM/00006` are not useful training
|
| 77 |
+
filenames and should be skipped.
|
| 78 |
+
- Non-anime or abstract path data, including obvious `MTV` paths and tourism /
|
| 79 |
+
railway program dumps, should be skipped.
|
| 80 |
+
- Mojibake and encoding-noise rows should be skipped unless explicitly kept for
|
| 81 |
+
diagnosis.
|
| 82 |
+
- Jellyfin-like paths (`Title/Season 1/E07 - Full Title ...`) are valid, but
|
| 83 |
+
the output should avoid duplicate title spans.
|
| 84 |
+
- Parent directory context is allowed only when the leaf filename is too weak to
|
| 85 |
+
identify the title; otherwise the leaf filename should dominate.
|
| 86 |
+
|
| 87 |
+
## Review Strategy
|
| 88 |
+
|
| 89 |
+
- High-frequency templates affect training most and must be sampled more heavily.
|
| 90 |
+
- Low-frequency templates are gated conservatively; ambiguous cases are sent to
|
| 91 |
+
review instead of generated training data.
|
| 92 |
+
- Middle-frequency templates should be audited by sampling a few examples from
|
| 93 |
+
every template class, then grouping failures by rule rather than patching
|
| 94 |
+
single examples blindly.
|
| 95 |
+
- A template can enter the generated training set only when its `TITLE`,
|
| 96 |
+
`EPISODE`, and `SEASON` behavior is defensible across sampled rows.
|
| 97 |
+
|
| 98 |
+
## Character Dataset Projection
|
| 99 |
+
|
| 100 |
+
- Regex-token JSONL is converted to character JSONL by projecting BIO labels:
|
| 101 |
+
first character keeps `B-X`, later characters become `I-X`; `O` remains `O`.
|
| 102 |
+
- Punctuation tokens must remain independently represented before character
|
| 103 |
+
projection so the model can learn filename structure boundaries.
|
tools/rust_dmhy_template_apply/src/main.rs
CHANGED
|
@@ -393,10 +393,18 @@ fn main() -> Result<()> {
|
|
| 393 |
"low_frequency_audit_max_count": args.audit_max_count,
|
| 394 |
"low_frequency_blocking_warnings": [
|
| 395 |
"ambiguous_no_episode_title",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 396 |
"hash_labeled",
|
| 397 |
"multiple_title_spans",
|
| 398 |
"no_title",
|
| 399 |
-
"path_retained"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 400 |
],
|
| 401 |
"expand": args.expand,
|
| 402 |
"sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
|
|
@@ -856,10 +864,18 @@ fn run_verify_generated_output(args: &Args) -> Result<()> {
|
|
| 856 |
if !matches!(
|
| 857 |
warning.as_str(),
|
| 858 |
"ambiguous_no_episode_title"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 859 |
| "hash_labeled"
|
| 860 |
| "multiple_title_spans"
|
| 861 |
| "no_title"
|
| 862 |
| "path_retained"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 863 |
) {
|
| 864 |
continue;
|
| 865 |
}
|
|
@@ -1123,18 +1139,30 @@ fn entity_spans(tokens: &[String], labels: &[String]) -> Vec<Value> {
|
|
| 1123 |
|
| 1124 |
fn audit_warnings(record: &Record) -> Vec<String> {
|
| 1125 |
let mut warnings = Vec::new();
|
| 1126 |
-
let
|
| 1127 |
-
|
| 1128 |
-
.filter(|span| span.get("label").and_then(Value::as_str) == Some("TITLE"))
|
| 1129 |
-
.count();
|
| 1130 |
if title_spans == 0 {
|
| 1131 |
warnings.push("no_title".to_string());
|
| 1132 |
} else if title_spans > 1 {
|
| 1133 |
warnings.push("multiple_title_spans".to_string());
|
| 1134 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1135 |
let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
|
|
|
|
|
|
|
| 1136 |
if !has_episode {
|
| 1137 |
warnings.push("no_episode".to_string());
|
|
|
|
|
|
|
|
|
|
| 1138 |
if record
|
| 1139 |
.dropped_title_candidate_positions
|
| 1140 |
.as_ref()
|
|
@@ -1143,20 +1171,112 @@ fn audit_warnings(record: &Record) -> Vec<String> {
|
|
| 1143 |
warnings.push("ambiguous_no_episode_title".to_string());
|
| 1144 |
}
|
| 1145 |
}
|
|
|
|
|
|
|
|
|
|
| 1146 |
if record.filename.contains('/') || record.filename.contains('\\') {
|
| 1147 |
warnings.push("path_retained".to_string());
|
| 1148 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1149 |
for (index, token) in record.tokens.iter().enumerate() {
|
|
|
|
|
|
|
| 1150 |
if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
|
| 1151 |
warnings.push("hash_labeled".to_string());
|
| 1152 |
break;
|
| 1153 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1154 |
}
|
| 1155 |
warnings.sort();
|
| 1156 |
warnings.dedup();
|
| 1157 |
warnings
|
| 1158 |
}
|
| 1159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1160 |
fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
|
| 1161 |
let mut counts = HashMap::new();
|
| 1162 |
for row in rows {
|
|
@@ -1223,10 +1343,7 @@ fn process_filename(
|
|
| 1223 |
}
|
| 1224 |
};
|
| 1225 |
let warnings = audit_warnings(&record);
|
| 1226 |
-
if warnings.iter().any(|warning| warning == "no_title")
|
| 1227 |
-
|| (recipe.count.unwrap_or(0) <= args.audit_max_count
|
| 1228 |
-
&& has_blocking_warnings(&warnings))
|
| 1229 |
-
{
|
| 1230 |
return Processed::Skipped {
|
| 1231 |
reason: "low_frequency_audit_warning",
|
| 1232 |
trimmed_parent,
|
|
@@ -1251,10 +1368,18 @@ fn has_blocking_warnings(warnings: &[String]) -> bool {
|
|
| 1251 |
matches!(
|
| 1252 |
warning.as_str(),
|
| 1253 |
"ambiguous_no_episode_title"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1254 |
| "hash_labeled"
|
| 1255 |
| "multiple_title_spans"
|
| 1256 |
| "no_title"
|
| 1257 |
| "path_retained"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1258 |
)
|
| 1259 |
})
|
| 1260 |
}
|
|
@@ -1773,6 +1898,10 @@ fn has_encoding_noise(value: &str) -> bool {
|
|
| 1773 |
"譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
|
| 1774 |
"楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
|
| 1775 |
"伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1776 |
];
|
| 1777 |
let marker_hits = markers
|
| 1778 |
.iter()
|
|
@@ -1783,7 +1912,8 @@ fn has_encoding_noise(value: &str) -> bool {
|
|
| 1783 |
.filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
|
| 1784 |
.count();
|
| 1785 |
let latin_mojibake = value.split_whitespace().any(|part| {
|
| 1786 |
-
part.
|
|
|
|
| 1787 |
});
|
| 1788 |
marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
|
| 1789 |
}
|
|
@@ -2075,6 +2205,9 @@ fn label_for_refined_piece(piece: &str, role: &str, token_class: &str) -> String
|
|
| 2075 |
return "O".to_string();
|
| 2076 |
}
|
| 2077 |
if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
|
|
|
|
|
|
|
|
|
|
| 2078 |
if atom_class == "RESOLUTION" {
|
| 2079 |
return "B-RESOLUTION".to_string();
|
| 2080 |
}
|
|
@@ -2129,6 +2262,24 @@ fn split_sxe_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
|
|
| 2129 |
Some((pieces, labels))
|
| 2130 |
}
|
| 2131 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2132 |
fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
|
| 2133 |
if DECIMAL_EPISODE_RE.is_match(token) {
|
| 2134 |
let pieces = split_generated_token(token);
|
|
@@ -3067,6 +3218,16 @@ fn normalize_title_token(token: &str) -> (Vec<String>, Vec<String>) {
|
|
| 3067 |
labels.push("O".to_string());
|
| 3068 |
continue;
|
| 3069 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3070 |
if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
|
| 3071 |
output_pieces.push(piece);
|
| 3072 |
labels.push("B-SEASON".to_string());
|
|
@@ -3732,6 +3893,7 @@ fn dmhy_record(filename: &str, template_id: &str, roles: &[String]) -> Option<Re
|
|
| 3732 |
let roles = adjust_contextual_roles(&tokens, &groups, roles);
|
| 3733 |
let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
|
| 3734 |
let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
|
|
|
|
| 3735 |
let labels = smooth_title_spans(&tokens, &labels);
|
| 3736 |
if tokens.len() != labels.len() {
|
| 3737 |
return None;
|
|
@@ -3811,6 +3973,26 @@ mod tests {
|
|
| 3811 |
let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
|
| 3812 |
assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
|
| 3813 |
assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3814 |
|
| 3815 |
let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
|
| 3816 |
assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
|
|
@@ -4135,6 +4317,13 @@ mod tests {
|
|
| 4135 |
"[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
|
| 4136 |
));
|
| 4137 |
assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4138 |
assert!(has_non_anime_noise(
|
| 4139 |
"13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
|
| 4140 |
));
|
|
|
|
| 393 |
"low_frequency_audit_max_count": args.audit_max_count,
|
| 394 |
"low_frequency_blocking_warnings": [
|
| 395 |
"ambiguous_no_episode_title",
|
| 396 |
+
"encoding_noise_survived",
|
| 397 |
+
"episode_version_missing_label",
|
| 398 |
+
"episode_in_title",
|
| 399 |
+
"generic_title_only",
|
| 400 |
"hash_labeled",
|
| 401 |
"multiple_title_spans",
|
| 402 |
"no_title",
|
| 403 |
+
"path_retained",
|
| 404 |
+
"sxe_compact_unexpanded",
|
| 405 |
+
"tech_in_title",
|
| 406 |
+
"template_episode_missing_label",
|
| 407 |
+
"template_sxe_missing_label"
|
| 408 |
],
|
| 409 |
"expand": args.expand,
|
| 410 |
"sample_per_template": if args.expand == "sample" { Some(args.sample_per_template) } else { None },
|
|
|
|
| 864 |
if !matches!(
|
| 865 |
warning.as_str(),
|
| 866 |
"ambiguous_no_episode_title"
|
| 867 |
+
| "encoding_noise_survived"
|
| 868 |
+
| "episode_version_missing_label"
|
| 869 |
+
| "episode_in_title"
|
| 870 |
+
| "generic_title_only"
|
| 871 |
| "hash_labeled"
|
| 872 |
| "multiple_title_spans"
|
| 873 |
| "no_title"
|
| 874 |
| "path_retained"
|
| 875 |
+
| "sxe_compact_unexpanded"
|
| 876 |
+
| "tech_in_title"
|
| 877 |
+
| "template_episode_missing_label"
|
| 878 |
+
| "template_sxe_missing_label"
|
| 879 |
) {
|
| 880 |
continue;
|
| 881 |
}
|
|
|
|
| 1139 |
|
| 1140 |
fn audit_warnings(record: &Record) -> Vec<String> {
|
| 1141 |
let mut warnings = Vec::new();
|
| 1142 |
+
let title_texts = entity_texts(&record.tokens, &record.labels, "TITLE");
|
| 1143 |
+
let title_spans = title_texts.len();
|
|
|
|
|
|
|
| 1144 |
if title_spans == 0 {
|
| 1145 |
warnings.push("no_title".to_string());
|
| 1146 |
} else if title_spans > 1 {
|
| 1147 |
warnings.push("multiple_title_spans".to_string());
|
| 1148 |
}
|
| 1149 |
+
if !title_texts.is_empty() && title_texts.iter().all(|title| generic_title_text(title)) {
|
| 1150 |
+
warnings.push("generic_title_only".to_string());
|
| 1151 |
+
}
|
| 1152 |
+
if title_texts.iter().any(|title| technical_title_text(title)) {
|
| 1153 |
+
warnings.push("tech_in_title".to_string());
|
| 1154 |
+
}
|
| 1155 |
+
if title_texts.iter().any(|title| episodeish_title_text(title)) {
|
| 1156 |
+
warnings.push("episode_in_title".to_string());
|
| 1157 |
+
}
|
| 1158 |
let has_episode = record.labels.iter().any(|label| label.ends_with("EPISODE"));
|
| 1159 |
+
let has_season = record.labels.iter().any(|label| label.ends_with("SEASON"));
|
| 1160 |
+
let has_special = record.labels.iter().any(|label| label.ends_with("SPECIAL"));
|
| 1161 |
if !has_episode {
|
| 1162 |
warnings.push("no_episode".to_string());
|
| 1163 |
+
if record.template.contains("EPISODE") && !has_special {
|
| 1164 |
+
warnings.push("template_episode_missing_label".to_string());
|
| 1165 |
+
}
|
| 1166 |
if record
|
| 1167 |
.dropped_title_candidate_positions
|
| 1168 |
.as_ref()
|
|
|
|
| 1171 |
warnings.push("ambiguous_no_episode_title".to_string());
|
| 1172 |
}
|
| 1173 |
}
|
| 1174 |
+
if record.template.contains("SXE") && (!has_season || !has_episode) {
|
| 1175 |
+
warnings.push("template_sxe_missing_label".to_string());
|
| 1176 |
+
}
|
| 1177 |
if record.filename.contains('/') || record.filename.contains('\\') {
|
| 1178 |
warnings.push("path_retained".to_string());
|
| 1179 |
}
|
| 1180 |
+
if has_encoding_noise(&record.filename)
|
| 1181 |
+
|| record
|
| 1182 |
+
.source_filename
|
| 1183 |
+
.as_ref()
|
| 1184 |
+
.is_some_and(|source| has_encoding_noise(source))
|
| 1185 |
+
{
|
| 1186 |
+
warnings.push("encoding_noise_survived".to_string());
|
| 1187 |
+
}
|
| 1188 |
for (index, token) in record.tokens.iter().enumerate() {
|
| 1189 |
+
let entity = record.labels.get(index).and_then(|label| label_entity(label));
|
| 1190 |
+
let cleaned = strip_wrapper(token);
|
| 1191 |
if HASH_RE.is_match(token) && record.labels.get(index).is_some_and(|label| label != "O") {
|
| 1192 |
warnings.push("hash_labeled".to_string());
|
| 1193 |
break;
|
| 1194 |
}
|
| 1195 |
+
if EPISODE_VERSION_RE.is_match(&compact_for_classify(&cleaned))
|
| 1196 |
+
&& entity != Some("EPISODE")
|
| 1197 |
+
{
|
| 1198 |
+
warnings.push("episode_version_missing_label".to_string());
|
| 1199 |
+
}
|
| 1200 |
+
if SXE_VALUE_RE.is_match(&cleaned) && entity != Some("EPISODE") && entity != Some("SEASON")
|
| 1201 |
+
{
|
| 1202 |
+
warnings.push("sxe_compact_unexpanded".to_string());
|
| 1203 |
+
}
|
| 1204 |
}
|
| 1205 |
warnings.sort();
|
| 1206 |
warnings.dedup();
|
| 1207 |
warnings
|
| 1208 |
}
|
| 1209 |
|
| 1210 |
+
fn label_entity(label: &str) -> Option<&str> {
|
| 1211 |
+
label
|
| 1212 |
+
.strip_prefix("B-")
|
| 1213 |
+
.or_else(|| label.strip_prefix("I-"))
|
| 1214 |
+
}
|
| 1215 |
+
|
| 1216 |
+
fn entity_texts(tokens: &[String], labels: &[String], target: &str) -> Vec<String> {
|
| 1217 |
+
let mut spans = Vec::new();
|
| 1218 |
+
let mut current = String::new();
|
| 1219 |
+
for (token, label) in tokens.iter().zip(labels.iter()) {
|
| 1220 |
+
let entity = label_entity(label);
|
| 1221 |
+
if entity == Some(target) {
|
| 1222 |
+
current.push_str(token);
|
| 1223 |
+
} else if !current.trim().is_empty() {
|
| 1224 |
+
spans.push(current.trim().to_string());
|
| 1225 |
+
current.clear();
|
| 1226 |
+
} else {
|
| 1227 |
+
current.clear();
|
| 1228 |
+
}
|
| 1229 |
+
}
|
| 1230 |
+
if !current.trim().is_empty() {
|
| 1231 |
+
spans.push(current.trim().to_string());
|
| 1232 |
+
}
|
| 1233 |
+
spans
|
| 1234 |
+
}
|
| 1235 |
+
|
| 1236 |
+
fn generic_title_text(text: &str) -> bool {
|
| 1237 |
+
matches!(
|
| 1238 |
+
text.trim().to_ascii_lowercase().as_str(),
|
| 1239 |
+
"tv"
|
| 1240 |
+
| "movie"
|
| 1241 |
+
| "mov"
|
| 1242 |
+
| "sample"
|
| 1243 |
+
| "commercial"
|
| 1244 |
+
| "commercials"
|
| 1245 |
+
| "cm"
|
| 1246 |
+
| "pv"
|
| 1247 |
+
| "op"
|
| 1248 |
+
| "ed"
|
| 1249 |
+
| "ncop"
|
| 1250 |
+
| "nced"
|
| 1251 |
+
| "menu"
|
| 1252 |
+
| "trailer"
|
| 1253 |
+
| "spot"
|
| 1254 |
+
| "bdmv"
|
| 1255 |
+
| "stream"
|
| 1256 |
+
)
|
| 1257 |
+
}
|
| 1258 |
+
|
| 1259 |
+
fn technical_title_text(text: &str) -> bool {
|
| 1260 |
+
let normalized = text.to_ascii_lowercase();
|
| 1261 |
+
normalized.contains("bdrip")
|
| 1262 |
+
|| normalized.contains("webrip")
|
| 1263 |
+
|| normalized.contains("web-dl")
|
| 1264 |
+
|| normalized.contains("hevc")
|
| 1265 |
+
|| normalized.contains("x264")
|
| 1266 |
+
|| normalized.contains("x265")
|
| 1267 |
+
|| normalized.contains("aac")
|
| 1268 |
+
|| normalized.contains("flac")
|
| 1269 |
+
|| normalized.contains("sourceunknown")
|
| 1270 |
+
}
|
| 1271 |
+
|
| 1272 |
+
fn episodeish_title_text(text: &str) -> bool {
|
| 1273 |
+
let trimmed = text.trim();
|
| 1274 |
+
EPISODE_VALUE_RE.is_match(trimmed)
|
| 1275 |
+
|| EPISODE_CJK_RE.is_match(trimmed)
|
| 1276 |
+
|| EPISODE_RANGE_RE.is_match(trimmed)
|
| 1277 |
+
|| trimmed.chars().all(|ch| ch.is_ascii_digit())
|
| 1278 |
+
}
|
| 1279 |
+
|
| 1280 |
fn warning_counts(rows: &[Value]) -> HashMap<String, usize> {
|
| 1281 |
let mut counts = HashMap::new();
|
| 1282 |
for row in rows {
|
|
|
|
| 1343 |
}
|
| 1344 |
};
|
| 1345 |
let warnings = audit_warnings(&record);
|
| 1346 |
+
if warnings.iter().any(|warning| warning == "no_title") || has_blocking_warnings(&warnings) {
|
|
|
|
|
|
|
|
|
|
| 1347 |
return Processed::Skipped {
|
| 1348 |
reason: "low_frequency_audit_warning",
|
| 1349 |
trimmed_parent,
|
|
|
|
| 1368 |
matches!(
|
| 1369 |
warning.as_str(),
|
| 1370 |
"ambiguous_no_episode_title"
|
| 1371 |
+
| "encoding_noise_survived"
|
| 1372 |
+
| "episode_version_missing_label"
|
| 1373 |
+
| "episode_in_title"
|
| 1374 |
+
| "generic_title_only"
|
| 1375 |
| "hash_labeled"
|
| 1376 |
| "multiple_title_spans"
|
| 1377 |
| "no_title"
|
| 1378 |
| "path_retained"
|
| 1379 |
+
| "sxe_compact_unexpanded"
|
| 1380 |
+
| "tech_in_title"
|
| 1381 |
+
| "template_episode_missing_label"
|
| 1382 |
+
| "template_sxe_missing_label"
|
| 1383 |
)
|
| 1384 |
})
|
| 1385 |
}
|
|
|
|
| 1898 |
"譁", "蜈", "螟", "蟄", "謇", "邱", "荳", "縺", "繧", "莨", "鬆", "髯", "瀛",
|
| 1899 |
"楀", "箷", "绲", "刔", "鏃", "湪", "鏍", "犲", "儚", "鐗", "吀", "铦", "躲",
|
| 1900 |
"伄", "椋", "伓", "姘", "帽", "娆", "洖", "浜", "堝", "澶", "湴", "鐒",
|
| 1901 |
+
"銇", "銈", "銉", "偅", "偗", "儱", "儫", "兗", "仧", "鏉变", "鍠靛",
|
| 1902 |
+
"銉熴", "銈︺", "瀵掕", "潐楦", "常涔", "涓歖", "缁堟", "湯鍒",
|
| 1903 |
+
"瀵诲", "線浣", "曟柟", "瓒呴", "绁炪", "偘銉", "兇銈", "銉砡",
|
| 1904 |
+
"銉砕", "杩风", "硦澶", "銇淬", "仧銉", "銉嗐", "偅銈", "銈躲",
|
| 1905 |
];
|
| 1906 |
let marker_hits = markers
|
| 1907 |
.iter()
|
|
|
|
| 1912 |
.filter(|ch| ('\u{ff61}'..='\u{ff9f}').contains(ch))
|
| 1913 |
.count();
|
| 1914 |
let latin_mojibake = value.split_whitespace().any(|part| {
|
| 1915 |
+
part.chars().any(|ch| matches!(ch, '帽' | '茅' | '脳' | '锛'))
|
| 1916 |
+
&& part.chars().any(|ch| ch.is_ascii_alphabetic())
|
| 1917 |
});
|
| 1918 |
marker_hits >= 2 || (marker_hits >= 1 && halfwidth_hits >= 1) || latin_mojibake
|
| 1919 |
}
|
|
|
|
| 2205 |
return "O".to_string();
|
| 2206 |
}
|
| 2207 |
if role == "SOURCE" || matches!(token_class, "BRACKET_MEDIA_BLOCK" | "MEDIA_BLOCK") {
|
| 2208 |
+
if atom_class == "EPISODE_VERSION" {
|
| 2209 |
+
return "B-EPISODE".to_string();
|
| 2210 |
+
}
|
| 2211 |
if atom_class == "RESOLUTION" {
|
| 2212 |
return "B-RESOLUTION".to_string();
|
| 2213 |
}
|
|
|
|
| 2262 |
Some((pieces, labels))
|
| 2263 |
}
|
| 2264 |
|
| 2265 |
+
fn repair_compact_sxe_tokens(
|
| 2266 |
+
tokens: Vec<String>,
|
| 2267 |
+
labels: Vec<String>,
|
| 2268 |
+
) -> (Vec<String>, Vec<String>) {
|
| 2269 |
+
let mut output_tokens = Vec::new();
|
| 2270 |
+
let mut output_labels = Vec::new();
|
| 2271 |
+
for (token, label) in tokens.into_iter().zip(labels.into_iter()) {
|
| 2272 |
+
if let Some((pieces, piece_labels)) = split_sxe_token(&token) {
|
| 2273 |
+
output_tokens.extend(pieces);
|
| 2274 |
+
output_labels.extend(piece_labels);
|
| 2275 |
+
} else {
|
| 2276 |
+
output_tokens.push(token);
|
| 2277 |
+
output_labels.push(label);
|
| 2278 |
+
}
|
| 2279 |
+
}
|
| 2280 |
+
(output_tokens, output_labels)
|
| 2281 |
+
}
|
| 2282 |
+
|
| 2283 |
fn split_episode_token(token: &str) -> Option<(Vec<String>, Vec<String>)> {
|
| 2284 |
if DECIMAL_EPISODE_RE.is_match(token) {
|
| 2285 |
let pieces = split_generated_token(token);
|
|
|
|
| 3218 |
labels.push("O".to_string());
|
| 3219 |
continue;
|
| 3220 |
}
|
| 3221 |
+
if let Some((pieces, piece_labels)) = split_sxe_token(&piece) {
|
| 3222 |
+
output_pieces.extend(pieces);
|
| 3223 |
+
labels.extend(piece_labels);
|
| 3224 |
+
continue;
|
| 3225 |
+
}
|
| 3226 |
+
if EPISODE_VERSION_RE.is_match(&compact_for_classify(&piece)) {
|
| 3227 |
+
output_pieces.push(piece);
|
| 3228 |
+
labels.push("B-EPISODE".to_string());
|
| 3229 |
+
continue;
|
| 3230 |
+
}
|
| 3231 |
if CJK_SEASON_TOKEN_RE.is_match(&piece) || SEASON_RE.is_match(&piece) {
|
| 3232 |
output_pieces.push(piece);
|
| 3233 |
labels.push("B-SEASON".to_string());
|
|
|
|
| 3893 |
let roles = adjust_contextual_roles(&tokens, &groups, roles);
|
| 3894 |
let (roles, dropped) = enforce_single_title_candidate(&tokens, &groups, &roles);
|
| 3895 |
let (tokens, labels) = project_refined_tokens(&tokens, &groups, &roles);
|
| 3896 |
+
let (tokens, labels) = repair_compact_sxe_tokens(tokens, labels);
|
| 3897 |
let labels = smooth_title_spans(&tokens, &labels);
|
| 3898 |
if tokens.len() != labels.len() {
|
| 3899 |
return None;
|
|
|
|
| 3973 |
let bracket_sxe = labels_for("[FLsnow.feat.PO][Himitsu_no_Aipri][1080P][S2E01]");
|
| 3974 |
assert!(bracket_sxe.contains(&("2".to_string(), "B-SEASON".to_string())));
|
| 3975 |
assert!(bracket_sxe.contains(&("01".to_string(), "B-EPISODE".to_string())));
|
| 3976 |
+
let bocchi_sxe =
|
| 3977 |
+
labels_for("Bocchi the Rock! 孤獨搖滾!S01E12「早起的日頭光照佇你的身上」");
|
| 3978 |
+
assert!(bocchi_sxe.contains(&("01".to_string(), "B-SEASON".to_string())));
|
| 3979 |
+
assert!(bocchi_sxe.contains(&("12".to_string(), "B-EPISODE".to_string())));
|
| 3980 |
+
assert!(!bocchi_sxe.contains(&("S01E12".to_string(), "O".to_string())));
|
| 3981 |
+
let sxe_range = labels_for(
|
| 3982 |
+
"【CXRAW】【TMNT 2012 TV series】【S5E12-S5E14】【Wanted:Bebop & Rocksteady】【DVDrip】【480p】【AVC Hi10P AAC MP4】",
|
| 3983 |
+
);
|
| 3984 |
+
assert!(sxe_range.contains(&("5".to_string(), "B-SEASON".to_string())));
|
| 3985 |
+
assert!(sxe_range.contains(&("12".to_string(), "B-EPISODE".to_string())));
|
| 3986 |
+
assert!(sxe_range.contains(&("14".to_string(), "B-EPISODE".to_string())));
|
| 3987 |
+
let episode_version_title = labels_for("[DHR][Dumbbell[10v2][BIG5][720P][AVC_AAC]");
|
| 3988 |
+
assert!(episode_version_title.contains(&("10v2".to_string(), "B-EPISODE".to_string())));
|
| 3989 |
+
assert!(!episode_version_title.contains(&("10v2".to_string(), "B-TITLE".to_string())));
|
| 3990 |
+
let episode_version_lang =
|
| 3991 |
+
labels_for("[GalaxyRailroad-888] Yu-Gi-Oh! GO RUSH !! [043v2_GB]");
|
| 3992 |
+
assert!(
|
| 3993 |
+
episode_version_lang.contains(&("043v2".to_string(), "B-EPISODE".to_string()))
|
| 3994 |
+
);
|
| 3995 |
+
assert!(episode_version_lang.contains(&("GB".to_string(), "B-SOURCE".to_string())));
|
| 3996 |
|
| 3997 |
let cursed = labels_for("[Coalgirls]_C3-Cube_x_Cursed_x_Curious_01_[8E416230]");
|
| 3998 |
assert!(cursed.contains(&("x".to_string(), "B-TITLE".to_string())));
|
|
|
|
| 4317 |
"[4K_SDR][DBD-Raws&HKG瀛楀箷绲刔[鏃ュ湪鏍″湌][01][2160P]"
|
| 4318 |
));
|
| 4319 |
assert!(has_encoding_noise("ATRI -My Dear Moments-/娆″洖浜堝憡 EP01 Log01"));
|
| 4320 |
+
assert!(has_encoding_noise(
|
| 4321 |
+
"[2002-2003] Mew Mew_鏉变含鍠靛柕(鏉变含銉熴儱銈︺儫銉ャ偊)_TV"
|
| 4322 |
+
));
|
| 4323 |
+
assert!(has_encoding_noise("[DAY][Megami no Caf茅 Terrace][01]"));
|
| 4324 |
+
assert!(has_encoding_noise(
|
| 4325 |
+
"[4K_SDR][DBD-Raws][瀵掕潐楦f常涔嬫椂 涓歖[NCED1]"
|
| 4326 |
+
));
|
| 4327 |
assert!(has_non_anime_noise(
|
| 4328 |
"13-[旅游番][花丸字幕组][日本不思议铁路之旅][15.03.19-16.02.03][720&1080][中日双语]/铁道旅 15.03.19 720"
|
| 4329 |
));
|