ModerRAS commited on May 19

Commit

e63569d

1 Parent(s): b57780c

Improve anime filename parser model

Browse files

Files changed (36) hide show

.gitignore +5 -0
MAINTENANCE.md +28 -18
README.md +52 -78
build_repair_focus_dataset.py +151 -0
case_metrics.json +459 -0
config.json +4 -4
data/parser_regression_cases.json +232 -0
dataset.py +144 -4
datasets/AnimeName +1 -1
diagnose_pipeline.py +198 -22
dmhy_dataset.py +433 -44
evaluate_parser_cases.py +163 -0
exports/anime_filename_parser.metadata.json +4 -4
exports/anime_filename_parser.onnx +2 -2
inference.py +342 -35
label_repairs.py +513 -0
model.safetensors +2 -2
model/config.json +0 -64
model/model.safetensors +0 -3
model/tokenizer_config.json +0 -44
model/training_args.bin +0 -3
model/vocab.json +0 -0
parse_eval_metrics.json +595 -0
pyproject.toml +36 -0
relabel_dataset_from_filenames.py +157 -0
repair_dataset_labels.py +103 -0
requirements.txt +12 -10
run_metadata.json +23 -0
tokenizer.py +3 -3
tokenizer_config.json +2 -2
train.py +223 -15
trainer_eval_metrics.json +11 -0
training_args.bin +1 -1
uv.lock +0 -0
vocab.char.json +2 -2
vocab.json +0 -0

.gitignore CHANGED Viewed

@@ -1,9 +1,14 @@
 __pycache__/
 *.pyc
 logs/
 checkpoints/
 test_checkpoints*/
 ab_checkpoints*/
 data/**/*.jsonl
 !data/synthetic_small.jsonl
 !data/test_smoke.jsonl

 __pycache__/
 *.pyc
+.venv/
+.pytest_cache/
+.ruff_cache/
 logs/
 checkpoints/
 test_checkpoints*/
 ab_checkpoints*/
+*.log
+*.onnx.data
 data/**/*.jsonl
 !data/synthetic_small.jsonl
 !data/test_smoke.jsonl

MAINTENANCE.md CHANGED Viewed

@@ -35,10 +35,9 @@ git submodule update --init --recursive
 Current DMHY snapshot:
 ```text
-last_file_id: 689304
-next_min_id: 689305
-labeled_samples: 263042
-mixed_train_samples: 363042
 ```
 The authoritative dataset files live in `datasets/AnimeName`.
@@ -46,17 +45,21 @@ The authoritative dataset files live in `datasets/AnimeName`.
 ## Train
 ```bash
-python -m pip install -r requirements.txt
-python train.py \
-  --data-file datasets/AnimeName/mixed_train.jsonl \
-  --vocab-file datasets/AnimeName/vocab.json \
-  --save-dir checkpoints/dmhy-finetune \
   --init-model-dir . \
-  --epochs 1 \
-  --batch-size 128 \
-  --learning-rate 0.0003 \
   --warmup-steps 300 \
-  --seed 42
 ```
 ## Publish a New Checkpoint
@@ -64,13 +67,20 @@ python train.py \
 Copy the final checkpoint to the repository root:
 ```powershell
-Copy-Item checkpoints/dmhy-finetune/final/config.json . -Force
-Copy-Item checkpoints/dmhy-finetune/final/model.safetensors . -Force
-Copy-Item checkpoints/dmhy-finetune/final/tokenizer_config.json . -Force
-Copy-Item checkpoints/dmhy-finetune/final/training_args.bin . -Force
-Copy-Item checkpoints/dmhy-finetune/final/vocab.json . -Force
 ```
 Then commit and push:
 ```bash

 Current DMHY snapshot:
 ```text
+labeled_samples: 632002
+char_vocab_size: 6199
+strict_bio_violations: 0
 ```
 The authoritative dataset files live in `datasets/AnimeName`.
 ## Train
 ```bash
+uv sync
+uv run python train.py \
+  --tokenizer char \
+  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
+  --vocab-file datasets/AnimeName/vocab.char.json \
+  --save-dir checkpoints/dmhy-char-full-relabel \
   --init-model-dir . \
+  --epochs 2 \
+  --batch-size 256 \
+  --learning-rate 0.00008 \
   --warmup-steps 300 \
+  --max-seq-length 128 \
+  --checkpoint-steps 1000 \
+  --parse-eval-limit 2048 \
+  --seed 48
 ```
 ## Publish a New Checkpoint
 Copy the final checkpoint to the repository root:
 ```powershell
+Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
+Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
+Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force
 ```
+There is no tracked `model/` duplicate. The root checkpoint is the publishing
+surface; ignored `checkpoints/` directories are training artifacts.
 Then commit and push:
 ```bash

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ language:
 AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
-The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.
 ## Model
@@ -28,9 +28,9 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
 - Layers: 4
 - Attention heads: 8
 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
-- Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
-- Max sequence length: 64
-- Parameters: about 5M
 The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
@@ -47,52 +47,40 @@ Current DMHY export waterline (from `datasets/AnimeName`):
 ## Vocabulary
-The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
-analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
-become `[UNK]`, so larger vocabulary directly improves coverage:
-| Vocab size | Coverage | Model params |
-|------------|----------|-------------|
-| 3000 (old) | 90.4% | ~4.0M |
-| 8000 (current) | 96.2% | ~5.3M |
-Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
-and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
-vocabulary.
-For character-token training, `vocab.char.json` is mirrored at the repository
-root for plain `git pull` users and also lives at
-`datasets/AnimeName/vocab.char.json` beside the dataset. It is built from the
-full `dmhy_weak_char.jsonl` export. The full DMHY weak dataset has **6195
-unique characters**, so the complete character vocab is only **6199** entries
-including special tokens and reaches 100% token coverage.
 ## Evaluation
-Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42):
-| Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime |
-|---------|------------|-------|--------|---------|----------|---------------|
-| regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s |
-| char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s |
-Field-level F1 on the same validation split:
-| Field | regex | char |
-|-------|-------|------|
-| GROUP | 0.9962 | 0.9516 |
-| TITLE | 0.9761 | 0.7983 |
-| SEASON | 0.9880 | 0.6290 |
-| EPISODE | 0.9950 | 0.8082 |
-The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans.
 ## Usage
 Install dependencies:
 ```bash
-pip install -r requirements.txt
 ```
 Parse a filename with this repository cloned locally:
@@ -121,47 +109,25 @@ git submodule update --init --recursive
 ## Training
-### Prerequisites (Windows / Local GPU)
-PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
-```bash
-pip install torch --index-url https://download.pytorch.org/whl/cu126
-pip install -r requirements.txt
-```
-### Fine-tune with rebuilt vocabulary
-```bash
-python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
-  --vocab-file datasets/AnimeName/vocab.json \
-  --save-dir checkpoints/dmhy-finetune \
-  --init-model-dir . \
-  --epochs 10 --batch-size 128 \
-  --learning-rate 0.0003 --warmup-steps 300 --seed 42
-```
-The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
-5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
-trains the full model. About 96% of token occurrences are now covered (vs 90%
-with the old 3000-token vocabulary).
 ### Character-token DMHY training
 ```bash
-python convert_to_char_dataset.py \
   --input datasets/AnimeName/dmhy_weak.jsonl \
   --output datasets/AnimeName/dmhy_weak_char.jsonl \
-  --vocab-output vocab.char.json \
   --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
-python train.py --tokenizer char \
   --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
-  --vocab-file vocab.char.json \
-  --save-dir checkpoints_char/dmhy-weak-char \
-  --epochs 1 --batch-size 64 \
-  --learning-rate 0.0003 --warmup-steps 300 \
-  --max-seq-length 128 --seed 42
 ```
 The converter keeps source metadata and adds `tokenizer_variant`, source token
@@ -169,12 +135,21 @@ count, and character token count fields to each record. The char dataset's
 p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
 while leaving room for `[CLS]` and `[SEP]`.
-### Regenerate datasets from source
 ```bash
-python data_generator.py --num-samples 100000
-python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
-python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
 ```
 ### Rebuild vocabulary (if needed)
@@ -192,7 +167,7 @@ json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
 ### Export ONNX for MiruPlay Android
 ```bash
-python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
 ```
 ---
@@ -213,14 +188,13 @@ python colab_train.py --profile dmhy_regex_finetune
 ## Repository Layout
-- `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
 - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
 - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
 - `convert_to_char_dataset.py`: full character-token projection for weak labels
 - `inference.py`: end-to-end filename parser CLI
 - `export_onnx.py`: ONNX export for Android integration
 - `exports/`: exported ONNX model and metadata
-- `data/dmhy/*.manifest.json`: dataset waterlines and counts
 - `datasets/AnimeName/`: nested dataset submodule
 ## Maintenance Notes

 AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
+The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
 ## Model
 - Layers: 4
 - Attention heads: 8
 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
+- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
+- Max sequence length: 128
+- Parameters: 4,783,631
 The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
 ## Vocabulary
+The published checkpoint uses a character vocabulary. `vocab.json` at the
+repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
+as a mirrored explicit copy for training/data maintenance. The full DMHY weak
+dataset has **6195 unique characters**, so the complete character vocab is only
+**6199** entries including special tokens and reaches 100% token coverage.
+The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
+dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
 ## Evaluation
+Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
+seed 48):
+| Metric | Value |
+|--------|-------|
+| Eval loss | 0.0163 |
+| Entity precision | 0.9800 |
+| Entity recall | 0.9867 |
+| Entity F1 | 0.9833 |
+| Token accuracy | 0.9943 |
+| Held-out parse full match | 2008/2048 (0.9805) |
+| Fixed regression full match | 21/21 (1.0000) |
+The fixed regression set includes second-season aliases such as `Ni`,
+`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
+blocks.
 ## Usage
 Install dependencies:
 ```bash
+uv sync
 ```
 Parse a filename with this repository cloned locally:
 ## Training
 ### Character-token DMHY training
 ```bash
+uv run python convert_to_char_dataset.py \
   --input datasets/AnimeName/dmhy_weak.jsonl \
   --output datasets/AnimeName/dmhy_weak_char.jsonl \
+  --vocab-output datasets/AnimeName/vocab.char.json \
   --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
+uv run python train.py --tokenizer char \
   --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
+  --vocab-file datasets/AnimeName/vocab.char.json \
+  --save-dir checkpoints/dmhy-char-full-relabel \
+  --init-model-dir . \
+  --epochs 2 --batch-size 256 \
+  --learning-rate 0.00008 --warmup-steps 300 \
+  --checkpoint-steps 1000 --save-total-limit 3 \
+  --parse-eval-limit 2048 \
+  --max-seq-length 128 --seed 48
 ```
 The converter keeps source metadata and adds `tokenizer_variant`, source token
 p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
 while leaving room for `[CLS]` and `[SEP]`.
+### Relabel the full dataset
 ```bash
+uv run python relabel_dataset_from_filenames.py \
+  --input datasets/AnimeName/dmhy_weak.jsonl \
+  --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
+  --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
+  --vocab-output datasets/AnimeName/vocab.relabel.json \
+  --base-vocab datasets/AnimeName/vocab.json \
+  --max-vocab-size 8000
+Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
+Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
+Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
+Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
 ```
 ### Rebuild vocabulary (if needed)
 ### Export ONNX for MiruPlay Android
 ```bash
+uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
 ```
 ---
 ## Repository Layout
+- `model.safetensors`, `config.json`, `vocab.json`: default published model
 - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
 - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
 - `convert_to_char_dataset.py`: full character-token projection for weak labels
 - `inference.py`: end-to-end filename parser CLI
 - `export_onnx.py`: ONNX export for Android integration
 - `exports/`: exported ONNX model and metadata
 - `datasets/AnimeName/`: nested dataset submodule
 ## Maintenance Notes

build_repair_focus_dataset.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""Build a small fine-tuning set focused on repaired filename structures."""
+from __future__ import annotations
+import argparse
+import json
+import random
+from pathlib import Path
+from typing import Iterable, List
+from label_repairs import repair_jsonl_item
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Build repair-focused char JSONL fine-tune data")
+    parser.add_argument("--input", required=True, help="Repaired char JSONL dataset")
+    parser.add_argument("--output", required=True, help="Output focus JSONL")
+    parser.add_argument("--context-samples", type=int, default=50000,
+                        help="Random non-repaired rows to include for stability")
+    parser.add_argument("--repeat-repaired", type=int, default=4,
+                        help="Repeat rows that still trigger a repair pass")
+    parser.add_argument("--repeat-manual", type=int, default=24,
+                        help="Repeat hand-labeled hard cases")
+    parser.add_argument("--seed", type=int, default=42)
+    return parser.parse_args()
+def char_item(filename: str, spans: List[tuple[str, str]]) -> dict:
+    tokens = list(filename)
+    labels = ["O"] * len(tokens)
+    cursor = 0
+    for text, entity in spans:
+        start = filename.find(text, cursor)
+        if start < 0:
+            start = filename.find(text)
+        if start < 0:
+            raise ValueError(f"Could not find span {text!r} in {filename!r}")
+        end = start + len(text)
+        labels[start] = f"B-{entity}"
+        for idx in range(start + 1, end):
+            labels[idx] = f"I-{entity}"
+        cursor = end
+    return {
+        "filename": filename,
+        "tokens": tokens,
+        "labels": labels,
+        "tokenizer_variant": "char",
+        "source": "manual_repair_focus",
+    }
+def manual_cases() -> Iterable[dict]:
+    yield char_item(
+        "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
+        [
+            ("AI-Raws", "GROUP"),
+            ("炎炎の消防隊", "TITLE"),
+            ("弐ノ章", "SEASON"),
+            ("13", "EPISODE"),
+            ("BD", "SOURCE"),
+            ("HEVC", "SOURCE"),
+            ("1920x1080", "RESOLUTION"),
+            ("FLAC", "SOURCE"),
+        ],
+    )
+    yield char_item(
+        "[AI-Raws] 炎炎の消防隊 弐ノ章 #01 (BD HEVC 1920x1080 FLAC).mkv",
+        [
+            ("AI-Raws", "GROUP"),
+            ("炎炎の消防隊", "TITLE"),
+            ("弐ノ章", "SEASON"),
+            ("01", "EPISODE"),
+            ("BD", "SOURCE"),
+            ("HEVC", "SOURCE"),
+            ("1920x1080", "RESOLUTION"),
+            ("FLAC", "SOURCE"),
+        ],
+    )
+    yield char_item(
+        "[DBD-Raws][炎炎消防队 貳之章][01][1080P][BDRip][HEVC-10bit][FLAC]",
+        [
+            ("DBD-Raws", "GROUP"),
+            ("炎炎消防队", "TITLE"),
+            ("貳之章", "SEASON"),
+            ("01", "EPISODE"),
+            ("1080P", "RESOLUTION"),
+            ("BDRip", "SOURCE"),
+            ("FLAC", "SOURCE"),
+        ],
+    )
+def main() -> None:
+    args = parse_args()
+    rng = random.Random(args.seed)
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+    repaired_rows: List[dict] = []
+    reservoir: List[dict] = []
+    seen_filenames = set()
+    total_rows = 0
+    with input_path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if not line.strip():
+                continue
+            total_rows += 1
+            item = json.loads(line)
+            _repaired_item, repairs = repair_jsonl_item(item)
+            filename = item.get("filename")
+            if repairs:
+                repaired_rows.append(item)
+                if filename:
+                    seen_filenames.add(filename)
+                continue
+            if filename in seen_filenames:
+                continue
+            if len(reservoir) < args.context_samples:
+                reservoir.append(item)
+            else:
+                index = rng.randrange(total_rows)
+                if index < args.context_samples:
+                    reservoir[index] = item
+    rows: List[dict] = []
+    for item in repaired_rows:
+        rows.extend([item] * max(1, args.repeat_repaired))
+    rows.extend(reservoir)
+    for item in manual_cases():
+        rows.extend([item] * max(1, args.repeat_manual))
+    rng.shuffle(rows)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("w", encoding="utf-8") as handle:
+        for item in rows:
+            handle.write(json.dumps(item, ensure_ascii=False, separators=(",", ":")) + "\n")
+    print(json.dumps({
+        "input": str(input_path),
+        "output": str(output_path),
+        "total_rows": total_rows,
+        "repaired_rows": len(repaired_rows),
+        "context_rows": len(reservoir),
+        "manual_rows": len(list(manual_cases())),
+        "written_rows": len(rows),
+    }, ensure_ascii=False, indent=2))
+if __name__ == "__main__":
+    main()

case_metrics.json ADDED Viewed

	@@ -0,0 +1,459 @@

+{
+  "model_dir": ".",
+  "case_file": "data\\parser_regression_cases.json",
+  "tokenizer_variant": "char",
+  "max_length": 128,
+  "use_rules": true,
+  "constrain_bio": true,
+  "case_count": 21,
+  "full_correct": 21,
+  "full_accuracy": 1.0,
+  "field_correct": {
+    "group": 18,
+    "title": 21,
+    "episode": 21,
+    "resolution": 21,
+    "source": 14,
+    "season": 8,
+    "special": 1
+  },
+  "field_total": {
+    "group": 18,
+    "title": 21,
+    "episode": 21,
+    "resolution": 21,
+    "source": 14,
+    "season": 8,
+    "special": 1
+  },
+  "field_accuracy": {
+    "episode": 1.0,
+    "group": 1.0,
+    "resolution": 1.0,
+    "season": 1.0,
+    "source": 1.0,
+    "special": 1.0,
+    "title": 1.0
+  },
+  "failures": [],
+  "results": [
+    {
+      "id": "lolihouse_dash_episode",
+      "filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "LoliHouse",
+        "title": "Yomi no Tsugai",
+        "episode": 7,
+        "resolution": "1080p",
+        "source": "WebRip"
+      },
+      "pred": {
+        "episode": 7,
+        "group": "LoliHouse",
+        "resolution": "1080p",
+        "source": "WebRip",
+        "title": "Yomi no Tsugai"
+      }
+    },
+    {
+      "id": "dot_season_episode_no_group",
+      "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "title": "Witch.Hat.Atelier",
+        "season": 1,
+        "episode": 7,
+        "group": null,
+        "resolution": "1080p",
+        "source": "NF"
+      },
+      "pred": {
+        "episode": 7,
+        "group": null,
+        "resolution": "1080p",
+        "season": 1,
+        "source": "NF",
+        "title": "Witch.Hat.Atelier"
+      }
+    },
+    {
+      "id": "ani_cjk_season_dash_episode",
+      "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "ANi",
+        "title": "異世界悠閒農家",
+        "season": 2,
+        "episode": 6,
+        "resolution": "1080P",
+        "source": "Baha"
+      },
+      "pred": {
+        "episode": 6,
+        "group": "ANi",
+        "resolution": "1080P",
+        "season": 2,
+        "source": "Baha",
+        "title": "異世界悠閒農家"
+      }
+    },
+    {
+      "id": "kisssub_bracket_title_episode",
+      "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "KissSub",
+        "title": "Shunkashuutou Daikousha - Haru no Mai",
+        "episode": 5,
+        "resolution": "1080P",
+        "source": "GB"
+      },
+      "pred": {
+        "episode": 5,
+        "group": "KissSub",
+        "resolution": "1080P",
+        "source": "GB",
+        "title": "Shunkashuutou Daikousha - Haru no Mai"
+      }
+    },
+    {
+      "id": "airotabracket_title_episode",
+      "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Airota",
+        "title": "Sousou no Frieren",
+        "episode": 29,
+        "resolution": "1080p",
+        "source": "CHT"
+      },
+      "pred": {
+        "episode": 29,
+        "group": "Airota",
+        "resolution": "1080p",
+        "source": "CHT",
+        "title": "Sousou no Frieren"
+      }
+    },
+    {
+      "id": "subsplease_parenthesized_resolution",
+      "filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "SubsPlease",
+        "title": "Mushoku Tensei",
+        "episode": 12,
+        "resolution": "1080p"
+      },
+      "pred": {
+        "episode": 12,
+        "group": "SubsPlease",
+        "resolution": "1080p",
+        "title": "Mushoku Tensei"
+      }
+    },
+    {
+      "id": "vcb_bracket_episode",
+      "filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "VCB-Studio",
+        "title": "Girls Band Cry",
+        "episode": 1,
+        "resolution": "1080p"
+      },
+      "pred": {
+        "episode": 1,
+        "group": "VCB-Studio",
+        "resolution": "1080p",
+        "title": "Girls Band Cry"
+      }
+    },
+    {
+      "id": "numeric_title_not_episode",
+      "filename": "86 Eighty Six - 01 [1080P][Baha]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "title": "86 Eighty Six",
+        "episode": 1,
+        "resolution": "1080P",
+        "source": "Baha"
+      },
+      "pred": {
+        "episode": 1,
+        "resolution": "1080P",
+        "source": "Baha",
+        "title": "86 Eighty Six"
+      }
+    },
+    {
+      "id": "erai_raws_dash_episode",
+      "filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Erai-raws",
+        "title": "Sousou no Frieren",
+        "episode": 1,
+        "resolution": "1080p"
+      },
+      "pred": {
+        "episode": 1,
+        "group": "Erai-raws",
+        "resolution": "1080p",
+        "title": "Sousou no Frieren"
+      }
+    },
+    {
+      "id": "nekomoe_space_group",
+      "filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Nekomoe kissaten",
+        "title": "Watashi no Shiawase na Kekkon",
+        "episode": 1,
+        "resolution": "1080p"
+      },
+      "pred": {
+        "episode": 1,
+        "group": "Nekomoe kissaten",
+        "resolution": "1080p",
+        "title": "Watashi no Shiawase na Kekkon"
+      }
+    },
+    {
+      "id": "long_running_episode",
+      "filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "title": "One.Piece",
+        "episode": 1110,
+        "resolution": "1080p",
+        "source": "WEB-DL"
+      },
+      "pred": {
+        "episode": 1110,
+        "resolution": "1080p",
+        "source": "WEB-DL",
+        "title": "One.Piece"
+      }
+    },
+    {
+      "id": "season_episode_amzn",
+      "filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "title": "Example.Show",
+        "season": 2,
+        "episode": 3,
+        "resolution": "2160p",
+        "source": "AMZN"
+      },
+      "pred": {
+        "episode": 3,
+        "resolution": "2160p",
+        "season": 2,
+        "source": "AMZN",
+        "title": "Example.Show"
+      }
+    },
+    {
+      "id": "cjk_group_with_prefix_tag",
+      "filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "喵萌奶茶屋",
+        "title": "葬送的芙莉莲",
+        "episode": 1,
+        "resolution": "1080P"
+      },
+      "pred": {
+        "episode": 1,
+        "group": "喵萌奶茶屋",
+        "resolution": "1080P",
+        "title": "葬送的芙莉莲"
+      }
+    },
+    {
+      "id": "leading_meta_not_group",
+      "filename": "[1080p] Witch Watch - 15 [CHS]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": null,
+        "title": "Witch Watch",
+        "episode": 15,
+        "resolution": "1080p",
+        "source": "CHS"
+      },
+      "pred": {
+        "episode": 15,
+        "group": null,
+        "resolution": "1080p",
+        "source": "CHS",
+        "title": "Witch Watch"
+      }
+    },
+    {
+      "id": "sakurato_group_language_source",
+      "filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Sakurato",
+        "title": "Witch Watch",
+        "episode": 15,
+        "resolution": "1080p",
+        "source": "CHS"
+      },
+      "pred": {
+        "episode": 15,
+        "group": "Sakurato",
+        "resolution": "1080p",
+        "source": "CHS",
+        "title": "Witch Watch"
+      }
+    },
+    {
+      "id": "billion_meta_lab_search_special",
+      "filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索：魔法姊妹露露特莉莉].mp4",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Billion Meta Lab",
+        "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
+        "episode": 7,
+        "resolution": "1080P",
+        "source": "CHT&JPN",
+        "special": "檢索：魔法姊妹露露特莉莉"
+      },
+      "pred": {
+        "episode": 7,
+        "group": "Billion Meta Lab",
+        "resolution": "1080P",
+        "source": "CHT&JPN",
+        "special": "檢索：魔法姊妹露露特莉莉",
+        "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi"
+      }
+    },
+    {
+      "id": "studio_greentea_s2_bracket_episode",
+      "filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "Studio GreenTea",
+        "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
+        "season": 2,
+        "episode": 6,
+        "resolution": "1080p",
+        "source": "WebRip"
+      },
+      "pred": {
+        "episode": 6,
+        "group": "Studio GreenTea",
+        "resolution": "1080p",
+        "season": 2,
+        "source": "WebRip",
+        "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken"
+      }
+    },
+    {
+      "id": "lolihouse_kakuriyo_bare_ni_season",
+      "filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "LoliHouse",
+        "title": "Kakuriyo no Yadomeshi",
+        "season": 2,
+        "episode": 12,
+        "resolution": "1080p",
+        "source": "WebRip"
+      },
+      "pred": {
+        "episode": 12,
+        "group": "LoliHouse",
+        "resolution": "1080p",
+        "season": 2,
+        "source": "WebRip",
+        "title": "Kakuriyo no Yadomeshi"
+      }
+    },
+    {
+      "id": "ani_kakuriyo_traditional_ni",
+      "filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "ANi",
+        "title": "妖怪旅館營業中",
+        "season": 2,
+        "episode": 11,
+        "resolution": "1080P",
+        "source": "Baha"
+      },
+      "pred": {
+        "episode": 11,
+        "group": "ANi",
+        "resolution": "1080P",
+        "season": 2,
+        "source": "Baha",
+        "title": "妖怪旅館營業中"
+      }
+    },
+    {
+      "id": "jibaketa_shokugeki_ni_no_sara",
+      "filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "jibaketa",
+        "title": "Shokugeki no Souma",
+        "season": 2,
+        "episode": 13,
+        "resolution": "1920x1080"
+      },
+      "pred": {
+        "episode": 13,
+        "group": "jibaketa",
+        "resolution": "1920x1080",
+        "season": 2,
+        "title": "Shokugeki no Souma"
+      }
+    },
+    {
+      "id": "ai_raws_fire_force_cjk_season_hash_episode",
+      "filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
+      "ok": true,
+      "errors": {},
+      "expected": {
+        "group": "AI-Raws",
+        "title": "炎炎の消防隊",
+        "season": 2,
+        "episode": 13,
+        "resolution": "1920x1080"
+      },
+      "pred": {
+        "episode": 13,
+        "group": "AI-Raws",
+        "resolution": "1920x1080",
+        "season": 2,
+        "title": "炎炎の消防隊"
+      }
+    }
+  ]
+}

config.json CHANGED Viewed

@@ -50,15 +50,15 @@
   },
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 128,
-  "max_seq_length": 64,
   "model_type": "bert",
   "num_attention_heads": 8,
   "num_hidden_layers": 4,
   "pad_token_id": 0,
   "tie_word_embeddings": true,
-  "tokenizer_variant": "regex",
-  "transformers_version": "5.8.0",
   "type_vocab_size": 2,
   "use_cache": false,
-  "vocab_size": 3000
 }

   },
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 128,
+  "max_seq_length": 128,
   "model_type": "bert",
   "num_attention_heads": 8,
   "num_hidden_layers": 4,
   "pad_token_id": 0,
   "tie_word_embeddings": true,
+  "tokenizer_variant": "char",
+  "transformers_version": "5.8.1",
   "type_vocab_size": 2,
   "use_cache": false,
+  "vocab_size": 6199
 }

data/parser_regression_cases.json ADDED Viewed

	@@ -0,0 +1,232 @@

+[
+  {
+    "id": "lolihouse_dash_episode",
+    "filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
+    "expected": {
+      "group": "LoliHouse",
+      "title": "Yomi no Tsugai",
+      "episode": 7,
+      "resolution": "1080p",
+      "source": "WebRip"
+    }
+  },
+  {
+    "id": "dot_season_episode_no_group",
+    "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
+    "expected": {
+      "title": "Witch.Hat.Atelier",
+      "season": 1,
+      "episode": 7,
+      "group": null,
+      "resolution": "1080p",
+      "source": "NF"
+    }
+  },
+  {
+    "id": "ani_cjk_season_dash_episode",
+    "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
+    "expected": {
+      "group": "ANi",
+      "title": "異世界悠閒農家",
+      "season": 2,
+      "episode": 6,
+      "resolution": "1080P",
+      "source": "Baha"
+    }
+  },
+  {
+    "id": "kisssub_bracket_title_episode",
+    "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
+    "expected": {
+      "group": "KissSub",
+      "title": "Shunkashuutou Daikousha - Haru no Mai",
+      "episode": 5,
+      "resolution": "1080P",
+      "source": "GB"
+    }
+  },
+  {
+    "id": "airotabracket_title_episode",
+    "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
+    "expected": {
+      "group": "Airota",
+      "title": "Sousou no Frieren",
+      "episode": 29,
+      "resolution": "1080p",
+      "source": "CHT"
+    }
+  },
+  {
+    "id": "subsplease_parenthesized_resolution",
+    "filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
+    "expected": {
+      "group": "SubsPlease",
+      "title": "Mushoku Tensei",
+      "episode": 12,
+      "resolution": "1080p"
+    }
+  },
+  {
+    "id": "vcb_bracket_episode",
+    "filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
+    "expected": {
+      "group": "VCB-Studio",
+      "title": "Girls Band Cry",
+      "episode": 1,
+      "resolution": "1080p"
+    }
+  },
+  {
+    "id": "numeric_title_not_episode",
+    "filename": "86 Eighty Six - 01 [1080P][Baha]",
+    "expected": {
+      "title": "86 Eighty Six",
+      "episode": 1,
+      "resolution": "1080P",
+      "source": "Baha"
+    }
+  },
+  {
+    "id": "erai_raws_dash_episode",
+    "filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
+    "expected": {
+      "group": "Erai-raws",
+      "title": "Sousou no Frieren",
+      "episode": 1,
+      "resolution": "1080p"
+    }
+  },
+  {
+    "id": "nekomoe_space_group",
+    "filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
+    "expected": {
+      "group": "Nekomoe kissaten",
+      "title": "Watashi no Shiawase na Kekkon",
+      "episode": 1,
+      "resolution": "1080p"
+    }
+  },
+  {
+    "id": "long_running_episode",
+    "filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
+    "expected": {
+      "title": "One.Piece",
+      "episode": 1110,
+      "resolution": "1080p",
+      "source": "WEB-DL"
+    }
+  },
+  {
+    "id": "season_episode_amzn",
+    "filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
+    "expected": {
+      "title": "Example.Show",
+      "season": 2,
+      "episode": 3,
+      "resolution": "2160p",
+      "source": "AMZN"
+    }
+  },
+  {
+    "id": "cjk_group_with_prefix_tag",
+    "filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
+    "expected": {
+      "group": "喵萌奶茶屋",
+      "title": "葬送的芙莉莲",
+      "episode": 1,
+      "resolution": "1080P"
+    }
+  },
+  {
+    "id": "leading_meta_not_group",
+    "filename": "[1080p] Witch Watch - 15 [CHS]",
+    "expected": {
+      "group": null,
+      "title": "Witch Watch",
+      "episode": 15,
+      "resolution": "1080p",
+      "source": "CHS"
+    }
+  },
+  {
+    "id": "sakurato_group_language_source",
+    "filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
+    "expected": {
+      "group": "Sakurato",
+      "title": "Witch Watch",
+      "episode": 15,
+      "resolution": "1080p",
+      "source": "CHS"
+    }
+  },
+  {
+    "id": "billion_meta_lab_search_special",
+    "filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索：魔法姊妹露露特莉莉].mp4",
+    "expected": {
+      "group": "Billion Meta Lab",
+      "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
+      "episode": 7,
+      "resolution": "1080P",
+      "source": "CHT&JPN",
+      "special": "檢索：魔法姊妹露露特莉莉"
+    }
+  },
+  {
+    "id": "studio_greentea_s2_bracket_episode",
+    "filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
+    "expected": {
+      "group": "Studio GreenTea",
+      "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
+      "season": 2,
+      "episode": 6,
+      "resolution": "1080p",
+      "source": "WebRip"
+    }
+  },
+  {
+    "id": "lolihouse_kakuriyo_bare_ni_season",
+    "filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
+    "expected": {
+      "group": "LoliHouse",
+      "title": "Kakuriyo no Yadomeshi",
+      "season": 2,
+      "episode": 12,
+      "resolution": "1080p",
+      "source": "WebRip"
+    }
+  },
+  {
+    "id": "ani_kakuriyo_traditional_ni",
+    "filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
+    "expected": {
+      "group": "ANi",
+      "title": "妖怪旅館營業中",
+      "season": 2,
+      "episode": 11,
+      "resolution": "1080P",
+      "source": "Baha"
+    }
+  },
+  {
+    "id": "jibaketa_shokugeki_ni_no_sara",
+    "filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
+    "expected": {
+      "group": "jibaketa",
+      "title": "Shokugeki no Souma",
+      "season": 2,
+      "episode": 13,
+      "resolution": "1920x1080"
+    }
+  },
+  {
+    "id": "ai_raws_fire_force_cjk_season_hash_episode",
+    "filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
+    "expected": {
+      "group": "AI-Raws",
+      "title": "炎炎の消防隊",
+      "season": 2,
+      "episode": 13,
+      "resolution": "1920x1080"
+    }
+  }
+]

dataset.py CHANGED Viewed

@@ -6,11 +6,13 @@ Handles token-ID conversion, label encoding, padding, and truncation.
 """
 import json
 import torch
 from torch.utils.data import Dataset
-from typing import Dict, List, Optional
 from config import Config
 from tokenizer import AnimeTokenizer
@@ -62,9 +64,7 @@ class AnimeDataset(Dataset):
             Dictionary with input_ids, attention_mask, labels as LongTensors.
         """
         item = self.data[idx]
-        tokens: List[str] = item["tokens"]
-        labels: List[str] = item["labels"]
-        tokens, labels = align_tokens_for_tokenizer(tokens, labels, self.tokenizer)
         # Convert tokens to IDs
         input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
@@ -137,6 +137,146 @@ def align_tokens_for_tokenizer(
     return aligned_tokens, aligned_labels
 def create_datasets(
     data_path: str,
     tokenizer: AnimeTokenizer,

 """
 import json
+from collections import Counter
 import torch
 from torch.utils.data import Dataset
+from typing import Dict, List, Optional, Tuple
 from config import Config
+from label_repairs import repair_sequel_season_labels
 from tokenizer import AnimeTokenizer
             Dictionary with input_ids, attention_mask, labels as LongTensors.
         """
         item = self.data[idx]
+        tokens, labels = labels_for_tokenizer(item, self.tokenizer)
         # Convert tokens to IDs
         input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
     return aligned_tokens, aligned_labels
+def labels_for_tokenizer(
+    item: Dict,
+    tokenizer: AnimeTokenizer,
+) -> Tuple[List[str], List[str]]:
+    """
+    Return tokens and labels in the exact tokenizer space used by the model.
+    Older DMHY weak-label files store a post-processed token sequence where
+    group/title brackets may be expanded even though AnimeTokenizer keeps the
+    same bracketed text as one inference token. If the raw filename is present,
+    project those weak labels back to character spans and then onto the current
+    tokenizer output. This keeps train/eval/inference preprocessing identical.
+    """
+    filename = item.get("filename")
+    source_tokens, source_labels, _repairs = repair_sequel_season_labels(item)
+    tokenizer_variant = getattr(tokenizer, "tokenizer_variant", "regex")
+    if not filename:
+        return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
+    # Current char datasets are already in the exact inference token space.
+    # Avoid re-scanning every filename during training.
+    if item.get("tokenizer_variant") == tokenizer_variant:
+        target_tokens = tokenizer.tokenize(filename)
+        if source_tokens == target_tokens:
+            return source_tokens, source_labels
+    projected = project_labels_from_filename(
+        filename=filename,
+        source_tokens=source_tokens,
+        source_labels=source_labels,
+        tokenizer=tokenizer,
+    )
+    if projected is not None:
+        return projected
+    # Fall back to the legacy behavior for synthetic fixtures or malformed rows.
+    return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
+def token_offsets_in_text(text: str, tokens: List[str]) -> Optional[List[Tuple[int, int]]]:
+    """Find token character offsets by scanning left to right."""
+    offsets: List[Tuple[int, int]] = []
+    cursor = 0
+    for token in tokens:
+        if token == "":
+            offsets.append((cursor, cursor))
+            continue
+        start = text.find(token, cursor)
+        if start < 0:
+            return None
+        end = start + len(token)
+        offsets.append((start, end))
+        cursor = end
+    return offsets
+def project_source_labels_to_chars(
+    text: str,
+    source_tokens: List[str],
+    source_labels: List[str],
+) -> Optional[List[str]]:
+    """Project source token BIO labels to per-character entity names."""
+    offsets = token_offsets_in_text(text, source_tokens)
+    if offsets is None or len(source_tokens) != len(source_labels):
+        return None
+    char_entities = ["O"] * len(text)
+    for token, label, (start, end) in zip(source_tokens, source_labels, offsets):
+        if not label.startswith(("B-", "I-")):
+            continue
+        entity = label.split("-", 1)[1]
+        # Bracketed single-token metadata in older data often includes the
+        # brackets in the token. Keep container punctuation as O so a tokenizer
+        # that splits brackets can learn cleaner boundaries.
+        inner_start = start
+        inner_end = end
+        if len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》":
+            inner_start += 1
+            inner_end -= 1
+        for pos in range(inner_start, inner_end):
+            if 0 <= pos < len(char_entities):
+                char_entities[pos] = entity
+    return char_entities
+def labels_from_char_projection(
+    text: str,
+    target_tokens: List[str],
+    char_entities: List[str],
+) -> Optional[List[str]]:
+    """Assign legal IOB2 labels to target tokens from per-character entities."""
+    offsets = token_offsets_in_text(text, target_tokens)
+    if offsets is None:
+        return None
+    labels: List[str] = []
+    active_entity: Optional[str] = None
+    for start, end in offsets:
+        span_entities = [
+            char_entities[pos]
+            for pos in range(start, end)
+            if 0 <= pos < len(char_entities) and char_entities[pos] != "O"
+        ]
+        if not span_entities:
+            labels.append("O")
+            active_entity = None
+            continue
+        entity = Counter(span_entities).most_common(1)[0][0]
+        prefix = "I" if active_entity == entity else "B"
+        labels.append(f"{prefix}-{entity}")
+        active_entity = entity
+    return labels
+def project_labels_from_filename(
+    filename: str,
+    source_tokens: List[str],
+    source_labels: List[str],
+    tokenizer: AnimeTokenizer,
+) -> Optional[Tuple[List[str], List[str]]]:
+    """
+    Re-tokenize filename and project weak BIO labels onto that tokenizer.
+    Returns None when source tokens cannot be aligned to the filename.
+    """
+    char_entities = project_source_labels_to_chars(filename, source_tokens, source_labels)
+    if char_entities is None:
+        return None
+    target_tokens = tokenizer.tokenize(filename)
+    target_labels = labels_from_char_projection(filename, target_tokens, char_entities)
+    if target_labels is None or len(target_tokens) != len(target_labels):
+        return None
+    return target_tokens, target_labels
 def create_datasets(
     data_path: str,
     tokenizer: AnimeTokenizer,

datasets/AnimeName CHANGED Viewed

	@@ -1 +1 @@
1	- Subproject commit ~~867350a1712e50cc71f5a9e81dd331ca46a7b1dd~~


1	+ Subproject commit 8d2b6c9e639fde6be0e428e5f34f56fccd5aa2ea

diagnose_pipeline.py CHANGED Viewed

@@ -27,7 +27,8 @@ from seqeval.metrics import classification_report, f1_score, precision_score, re
 from transformers import BertForTokenClassification
 from config import Config
-from dataset import align_tokens_for_tokenizer
 from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
@@ -81,16 +82,6 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
     for idx, label in enumerate(labels):
         token = tokens[idx] if idx < len(tokens) else None
         if label == "O":
-            if previous_label.startswith("B-"):
-                violations.append(
-                    {
-                        "type": "B_DIRECT_TO_O",
-                        "index": idx,
-                        "prev_label": previous_label,
-                        "label": label,
-                        "token": token,
-                    }
-                )
             current_entity = None
         elif label.startswith("B-"):
             current_entity = entity_type(label)
@@ -124,6 +115,24 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
     return violations
 def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
     spans: List[dict] = []
     start: Optional[int] = None
@@ -241,7 +250,7 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
     unk = 0
     unk_counter: Counter = Counter()
     for sample in samples:
-        tokens, labels = align_tokens_for_tokenizer(sample["tokens"], sample["labels"], tokenizer)
         ids = tokenizer.convert_tokens_to_ids(tokens)
         for token, token_id in zip(tokens, ids):
             total += 1
@@ -257,13 +266,12 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
 def prepare_inputs(
-    tokens: List[str],
-    labels: List[str],
     tokenizer: AnimeTokenizer,
     label2id: Dict[str, int],
     max_length: int,
 ) -> Tuple[List[int], List[int], List[int], List[str]]:
-    tokens, labels = align_tokens_for_tokenizer(tokens, labels, tokenizer)
     input_ids = tokenizer.convert_tokens_to_ids(tokens)
     input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
     label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
@@ -283,6 +291,48 @@ def prepare_inputs(
     return input_ids, attention_mask, label_ids, tokens
 def evaluate_model(
     samples: List[dict],
     model_dir: Path,
@@ -313,12 +363,15 @@ def evaluate_model(
     confusion: Counter = Counter()
     entity_confusion: Counter = Counter()
     boundary_errors: Counter = Counter()
     with torch.no_grad():
         for sample in eval_samples:
-            input_ids, attention_mask, label_ids, _tokens = prepare_inputs(
-                sample["tokens"],
-                sample["labels"],
                 tokenizer,
                 label2id,
                 max_length,
@@ -326,13 +379,17 @@ def evaluate_model(
             input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
             mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
             logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
-            pred_ids = torch.argmax(logits, dim=-1)[0].detach().cpu().tolist()
             true_labels: List[str] = []
             pred_labels: List[str] = []
-            for pred_id, label_id in zip(pred_ids, label_ids):
                 if label_id == -100:
                     continue
                 true_label = id2label.get(label_id, "O")
                 pred_label = id2label.get(pred_id, "O")
                 true_labels.append(true_label)
@@ -348,6 +405,57 @@ def evaluate_model(
                         boundary_errors["BIO-prefix"] += 1
             true_sequences.append(true_labels)
             pred_sequences.append(pred_labels)
     errors = confusion.copy()
     for label in set(label for pair in confusion for label in pair):
@@ -364,6 +472,10 @@ def evaluate_model(
             {k: v for k, v in entity_confusion.items() if k[0] != k[1]}
         ).most_common(30),
         "boundary_errors": boundary_errors,
     }
@@ -444,6 +556,7 @@ def main() -> None:
     length_values: List[int] = []
     aligned_length_values: List[int] = []
     violations: List[dict] = []
     mismatch_examples: List[dict] = []
     space_label_counter: Counter = Counter()
     boundary_drift_counter: Counter = Counter()
@@ -472,7 +585,7 @@ def main() -> None:
         label_counter.update(labels)
         length_values.append(len(tokens))
-        aligned_tokens, aligned_labels = align_tokens_for_tokenizer(tokens, labels, tokenizer)
         aligned_length_values.append(len(aligned_tokens))
         if len(aligned_tokens) + 2 > max_length:
             truncation_count += 1
@@ -490,6 +603,17 @@ def main() -> None:
                 }
             )
             violations.append(violation)
         for span in spans_from_labels(tokens, labels):
             text = span["text"]
             if span["type"] == "TITLE":
@@ -594,19 +718,26 @@ def main() -> None:
     )
     violation_counter = Counter(v["type"] for v in violations)
     sections.append(
         (
             "BIO Violations And Boundary Drift",
             "\n".join(
                 [
-                    "### Violation counts",
                     format_counter(violation_counter),
                     "",
                     "### Boundary drift heuristics",
                     format_counter(boundary_drift_counter),
                     "",
                     "### Sample violations",
                     markdown_json(violations[:30]),
                 ]
             ),
         )
@@ -659,6 +790,29 @@ def main() -> None:
             [true, pred, f"{count:,}"]
             for (true, pred), count in model_eval["top_entity_confusions"]
         ]
         sections.append(
             (
                 "Model Confusion Analysis",
@@ -678,6 +832,28 @@ def main() -> None:
                         "### Top entity-type confusions",
                         markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
                         "",
                         "### Seqeval report",
                         "```text\n" + model_eval["classification_report"] + "\n```",
                     ]

 from transformers import BertForTokenClassification
 from config import Config
+from dataset import labels_for_tokenizer
+from inference import constrained_bio_decode, postprocess
 from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
     for idx, label in enumerate(labels):
         token = tokens[idx] if idx < len(tokens) else None
         if label == "O":
             current_entity = None
         elif label.startswith("B-"):
             current_entity = entity_type(label)
     return violations
+def bio_boundary_warnings(tokens: List[str], labels: List[str]) -> List[dict]:
+    """Collect legal-but-suspicious boundary patterns separately from BIO errors."""
+    warnings: List[dict] = []
+    for idx, label in enumerate(labels[1:], 1):
+        previous_label = labels[idx - 1]
+        if label == "O" and previous_label.startswith("B-"):
+            warnings.append(
+                {
+                    "type": "SINGLE_TOKEN_ENTITY",
+                    "index": idx,
+                    "prev_label": previous_label,
+                    "label": label,
+                    "token": tokens[idx] if idx < len(tokens) else None,
+                }
+            )
+    return warnings
 def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
     spans: List[dict] = []
     start: Optional[int] = None
     unk = 0
     unk_counter: Counter = Counter()
     for sample in samples:
+        tokens, _labels = labels_for_tokenizer(sample, tokenizer)
         ids = tokenizer.convert_tokens_to_ids(tokens)
         for token, token_id in zip(tokens, ids):
             total += 1
 def prepare_inputs(
+    sample: dict,
     tokenizer: AnimeTokenizer,
     label2id: Dict[str, int],
     max_length: int,
 ) -> Tuple[List[int], List[int], List[int], List[str]]:
+    tokens, labels = labels_for_tokenizer(sample, tokenizer)
     input_ids = tokenizer.convert_tokens_to_ids(tokens)
     input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
     label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
     return input_ids, attention_mask, label_ids, tokens
+def normalize_field_value(field: str, value) -> Optional[str]:
+    if value is None:
+        return None
+    if field in {"episode", "season"}:
+        try:
+            return str(int(value))
+        except (TypeError, ValueError):
+            return str(value).strip().lower()
+    text = str(value).strip()
+    if field in {"resolution", "source"}:
+        return text.lower().replace("_", "-")
+    return re.sub(r"\s+", " ", text).strip().lower()
+def update_parse_metrics(counter: Counter, gold: dict, pred: dict) -> None:
+    fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
+    all_match = True
+    for field in fields:
+        gold_value = normalize_field_value(field, gold.get(field))
+        pred_value = normalize_field_value(field, pred.get(field))
+        if gold_value == pred_value:
+            counter[f"{field}_correct"] += 1
+        else:
+            all_match = False
+            counter[(field, gold_value, pred_value)] += 1
+        counter[f"{field}_total"] += 1
+    if all_match:
+        counter["full_match_correct"] += 1
+    counter["full_match_total"] += 1
+def collect_field_failures(gold: dict, pred: dict) -> Dict[str, Dict[str, Optional[str]]]:
+    return {
+        field: {
+            "gold": normalize_field_value(field, gold.get(field)),
+            "pred": normalize_field_value(field, pred.get(field)),
+        }
+        for field in ["group", "title", "season", "episode", "resolution", "source", "special"]
+        if normalize_field_value(field, gold.get(field)) != normalize_field_value(field, pred.get(field))
+    }
 def evaluate_model(
     samples: List[dict],
     model_dir: Path,
     confusion: Counter = Counter()
     entity_confusion: Counter = Counter()
     boundary_errors: Counter = Counter()
+    parse_metrics: Counter = Counter()
+    parse_metrics_no_rules: Counter = Counter()
+    field_failures: List[dict] = []
+    field_failures_no_rules: List[dict] = []
     with torch.no_grad():
         for sample in eval_samples:
+            input_ids, attention_mask, label_ids, sample_tokens = prepare_inputs(
+                sample,
                 tokenizer,
                 label2id,
                 max_length,
             input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
             mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
             logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
+            active_count = sum(1 for label_id in label_ids if label_id != -100)
+            pred_ids = constrained_bio_decode(logits[0, 1:1 + active_count, :], id2label)
             true_labels: List[str] = []
             pred_labels: List[str] = []
+            pred_idx = 0
+            for label_id in label_ids:
                 if label_id == -100:
                     continue
+                pred_id = pred_ids[pred_idx]
+                pred_idx += 1
                 true_label = id2label.get(label_id, "O")
                 pred_label = id2label.get(pred_id, "O")
                 true_labels.append(true_label)
                         boundary_errors["BIO-prefix"] += 1
             true_sequences.append(true_labels)
             pred_sequences.append(pred_labels)
+            active_tokens = sample_tokens[:len(true_labels)]
+            gold_parse = postprocess(
+                active_tokens,
+                true_labels,
+                tokenizer=tokenizer,
+                filename=sample.get("filename"),
+                use_rules=True,
+            )
+            pred_parse = postprocess(
+                active_tokens,
+                pred_labels,
+                tokenizer=tokenizer,
+                filename=sample.get("filename"),
+                use_rules=True,
+            )
+            gold_parse_no_rules = postprocess(
+                active_tokens,
+                true_labels,
+                tokenizer=tokenizer,
+                filename=sample.get("filename"),
+                use_rules=False,
+            )
+            pred_parse_no_rules = postprocess(
+                active_tokens,
+                pred_labels,
+                tokenizer=tokenizer,
+                filename=sample.get("filename"),
+                use_rules=False,
+            )
+            update_parse_metrics(parse_metrics, gold_parse, pred_parse)
+            update_parse_metrics(parse_metrics_no_rules, gold_parse_no_rules, pred_parse_no_rules)
+            failures = collect_field_failures(gold_parse, pred_parse)
+            if failures and len(field_failures) < 30:
+                field_failures.append(
+                    {
+                        "filename": sample.get("filename"),
+                        "errors": failures,
+                        "gold": gold_parse,
+                        "pred": pred_parse,
+                    }
+                )
+            failures_no_rules = collect_field_failures(gold_parse_no_rules, pred_parse_no_rules)
+            if failures_no_rules and len(field_failures_no_rules) < 30:
+                field_failures_no_rules.append(
+                    {
+                        "filename": sample.get("filename"),
+                        "errors": failures_no_rules,
+                        "gold": gold_parse_no_rules,
+                        "pred": pred_parse_no_rules,
+                    }
+                )
     errors = confusion.copy()
     for label in set(label for pair in confusion for label in pair):
             {k: v for k, v in entity_confusion.items() if k[0] != k[1]}
         ).most_common(30),
         "boundary_errors": boundary_errors,
+        "parse_metrics": parse_metrics,
+        "parse_metrics_no_rules": parse_metrics_no_rules,
+        "field_failures": field_failures,
+        "field_failures_no_rules": field_failures_no_rules,
     }
     length_values: List[int] = []
     aligned_length_values: List[int] = []
     violations: List[dict] = []
+    boundary_warnings: List[dict] = []
     mismatch_examples: List[dict] = []
     space_label_counter: Counter = Counter()
     boundary_drift_counter: Counter = Counter()
         label_counter.update(labels)
         length_values.append(len(tokens))
+        aligned_tokens, aligned_labels = labels_for_tokenizer(sample, tokenizer)
         aligned_length_values.append(len(aligned_tokens))
         if len(aligned_tokens) + 2 > max_length:
             truncation_count += 1
                 }
             )
             violations.append(violation)
+        for warning in bio_boundary_warnings(tokens, labels):
+            warning.update(
+                {
+                    "row": row_idx,
+                    "file_id": sample.get("file_id"),
+                    "filename": sample.get("filename"),
+                    "context_tokens": tokens[max(0, warning["index"] - 5):warning["index"] + 6],
+                    "context_labels": labels[max(0, warning["index"] - 5):warning["index"] + 6],
+                }
+            )
+            boundary_warnings.append(warning)
         for span in spans_from_labels(tokens, labels):
             text = span["text"]
             if span["type"] == "TITLE":
     )
     violation_counter = Counter(v["type"] for v in violations)
+    warning_counter = Counter(w["type"] for w in boundary_warnings)
     sections.append(
         (
             "BIO Violations And Boundary Drift",
             "\n".join(
                 [
+                    "### True BIO violation counts",
                     format_counter(violation_counter),
                     "",
+                    "### Legal boundary warning counts",
+                    format_counter(warning_counter),
+                    "",
                     "### Boundary drift heuristics",
                     format_counter(boundary_drift_counter),
                     "",
                     "### Sample violations",
                     markdown_json(violations[:30]),
+                    "",
+                    "### Sample boundary warnings",
+                    markdown_json(boundary_warnings[:30]),
                 ]
             ),
         )
             [true, pred, f"{count:,}"]
             for (true, pred), count in model_eval["top_entity_confusions"]
         ]
+        def parse_metric_tables(metrics: Counter) -> Tuple[List[List[str]], str, List[List[str]]]:
+            field_rows = []
+            for field in ["group", "title", "season", "episode", "resolution", "source", "special"]:
+                total = metrics.get(f"{field}_total", 0)
+                correct = metrics.get(f"{field}_correct", 0)
+                acc = correct / total if total else 0.0
+                field_rows.append([field, f"{correct:,}/{total:,}", f"{acc:.4f}"])
+            full_total = metrics.get("full_match_total", 0)
+            full_correct = metrics.get("full_match_correct", 0)
+            full_acc = full_correct / full_total if full_total else 0.0
+            full_line = f"{full_correct:,}/{full_total:,} ({full_acc:.4f})"
+            error_rows = [
+                [field, str(gold), str(pred), f"{count:,}"]
+                for key, count in Counter(
+                    {key: count for key, count in metrics.items() if isinstance(key, tuple)}
+                ).most_common(30)
+                if isinstance(key, tuple)
+                for field, gold, pred in [key]
+            ]
+            return field_rows, full_line, error_rows
+        rule_field_rows, rule_full_line, rule_error_rows = parse_metric_tables(model_eval["parse_metrics"])
+        ner_field_rows, ner_full_line, ner_error_rows = parse_metric_tables(model_eval["parse_metrics_no_rules"])
         sections.append(
             (
                 "Model Confusion Analysis",
                         "### Top entity-type confusions",
                         markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
                         "",
+                        "### Field exact-match accuracy (rule-assisted)",
+                        markdown_table(["field", "correct/total", "accuracy"], rule_field_rows),
+                        "",
+                        f"Rule-assisted full parse exact match: {rule_full_line}",
+                        "",
+                        "### Top rule-assisted field parse errors",
+                        markdown_table(["field", "gold", "pred", "count"], rule_error_rows) if rule_error_rows else "- none",
+                        "",
+                        "### Field exact-match accuracy (NER-only, no rules)",
+                        markdown_table(["field", "correct/total", "accuracy"], ner_field_rows),
+                        "",
+                        f"NER-only full parse exact match: {ner_full_line}",
+                        "",
+                        "### Top NER-only field parse errors",
+                        markdown_table(["field", "gold", "pred", "count"], ner_error_rows) if ner_error_rows else "- none",
+                        "",
+                        "### Hardest sampled parse failures (rule-assisted)",
+                        markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
+                        "",
+                        "### Hardest sampled parse failures (NER-only)",
+                        markdown_json(model_eval["field_failures_no_rules"][:10]) if model_eval["field_failures_no_rules"] else "- none",
+                        "",
                         "### Seqeval report",
                         "```text\n" + model_eval["classification_report"] + "\n```",
                     ]

dmhy_dataset.py CHANGED Viewed

@@ -19,7 +19,8 @@ from datetime import datetime, timezone
 from pathlib import Path
 from typing import Iterable, List, Optional, Sequence
-from data_generator import assign_bio, categorize_meta_token
 from tokenizer import AnimeTokenizer
@@ -35,8 +36,9 @@ NOISE_BRACKETS = {
     "繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
 }
-SPECIAL_RE = re.compile(r"^(?:ova|oad|sp|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
-EPISODE_RE = re.compile(r"^(?:[Ee][Pp]?|#)?(\d{1,4})(?:v\d+)?$", re.I)
 SEASON_RE = re.compile(
     r"^(?:"
     r"[Ss](\d{1,2})|"
@@ -45,16 +47,28 @@ SEASON_RE = re.compile(
     r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
     r")$", re.I
 )
 SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
 DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
 HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
 DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
 RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
 SOURCE_RE = re.compile(
-    r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|HDTV|"
     r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
     r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
-    r"CHS|CHT|BIG5|GB|JPN?|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
     re.I,
 )
 GROUP_HINT_RE = re.compile(
@@ -112,12 +126,20 @@ def cn_number_to_int(text: str) -> Optional[int]:
 def season_number(token: str) -> Optional[int]:
     clean = clean_bracket(token)
     match = SEASON_RE.match(clean)
-    if not match:
-        return None
-    value = next((g for g in match.groups() if g), None)
-    if value is None:
-        return None
-    return cn_number_to_int(value)
 def episode_number(token: str) -> Optional[int]:
@@ -126,7 +148,13 @@ def episode_number(token: str) -> Optional[int]:
         return None
     if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
         return None
-    if re.match(r"^第\d{1,4}[话話集]$", clean):
         return int(re.search(r"\d+", clean).group())
     match = EPISODE_RE.match(clean)
     if not match:
@@ -137,8 +165,13 @@ def episode_number(token: str) -> Optional[int]:
     return number
 def is_resolution(token: str) -> bool:
-    return bool(RESOLUTION_RE.match(clean_bracket(token)))
 def is_source(token: str) -> bool:
@@ -149,11 +182,17 @@ def is_source(token: str) -> bool:
         is_resolution(clean) or SOURCE_RE.match(clean)
     ):
         return True
-    return bool(SOURCE_RE.match(clean))
 def is_special(token: str) -> bool:
-    return bool(SPECIAL_RE.match(clean_bracket(token)))
 def is_noise_bracket(token: str) -> bool:
@@ -194,7 +233,7 @@ def is_title_token(token: str) -> bool:
         return False
     if is_resolution(clean) or is_source(clean) or is_special(clean):
         return False
-    if season_number(clean) is not None or episode_number(clean) is not None:
         return False
     if DATE_RE.match(clean) or HASH_RE.match(clean):
         return False
@@ -221,9 +260,13 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
         number = episode_number(token)
         if number is None:
             continue
-        score = 0
         clean = clean_bracket(token)
-        if re.match(r"^(?:[Ee][Pp]?|#|第)", clean, re.I):
             score += 4
         if token.startswith("[") or token.startswith("(") or token.startswith("【"):
             score += 3
@@ -239,12 +282,317 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
     return max(candidates, key=lambda item: (item[0], item[1]))[1]
 def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
     inner = clean_bracket(token)
     if not inner:
         return [token], [category]
-    open_char = token[0] if token[0] in "[【(《" else ""
-    close_char = token[-1] if token[-1] in "]】)》" else ""
     inner_tokens = tokenizer.tokenize(inner)
     tokens: List[str] = []
     cats: List[str] = []
@@ -259,6 +607,38 @@ def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer)
     return tokens, cats
 def expand_tokens_and_categories(
     tokens: Sequence[str],
     categories: Sequence[str],
@@ -281,15 +661,34 @@ def expand_tokens_and_categories(
             expanded_tokens.extend(split_tokens)
             expanded_categories.extend(split_categories)
             continue
         expanded_tokens.append(token)
         expanded_categories.append(category)
     return expanded_tokens, expanded_categories
 def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
     tokens = tokenizer.tokenize(filename)
     if not tokens:
         return None
     categories = ["sep" if token in {" ", "-", "_", "|", "~", "～", "."} else "title" for token in tokens]
@@ -306,15 +705,16 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
             categories[idx] = "source"
         elif is_special(token):
             categories[idx] = "special"
-        elif season_number(token) is not None:
             categories[idx] = "season"
         elif is_noise_bracket(token):
             categories[idx] = "sep"
     episode_idx = find_episode_index(tokens)
     if episode_idx is None:
-        return None
     categories[episode_idx] = "episode"
     # S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
     # token slips through, expand_tokens_and_categories will split it.
@@ -341,7 +741,11 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
         title_start += 1
     title_start, title_end = trim_title_span(tokens, title_start, title_end)
     if title_start >= title_end:
-        return None
     for idx, token in enumerate(tokens):
         if title_start <= idx < title_end:
@@ -351,28 +755,13 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
             categories[idx] = "sep"
     if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
-        return None
-    # Expand bracket content for group/title tokens (e.g. [剑来 第2季] →
-    # [, 剑, 来,  , 第2季, ]) so that season markers mixed with title text
-    # inside a bracket can be detected as separate tokens.
-    expanded_tokens, expanded_categories = expand_tokens_and_categories(
-        tokens, categories, tokenizer
-    )
-    # Re-detect season markers in expanded tokens (bracket expansion exposes
-    # patterns like 第2季 that were previously hidden inside mixed brackets).
-    for idx in range(len(expanded_tokens)):
-        cat = expanded_categories[idx]
-        if cat not in {"sep", "episode", "group", "source", "resolution",
-                        "special", "season"}:
-            if season_number(expanded_tokens[idx]) is not None:
-                expanded_categories[idx] = "season"
-    labels = assign_bio(expanded_tokens, expanded_categories)
-    if len(expanded_tokens) != len(labels):
-        return None
-    return {"tokens": expanded_tokens, "labels": labels}
 def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:

 from pathlib import Path
 from typing import Iterable, List, Optional, Sequence
+from data_generator import LABEL_MAP, categorize_meta_token
+from label_repairs import season_marker_number
 from tokenizer import AnimeTokenizer
     "繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
 }
+SPECIAL_RE = re.compile(r"^(?:ova\d*|oad\d*|sp\d*|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
+SPECIAL_SEARCH_RE = re.compile(r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[:：].+", re.I)
+EPISODE_RE = re.compile(r"^(?:[Ee][Pp]?|#)?(\d{1,4})(?:v\d+|END)?$", re.I)
 SEASON_RE = re.compile(
     r"^(?:"
     r"[Ss](\d{1,2})|"
     r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
     r")$", re.I
 )
+READING_SEASON_RE = re.compile(
+    r"^(?:Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|Ni\s+Gakki|Sono\s+Ni|"
+    r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|(?:Yon|Shi|Shin)\s+no\s+Sara|"
+    r"(?:Go|Gou)\s+no\s+Sara)$",
+    re.I,
+)
+CJK_SEQUEL_SEASON_RE = re.compile(
+    r"^(?:[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
+    r"[ⅡⅢⅣⅤⅥⅦⅧⅨ]|II|III|IV|V|VI|VII|VIII|IX)$",
+    re.I,
+)
 SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
 DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
 HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
 DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
 RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
+RESOLUTION_SEARCH_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
 SOURCE_RE = re.compile(
+    r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
     r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
     r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
+    r"CHS|CHT|BIG5|GB|JPN?|JPSC|JPTC|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
     re.I,
 )
 GROUP_HINT_RE = re.compile(
 def season_number(token: str) -> Optional[int]:
     clean = clean_bracket(token)
     match = SEASON_RE.match(clean)
+    if match:
+        value = next((g for g in match.groups() if g), None)
+        if value is None:
+            return None
+        return cn_number_to_int(value)
+    if READING_SEASON_RE.match(clean) or CJK_SEQUEL_SEASON_RE.match(clean):
+        return season_marker_number(clean)
+    return None
+def is_explicit_season(token: str) -> bool:
+    """Return True for unambiguous season syntax such as S02 or 第2季."""
+    clean = clean_bracket(token)
+    return bool(SEASON_RE.match(clean))
 def episode_number(token: str) -> Optional[int]:
         return None
     if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
         return None
+    if re.match(r"^第\d{1,4}(?:\(\d{1,4}\))?[话話集]$", clean):
+        return int(re.search(r"\d+", clean).group())
+    if re.match(r"^(?:OVA|OAD|SP)\d{1,4}$", clean, re.I):
+        return int(re.search(r"\d+", clean).group())
+    if re.match(r"^\d{1,4}\s*END$", clean, re.I):
+        return int(re.search(r"\d+", clean).group())
+    if re.match(r"^\d{1,4}[._]\d+$", clean):
         return int(re.search(r"\d+", clean).group())
     match = EPISODE_RE.match(clean)
     if not match:
     return number
+def has_wrapping_brackets(token: str) -> bool:
+    return len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》"
 def is_resolution(token: str) -> bool:
+    clean = clean_bracket(token)
+    return bool(RESOLUTION_RE.match(clean) or (has_wrapping_brackets(token) and RESOLUTION_SEARCH_RE.search(clean)))
 def is_source(token: str) -> bool:
         is_resolution(clean) or SOURCE_RE.match(clean)
     ):
         return True
+    if SOURCE_RE.match(clean):
+        return True
+    if has_wrapping_brackets(token):
+        parts = [part for part in re.split(r"[\s&+/,._-]+", clean) if part]
+        return bool(parts) and all(SOURCE_RE.match(part) or is_noise_bracket(part) for part in parts)
+    return False
 def is_special(token: str) -> bool:
+    clean = clean_bracket(token)
+    return bool(SPECIAL_RE.match(clean) or SPECIAL_SEARCH_RE.match(clean))
 def is_noise_bracket(token: str) -> bool:
         return False
     if is_resolution(clean) or is_source(clean) or is_special(clean):
         return False
+    if is_explicit_season(clean) or episode_number(clean) is not None:
         return False
     if DATE_RE.match(clean) or HASH_RE.match(clean):
         return False
         number = episode_number(token)
         if number is None:
             continue
         clean = clean_bracket(token)
+        if idx > 0 and tokens[idx - 1] == "." and re.fullmatch(r"\d+", clean):
+            previous_clean = clean_bracket(tokens[idx - 2]) if idx >= 2 else ""
+            if previous_clean.lower() in VIDEO_EXTENSIONS or f".{clean}".lower() in VIDEO_EXTENSIONS:
+                continue
+        score = 0
+        if re.match(r"^(?:[Ee][Pp]?|#|第|OVA|OAD|SP)", clean, re.I):
             score += 4
         if token.startswith("[") or token.startswith("(") or token.startswith("【"):
             score += 3
     return max(candidates, key=lambda item: (item[0], item[1]))[1]
+def is_separator_token(token: str) -> bool:
+    return token in {" ", "-", "_", "|", "~", "～", ".", "+", "&", "/", ","}
+def has_only_separators_between(tokens: Sequence[str], start: int, end: int) -> bool:
+    return all(is_separator_token(token) for token in tokens[start:end])
+def is_context_season_token(tokens: Sequence[str], idx: int, episode_idx: int) -> bool:
+    """Detect compact season markers only when they structurally lead into an episode."""
+    if idx >= episode_idx:
+        return False
+    token = tokens[idx]
+    clean = clean_bracket(token)
+    if not clean:
+        return False
+    if is_explicit_season(clean):
+        return True
+    if season_number(clean) is None:
+        return False
+    if not has_only_separators_between(tokens, idx + 1, episode_idx):
+        return False
+    # A bare V is often the volume prefix in V02E01, not season five.
+    if clean.upper() == "V":
+        return False
+    return True
+def label_context_season_tokens(
+    tokens: Sequence[str],
+    categories: List[str],
+    episode_idx: int,
+) -> None:
+    if (
+        episode_idx >= 2
+        and clean_bracket(tokens[episode_idx]).upper().startswith("E")
+        and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
+        and clean_bracket(tokens[episode_idx - 1]).isdigit()
+    ):
+        categories[episode_idx - 2] = "season"
+        categories[episode_idx - 1] = "season"
+        return
+    for idx in range(episode_idx):
+        if categories[idx] in {"group", "episode", "resolution", "source", "special"}:
+            continue
+        if is_context_season_token(tokens, idx, episode_idx):
+            categories[idx] = "season"
+def embedded_bracket_episode(token: str) -> Optional[tuple[str, str, str]]:
+    """Split malformed tokens such as '[Group}Title[658]' into title + episode."""
+    if episode_number(token) is not None:
+        return None
+    match = re.match(r"^(?P<prefix>.+?)\[(?P<episode>\d{1,4}(?:v\d+)?)(?P<close>\])?$", token, re.I)
+    if match is None and has_wrapping_brackets(token):
+        match = re.match(r"^(?P<prefix>.+?)(?P<episode>\d{2,4})(?P<close>[\]\)】》])$", token, re.I)
+    if not match:
+        return None
+    prefix = match.group("prefix")
+    episode = match.group("episode")
+    close = match.group("close") or ""
+    if not clean_bracket(prefix):
+        return None
+    number = int(re.search(r"\d+", episode).group())
+    if number == 0 or number > 2000:
+        return None
+    return prefix, episode, close
+def append_tokenized_category(
+    tokens: List[str],
+    categories: List[str],
+    text: str,
+    category: str,
+    tokenizer: AnimeTokenizer,
+) -> None:
+    for piece in tokenizer.tokenize(text):
+        if not piece:
+            continue
+        if is_separator_token(piece) or piece in {"[", "]", "(", ")", "【", "】", "《", "》"}:
+            piece_category = "sep"
+        else:
+            piece_category = category
+        tokens.append(piece)
+        categories.append(piece_category)
+def finalize_weak_sample(
+    tokens: Sequence[str],
+    categories: Sequence[str],
+    tokenizer: AnimeTokenizer,
+    require_episode: bool = True,
+) -> Optional[dict]:
+    expanded_tokens, expanded_categories = expand_tokens_and_categories(tokens, categories, tokenizer)
+    # Only unambiguous season forms are promoted here. Compact sequel markers
+    # such as 貳, II, or Ni no Sara need episode context and are repaired by
+    # label_repairs from character spans; treating every single CJK numeral as
+    # season would corrupt titles like 魯邦三世.
+    for idx, token in enumerate(expanded_tokens):
+        if expanded_categories[idx] in {"sep", "episode", "group", "source", "resolution", "special", "season"}:
+            continue
+        if is_explicit_season(token):
+            expanded_categories[idx] = "season"
+    labels = assign_iob2(expanded_categories)
+    if len(expanded_tokens) != len(labels):
+        return None
+    if not any(label.endswith("TITLE") for label in labels):
+        return None
+    if require_episode and not any(label.endswith("EPISODE") for label in labels):
+        return None
+    return {"tokens": expanded_tokens, "labels": labels}
+def assign_iob2(categories: Sequence[str]) -> List[str]:
+    labels: List[str] = []
+    previous_entity: Optional[str] = None
+    for category in categories:
+        entity = LABEL_MAP.get(category, "O")
+        if entity == "O":
+            labels.append("O")
+            previous_entity = None
+            continue
+        prefix = "I" if previous_entity == entity else "B"
+        labels.append(f"{prefix}-{entity}")
+        previous_entity = entity
+    return labels
+def fallback_embedded_episode_sample(
+    tokens: Sequence[str],
+    tokenizer: AnimeTokenizer,
+) -> Optional[dict]:
+    rebuilt_tokens: List[str] = []
+    rebuilt_categories: List[str] = []
+    used_episode = False
+    for token in tokens:
+        embedded = embedded_bracket_episode(token)
+        if embedded and not used_episode:
+            prefix, episode, close = embedded
+            append_tokenized_category(rebuilt_tokens, rebuilt_categories, prefix, "title", tokenizer)
+            rebuilt_tokens.append(episode)
+            rebuilt_categories.append("episode")
+            if close:
+                rebuilt_tokens.append(close)
+                rebuilt_categories.append("sep")
+            used_episode = True
+            continue
+        if not used_episode:
+            category = "sep" if is_separator_token(token) else "title"
+        elif is_resolution(token):
+            category = "resolution"
+        elif is_source(token):
+            category = "source"
+        elif is_special(token):
+            category = "special"
+        else:
+            category = "sep"
+        rebuilt_tokens.append(token)
+        rebuilt_categories.append(category)
+    if not used_episode:
+        return None
+    return finalize_weak_sample(rebuilt_tokens, rebuilt_categories, tokenizer)
+def has_embedded_episode_candidate(tokens: Sequence[str]) -> bool:
+    return any(embedded_bracket_episode(token) is not None for token in tokens)
+def fallback_episode_first_sample(
+    tokens: Sequence[str],
+    categories: Sequence[str],
+    episode_idx: int,
+    tokenizer: AnimeTokenizer,
+) -> Optional[dict]:
+    fallback_categories = ["sep"] * len(tokens)
+    # V02E01-style catalog rows are episode-first. The tokenizer currently
+    # exposes them as V, 02, E01, so keep V02 together as a season span.
+    if (
+        episode_idx >= 2
+        and clean_bracket(tokens[episode_idx]).upper().startswith("E")
+        and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
+        and clean_bracket(tokens[episode_idx - 1]).isdigit()
+    ):
+        fallback_categories[episode_idx - 2] = "season"
+        fallback_categories[episode_idx - 1] = "season"
+    else:
+        label_context_season_tokens(tokens, fallback_categories, episode_idx)
+    fallback_categories[episode_idx] = "episode"
+    title_indices: List[int] = []
+    for idx in range(episode_idx + 1, len(tokens)):
+        token = tokens[idx]
+        if is_separator_token(token):
+            continue
+        if is_resolution(token) or is_source(token) or is_special(token) or is_noise_bracket(token):
+            fallback_categories[idx] = "resolution" if is_resolution(token) else "source" if is_source(token) else "special" if is_special(token) else "sep"
+            continue
+        title_indices.append(idx)
+    if not title_indices:
+        # Some rows are title-only brackets followed by season/episode,
+        # e.g. [伊蘇] II-01. If the leading bracket was guessed as GROUP but
+        # no real title exists, use it as TITLE to keep the row useful.
+        for idx in range(episode_idx):
+            if categories[idx] == "group" and clean_bracket(tokens[idx]):
+                title_indices.append(idx)
+                break
+    for idx in title_indices:
+        fallback_categories[idx] = "title"
+    if title_indices:
+        for idx in range(title_indices[0], title_indices[-1] + 1):
+            if is_separator_token(tokens[idx]):
+                fallback_categories[idx] = "title"
+    return finalize_weak_sample(tokens, fallback_categories, tokenizer)
+def fallback_minimal_sample(
+    tokens: Sequence[str],
+    episode_idx: int,
+    tokenizer: AnimeTokenizer,
+) -> Optional[dict]:
+    """Keep malformed low-information rows instead of silently dropping them."""
+    categories: List[str] = []
+    title_idx: Optional[int] = None
+    for idx, token in enumerate(tokens):
+        if idx == episode_idx:
+            categories.append("episode")
+        elif is_resolution(token):
+            categories.append("resolution")
+        elif is_source(token):
+            categories.append("source")
+        elif is_special(token):
+            categories.append("special")
+            if title_idx is None:
+                title_idx = idx
+        else:
+            categories.append("sep")
+    if title_idx is None:
+        for idx, token in enumerate(tokens):
+            if idx == episode_idx or is_separator_token(token):
+                continue
+            if categories[idx] not in {"resolution", "source"}:
+                title_idx = idx
+                break
+    if title_idx is None:
+        return None
+    categories[title_idx] = "title"
+    return finalize_weak_sample(tokens, categories, tokenizer)
+def fallback_no_episode_sample(tokens: Sequence[str], tokenizer: AnimeTokenizer) -> Optional[dict]:
+    """Label movies, OP/ED/SP, and malformed rows that have no true episode token."""
+    categories: List[str] = []
+    seen_title = False
+    title_allowed = True
+    for idx, token in enumerate(tokens):
+        if is_separator_token(token):
+            categories.append("title" if seen_title and title_allowed else "sep")
+            continue
+        if idx == 0 and is_group_bracket(token, idx, tokens):
+            categories.append("group")
+            continue
+        if is_resolution(token):
+            categories.append("resolution")
+            title_allowed = False
+            continue
+        if is_source(token):
+            categories.append("source")
+            title_allowed = False
+            continue
+        if is_special(token):
+            categories.append("special")
+            title_allowed = False
+            continue
+        if is_noise_bracket(token):
+            categories.append("sep")
+            continue
+        categories.append("title")
+        seen_title = True
+    return finalize_weak_sample(tokens, categories, tokenizer, require_episode=False)
+def bracket_delimiters(token: str) -> tuple[str, str]:
+    open_char = token[0] if token and token[0] in "[【(《" else ""
+    close_char = token[-1] if token and token[-1] in "]】)》" else ""
+    return open_char, close_char
 def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
     inner = clean_bracket(token)
     if not inner:
         return [token], [category]
+    open_char, close_char = bracket_delimiters(token)
     inner_tokens = tokenizer.tokenize(inner)
     tokens: List[str] = []
     cats: List[str] = []
     return tokens, cats
+def label_meta_bracket_contents(token: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
+    inner = clean_bracket(token)
+    if not inner:
+        return [token], ["sep"]
+    open_char, close_char = bracket_delimiters(token)
+    inner_tokens = tokenizer.tokenize(inner)
+    tokens: List[str] = []
+    cats: List[str] = []
+    if open_char:
+        tokens.append(open_char)
+        cats.append("sep")
+    for inner_token in inner_tokens:
+        if inner_token in {" ", "-", "_", "|", "~", "～", ".", "+", "&", "/", ","}:
+            cat = "sep"
+        elif is_resolution(inner_token) or RESOLUTION_SEARCH_RE.fullmatch(inner_token):
+            cat = "resolution"
+        elif is_source(inner_token):
+            cat = "source"
+        elif is_special(inner_token):
+            cat = "special"
+        elif is_noise_bracket(inner_token):
+            cat = "sep"
+        else:
+            cat = "sep"
+        tokens.append(inner_token)
+        cats.append(cat)
+    if close_char:
+        tokens.append(close_char)
+        cats.append("sep")
+    return tokens, cats
 def expand_tokens_and_categories(
     tokens: Sequence[str],
     categories: Sequence[str],
             expanded_tokens.extend(split_tokens)
             expanded_categories.extend(split_categories)
             continue
+        if category in {"source", "resolution", "special", "sep"} and (
+            token.startswith("[") or token.startswith("(") or token.startswith("【") or token.startswith("《")
+        ):
+            split_tokens, split_categories = label_meta_bracket_contents(token, tokenizer)
+            if any(cat != "sep" for cat in split_categories):
+                expanded_tokens.extend(split_tokens)
+                expanded_categories.extend(split_categories)
+                continue
         expanded_tokens.append(token)
         expanded_categories.append(category)
     return expanded_tokens, expanded_categories
 def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
+    basename = normalize_path_basename(str(filename))
+    stem, ext = strip_video_extension(basename)
+    if ext in VIDEO_EXTENSIONS:
+        filename = stem
+    else:
+        filename = basename
     tokens = tokenizer.tokenize(filename)
     if not tokens:
         return None
+    if has_embedded_episode_candidate(tokens):
+        embedded_sample = fallback_embedded_episode_sample(tokens, tokenizer)
+        if embedded_sample is not None:
+            return embedded_sample
     categories = ["sep" if token in {" ", "-", "_", "|", "~", "～", "."} else "title" for token in tokens]
             categories[idx] = "source"
         elif is_special(token):
             categories[idx] = "special"
+        elif is_explicit_season(token):
             categories[idx] = "season"
         elif is_noise_bracket(token):
             categories[idx] = "sep"
     episode_idx = find_episode_index(tokens)
     if episode_idx is None:
+        return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_no_episode_sample(tokens, tokenizer)
     categories[episode_idx] = "episode"
+    label_context_season_tokens(tokens, categories, episode_idx)
     # S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
     # token slips through, expand_tokens_and_categories will split it.
         title_start += 1
     title_start, title_end = trim_title_span(tokens, title_start, title_end)
     if title_start >= title_end:
+        return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
+            tokens, categories, episode_idx, tokenizer
+        ) or fallback_minimal_sample(
+            tokens, episode_idx, tokenizer
+        )
     for idx, token in enumerate(tokens):
         if title_start <= idx < title_end:
             categories[idx] = "sep"
     if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
+        return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
+            tokens, categories, episode_idx, tokenizer
+        ) or fallback_minimal_sample(
+            tokens, episode_idx, tokenizer
+        )
+    return finalize_weak_sample(tokens, categories, tokenizer)
 def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:

evaluate_parser_cases.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""Evaluate parser checkpoints on fixed real-world filename cases."""
+import argparse
+import json
+import os
+from typing import Dict, List, Optional
+import torch
+from transformers import BertForTokenClassification
+from config import Config
+from inference import parse_filename
+from tokenizer import load_tokenizer
+DEFAULT_CASE_FILE = os.path.join("data", "parser_regression_cases.json")
+def normalize_field_value(field: str, value) -> Optional[str]:
+    if value is None:
+        return None
+    if field in {"episode", "season"}:
+        try:
+            return str(int(value))
+        except (TypeError, ValueError):
+            return str(value).strip().lower()
+    text = str(value).strip()
+    if field in {"resolution", "source"}:
+        return text.lower().replace("_", "-")
+    return " ".join(text.lower().split())
+def load_cases(path: str) -> List[Dict]:
+    with open(path, "r", encoding="utf-8") as f:
+        cases = json.load(f)
+    if not isinstance(cases, list):
+        raise ValueError(f"{path} must contain a JSON list")
+    return cases
+def evaluate_cases(
+    model_dir: str,
+    case_file: str,
+    tokenizer_variant: Optional[str],
+    max_length: Optional[int],
+    use_rules: bool,
+    constrain_bio: bool,
+) -> Dict:
+    cfg = Config()
+    tokenizer = load_tokenizer(model_dir, tokenizer_variant)
+    model = BertForTokenClassification.from_pretrained(model_dir)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model.to(device)
+    model.eval()
+    id2label = {int(k): v for k, v in getattr(model.config, "id2label", cfg.id2label).items()}
+    resolved_max_length = max_length or int(getattr(model.config, "max_seq_length", 64))
+    cases = load_cases(case_file)
+    field_totals: Dict[str, int] = {}
+    field_correct: Dict[str, int] = {}
+    results = []
+    full_correct = 0
+    for case in cases:
+        expected = case.get("expected", {})
+        pred = parse_filename(
+            case["filename"],
+            model,
+            tokenizer,
+            id2label,
+            max_length=resolved_max_length,
+            debug=False,
+            use_rules=use_rules,
+            constrain_bio=constrain_bio,
+        )
+        errors = {}
+        for field, expected_value in expected.items():
+            field_totals[field] = field_totals.get(field, 0) + 1
+            expected_norm = normalize_field_value(field, expected_value)
+            pred_norm = normalize_field_value(field, pred.get(field))
+            if expected_norm == pred_norm:
+                field_correct[field] = field_correct.get(field, 0) + 1
+            else:
+                errors[field] = {
+                    "expected": expected_value,
+                    "pred": pred.get(field),
+                }
+        if not errors:
+            full_correct += 1
+        results.append(
+            {
+                "id": case.get("id"),
+                "filename": case["filename"],
+                "ok": not errors,
+                "errors": errors,
+                "expected": expected,
+                "pred": {field: pred.get(field) for field in sorted(expected)},
+            }
+        )
+    field_accuracy = {
+        field: field_correct.get(field, 0) / total
+        for field, total in sorted(field_totals.items())
+    }
+    return {
+        "model_dir": model_dir,
+        "case_file": case_file,
+        "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
+        "max_length": resolved_max_length,
+        "use_rules": use_rules,
+        "constrain_bio": constrain_bio,
+        "case_count": len(cases),
+        "full_correct": full_correct,
+        "full_accuracy": full_correct / len(cases) if cases else 0.0,
+        "field_correct": field_correct,
+        "field_total": field_totals,
+        "field_accuracy": field_accuracy,
+        "failures": [result for result in results if not result["ok"]],
+        "results": results,
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Evaluate parser on fixed filename regression cases")
+    parser.add_argument("--model-dir", required=True)
+    parser.add_argument("--case-file", default=DEFAULT_CASE_FILE)
+    parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
+    parser.add_argument("--max-length", type=int, default=None)
+    parser.add_argument("--output", default=None, help="Optional JSON output path")
+    parser.add_argument("--no-rule-assist", action="store_true")
+    parser.add_argument("--no-constrained-bio", action="store_true")
+    args = parser.parse_args()
+    metrics = evaluate_cases(
+        model_dir=args.model_dir,
+        case_file=args.case_file,
+        tokenizer_variant=args.tokenizer,
+        max_length=args.max_length,
+        use_rules=not args.no_rule_assist,
+        constrain_bio=not args.no_constrained_bio,
+    )
+    print(
+        f"Full case accuracy: {metrics['full_correct']}/{metrics['case_count']} "
+        f"({metrics['full_accuracy']:.4f})"
+    )
+    for field, total in metrics["field_total"].items():
+        correct = metrics["field_correct"].get(field, 0)
+        print(f"  {field}: {correct}/{total} ({correct / total:.4f})")
+    if metrics["failures"]:
+        print("\nFailures:")
+        for failure in metrics["failures"]:
+            print(json.dumps(failure, ensure_ascii=False))
+    if args.output:
+        os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
+        with open(args.output, "w", encoding="utf-8") as f:
+            json.dump(metrics, f, ensure_ascii=False, indent=2)
+if __name__ == "__main__":
+    main()

exports/anime_filename_parser.metadata.json CHANGED Viewed

@@ -1,12 +1,12 @@
 {
-  "model_dir": "checkpoints\\dmhy-finetune\\final",
   "output": "exports\\anime_filename_parser.onnx",
-  "max_length": 64,
   "sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
   "logits_shape": [
     1,
-    64,
     15
   ],
-  "max_abs_diff": 3.1948089599609375e-05
 }

 {
+  "model_dir": ".",
   "output": "exports\\anime_filename_parser.onnx",
+  "max_length": 128,
   "sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
   "logits_shape": [
     1,
+    128,
     15
   ],
+  "max_abs_diff": 3.3855438232421875e-05
 }

exports/anime_filename_parser.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:684a9bd25f9e53e01adcf1e3bd60c8c674fa66d94e11167ab807f73517501603
-size 16356487

 version https://git-lfs.github.com/spec/v1
+oid sha256:f9b874fbd4217a190487f512dcc6dd7ce2f0e610147703ca0cddcc0db44fb1c7
+size 19633926

inference.py CHANGED Viewed

@@ -20,6 +20,7 @@ import torch
 from transformers import BertForTokenClassification
 from config import Config
 from tokenizer import AnimeTokenizer, load_tokenizer
@@ -37,6 +38,10 @@ def extract_season_number(text: str) -> Optional[int]:
     Examples:
         "S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
     """
     # Arabic digits
     match = re.search(r'(\d+)', text)
     if match:
@@ -261,19 +266,66 @@ def postprocess(
 BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
-RESOLUTION_RE = re.compile(r"\b(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})\b")
-SOURCE_RE = re.compile(
-    r"\b(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|HDTV|"
-    r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X)\b",
     re.I,
 )
 EPISODE_PATTERNS = [
-    re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第)?(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I),
-    re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I),
 ]
 SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
 NOISE_META_RE = re.compile(
-    r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|"
     r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
     r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
     re.I,
@@ -316,6 +368,52 @@ def looks_like_group(text: str) -> bool:
     )
 def apply_rule_assists(filename: str, result: Dict) -> Dict:
     """
     Fill high-confidence structural fields from filename conventions.
@@ -327,8 +425,8 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
     brackets = bracket_parts(filename)
     if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
-        first_text, first_start, _first_end = brackets[0]
-        if first_start == 0 and looks_like_group(first_text):
             repaired["group"] = first_text
     if not repaired.get("resolution"):
@@ -336,10 +434,34 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
         if match:
             repaired["resolution"] = match.group(0)
-    if not repaired.get("source"):
-        match = SOURCE_RE.search(filename)
-        if match:
-            repaired["source"] = match.group(0).replace("_", "-")
     if repaired.get("season") is None:
         match = SEASON_RE.search(filename)
@@ -348,52 +470,223 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
             season = cn_number_to_int(value)
             if season is not None:
                 repaired["season"] = season
-    if repaired.get("episode") is None:
-        candidates: List[Tuple[int, int, str]] = []
-        for pattern in EPISODE_PATTERNS:
-            for match in pattern.finditer(filename):
-                ep_text = match.group("ep")
-                ep = int(ep_text)
-                if ep == 0 or ep > 2000:
-                    continue
-                score = match.start()
-                if 1 <= ep <= 200:
-                    score += 10000
-                if "-" in filename[max(0, match.start() - 3):match.start() + 1]:
-                    score += 1000
-                if match.start() > len(filename) // 3:
-                    score += 200
-                candidates.append((score, ep, ep_text))
-        if candidates:
-            repaired["episode"] = max(candidates, key=lambda item: item[0])[1]
     title = repaired.get("title")
     group = repaired.get("group")
     if title and group and title.startswith(group):
         title = title[len(group):].lstrip("]】)>}）》 \t-_.")
         repaired["title"] = title or repaired["title"]
-    if (not repaired.get("title") or (group and repaired["title"].startswith(group))) and repaired.get("episode"):
         repaired_title = infer_title_span(filename, group, repaired["episode"])
         if repaired_title:
             repaired["title"] = repaired_title
     return repaired
 def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
     start = 0
     if group:
         first = BRACKET_RE.match(filename)
         if first and group in first.group(0):
             start = first.end()
     end = None
     if episode is not None:
         ep_patterns = [
             rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
             rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
             rf"[Ee]0*{episode}(?:v\d+)?",
         ]
         for pattern in ep_patterns:
@@ -412,7 +705,7 @@ def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]
     if end is None or end <= start:
         return None
-    title = filename[start:end].strip(" \t-_.[]()【】《》（）")
     return title or None
@@ -448,6 +741,16 @@ def parse_filename(
     # Convert to input IDs
     input_ids = tokenizer.convert_tokens_to_ids(tokens)
     unk_token_id = tokenizer.unk_token_id
     unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
@@ -516,6 +819,10 @@ def parse_filename(
             "unk_count": len(unk_tokens),
             "unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
             "unk_tokens": unk_tokens[:50],
             "tokens": tokens[:available],
             "labels": label_strings,
             "scores": [round(float(score), 4) for score in selected_scores],
@@ -544,7 +851,7 @@ def main():
     parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
     parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
     parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
-    parser.add_argument("--model-dir", type=str, default="./checkpoints/final",
                         help="Path to trained model directory")
     parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
                         help="Tokenizer variant override. Defaults to checkpoint metadata")

 from transformers import BertForTokenClassification
 from config import Config
+from label_repairs import season_marker_number
 from tokenizer import AnimeTokenizer, load_tokenizer
     Examples:
         "S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
     """
+    marker_value = season_marker_number(text)
+    if marker_value is not None:
+        return marker_value
     # Arabic digits
     match = re.search(r'(\d+)', text)
     if match:
 BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
+RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
+SOURCE_TOKEN_PATTERN = (
+    r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
+    r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
+    r"CHS|CHT|GB|BIG5|JPN?|繁中|简中"
+)
+SOURCE_RE = re.compile(rf"\b(?:{SOURCE_TOKEN_PATTERN})\b", re.I)
+SOURCE_TAG_RE = re.compile(
+    rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
+    re.I,
+)
+SPECIAL_TAG_RE = re.compile(
+    r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[:：].+",
     re.I,
 )
 EPISODE_PATTERNS = [
+    ("season_episode", re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I)),
+    ("dash_episode", re.compile(r"(?:^|[\s._])[-_]\s*(?P<ep>\d{1,4})(?:v\d+)?(?=$|[\s._\-\]\)】》\[])")),
+    ("bracket_episode", re.compile(r"[\[\(【《](?:EP?|#)?(?P<ep>\d{1,4})(?:v\d+)?[\]\)】》]", re.I)),
+    ("explicit_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I)),
+    (
+        "long_episode",
+        re.compile(
+            r"(?:^|[\s._\-\[\(【《])(?P<ep>\d{3,4})(?:v\d+)?"
+            r"(?=[\s._\-\]\)】》\[]+(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
+            re.I,
+        ),
+    ),
+    ("generic_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?P<ep>\d{1,3})(?:v\d+)?(?=$|[\s._\-\]\)】》])", re.I)),
 ]
 SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
+SEQUEL_MARKER_RE = re.compile(
+    r"(?<![A-Za-z0-9])"
+    r"(?P<marker>"
+    r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
+    r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
+    r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
+    r"(?:Go|Gou)\s+no\s+Sara|"
+    r"Ni\s+Gakki|Sono\s+Ni|Ni|"
+    r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
+    r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
+    r")"
+    r"(?![A-Za-z0-9])",
+    re.I,
+)
+TRAILING_SEQUEL_MARKER_RE = re.compile(
+    r"(?:^|[\s._-])"
+    r"(?P<marker>"
+    r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
+    r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
+    r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
+    r"(?:Go|Gou)\s+no\s+Sara|"
+    r"Ni\s+Gakki|Sono\s+Ni|Ni|"
+    r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
+    r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
+    r")$",
+    re.I,
+)
 NOISE_META_RE = re.compile(
+    r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|"
     r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
     r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
     re.I,
     )
+def looks_like_episode_or_meta(text: str) -> bool:
+    if not text:
+        return False
+    clean = text.strip()
+    return bool(
+        re.fullmatch(r"(?:EP?|#)?\d{1,4}(?:v\d+)?", clean, re.I)
+        or RESOLUTION_RE.search(clean)
+        or SOURCE_TAG_RE.fullmatch(clean)
+        or SOURCE_RE.search(clean)
+        or SPECIAL_TAG_RE.search(clean)
+        or NOISE_META_RE.search(clean)
+    )
+def looks_like_structural_group(text: str, filename: str, bracket_end: int) -> bool:
+    """Heuristic for short leading release-group brackets not in the name list."""
+    if looks_like_group(text):
+        return True
+    if not text or looks_like_episode_or_meta(text):
+        return False
+    after = filename[bracket_end:].lstrip(" \t._")
+    if after.startswith("-"):
+        return False
+    next_bracket = BRACKET_RE.match(after)
+    if next_bracket:
+        next_text = next(group for group in next_bracket.groups() if group is not None)
+        if looks_like_episode_or_meta(next_text):
+            return False
+    words = re.findall(r"[A-Za-z0-9]+", text)
+    if not words:
+        if re.search(r"[\u3400-\u9fff]", text) and len(text) <= 32:
+            return True
+        return False
+    if len(text) > 32:
+        return False
+    if len(words) == 1:
+        return True
+    if any(sep in text for sep in "-_"):
+        return True
+    if words[0].isupper() and len(words[0]) <= 4 and len(words) <= 3:
+        return True
+    return False
 def apply_rule_assists(filename: str, result: Dict) -> Dict:
     """
     Fill high-confidence structural fields from filename conventions.
     brackets = bracket_parts(filename)
     if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
+        first_text, first_start, first_end = brackets[0]
+        if first_start == 0 and looks_like_structural_group(first_text, filename, first_end):
             repaired["group"] = first_text
     if not repaired.get("resolution"):
         if match:
             repaired["resolution"] = match.group(0)
+    source_matches = source_candidates(filename)
+    current_source = repaired.get("source")
+    preferred_source = source_matches[0] if source_matches else None
+    if source_matches and (
+        not current_source
+        or not SOURCE_RE.fullmatch(str(current_source))
+        or len(str(current_source)) <= 3 and str(current_source).lower() not in {"nf", "cr"}
+        or (
+            preferred_source
+            and str(current_source).lower().replace("_", "-") in {"web-dl", "webdl", "webrip", "web-rip"}
+            and preferred_source.lower().replace("_", "-") not in {"web-dl", "webdl", "webrip", "web-rip"}
+        )
+    ):
+        repaired["source"] = preferred_source
+    if not repaired.get("special"):
+        for text, _start, _end in brackets:
+            clean = text.strip()
+            if SPECIAL_TAG_RE.search(clean):
+                repaired["special"] = clean
+                break
+    episode = best_structural_episode(filename)
+    if episode is not None and (
+        repaired.get("episode") is None
+        or not plausible_episode_context(filename, int(repaired["episode"]))
+    ):
+        repaired["episode"] = episode
     if repaired.get("season") is None:
         match = SEASON_RE.search(filename)
             season = cn_number_to_int(value)
             if season is not None:
                 repaired["season"] = season
+        if repaired.get("season") is None and repaired.get("episode") is not None:
+            sequel = structural_sequel_marker(filename, repaired.get("group"), repaired.get("episode"))
+            if sequel is not None:
+                repaired["season"] = sequel[1]
+    elif repaired.get("episode") == repaired.get("season") and not SEASON_RE.search(filename):
+        repaired["season"] = None
     title = repaired.get("title")
     group = repaired.get("group")
+    if group and (NOISE_META_RE.search(str(group)) or SOURCE_RE.fullmatch(str(group)) or RESOLUTION_RE.fullmatch(str(group))):
+        repaired["group"] = None
+        group = None
     if title and group and title.startswith(group):
         title = title[len(group):].lstrip("]】)>}）》 \t-_.")
         repaired["title"] = title or repaired["title"]
+    if repaired.get("episode"):
         repaired_title = infer_title_span(filename, group, repaired["episode"])
         if repaired_title:
             repaired["title"] = repaired_title
+    if repaired.get("title") and repaired.get("season") is not None:
+        repaired["title"] = strip_trailing_season_from_title(repaired["title"], repaired["season"])
     return repaired
+def structural_sequel_marker(
+    filename: str,
+    group: Optional[str],
+    episode: Optional[int],
+) -> Optional[Tuple[str, int]]:
+    if episode is None:
+        return None
+    title_end = None
+    if episode is not None:
+        ep_patterns = [
+            rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
+            rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
+            rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
+            rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
+            rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
+        ]
+        start = 0
+        if group:
+            first = BRACKET_RE.match(filename)
+            if first and group in first.group(0):
+                start = first.end()
+        for pattern in ep_patterns:
+            match = re.search(pattern, filename[start:], re.I)
+            if match:
+                title_end = start + match.start()
+                break
+    if title_end is None:
+        return None
+    prefix = filename[:title_end].rstrip(" \t-_.")
+    for match in reversed(list(SEQUEL_MARKER_RE.finditer(prefix))):
+        marker = match.group("marker")
+        value = season_marker_number(marker)
+        if value is None:
+            continue
+        tail = prefix[match.end():].strip(" \t-_.")
+        if tail:
+            continue
+        if marker.lower() == "ni" and "Kakuriyo no Yadomeshi Ni" not in prefix:
+            continue
+        return marker, value
+    return None
+def normalize_source_text(text: str) -> str:
+    text = re.sub(r"\s+", "", text.strip())
+    text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
+    text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
+    text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
+    text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
+    return text.replace("_", "-")
+def source_priority(source: str) -> int:
+    normalized = source.lower().replace("_", "-").replace(" ", "")
+    parts = re.split(r"[&+/,]", normalized)
+    if any(part in {"nf", "netflix", "amzn", "baha", "cr", "abema", "dsnp", "u-next", "hulu", "at-x"} for part in parts):
+        return 90
+    if any(part in {"web-dl", "webdl", "webrip", "web-rip", "bdrip", "bluray", "bdmv", "bd", "dvdrip", "dvd", "tvrip", "hdtv"} for part in parts):
+        return 60
+    if len(parts) > 1:
+        return 40
+    return 20
+def source_candidates(filename: str) -> List[str]:
+    candidates: List[Tuple[int, int, str]] = []
+    for text, start, _end in bracket_parts(filename):
+        clean = text.strip()
+        if SOURCE_TAG_RE.fullmatch(clean):
+            normalized = normalize_source_text(clean)
+            candidates.append((source_priority(normalized), -start, normalized))
+    for match in SOURCE_RE.finditer(filename):
+        normalized = normalize_source_text(match.group(0))
+        candidates.append((source_priority(normalized), -match.start(), normalized))
+    deduped: Dict[str, Tuple[int, int, str]] = {}
+    for priority, neg_start, value in candidates:
+        key = value.lower()
+        if key not in deduped or (priority, neg_start) > (deduped[key][0], deduped[key][1]):
+            deduped[key] = (priority, neg_start, value)
+    return [value for _priority, _neg_start, value in sorted(deduped.values(), reverse=True)]
+def best_structural_episode(filename: str) -> Optional[int]:
+    priorities = {
+        "season_episode": 1000,
+        "dash_episode": 900,
+        "bracket_episode": 850,
+        "explicit_episode": 800,
+        "long_episode": 750,
+        "generic_episode": 100,
+    }
+    candidates: List[Tuple[int, int, int]] = []
+    for name, pattern in EPISODE_PATTERNS:
+        for match in pattern.finditer(filename):
+            ep_text = match.group("ep")
+            ep = int(ep_text)
+            if ep == 0 or ep > 2000:
+                continue
+            context = filename[max(0, match.start() - 5):match.end() + 5]
+            if RESOLUTION_RE.search(context) or re.search(r"AAC|DDP|AC3|H\.?26[45]|x26[45]", context, re.I):
+                continue
+            priority = priorities[name]
+            if 1 <= ep <= 200:
+                priority += 20
+            candidates.append((priority, match.start(), ep))
+    if not candidates:
+        return None
+    return max(candidates, key=lambda item: (item[0], item[1]))[2]
+def plausible_episode_context(filename: str, episode: int) -> bool:
+    ep_text = str(episode)
+    padded = f"{episode:02d}"
+    if re.search(rf"(?<![A-Za-z0-9])(?:H|x)\.?0*{re.escape(ep_text)}(?!\d)", filename, re.I):
+        return False
+    patterns = [
+        rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
+        rf"(?:^|[\s._])[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])",
+        rf"[\[\(【《](?:EP?|#)?0*{episode}(?:v\d+)?[\]\)】》]",
+        rf"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)0*{episode}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])",
+        rf"(?:^|[\s._\-\[\(【《])0*{episode}(?:v\d+)?(?=[\s._\-\]\)】》\[]+(?:\d{{3,4}}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
+    ]
+    return any(re.search(pattern, filename, re.I) for pattern in patterns) or bool(
+        re.search(rf"(?:^|[\s._\-\[\(【《])(?:{re.escape(ep_text)}|{re.escape(padded)})(?=$|[\s._\-\]\)】》])", filename)
+    )
+def strip_trailing_season_from_title(title: str, season: int) -> str:
+    season_text = str(season)
+    patterns = [
+        rf"\s+[Ss]0*{season_text}$",
+        rf"\s+Season\s*0*{season_text}$",
+        rf"\s+0*{season_text}$",
+    ]
+    cleaned = title
+    for pattern in patterns:
+        cleaned = re.sub(pattern, "", cleaned, flags=re.I).strip(" \t-_.")
+    match = TRAILING_SEQUEL_MARKER_RE.search(cleaned)
+    if match and season_marker_number(match.group("marker")) == season:
+        cleaned = cleaned[:match.start()].strip(" \t-_.")
+    return cleaned or title
+def clean_inferred_title(title: str) -> str:
+    raw_title = title.strip(" \t-_.")
+    bracket_matches = list(BRACKET_RE.finditer(raw_title))
+    if bracket_matches:
+        first = bracket_matches[0]
+        prefix = raw_title[:first.start()].strip(" \t-_.★☆")
+        text = next(group for group in first.groups() if group is not None).strip()
+        if text and not looks_like_episode_or_meta(text) and (
+            not prefix
+            or re.search(r"(?:新番|月|合集|繁|简|字幕|先行|合集|★|☆)", prefix, re.I)
+        ):
+            return text
+    return raw_title.strip("[]()【】《》（）")
 def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
     start = 0
     if group:
         first = BRACKET_RE.match(filename)
         if first and group in first.group(0):
             start = first.end()
+    else:
+        # Some releases put leading metadata before the actual title, e.g.
+        # `[1080p] Title - 01`. Do not keep that wrapper as title text.
+        while True:
+            leading = BRACKET_RE.match(filename[start:].lstrip(" \t._-"))
+            if not leading:
+                break
+            skipped_ws = len(filename[start:]) - len(filename[start:].lstrip(" \t._-"))
+            text = next(group for group in leading.groups() if group is not None)
+            if not looks_like_episode_or_meta(text):
+                break
+            start += skipped_ws + leading.end()
     end = None
     if episode is not None:
         ep_patterns = [
+            rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
             rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
             rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
+            rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
+            rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
             rf"[Ee]0*{episode}(?:v\d+)?",
         ]
         for pattern in ep_patterns:
     if end is None or end <= start:
         return None
+    title = clean_inferred_title(filename[start:end])
     return title or None
     # Convert to input IDs
     input_ids = tokenizer.convert_tokens_to_ids(tokens)
+    embedding_size = model.get_input_embeddings().weight.shape[0]
+    out_of_range_tokens = [
+        token for token, token_id in zip(tokens, input_ids)
+        if token_id >= embedding_size
+    ]
+    if out_of_range_tokens:
+        input_ids = [
+            token_id if token_id < embedding_size else tokenizer.unk_token_id
+            for token_id in input_ids
+        ]
     unk_token_id = tokenizer.unk_token_id
     unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
             "unk_count": len(unk_tokens),
             "unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
             "unk_tokens": unk_tokens[:50],
+            "vocab_mismatch": bool(out_of_range_tokens),
+            "model_embedding_size": int(embedding_size),
+            "tokenizer_vocab_size": int(tokenizer.vocab_size),
+            "out_of_range_tokens": out_of_range_tokens[:50],
             "tokens": tokens[:available],
             "labels": label_strings,
             "scores": [round(float(score), 4) for score in selected_scores],
     parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
     parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
     parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
+    parser.add_argument("--model-dir", type=str, default=".",
                         help="Path to trained model directory")
     parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
                         help="Tokenizer variant override. Defaults to checkpoint metadata")

label_repairs.py ADDED Viewed

	@@ -0,0 +1,513 @@

+"""Deterministic label repairs for known weak-label blind spots."""
+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from typing import Dict, Iterable, List, Optional, Sequence, Tuple
+SEPARATOR_CHARS = set(" \t-_.|~～")
+ROMAN_NUMERAL_VALUES = {
+    "II": 2,
+    "III": 3,
+    "IV": 4,
+    "V": 5,
+    "VI": 6,
+    "VII": 7,
+    "VIII": 8,
+    "IX": 9,
+    "Ⅱ": 2,
+    "Ⅲ": 3,
+    "Ⅳ": 4,
+    "Ⅴ": 5,
+    "Ⅵ": 6,
+    "Ⅶ": 7,
+    "Ⅷ": 8,
+    "Ⅸ": 9,
+}
+CN_NUMERAL_VALUES = {
+    "一": 1,
+    "二": 2,
+    "兩": 2,
+    "两": 2,
+    "貳": 2,
+    "贰": 2,
+    "弐": 2,
+    "弍": 2,
+    "三": 3,
+    "參": 3,
+    "叁": 3,
+    "参": 3,
+    "四": 4,
+    "肆": 4,
+    "五": 5,
+    "伍": 5,
+    "六": 6,
+    "陸": 6,
+    "陆": 6,
+    "七": 7,
+    "柒": 7,
+    "八": 8,
+    "捌": 8,
+    "九": 9,
+    "玖": 9,
+    "十": 10,
+}
+READING_MARKER_VALUES = {
+    "ni no sara": 2,
+    "ni no shou": 2,
+    "ni no sho": 2,
+    "ni no syo": 2,
+    "ni no shō": 2,
+    "ni gakki": 2,
+    "sono ni": 2,
+    "san no sara": 3,
+    "san no shou": 3,
+    "san no sho": 3,
+    "san no syo": 3,
+    "yon no sara": 4,
+    "shi no sara": 4,
+    "shin no sara": 4,
+    "go no sara": 5,
+    "gou no sara": 5,
+}
+# Bare "Ni" is often the Japanese particle に in romanized titles. Only repair
+# it for titles that have been verified as a sequel marker in the release name.
+STANDALONE_NI_SEASON_BASES = {
+    "Kakuriyo no Yadomeshi": 2,
+}
+EPISODE_CONTEXT_RE = re.compile(
+    r"^\s*(?:"
+    r"[-_]\s*(?:\d{1,4}|NCOP|NCED|OP|ED|OVA|OAD|SP|END)\b|"
+    r"#\s*\d{1,4}|"
+    r"[\[\(【《]\s*(?:EP?|#)?\d{1,4}"
+    r")",
+    re.I,
+)
+EPISODE_SPAN_RE = re.compile(
+    r"(?:"
+    r"[Ss]\d{1,2}[Ee]\d{1,4}(?:v\d+)?|"
+    r"(?:^|[\s._])[-_]\s*\d{1,4}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])|"
+    r"[\[\(【《](?:EP?|#)?\d{1,4}(?:v\d+)?[\]\)】》]|"
+    r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)\d{1,4}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])"
+    r")",
+    re.I,
+)
+BRACKET_RE = re.compile(r"\[([^\]]*)\]|\(([^)]*)\)|【([^】]*)】|《([^》]*)》")
+RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
+SOURCE_TOKEN_PATTERN = (
+    r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
+    r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
+    r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
+    r"CHS|CHT|GB|BIG5|JPN?|JPSC|JPTC|繁中|简中"
+)
+SOURCE_RE = re.compile(rf"(?<![A-Za-z0-9])(?:{SOURCE_TOKEN_PATTERN})(?![A-Za-z0-9])", re.I)
+SOURCE_TAG_RE = re.compile(
+    rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/,_-]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
+    re.I,
+)
+SPECIAL_TAG_RE = re.compile(
+    r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[:：].+",
+    re.I,
+)
+READING_MARKER_RE = re.compile(
+    r"(?<![A-Za-z0-9])"
+    r"(?P<marker>"
+    r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
+    r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
+    r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
+    r"(?:Go|Gou)\s+no\s+Sara|"
+    r"Ni\s+Gakki|"
+    r"Sono\s+Ni"
+    r")"
+    r"(?![A-Za-z0-9])",
+)
+ROMAN_MARKER_RE = re.compile(
+    r"(?<![A-Za-z0-9])"
+    r"(?P<marker>II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ])"
+    r"(?![A-Za-z0-9])"
+)
+CJK_MARKER_RE = re.compile(
+    r"(?P<marker>"
+    r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
+    r"第[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖\d]+[季期部章]"
+    r")"
+)
+@dataclass(frozen=True)
+class LabelRepair:
+    kind: str
+    marker: str
+    value: int
+    start: int
+    end: int
+def clean_marker_text(text: str) -> str:
+    return text.strip().strip("[]()【】《》（）").strip()
+def cn_number_to_int(text: str) -> Optional[int]:
+    text = text.strip()
+    if text.isdigit():
+        return int(text)
+    if text in CN_NUMERAL_VALUES:
+        return CN_NUMERAL_VALUES[text]
+    values = CN_NUMERAL_VALUES
+    if text.startswith("十") and len(text) == 2:
+        return 10 + values.get(text[1], 0)
+    if text.endswith("十") and len(text) == 2:
+        return values.get(text[0], 0) * 10
+    if "十" in text and len(text) == 3:
+        return values.get(text[0], 0) * 10 + values.get(text[2], 0)
+    return None
+def season_marker_number(text: str) -> Optional[int]:
+    """Return season number for compact sequel markers such as II or Ni no Sara."""
+    clean = clean_marker_text(text)
+    if not clean:
+        return None
+    if clean in ROMAN_NUMERAL_VALUES:
+        return ROMAN_NUMERAL_VALUES[clean]
+    lowered = re.sub(r"\s+", " ", clean.lower()).strip()
+    if lowered in READING_MARKER_VALUES:
+        return READING_MARKER_VALUES[lowered]
+    if lowered == "ni":
+        return 2
+    explicit = re.fullmatch(r"第(.+)[季期部章]", clean)
+    if explicit:
+        return cn_number_to_int(explicit.group(1))
+    cjk = re.fullmatch(r"([一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖])(?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?", clean)
+    if cjk:
+        return cn_number_to_int(cjk.group(1))
+    return None
+def token_offsets_in_text(text: str, tokens: Sequence[str]) -> Optional[List[Tuple[int, int]]]:
+    offsets: List[Tuple[int, int]] = []
+    cursor = 0
+    for token in tokens:
+        if token == "":
+            offsets.append((cursor, cursor))
+            continue
+        position = text.find(token, cursor)
+        if position < 0:
+            return None
+        end = position + len(token)
+        offsets.append((position, end))
+        cursor = end
+    return offsets
+def has_episode_context(text: str, marker_end: int) -> bool:
+    tail = text[marker_end:]
+    if EPISODE_CONTEXT_RE.match(tail):
+        return True
+    # Some releases put a season marker at the end of a title bracket and the
+    # episode in the next bracket: `[Title 貳之章][01]`.
+    tail = tail.lstrip()
+    tail = re.sub(r"^[\]\)】》]\s*", "", tail)
+    tail = re.sub(
+        r"^(?:[\[\(【《]\s*(?:menu|menus|bdmenu|ncop|nced|op|ed|ova|oad|sp)\s*[\]\)】》]\s*){0,2}",
+        "",
+        tail,
+        flags=re.I,
+    )
+    return bool(EPISODE_CONTEXT_RE.match(tail))
+def find_sequel_season_markers(text: str) -> List[LabelRepair]:
+    """Find high-confidence sequel markers that should be labeled as SEASON."""
+    repairs: List[LabelRepair] = []
+    for pattern, kind in (
+        (READING_MARKER_RE, "reading"),
+        (ROMAN_MARKER_RE, "roman"),
+        (CJK_MARKER_RE, "cjk"),
+    ):
+        for match in pattern.finditer(text):
+            marker = match.group("marker")
+            value = season_marker_number(marker)
+            if value is None or not has_episode_context(text, match.end()):
+                continue
+            repairs.append(LabelRepair(kind, marker, value, match.start(), match.end()))
+    for base, value in STANDALONE_NI_SEASON_BASES.items():
+        pattern = re.compile(rf"(?<![A-Za-z0-9]){re.escape(base)}\s+(?P<marker>Ni)(?![A-Za-z0-9])")
+        for match in pattern.finditer(text):
+            if not has_episode_context(text, match.end("marker")):
+                continue
+            repairs.append(
+                LabelRepair(
+                    kind="verified_bare_ni",
+                    marker=match.group("marker"),
+                    value=value,
+                    start=match.start("marker"),
+                    end=match.end("marker"),
+                )
+            )
+    repairs.sort(key=lambda item: (item.start, item.end))
+    deduped: List[LabelRepair] = []
+    for repair in repairs:
+        if deduped and repair.start < deduped[-1].end:
+            previous = deduped[-1]
+            if (repair.end - repair.start) > (previous.end - previous.start):
+                deduped[-1] = repair
+            continue
+        deduped.append(repair)
+    return deduped
+def labels_have_season_before(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], marker_start: int) -> bool:
+    return any(label.endswith("SEASON") and end <= marker_start for label, (_start, end) in zip(labels, offsets))
+def token_indices_for_span(offsets: Sequence[Tuple[int, int]], start: int, end: int) -> List[int]:
+    return [
+        idx for idx, (tok_start, tok_end) in enumerate(offsets)
+        if tok_start < end and tok_end > start
+    ]
+def label_span(labels: List[str], indices: Sequence[int], entity: str) -> None:
+    previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
+    first = not previous_is_same_entity
+    for idx in indices:
+        labels[idx] = f"B-{entity}" if first else f"I-{entity}"
+        first = False
+def label_span_if_changed(labels: List[str], indices: Sequence[int], entity: str) -> bool:
+    previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
+    first_label = f"I-{entity}" if previous_is_same_entity else f"B-{entity}"
+    expected = [first_label] + [f"I-{entity}"] * max(0, len(indices) - 1)
+    if [labels[idx] for idx in indices] == expected:
+        return False
+    label_span(labels, indices, entity)
+    return True
+def safe_to_overwrite_meta(labels: Sequence[str], indices: Sequence[int]) -> bool:
+    if not indices:
+        return False
+    return not any(
+        labels[idx].endswith(("GROUP", "EPISODE", "SEASON"))
+        for idx in indices
+    )
+def mark_adjacent_title_separators_o(
+    tokens: Sequence[str],
+    labels: List[str],
+    marker_indices: Sequence[int],
+) -> None:
+    if not marker_indices:
+        return
+    idx = marker_indices[0] - 1
+    while idx >= 0 and "".join(tokens[idx]).strip() == "" and labels[idx].endswith("TITLE"):
+        labels[idx] = "O"
+        idx -= 1
+    idx = marker_indices[-1] + 1
+    while idx < len(tokens) and tokens[idx] in SEPARATOR_CHARS and labels[idx].endswith("TITLE"):
+        labels[idx] = "O"
+        idx += 1
+def first_episode_end(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], text: str) -> int:
+    ends = [
+        end for label, (_start, end) in zip(labels, offsets)
+        if label.endswith("EPISODE")
+    ]
+    if ends:
+        return min(ends)
+    match = EPISODE_SPAN_RE.search(text)
+    return match.end() if match else 0
+def bracket_content_spans(text: str) -> Iterable[Tuple[str, int, int, int, int]]:
+    for match in BRACKET_RE.finditer(text):
+        groups = match.groups()
+        group_index = next((idx for idx, value in enumerate(groups) if value is not None), None)
+        if group_index is None:
+            continue
+        inner = groups[group_index] or ""
+        # The opening delimiter is one code point in all supported bracket forms.
+        inner_start = match.start() + 1
+        inner_end = inner_start + len(inner)
+        yield inner.strip(), inner_start, inner_end, match.start(), match.end()
+def repair_structural_meta_labels(
+    text: str,
+    tokens: Sequence[str],
+    labels: List[str],
+    offsets: Sequence[Tuple[int, int]],
+) -> List[LabelRepair]:
+    repairs: List[LabelRepair] = []
+    episode_end = first_episode_end(labels, offsets, text)
+    for clean, inner_start, inner_end, bracket_start, _bracket_end in bracket_content_spans(text):
+        if bracket_start < episode_end:
+            continue
+        if not clean:
+            continue
+        if SPECIAL_TAG_RE.fullmatch(clean):
+            indices = token_indices_for_span(offsets, inner_start, inner_end)
+            if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SPECIAL"):
+                repairs.append(LabelRepair("special", clean, 0, inner_start, inner_end))
+            continue
+        if SOURCE_TAG_RE.fullmatch(clean):
+            indices = token_indices_for_span(offsets, inner_start, inner_end)
+            if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
+                repairs.append(LabelRepair("source", clean, 0, inner_start, inner_end))
+            continue
+        for match in RESOLUTION_RE.finditer(clean):
+            start = inner_start + match.start()
+            end = inner_start + match.end()
+            indices = token_indices_for_span(offsets, start, end)
+            if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "RESOLUTION"):
+                repairs.append(LabelRepair("resolution", match.group(0), 0, start, end))
+        for match in SOURCE_RE.finditer(clean):
+            start = inner_start + match.start()
+            end = inner_start + match.end()
+            indices = token_indices_for_span(offsets, start, end)
+            if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
+                repairs.append(LabelRepair("source", match.group(0), 0, start, end))
+    # Dot-separated WEB names often carry source/resolution after SxxEyy without
+    # brackets. Repair only after the episode span to avoid touching titles.
+    for pattern, entity in ((RESOLUTION_RE, "RESOLUTION"), (SOURCE_RE, "SOURCE")):
+        for match in pattern.finditer(text):
+            if match.start() < episode_end:
+                continue
+            indices = token_indices_for_span(offsets, match.start(), match.end())
+            if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, entity):
+                repairs.append(LabelRepair(entity.lower(), match.group(0), 0, match.start(), match.end()))
+    return repairs
+def repair_known_label_issues(
+    item: Dict,
+) -> Tuple[List[str], List[str], List[LabelRepair]]:
+    """
+    Repair known weak-label issues.
+    The repair is intentionally conservative:
+    - sequel markers must be immediately before an episode/special context;
+    - sequel marker spans must currently be part of TITLE/O, not group/meta;
+    - rows that already have a season before the marker are left alone;
+    - structural meta repairs only touch spans after the first episode.
+    """
+    source_tokens = [str(token) for token in item.get("tokens", [])]
+    source_labels = [str(label) for label in item.get("labels", [])]
+    if len(source_tokens) != len(source_labels):
+        return source_tokens, source_labels, []
+    filename = str(item.get("filename") or "")
+    text = filename if filename else "".join(source_tokens)
+    offsets = token_offsets_in_text(text, source_tokens)
+    if offsets is None:
+        text = "".join(source_tokens)
+        offsets = token_offsets_in_text(text, source_tokens)
+    if offsets is None:
+        return source_tokens, source_labels, []
+    repaired_labels = list(source_labels)
+    applied: List[LabelRepair] = []
+    quick_text = text.lower()
+    has_sequel_marker_hint = any(
+        needle in text or needle in quick_text
+        for needle in (
+            " II", " III", " IV", " V", " VI", " VII", " VIII", " IX",
+            "Ⅱ", "Ⅲ", "Ⅳ", "Ⅴ", "Ⅵ", "Ⅶ", "Ⅷ", "Ⅸ",
+            "之章", "之期", "之季", "之部", "ノ章", "ノ期", "の章", "の期",
+            "貳", "贰", "弐", "弍", "參", "叁", "参", "肆", "陸", "陆",
+            "Ni ", " ni ", " no Sara", "Gakki",
+        )
+    )
+    if has_sequel_marker_hint:
+        for repair in find_sequel_season_markers(text):
+            if labels_have_season_before(repaired_labels, offsets, repair.start):
+                continue
+            indices = token_indices_for_span(offsets, repair.start, repair.end)
+            if not indices:
+                continue
+            existing = [repaired_labels[idx] for idx in indices]
+            if any(
+                label.endswith(("GROUP", "EPISODE", "RESOLUTION", "SOURCE", "SPECIAL"))
+                for label in existing
+            ):
+                continue
+            if not any(label.endswith("TITLE") for label in existing):
+                continue
+            label_span(repaired_labels, indices, "SEASON")
+            mark_adjacent_title_separators_o(source_tokens, repaired_labels, indices)
+            applied.append(repair)
+    applied.extend(repair_structural_meta_labels(text, source_tokens, repaired_labels, offsets))
+    return source_tokens, repaired_labels, applied
+def repair_sequel_season_labels(
+    item: Dict,
+) -> Tuple[List[str], List[str], List[LabelRepair]]:
+    """Backward-compatible wrapper for callers that repair known label issues."""
+    return repair_known_label_issues(item)
+def repair_jsonl_item(item: Dict) -> Tuple[Dict, List[LabelRepair]]:
+    tokens, labels, repairs = repair_known_label_issues(item)
+    labels = normalize_iob2(labels)
+    if not repairs:
+        if labels == item.get("labels", []):
+            return item, []
+        repaired = dict(item)
+        repaired["labels"] = labels
+        return repaired, []
+    repaired = dict(item)
+    repaired["tokens"] = tokens
+    repaired["labels"] = labels
+    return repaired, repairs
+def normalize_iob2(labels: Sequence[str]) -> List[str]:
+    normalized: List[str] = []
+    previous_entity: Optional[str] = None
+    for label in labels:
+        if not label.startswith(("B-", "I-")):
+            normalized.append("O")
+            previous_entity = None
+            continue
+        entity = label.split("-", 1)[1]
+        prefix = "I" if previous_entity == entity else "B"
+        normalized.append(f"{prefix}-{entity}")
+        previous_entity = entity
+    return normalized

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f2ad5fcbe0fe0e8ce563aa65347368f410e9825d998283e300a446ee2a921cf3
-size 15866796

 version https://git-lfs.github.com/spec/v1
+oid sha256:697d7491b83ef615994e02f11f0f65362c400f5eb6b4be8f43f02435ad43173f
+size 19142604

model/config.json DELETED Viewed

@@ -1,64 +0,0 @@
-{
-  "add_cross_attention": false,
-  "architectures": [
-    "BertForTokenClassification"
-  ],
-  "attention_probs_dropout_prob": 0.1,
-  "bos_token_id": null,
-  "classifier_dropout": null,
-  "dtype": "float32",
-  "eos_token_id": null,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
-  "hidden_size": 256,
-  "id2label": {
-    "0": "O",
-    "1": "B-TITLE",
-    "2": "I-TITLE",
-    "3": "B-SEASON",
-    "4": "I-SEASON",
-    "5": "B-EPISODE",
-    "6": "I-EPISODE",
-    "7": "B-SPECIAL",
-    "8": "I-SPECIAL",
-    "9": "B-GROUP",
-    "10": "I-GROUP",
-    "11": "B-RESOLUTION",
-    "12": "I-RESOLUTION",
-    "13": "B-SOURCE",
-    "14": "I-SOURCE"
-  },
-  "initializer_range": 0.02,
-  "intermediate_size": 1024,
-  "is_decoder": false,
-  "label2id": {
-    "B-EPISODE": 5,
-    "B-GROUP": 9,
-    "B-RESOLUTION": 11,
-    "B-SEASON": 3,
-    "B-SOURCE": 13,
-    "B-SPECIAL": 7,
-    "B-TITLE": 1,
-    "I-EPISODE": 6,
-    "I-GROUP": 10,
-    "I-RESOLUTION": 12,
-    "I-SEASON": 4,
-    "I-SOURCE": 14,
-    "I-SPECIAL": 8,
-    "I-TITLE": 2,
-    "O": 0
-  },
-  "layer_norm_eps": 1e-12,
-  "max_position_embeddings": 128,
-  "max_seq_length": 64,
-  "model_type": "bert",
-  "num_attention_heads": 8,
-  "num_hidden_layers": 4,
-  "pad_token_id": 0,
-  "tie_word_embeddings": true,
-  "tokenizer_variant": "regex",
-  "transformers_version": "5.8.1",
-  "type_vocab_size": 2,
-  "use_cache": false,
-  "vocab_size": 3000
-}

model/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8213677836eed2c4e4f64f81ebeff58e6166c808aee158954055475cbf90601b
-size 15866796

model/tokenizer_config.json DELETED Viewed

@@ -1,44 +0,0 @@
-{
-  "added_tokens_decoder": {
-    "0": {
-      "content": "[PAD]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "[UNK]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "[CLS]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "[SEP]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "backend": "custom",
-  "cls_token": "[CLS]",
-  "model_max_length": 1000000000000000019884624838656,
-  "pad_token": "[PAD]",
-  "sep_token": "[SEP]",
-  "tokenizer_class": "AnimeTokenizer",
-  "tokenizer_variant": "regex",
-  "unk_token": "[UNK]"
-}

model/training_args.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:67f4980e6e5c8a3b151030042cae7449e798e3fc87518f33ed4d557e6fa17e41
-size 5265

model/vocab.json DELETED Viewed

The diff for this file is too large to render. See raw diff

parse_eval_metrics.json ADDED Viewed

	@@ -0,0 +1,595 @@

+{
+  "sample_count": 2048,
+  "field_accuracy": {
+    "group": 1.0,
+    "title": 0.99658203125,
+    "season": 0.994140625,
+    "episode": 0.99609375,
+    "resolution": 0.998046875,
+    "source": 0.99365234375,
+    "special": 0.998046875
+  },
+  "field_correct": {
+    "group": 2048,
+    "title": 2041,
+    "season": 2036,
+    "episode": 2040,
+    "resolution": 2044,
+    "source": 2035,
+    "special": 2044
+  },
+  "field_total": {
+    "group": 2048,
+    "title": 2048,
+    "season": 2048,
+    "episode": 2048,
+    "resolution": 2048,
+    "source": 2048,
+    "special": 2048
+  },
+  "full_match_accuracy": 0.98046875,
+  "full_match_correct": 2008,
+  "full_match_total": 2048,
+  "failures": [
+    {
+      "filename": "[DBD-Raws][Boruto Naruto Next Generations][menu][S13][D2][02][1080P][BDRip][HEVC-10bit][FLAC]",
+      "errors": {
+        "season": {
+          "gold": null,
+          "pred": "13"
+        }
+      },
+      "gold": {
+        "group": "DBD-Raws",
+        "title": "Boruto Naruto Next Generations",
+        "season": null,
+        "episode": 2,
+        "resolution": "1080P",
+        "source": "BDRip",
+        "special": null
+      },
+      "pred": {
+        "group": "DBD-Raws",
+        "title": "Boruto Naruto Next Generations",
+        "season": 13,
+        "episode": 2,
+        "resolution": "1080P",
+        "source": "BDRip",
+        "special": null
+      }
+    },
+    {
+      "filename": "[アニメ BD] ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」(1424x1072 HEVC 10bit FLAC softSub(chi+eng) chap)",
+      "errors": {
+        "season": {
+          "gold": null,
+          "pred": "1"
+        }
+      },
+      "gold": {
+        "group": "アニメ BD",
+        "title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
+        "season": null,
+        "episode": 9,
+        "resolution": "1424x1072",
+        "source": "BD",
+        "special": null
+      },
+      "pred": {
+        "group": "アニメ BD",
+        "title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
+        "season": 1,
+        "episode": 9,
+        "resolution": "1424x1072",
+        "source": "BD",
+        "special": null
+      }
+    },
+    {
+      "filename": "コメットさん☆ 第11話 「バトンの力」(DVD DivX4.12 QB95 640x480 24f) [CRC32_C09E1AB0]",
+      "errors": {
+        "source": {
+          "gold": "cr",
+          "pred": "dvd"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "コメットさん☆",
+        "season": null,
+        "episode": 11,
+        "resolution": "640x480",
+        "source": "CR",
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "コメットさん☆",
+        "season": null,
+        "episode": 11,
+        "resolution": "640x480",
+        "source": "DVD",
+        "special": null
+      }
+    },
+    {
+      "filename": "[Kamigami&Mabors&VCB-Studio] Saenai Heroine no Sodatekata Flat [07][Ma10p_1080p][x265_2aac]",
+      "errors": {
+        "source": {
+          "gold": "aac",
+          "pred": "x265-2aac"
+        }
+      },
+      "gold": {
+        "group": "Kamigami&Mabors&VCB-Studio",
+        "title": "Saenai Heroine no Sodatekata Flat",
+        "season": null,
+        "episode": 7,
+        "resolution": "1080p",
+        "source": "aac",
+        "special": null
+      },
+      "pred": {
+        "group": "Kamigami&Mabors&VCB-Studio",
+        "title": "Saenai Heroine no Sodatekata Flat",
+        "season": null,
+        "episode": 7,
+        "resolution": "1080p",
+        "source": "x265_2aac",
+        "special": null
+      }
+    },
+    {
+      "filename": "[Liuyun&VCB-Studio] Hanasaku Iroha [07][Hi10p_1080p][x264_flac_ac3]",
+      "errors": {
+        "source": {
+          "gold": "flac",
+          "pred": "x264-flac"
+        }
+      },
+      "gold": {
+        "group": "Liuyun&VCB-Studio",
+        "title": "Hanasaku Iroha",
+        "season": null,
+        "episode": 7,
+        "resolution": "1080p",
+        "source": "flac",
+        "special": null
+      },
+      "pred": {
+        "group": "Liuyun&VCB-Studio",
+        "title": "Hanasaku Iroha",
+        "season": null,
+        "episode": 7,
+        "resolution": "1080p",
+        "source": "x264_flac",
+        "special": null
+      }
+    },
+    {
+      "filename": "小新外传4[EP02][2017.06.07]出动！妖怪克星",
+      "errors": {
+        "title": {
+          "gold": "小新外传4 ep02 2017 06",
+          "pred": "小新外传 ep02 2"
+        },
+        "season": {
+          "gold": null,
+          "pred": "4"
+        },
+        "episode": {
+          "gold": "7",
+          "pred": "2"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "小新外传4 EP02 2017 06",
+        "season": null,
+        "episode": 7,
+        "resolution": null,
+        "source": null,
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "小新外传 EP02 2",
+        "season": 4,
+        "episode": 2,
+        "resolution": null,
+        "source": null,
+        "special": null
+      }
+    },
+    {
+      "filename": "[GM-Team][国漫][异常生物见闻录][The Record of Unusual Creatures][2019][12][HEVC][GB][3840×2160]",
+      "errors": {
+        "resolution": {
+          "gold": "3840×2160",
+          "pred": "3840×"
+        }
+      },
+      "gold": {
+        "group": "GM-Team",
+        "title": "国漫",
+        "season": null,
+        "episode": 12,
+        "resolution": "3840×2160",
+        "source": "GB",
+        "special": null
+      },
+      "pred": {
+        "group": "GM-Team",
+        "title": "国漫",
+        "season": null,
+        "episode": 12,
+        "resolution": "3840×",
+        "source": "GB",
+        "special": null
+      }
+    },
+    {
+      "filename": "Ⅱ 116 第108次鐘聲已經敲過了嗎？",
+      "errors": {
+        "title": {
+          "gold": "ⅱ 116 第",
+          "pred": "第"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "Ⅱ 116 第",
+        "season": null,
+        "episode": 116,
+        "resolution": null,
+        "source": null,
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "第",
+        "season": null,
+        "episode": 116,
+        "resolution": null,
+        "source": null,
+        "special": null
+      }
+    },
+    {
+      "filename": "EP08 & EP11 NCED",
+      "errors": {
+        "title": {
+          "gold": "&",
+          "pred": "ep"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "&",
+        "season": null,
+        "episode": 11,
+        "resolution": null,
+        "source": null,
+        "special": "NCED"
+      },
+      "pred": {
+        "group": null,
+        "title": "EP",
+        "season": null,
+        "episode": 11,
+        "resolution": null,
+        "source": null,
+        "special": "NCED"
+      }
+    },
+    {
+      "filename": "[S1YURICON] Necronomico no Cosmic Horror Show[06][1080p][WebRip][HEVC_AAC][CHS]",
+      "errors": {
+        "season": {
+          "gold": null,
+          "pred": "1"
+        }
+      },
+      "gold": {
+        "group": "S1YURICON",
+        "title": "Necronomico no Cosmic Horror Show",
+        "season": null,
+        "episode": 6,
+        "resolution": "1080p",
+        "source": "WebRip",
+        "special": null
+      },
+      "pred": {
+        "group": "S1YURICON",
+        "title": "Necronomico no Cosmic Horror Show",
+        "season": 1,
+        "episode": 6,
+        "resolution": "1080p",
+        "source": "WebRip",
+        "special": null
+      }
+    },
+    {
+      "filename": "[FZsub]Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02(14) (MX 1280x720 x264 AAC)_x264",
+      "errors": {
+        "title": {
+          "gold": "gate - jieitai kanochi nite, kaku tatakaeri 2",
+          "pred": "gate - jieitai kanochi nite, kaku tatakaeri 2 - 02"
+        },
+        "season": {
+          "gold": "2",
+          "pred": null
+        }
+      },
+      "gold": {
+        "group": "FZsub",
+        "title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2",
+        "season": 2,
+        "episode": 14,
+        "resolution": "1280x720",
+        "source": "x264",
+        "special": null
+      },
+      "pred": {
+        "group": "FZsub",
+        "title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02",
+        "season": null,
+        "episode": 14,
+        "resolution": "1280x720",
+        "source": "x264",
+        "special": null
+      }
+    },
+    {
+      "filename": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On [BD 1248x702 23.976fps AVC-yuv420p10 FLAC] v2 - yan04000985",
+      "errors": {
+        "episode": {
+          "gold": null,
+          "pred": "23"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
+        "season": null,
+        "episode": null,
+        "resolution": "1248x702",
+        "source": "BD",
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
+        "season": null,
+        "episode": 23,
+        "resolution": "1248x702",
+        "source": "BD",
+        "special": null
+      }
+    },
+    {
+      "filename": "Mary_E_Il_Giardino_Segreto_-_07_-_Camilla_[DvdMUX_by_Magic_©2008]",
+      "errors": {
+        "source": {
+          "gold": null,
+          "pred": "dvd"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "Mary_E_Il_Giardino_Segreto",
+        "season": null,
+        "episode": 7,
+        "resolution": null,
+        "source": null,
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "Mary_E_Il_Giardino_Segreto",
+        "season": null,
+        "episode": 7,
+        "resolution": null,
+        "source": "Dvd",
+        "special": null
+      }
+    },
+    {
+      "filename": "(アニメ) アイドル伝説えり子 第24話 「心をつなぐ輪舞曲」 (DVD 640x480DivX5.02QB93 48kHz128kbps)",
+      "errors": {
+        "resolution": {
+          "gold": null,
+          "pred": "640x480"
+        }
+      },
+      "gold": {
+        "group": "アニメ",
+        "title": "アイドル伝説えり子",
+        "season": null,
+        "episode": 24,
+        "resolution": null,
+        "source": "DVD",
+        "special": null
+      },
+      "pred": {
+        "group": "アニメ",
+        "title": "アイドル伝説えり子",
+        "season": null,
+        "episode": 24,
+        "resolution": "640x480",
+        "source": "DVD",
+        "special": null
+      }
+    },
+    {
+      "filename": "[DMG] 東京レイヴンズ 第06話「days in nest -休日-」 [BDRip][AVC_AAC][720P][CHS](A8161323)",
+      "errors": {
+        "episode": {
+          "gold": "1323",
+          "pred": "6"
+        }
+      },
+      "gold": {
+        "group": "DMG",
+        "title": "東京レイヴンズ 第06話「days in nest -休日-」",
+        "season": null,
+        "episode": 1323,
+        "resolution": "720P",
+        "source": "BDRip",
+        "special": null
+      },
+      "pred": {
+        "group": "DMG",
+        "title": "東京レイヴンズ 第06話「days in nest -休日-」",
+        "season": null,
+        "episode": 6,
+        "resolution": "720P",
+        "source": "BDRip",
+        "special": null
+      }
+    },
+    {
+      "filename": "[S1YURICON] Necronomico no Cosmic Horror Show[05v2][1080p][WebRip][AVC_AAC][CHS]",
+      "errors": {
+        "season": {
+          "gold": null,
+          "pred": "1"
+        }
+      },
+      "gold": {
+        "group": "S1YURICON",
+        "title": "Necronomico no Cosmic Horror Show",
+        "season": null,
+        "episode": 5,
+        "resolution": "1080p",
+        "source": "WebRip",
+        "special": null
+      },
+      "pred": {
+        "group": "S1YURICON",
+        "title": "Necronomico no Cosmic Horror Show",
+        "season": 1,
+        "episode": 5,
+        "resolution": "1080p",
+        "source": "WebRip",
+        "special": null
+      }
+    },
+    {
+      "filename": "Cardcaptor Sakura - 17 [x264-AAC-BD1440x1080p][Sakura][C-W][E2B50799]",
+      "errors": {
+        "resolution": {
+          "gold": null,
+          "pred": "1080p"
+        },
+        "source": {
+          "gold": null,
+          "pred": "e2b50799"
+        }
+      },
+      "gold": {
+        "group": null,
+        "title": "Cardcaptor Sakura",
+        "season": null,
+        "episode": 17,
+        "resolution": null,
+        "source": null,
+        "special": null
+      },
+      "pred": {
+        "group": null,
+        "title": "Cardcaptor Sakura",
+        "season": null,
+        "episode": 17,
+        "resolution": "1080p",
+        "source": "E2B50799",
+        "special": null
+      }
+    },
+    {
+      "filename": "[Xspitfire911] Tate no Yuusha no Nariagari S01E20 BDRIP 1080p X265 10bit VOSTFR",
+      "errors": {
+        "season": {
+          "gold": null,
+          "pred": "1"
+        }
+      },
+      "gold": {
+        "group": "Xspitfire911",
+        "title": "Tate no Yuusha no Nariagari",
+        "season": null,
+        "episode": 20,
+        "resolution": "1080p",
+        "source": "BDRIP",
+        "special": null
+      },
+      "pred": {
+        "group": "Xspitfire911",
+        "title": "Tate no Yuusha no Nariagari",
+        "season": 1,
+        "episode": 20,
+        "resolution": "1080p",
+        "source": "BDRIP",
+        "special": null
+      }
+    },
+    {
+      "filename": "[KTXP][Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV][13][BIG5][720P][MP4]",
+      "errors": {
+        "title": {
+          "gold": "dungeon ni deai wo motomeru no wa machigatteiru darou ka",
+          "pred": "dungeon ni deai wo motomeru no wa machigatteiru darou ka iv"
+        },
+        "season": {
+          "gold": "4",
+          "pred": null
+        }
+      },
+      "gold": {
+        "group": "KTXP",
+        "title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka",
+        "season": 4,
+        "episode": 13,
+        "resolution": "720P",
+        "source": "BIG5",
+        "special": null
+      },
+      "pred": {
+        "group": "KTXP",
+        "title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV",
+        "season": null,
+        "episode": 13,
+        "resolution": "720P",
+        "source": "BIG5",
+        "special": null
+      }
+    },
+    {
+      "filename": "[JyFanSub][Fate_Apocrypha][15][GB][1080]p",
+      "errors": {
+        "episode": {
+          "gold": "1080",
+          "pred": "15"
+        }
+      },
+      "gold": {
+        "group": "JyFanSub",
+        "title": "Fate_Apocrypha",
+        "season": null,
+        "episode": 1080,
+        "resolution": null,
+        "source": "GB",
+        "special": null
+      },
+      "pred": {
+        "group": "JyFanSub",
+        "title": "Fate_Apocrypha",
+        "season": null,
+        "episode": 15,
+        "resolution": null,
+        "source": "GB",
+        "special": null
+      }
+    }
+  ]
+}

pyproject.toml ADDED Viewed

	@@ -0,0 +1,36 @@

+[project]
+name = "anifilebert"
+version = "0.1.0"
+description = "Tiny BERT token-classification model and tooling for parsing anime release filenames."
+readme = "README.md"
+requires-python = ">=3.11"
+license = { text = "Apache-2.0" }
+dependencies = [
+    "accelerate==1.13.0",
+    "datasets==4.8.5",
+    "numpy==2.4.5",
+    "onnx==1.21.0",
+    "onnxruntime==1.26.0",
+    "onnxscript==0.7.0",
+    "seqeval==1.2.2",
+    "tensorboard>=2.14.0",
+    "torch==2.12.0+cu126",
+    "transformers==5.8.1",
+]
+[project.urls]
+Repository = "https://huggingface.co/ModerRAS/AniFileBERT"
+[tool.uv]
+package = false
+environments = ["sys_platform == 'win32'"]
+[tool.uv.sources]
+torch = [
+    { index = "pytorch-cu126", marker = "platform_system == 'Windows'" },
+]
+[[tool.uv.index]]
+name = "pytorch-cu126"
+url = "https://download.pytorch.org/whl/cu126"
+explicit = true

relabel_dataset_from_filenames.py ADDED Viewed

	@@ -0,0 +1,157 @@

+"""Rebuild AnimeName weak labels from each stored filename."""
+from __future__ import annotations
+import argparse
+import json
+from collections import Counter
+from datetime import datetime, timezone
+from pathlib import Path
+from statistics import mean
+from typing import Iterable
+from dmhy_dataset import weak_label_filename
+from label_repairs import repair_jsonl_item
+from tokenizer import AnimeTokenizer
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Relabel a JSONL dataset from filename strings")
+    parser.add_argument("--input", required=True, help="Input JSONL containing filename fields")
+    parser.add_argument("--output", required=True, help="Output relabeled regex-token JSONL")
+    parser.add_argument("--manifest-output", default=None, help="Relabel manifest JSON")
+    parser.add_argument("--vocab-output", default=None, help="Optional regex vocab JSON")
+    parser.add_argument("--base-vocab", default=None, help="Optional regex vocab whose IDs should be preserved")
+    parser.add_argument("--max-vocab-size", type=int, default=3000)
+    parser.add_argument("--limit", type=int, default=None)
+    parser.add_argument("--progress", type=int, default=50000)
+    parser.add_argument("--example-count", type=int, default=20)
+    return parser.parse_args()
+def iter_jsonl(path: Path) -> Iterable[dict]:
+    with path.open("r", encoding="utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                yield json.loads(line)
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"{path}:{line_no}: invalid JSON") from exc
+def length_stats(values: list[int]) -> dict:
+    if not values:
+        return {"min": 0, "mean": 0, "p50": 0, "p90": 0, "p95": 0, "p99": 0, "max": 0}
+    ordered = sorted(values)
+    def percentile(pct: float) -> int:
+        index = min(len(ordered) - 1, round((pct / 100) * (len(ordered) - 1)))
+        return ordered[index]
+    return {
+        "min": min(values),
+        "mean": mean(values),
+        "p50": percentile(50),
+        "p90": percentile(90),
+        "p95": percentile(95),
+        "p99": percentile(99),
+        "max": max(values),
+    }
+def main() -> None:
+    args = parse_args()
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+    manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
+    vocab_path = Path(args.vocab_output) if args.vocab_output else None
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    if vocab_path:
+        vocab_path.parent.mkdir(parents=True, exist_ok=True)
+    tokenizer = AnimeTokenizer()
+    rows_in = 0
+    rows_written = 0
+    rows_failed = 0
+    rows_repaired_after_relabel = 0
+    label_counter: Counter[str] = Counter()
+    failure_counter: Counter[str] = Counter()
+    token_lists: list[list[str]] = []
+    lengths: list[int] = []
+    examples: list[dict] = []
+    failures: list[dict] = []
+    with output_path.open("w", encoding="utf-8", newline="\n") as out:
+        for item in iter_jsonl(input_path):
+            rows_in += 1
+            filename = item.get("filename")
+            if not filename:
+                rows_failed += 1
+                failure_counter["missing_filename"] += 1
+                continue
+            sample = weak_label_filename(str(filename), tokenizer)
+            if sample is None:
+                rows_failed += 1
+                failure_counter["weak_label_failed"] += 1
+                if len(failures) < args.example_count:
+                    failures.append({"file_id": item.get("file_id"), "filename": filename})
+                continue
+            record = dict(item)
+            record.pop("tokenizer_variant", None)
+            record.pop("source_token_count", None)
+            record.pop("char_token_count", None)
+            record["tokens"] = sample["tokens"]
+            record["labels"] = sample["labels"]
+            repaired, repairs = repair_jsonl_item(record)
+            if repairs:
+                rows_repaired_after_relabel += 1
+                record = repaired
+            out.write(json.dumps(record, ensure_ascii=False, separators=(",", ":")) + "\n")
+            rows_written += 1
+            label_counter.update(record["labels"])
+            token_lists.append(record["tokens"])
+            lengths.append(len(record["tokens"]))
+            if len(examples) < args.example_count:
+                examples.append(record)
+            if args.limit is not None and rows_written >= args.limit:
+                break
+            if args.progress and rows_written % args.progress == 0:
+                print(f"relabeled {rows_written:,} rows; failed={rows_failed:,}")
+    base_vocab = None
+    if args.base_vocab:
+        with Path(args.base_vocab).open("r", encoding="utf-8") as handle:
+            base_vocab = json.load(handle)
+    tokenizer.build_vocab(token_lists, max_size=args.max_vocab_size, base_vocab=base_vocab)
+    if vocab_path:
+        vocab_path.write_text(json.dumps(tokenizer.get_vocab(), ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
+    manifest = {
+        "created_at": datetime.now(timezone.utc).isoformat(),
+        "input": str(input_path),
+        "output": str(output_path),
+        "vocab_output": str(vocab_path) if vocab_path else None,
+        "row_count": rows_written,
+        "input_rows": rows_in,
+        "failed_rows": rows_failed,
+        "repaired_after_relabel_rows": rows_repaired_after_relabel,
+        "failure_counts": dict(failure_counter),
+        "label_counts": dict(label_counter),
+        "token_length": length_stats(lengths),
+        "vocab_size": tokenizer.vocab_size,
+        "examples": examples,
+        "failures": failures,
+    }
+    manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
+    print(json.dumps({k: v for k, v in manifest.items() if k not in {"examples", "failures"}}, ensure_ascii=False, indent=2))
+if __name__ == "__main__":
+    main()

repair_dataset_labels.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Repair known weak-label mistakes in exported AnimeName JSONL datasets."""
+from __future__ import annotations
+import argparse
+import json
+from collections import Counter, defaultdict
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Dict, List
+from label_repairs import LabelRepair, repair_jsonl_item
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Repair weak BIO labels in a JSONL dataset")
+    parser.add_argument("--input", required=True, help="Input JSONL")
+    parser.add_argument("--output", required=True, help="Output repaired JSONL")
+    parser.add_argument("--manifest-output", default=None, help="Optional repair manifest JSON")
+    parser.add_argument("--dry-run", action="store_true", help="Scan only; do not write output JSONL")
+    parser.add_argument("--example-limit", type=int, default=40)
+    return parser.parse_args()
+def repair_key(repair: LabelRepair) -> str:
+    return f"{repair.kind}:{repair.marker}"
+def main() -> None:
+    args = parse_args()
+    input_path = Path(args.input)
+    output_path = Path(args.output)
+    manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
+    counts: Counter[str] = Counter()
+    marker_counts: Counter[str] = Counter()
+    examples: Dict[str, List[dict]] = defaultdict(list)
+    label_counts: Counter[str] = Counter()
+    row_count = 0
+    repaired_rows = 0
+    output_handle = None
+    if not args.dry_run:
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        output_handle = output_path.open("w", encoding="utf-8")
+    try:
+        with input_path.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                line = line.strip()
+                if not line:
+                    continue
+                row_count += 1
+                item = json.loads(line)
+                repaired, repairs = repair_jsonl_item(item)
+                if repairs:
+                    repaired_rows += 1
+                    for repair in repairs:
+                        key = repair_key(repair)
+                        counts[repair.kind] += 1
+                        marker_counts[key] += 1
+                        if len(examples[key]) < args.example_limit:
+                            examples[key].append(
+                                {
+                                    "file_id": item.get("file_id"),
+                                    "filename": item.get("filename"),
+                                    "marker": repair.marker,
+                                    "value": repair.value,
+                                    "span": [repair.start, repair.end],
+                                }
+                            )
+                label_counts.update(repaired.get("labels", []))
+                if output_handle is not None:
+                    output_handle.write(json.dumps(repaired, ensure_ascii=False, separators=(",", ":")) + "\n")
+    finally:
+        if output_handle is not None:
+            output_handle.close()
+    manifest = {
+        "created_at": datetime.now(timezone.utc).isoformat(),
+        "input": str(input_path),
+        "output": None if args.dry_run else str(output_path),
+        "dry_run": args.dry_run,
+        "row_count": row_count,
+        "repaired_rows": repaired_rows,
+        "repair_counts": dict(counts),
+        "marker_counts": dict(marker_counts),
+        "label_counts": dict(label_counts),
+        "examples": examples,
+    }
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
+    print(json.dumps({
+        "row_count": row_count,
+        "repaired_rows": repaired_rows,
+        "repair_counts": dict(counts),
+        "manifest": str(manifest_path),
+        "output": None if args.dry_run else str(output_path),
+    }, ensure_ascii=False, indent=2))
+if __name__ == "__main__":
+    main()

requirements.txt CHANGED Viewed

@@ -1,10 +1,12 @@
-torch>=2.0.0
-transformers>=4.30.0
-datasets>=2.12.0
-accelerate>=1.1.0
-seqeval>=1.2.2
-numpy>=1.24.0
-tqdm>=4.65.0
-onnx>=1.16.0
-onnxruntime>=1.18.0
-onnxscript>=0.1.0

+--extra-index-url https://download.pytorch.org/whl/cu126
+accelerate==1.13.0
+datasets==4.8.5
+numpy==2.4.5
+onnx==1.21.0
+onnxruntime==1.26.0
+onnxscript==0.7.0
+seqeval==1.2.2
+tensorboard>=2.14.0
+torch==2.12.0+cu126
+transformers==5.8.1

run_metadata.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "experiment_name": "dmhy-char-full-relabel",
+  "data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
+  "tokenizer_variant": "char",
+  "vocab_file": "datasets/AnimeName/vocab.char.json",
+  "vocab_size": 6199,
+  "max_seq_length": 128,
+  "hidden_size": 256,
+  "num_hidden_layers": 4,
+  "num_attention_heads": 8,
+  "intermediate_size": 1024,
+  "train_samples": 619361,
+  "eval_samples": 12641,
+  "epochs": 2.0,
+  "batch_size": 256,
+  "learning_rate": 8e-05,
+  "warmup_steps": 300,
+  "seed": 48,
+  "device": "cuda",
+  "fp16": true,
+  "gradient_accumulation_steps": 1,
+  "dataloader_num_workers": 4
+}

tokenizer.py CHANGED Viewed

@@ -45,9 +45,9 @@ class AnimeTokenizer(PreTrainedTokenizer):
     # Layer 2: Individual format token patterns
     FORMAT_PATTERNS: List[str] = [
         # Resolution
-        r'\d{3,4}[pP]',
-        r'\d{3,4}[xX×]\d{3,4}',
-        r'\d[Kk]',
         # Codec
         r'[xX]26[45]',

     # Layer 2: Individual format token patterns
     FORMAT_PATTERNS: List[str] = [
         # Resolution
+        r'(?<![A-Za-z0-9])\d{3,4}[pP](?![A-Za-z0-9])',
+        r'(?<![A-Za-z0-9])\d{3,4}[xX×]\d{3,4}(?![A-Za-z0-9])',
+        r'(?<![A-Za-z0-9])\d[Kk](?![A-Za-z0-9])',
         # Codec
         r'[xX]26[45]',

tokenizer_config.json CHANGED Viewed

@@ -38,7 +38,7 @@
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
-  "tokenizer_class": "AnimeTokenizer",
-  "tokenizer_variant": "regex",
   "unk_token": "[UNK]"
 }

   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
+  "tokenizer_class": "CharAnimeTokenizer",
+  "tokenizer_variant": "char",
   "unk_token": "[UNK]"
 }

train.py CHANGED Viewed

@@ -14,6 +14,7 @@ import json
 import tempfile
 import argparse
 import random
 from typing import Dict, List, Optional
 import numpy as np
@@ -29,7 +30,8 @@ from seqeval.metrics import classification_report, accuracy_score, f1_score, pre
 from config import Config
 from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
 from model import create_model, print_model_summary, count_parameters
-from dataset import AnimeDataset, align_tokens_for_tokenizer
 def compute_metrics(p):
@@ -88,10 +90,27 @@ def parse_args() -> argparse.Namespace:
                         help="Save resumable checkpoints every N steps instead of only at epoch end")
     parser.add_argument("--save-total-limit", type=int, default=2,
                         help="Maximum number of checkpoints to keep")
     parser.add_argument("--cpu", action="store_true", help="Force CPU training")
     parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
     parser.add_argument("--resume-from-checkpoint", default=None,
                         help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
     return parser.parse_args()
@@ -172,6 +191,118 @@ def validate_dataset_tokenizer_metadata(data: List[Dict], tokenizer_variant: str
         )
 def remap_token_embeddings(
     model: BertForTokenClassification,
     old_vocab: Dict[str, int],
@@ -220,7 +351,7 @@ def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_pat
                          max_size: Optional[int] = None) -> None:
     token_lists: List[List[str]] = []
     for item in data:
-        tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
         token_lists.append(tokens)
     tokenizer.build_vocab(token_lists, max_size=max_size)
@@ -250,20 +381,35 @@ def main():
         config.warmup_steps = args.warmup_steps
     if args.train_split is not None:
         config.train_split = args.train_split
     if args.max_seq_length is not None:
         config.max_seq_length = args.max_seq_length
     elif tokenizer_variant == "char":
         config.max_seq_length = max(config.max_seq_length, 128)
     random.seed(args.seed)
     np.random.seed(args.seed)
     torch.manual_seed(args.seed)
     print("Loading dataset...")
-    with open(config.data_file, 'r', encoding='utf-8') as f:
-        all_data = [json.loads(line) for line in f if line.strip()]
-    if args.limit_samples is not None:
-        all_data = all_data[:args.limit_samples]
     if not args.no_shuffle:
         random.shuffle(all_data)
     validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
@@ -280,6 +426,9 @@ def main():
     print(f"  Variant: {tokenizer_variant}")
     print(f"  Vocab size: {tokenizer.vocab_size}")
     print(f"  Max sequence length: {config.max_seq_length}")
     # Update config with actual vocab size
     config.vocab_size = tokenizer.vocab_size
@@ -288,15 +437,22 @@ def main():
     if args.init_model_dir:
         print(f"Loading model for fine-tuning: {args.init_model_dir}")
         model = BertForTokenClassification.from_pretrained(args.init_model_dir)
-        init_tokenizer = load_tokenizer(args.init_model_dir)
         init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
         if init_variant != tokenizer_variant:
             print(f"  WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
             print("  Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
-        if model.config.vocab_size != config.vocab_size or init_tokenizer.get_vocab() != tokenizer.get_vocab():
             copied = remap_token_embeddings(
                 model=model,
-                old_vocab=init_tokenizer.get_vocab(),
                 new_vocab=tokenizer.get_vocab(),
                 pad_token_id=tokenizer.pad_token_id,
             )
@@ -316,6 +472,7 @@ def main():
         print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
     split_idx = int(len(all_data) * config.train_split)
     train_data = all_data[:split_idx]
     eval_data = all_data[split_idx:]
@@ -350,8 +507,7 @@ def main():
     use_cpu = args.cpu or not torch.cuda.is_available()
     use_fp16 = not use_cpu
     print(f"  Device: {'CPU' if use_cpu else 'CUDA'}")
-    save_strategy = "steps" if args.checkpoint_steps else "epoch"
-    load_best_model_at_end = args.checkpoint_steps is None
     # Training arguments
     training_args = TrainingArguments(
@@ -359,20 +515,23 @@ def main():
         num_train_epochs=config.num_epochs,
         per_device_train_batch_size=config.batch_size,
         per_device_eval_batch_size=config.batch_size,
-        eval_strategy="epoch",
-        save_strategy=save_strategy,
         save_steps=args.checkpoint_steps,
         logging_steps=config.log_interval,
         learning_rate=config.learning_rate,
         weight_decay=config.weight_decay,
         warmup_steps=config.warmup_steps,
         use_cpu=use_cpu,
-        report_to="none",
         save_total_limit=args.save_total_limit,
-        load_best_model_at_end=load_best_model_at_end,
         metric_for_best_model="f1",
         greater_is_better=True,
         dataloader_num_workers=config.num_workers,
         fp16=use_fp16,
     )
@@ -410,6 +569,31 @@ def main():
     final_save_path = os.path.join(config.save_dir, "final")
     trainer.save_model(final_save_path)
     tokenizer.save_pretrained(final_save_path)
     print(f"Model saved to: {final_save_path}")
     # Final evaluation
@@ -417,6 +601,30 @@ def main():
     eval_results = trainer.evaluate()
     for key, value in eval_results.items():
         print(f"  {key}: {value:.4f}")
 if __name__ == "__main__":

 import tempfile
 import argparse
 import random
+from collections import Counter
 from typing import Dict, List, Optional
 import numpy as np
 from config import Config
 from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
 from model import create_model, print_model_summary, count_parameters
+from dataset import AnimeDataset, labels_for_tokenizer
+from inference import parse_filename, postprocess
 def compute_metrics(p):
                         help="Save resumable checkpoints every N steps instead of only at epoch end")
     parser.add_argument("--save-total-limit", type=int, default=2,
                         help="Maximum number of checkpoints to keep")
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
+                        help="Accumulate gradients across this many steps")
+    parser.add_argument("--num-workers", type=int, default=None,
+                        help="DataLoader worker count. Defaults to config.num_workers")
     parser.add_argument("--cpu", action="store_true", help="Force CPU training")
     parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
     parser.add_argument("--resume-from-checkpoint", default=None,
                         help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
+    parser.add_argument("--tensorboard", dest="tensorboard", action="store_true",
+                        help="Log metrics to TensorBoard in addition to stdout/checkpoints")
+    parser.add_argument("--no-tensorboard", dest="tensorboard", action="store_false",
+                        help="Disable TensorBoard logging")
+    parser.add_argument("--experiment-name", default=None,
+                        help="Optional experiment name written to run_metadata.json")
+    parser.add_argument("--parse-eval-limit", type=int, default=512,
+                        help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
+    parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
+    parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
+    parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
+    parser.add_argument("--intermediate-size", type=int, default=None, help="Override BERT FFN intermediate size")
+    parser.set_defaults(tensorboard=True)
     return parser.parse_args()
         )
+def load_jsonl(data_file: str, limit: Optional[int] = None) -> List[Dict]:
+    """Load JSONL rows, stopping early for smoke runs."""
+    data: List[Dict] = []
+    with open(data_file, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            data.append(json.loads(line))
+            if limit is not None and len(data) >= limit:
+                break
+    return data
+def normalize_field_value(field: str, value) -> Optional[str]:
+    if value is None:
+        return None
+    if field in {"episode", "season"}:
+        try:
+            return str(int(value))
+        except (TypeError, ValueError):
+            return str(value).strip().lower()
+    text = str(value).strip()
+    if field in {"resolution", "source"}:
+        return text.lower().replace("_", "-")
+    return " ".join(text.lower().split())
+def parse_exact_metrics(
+    samples: List[Dict],
+    model: BertForTokenClassification,
+    tokenizer: AnimeTokenizer,
+    id2label: Dict[int, str],
+    max_length: int,
+    limit: Optional[int],
+) -> Dict:
+    """Evaluate end-to-end field exact match on filenames, not just token loss."""
+    fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
+    selected = [sample for sample in samples if sample.get("filename")]
+    if limit is not None and limit > 0:
+        selected = selected[:limit]
+    counter: Counter = Counter()
+    failures: List[Dict] = []
+    model.eval()
+    for sample in selected:
+        filename = sample["filename"]
+        tokens, gold_labels = labels_for_tokenizer(sample, tokenizer)
+        available = max(0, max_length - 2)
+        tokens = tokens[:available]
+        gold_labels = gold_labels[:available]
+        gold = postprocess(tokens, gold_labels, tokenizer=tokenizer, filename=filename, use_rules=True)
+        gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
+        for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
+            if entity not in gold_entities:
+                gold[optional_field] = None
+        pred = parse_filename(
+            filename,
+            model,
+            tokenizer,
+            id2label,
+            max_length=max_length,
+            debug=False,
+            use_rules=True,
+            constrain_bio=True,
+        )
+        full_match = True
+        field_errors: Dict[str, Dict[str, Optional[str]]] = {}
+        for field in fields:
+            gold_value = normalize_field_value(field, gold.get(field))
+            pred_value = normalize_field_value(field, pred.get(field))
+            counter[f"{field}_total"] += 1
+            if gold_value == pred_value:
+                counter[f"{field}_correct"] += 1
+            else:
+                full_match = False
+                field_errors[field] = {"gold": gold_value, "pred": pred_value}
+        counter["full_total"] += 1
+        if full_match:
+            counter["full_correct"] += 1
+        elif len(failures) < 20:
+            failures.append(
+                {
+                    "filename": filename,
+                    "errors": field_errors,
+                    "gold": {field: gold.get(field) for field in fields},
+                    "pred": {field: pred.get(field) for field in fields},
+                }
+            )
+    field_accuracy = {}
+    for field in fields:
+        total = counter.get(f"{field}_total", 0)
+        correct = counter.get(f"{field}_correct", 0)
+        field_accuracy[field] = correct / total if total else 0.0
+    total = counter.get("full_total", 0)
+    correct = counter.get("full_correct", 0)
+    return {
+        "sample_count": total,
+        "field_accuracy": field_accuracy,
+        "field_correct": {field: counter.get(f"{field}_correct", 0) for field in fields},
+        "field_total": {field: counter.get(f"{field}_total", 0) for field in fields},
+        "full_match_accuracy": correct / total if total else 0.0,
+        "full_match_correct": correct,
+        "full_match_total": total,
+        "failures": failures,
+    }
 def remap_token_embeddings(
     model: BertForTokenClassification,
     old_vocab: Dict[str, int],
                          max_size: Optional[int] = None) -> None:
     token_lists: List[List[str]] = []
     for item in data:
+        tokens, _labels = labels_for_tokenizer(item, tokenizer)
         token_lists.append(tokens)
     tokenizer.build_vocab(token_lists, max_size=max_size)
         config.warmup_steps = args.warmup_steps
     if args.train_split is not None:
         config.train_split = args.train_split
+    if args.num_workers is not None:
+        config.num_workers = args.num_workers
     if args.max_seq_length is not None:
         config.max_seq_length = args.max_seq_length
     elif tokenizer_variant == "char":
         config.max_seq_length = max(config.max_seq_length, 128)
+    if args.hidden_size is not None:
+        config.hidden_size = args.hidden_size
+    if args.num_hidden_layers is not None:
+        config.num_hidden_layers = args.num_hidden_layers
+    if args.num_attention_heads is not None:
+        config.num_attention_heads = args.num_attention_heads
+    if args.intermediate_size is not None:
+        config.intermediate_size = args.intermediate_size
+    if config.hidden_size % config.num_attention_heads != 0:
+        raise ValueError(
+            f"hidden_size ({config.hidden_size}) must be divisible by "
+            f"num_attention_heads ({config.num_attention_heads})."
+        )
+    config.max_position_embeddings = max(config.max_position_embeddings, config.max_seq_length)
     random.seed(args.seed)
     np.random.seed(args.seed)
     torch.manual_seed(args.seed)
     print("Loading dataset...")
+    all_data = load_jsonl(config.data_file, args.limit_samples)
+    if len(all_data) < 2:
+        raise ValueError("Need at least two samples so train/eval split is non-empty.")
     if not args.no_shuffle:
         random.shuffle(all_data)
     validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
     print(f"  Variant: {tokenizer_variant}")
     print(f"  Vocab size: {tokenizer.vocab_size}")
     print(f"  Max sequence length: {config.max_seq_length}")
+    if torch.cuda.is_available() and not args.cpu:
+        print(f"  CUDA device: {torch.cuda.get_device_name(0)}")
+        print("  Mixed precision: fp16")
     # Update config with actual vocab size
     config.vocab_size = tokenizer.vocab_size
     if args.init_model_dir:
         print(f"Loading model for fine-tuning: {args.init_model_dir}")
         model = BertForTokenClassification.from_pretrained(args.init_model_dir)
+        init_tokenizer = load_tokenizer(args.init_model_dir, tokenizer_variant)
+        init_vocab = init_tokenizer.get_vocab()
+        embedding_size = model.get_input_embeddings().weight.shape[0]
+        if len(init_vocab) != embedding_size:
+            print(
+                "  WARNING: init checkpoint tokenizer vocab length does not match model embedding size "
+                f"({len(init_vocab):,} vs {embedding_size:,}). Prefer a self-consistent checkpoint."
+            )
         init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
         if init_variant != tokenizer_variant:
             print(f"  WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
             print("  Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
+        if model.config.vocab_size != config.vocab_size or init_vocab != tokenizer.get_vocab():
             copied = remap_token_embeddings(
                 model=model,
+                old_vocab=init_vocab,
                 new_vocab=tokenizer.get_vocab(),
                 pad_token_id=tokenizer.pad_token_id,
             )
         print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
     split_idx = int(len(all_data) * config.train_split)
+    split_idx = max(1, min(len(all_data) - 1, split_idx))
     train_data = all_data[:split_idx]
     eval_data = all_data[split_idx:]
     use_cpu = args.cpu or not torch.cuda.is_available()
     use_fp16 = not use_cpu
     print(f"  Device: {'CPU' if use_cpu else 'CUDA'}")
+    eval_save_strategy = "steps" if args.checkpoint_steps else "epoch"
     # Training arguments
     training_args = TrainingArguments(
         num_train_epochs=config.num_epochs,
         per_device_train_batch_size=config.batch_size,
         per_device_eval_batch_size=config.batch_size,
+        eval_strategy=eval_save_strategy,
+        save_strategy=eval_save_strategy,
+        eval_steps=args.checkpoint_steps,
         save_steps=args.checkpoint_steps,
         logging_steps=config.log_interval,
         learning_rate=config.learning_rate,
         weight_decay=config.weight_decay,
         warmup_steps=config.warmup_steps,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
         use_cpu=use_cpu,
+        report_to=["tensorboard"] if args.tensorboard else "none",
         save_total_limit=args.save_total_limit,
+        load_best_model_at_end=True,
         metric_for_best_model="f1",
         greater_is_better=True,
         dataloader_num_workers=config.num_workers,
+        dataloader_pin_memory=not use_cpu,
         fp16=use_fp16,
     )
     final_save_path = os.path.join(config.save_dir, "final")
     trainer.save_model(final_save_path)
     tokenizer.save_pretrained(final_save_path)
+    metadata = {
+        "experiment_name": args.experiment_name,
+        "data_file": config.data_file,
+        "tokenizer_variant": tokenizer_variant,
+        "vocab_file": vocab_path,
+        "vocab_size": tokenizer.vocab_size,
+        "max_seq_length": config.max_seq_length,
+        "hidden_size": config.hidden_size,
+        "num_hidden_layers": config.num_hidden_layers,
+        "num_attention_heads": config.num_attention_heads,
+        "intermediate_size": config.intermediate_size,
+        "train_samples": len(train_dataset),
+        "eval_samples": len(eval_dataset),
+        "epochs": config.num_epochs,
+        "batch_size": config.batch_size,
+        "learning_rate": config.learning_rate,
+        "warmup_steps": config.warmup_steps,
+        "seed": args.seed,
+        "device": "cpu" if use_cpu else "cuda",
+        "fp16": use_fp16,
+        "gradient_accumulation_steps": training_args.gradient_accumulation_steps,
+        "dataloader_num_workers": config.num_workers,
+    }
+    with open(os.path.join(final_save_path, "run_metadata.json"), "w", encoding="utf-8") as f:
+        json.dump(metadata, f, ensure_ascii=False, indent=2)
     print(f"Model saved to: {final_save_path}")
     # Final evaluation
     eval_results = trainer.evaluate()
     for key, value in eval_results.items():
         print(f"  {key}: {value:.4f}")
+    with open(os.path.join(final_save_path, "trainer_eval_metrics.json"), "w", encoding="utf-8") as f:
+        json.dump({key: float(value) for key, value in eval_results.items()}, f, ensure_ascii=False, indent=2)
+    if args.parse_eval_limit != 0:
+        parse_limit = args.parse_eval_limit if args.parse_eval_limit and args.parse_eval_limit > 0 else None
+        parse_metrics = parse_exact_metrics(
+            eval_data,
+            trainer.model,
+            tokenizer,
+            config.id2label,
+            config.max_seq_length,
+            parse_limit,
+        )
+        with open(os.path.join(final_save_path, "parse_eval_metrics.json"), "w", encoding="utf-8") as f:
+            json.dump(parse_metrics, f, ensure_ascii=False, indent=2)
+        print("\nParse exact-match evaluation:")
+        print(
+            f"  full_match: {parse_metrics['full_match_correct']}/"
+            f"{parse_metrics['full_match_total']} ({parse_metrics['full_match_accuracy']:.4f})"
+        )
+        for field, accuracy in parse_metrics["field_accuracy"].items():
+            correct = parse_metrics["field_correct"][field]
+            total = parse_metrics["field_total"][field]
+            print(f"  {field}: {correct}/{total} ({accuracy:.4f})")
 if __name__ == "__main__":

trainer_eval_metrics.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "eval_loss": 0.01631847210228443,
+  "eval_precision": 0.9799749533444652,
+  "eval_recall": 0.986698478236683,
+  "eval_f1": 0.9833252228334185,
+  "eval_accuracy": 0.9943065860243627,
+  "eval_runtime": 39.3604,
+  "eval_samples_per_second": 321.161,
+  "eval_steps_per_second": 1.27,
+  "epoch": 2.0
+}

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d71d921b0df7747e0ef56e0c8d857b27141dc8dfa47a8c93c20f39216b35e0db
 size 5265

 version https://git-lfs.github.com/spec/v1
+oid sha256:b5aa0df615ce731796aa9934b0505e00a685611be134c071d7b2487d8112dde1
 size 5265

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.char.json CHANGED Viewed

@@ -56,8 +56,8 @@
   "N": 54,
   "3": 55,
   "(": 56,
-  ")": 57,
-  "K": 58,
   "g": 59,
   "y": 60,
   "O": 61,

   "N": 54,
   "3": 55,
   "(": 56,
+  "K": 57,
+  ")": 58,
   "g": 59,
   "y": 60,
   "O": 61,

vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff