ModerRAS commited on
Commit
e63569d
·
1 Parent(s): b57780c

Improve anime filename parser model

Browse files
.gitignore CHANGED
@@ -1,9 +1,14 @@
1
  __pycache__/
2
  *.pyc
 
 
 
3
  logs/
4
  checkpoints/
5
  test_checkpoints*/
6
  ab_checkpoints*/
 
 
7
  data/**/*.jsonl
8
  !data/synthetic_small.jsonl
9
  !data/test_smoke.jsonl
 
1
  __pycache__/
2
  *.pyc
3
+ .venv/
4
+ .pytest_cache/
5
+ .ruff_cache/
6
  logs/
7
  checkpoints/
8
  test_checkpoints*/
9
  ab_checkpoints*/
10
+ *.log
11
+ *.onnx.data
12
  data/**/*.jsonl
13
  !data/synthetic_small.jsonl
14
  !data/test_smoke.jsonl
MAINTENANCE.md CHANGED
@@ -35,10 +35,9 @@ git submodule update --init --recursive
35
  Current DMHY snapshot:
36
 
37
  ```text
38
- last_file_id: 689304
39
- next_min_id: 689305
40
- labeled_samples: 263042
41
- mixed_train_samples: 363042
42
  ```
43
 
44
  The authoritative dataset files live in `datasets/AnimeName`.
@@ -46,17 +45,21 @@ The authoritative dataset files live in `datasets/AnimeName`.
46
  ## Train
47
 
48
  ```bash
49
- python -m pip install -r requirements.txt
50
- python train.py \
51
- --data-file datasets/AnimeName/mixed_train.jsonl \
52
- --vocab-file datasets/AnimeName/vocab.json \
53
- --save-dir checkpoints/dmhy-finetune \
 
54
  --init-model-dir . \
55
- --epochs 1 \
56
- --batch-size 128 \
57
- --learning-rate 0.0003 \
58
  --warmup-steps 300 \
59
- --seed 42
 
 
 
60
  ```
61
 
62
  ## Publish a New Checkpoint
@@ -64,13 +67,20 @@ python train.py \
64
  Copy the final checkpoint to the repository root:
65
 
66
  ```powershell
67
- Copy-Item checkpoints/dmhy-finetune/final/config.json . -Force
68
- Copy-Item checkpoints/dmhy-finetune/final/model.safetensors . -Force
69
- Copy-Item checkpoints/dmhy-finetune/final/tokenizer_config.json . -Force
70
- Copy-Item checkpoints/dmhy-finetune/final/training_args.bin . -Force
71
- Copy-Item checkpoints/dmhy-finetune/final/vocab.json . -Force
 
 
 
 
72
  ```
73
 
 
 
 
74
  Then commit and push:
75
 
76
  ```bash
 
35
  Current DMHY snapshot:
36
 
37
  ```text
38
+ labeled_samples: 632002
39
+ char_vocab_size: 6199
40
+ strict_bio_violations: 0
 
41
  ```
42
 
43
  The authoritative dataset files live in `datasets/AnimeName`.
 
45
  ## Train
46
 
47
  ```bash
48
+ uv sync
49
+ uv run python train.py \
50
+ --tokenizer char \
51
+ --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
52
+ --vocab-file datasets/AnimeName/vocab.char.json \
53
+ --save-dir checkpoints/dmhy-char-full-relabel \
54
  --init-model-dir . \
55
+ --epochs 2 \
56
+ --batch-size 256 \
57
+ --learning-rate 0.00008 \
58
  --warmup-steps 300 \
59
+ --max-seq-length 128 \
60
+ --checkpoint-steps 1000 \
61
+ --parse-eval-limit 2048 \
62
+ --seed 48
63
  ```
64
 
65
  ## Publish a New Checkpoint
 
67
  Copy the final checkpoint to the repository root:
68
 
69
  ```powershell
70
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
71
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
72
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
73
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
74
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
75
+ Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
76
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
77
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
78
+ Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force
79
  ```
80
 
81
+ There is no tracked `model/` duplicate. The root checkpoint is the publishing
82
+ surface; ignored `checkpoints/` directories are training artifacts.
83
+
84
  Then commit and push:
85
 
86
  ```bash
README.md CHANGED
@@ -19,7 +19,7 @@ language:
19
 
20
  AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
21
 
22
- The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokenizer model used by MiruPlay.
23
 
24
  ## Model
25
 
@@ -28,9 +28,9 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
28
  - Layers: 4
29
  - Attention heads: 8
30
  - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
31
- - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
32
- - Max sequence length: 64
33
- - Parameters: about 5M
34
 
35
  The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
36
 
@@ -47,52 +47,40 @@ Current DMHY export waterline (from `datasets/AnimeName`):
47
 
48
  ## Vocabulary
49
 
50
- The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
51
- analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
52
- become `[UNK]`, so larger vocabulary directly improves coverage:
 
 
53
 
54
- | Vocab size | Coverage | Model params |
55
- |------------|----------|-------------|
56
- | 3000 (old) | 90.4% | ~4.0M |
57
- | 8000 (current) | 96.2% | ~5.3M |
58
-
59
- Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
60
- and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
61
- vocabulary.
62
-
63
- For character-token training, `vocab.char.json` is mirrored at the repository
64
- root for plain `git pull` users and also lives at
65
- `datasets/AnimeName/vocab.char.json` beside the dataset. It is built from the
66
- full `dmhy_weak_char.jsonl` export. The full DMHY weak dataset has **6195
67
- unique characters**, so the complete character vocab is only **6199** entries
68
- including special tokens and reaches 100% token coverage.
69
 
70
  ## Evaluation
71
 
72
- Balanced mixed-data A/B run (`50K` synthetic + `50K` DMHY weak labels, 1 epoch, batch size 128, seed 42):
73
-
74
- | Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime |
75
- |---------|------------|-------|--------|---------|----------|---------------|
76
- | regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s |
77
- | char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s |
78
 
79
- Field-level F1 on the same validation split:
 
 
 
 
 
 
 
 
80
 
81
- | Field | regex | char |
82
- |-------|-------|------|
83
- | GROUP | 0.9962 | 0.9516 |
84
- | TITLE | 0.9761 | 0.7983 |
85
- | SEASON | 0.9880 | 0.6290 |
86
- | EPISODE | 0.9950 | 0.8082 |
87
-
88
- The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans.
89
 
90
  ## Usage
91
 
92
  Install dependencies:
93
 
94
  ```bash
95
- pip install -r requirements.txt
96
  ```
97
 
98
  Parse a filename with this repository cloned locally:
@@ -121,47 +109,25 @@ git submodule update --init --recursive
121
 
122
  ## Training
123
 
124
- ### Prerequisites (Windows / Local GPU)
125
-
126
- PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
127
-
128
- ```bash
129
- pip install torch --index-url https://download.pytorch.org/whl/cu126
130
- pip install -r requirements.txt
131
- ```
132
-
133
- ### Fine-tune with rebuilt vocabulary
134
-
135
- ```bash
136
- python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
137
- --vocab-file datasets/AnimeName/vocab.json \
138
- --save-dir checkpoints/dmhy-finetune \
139
- --init-model-dir . \
140
- --epochs 10 --batch-size 128 \
141
- --learning-rate 0.0003 --warmup-steps 300 --seed 42
142
- ```
143
-
144
- The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
145
- 5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
146
- trains the full model. About 96% of token occurrences are now covered (vs 90%
147
- with the old 3000-token vocabulary).
148
-
149
  ### Character-token DMHY training
150
 
151
  ```bash
152
- python convert_to_char_dataset.py \
153
  --input datasets/AnimeName/dmhy_weak.jsonl \
154
  --output datasets/AnimeName/dmhy_weak_char.jsonl \
155
- --vocab-output vocab.char.json \
156
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
157
 
158
- python train.py --tokenizer char \
159
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
160
- --vocab-file vocab.char.json \
161
- --save-dir checkpoints_char/dmhy-weak-char \
162
- --epochs 1 --batch-size 64 \
163
- --learning-rate 0.0003 --warmup-steps 300 \
164
- --max-seq-length 128 --seed 42
 
 
 
165
  ```
166
 
167
  The converter keeps source metadata and adds `tokenizer_variant`, source token
@@ -169,12 +135,21 @@ count, and character token count fields to each record. The char dataset's
169
  p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
170
  while leaving room for `[CLS]` and `[SEP]`.
171
 
172
- ### Regenerate datasets from source
173
 
174
  ```bash
175
- python data_generator.py --num-samples 100000
176
- python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --output data/dmhy/dmhy_weak.jsonl
177
- python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
 
 
 
 
 
 
 
 
 
178
  ```
179
 
180
  ### Rebuild vocabulary (if needed)
@@ -192,7 +167,7 @@ json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
192
  ### Export ONNX for MiruPlay Android
193
 
194
  ```bash
195
- python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
196
  ```
197
 
198
  ---
@@ -213,14 +188,13 @@ python colab_train.py --profile dmhy_regex_finetune
213
 
214
  ## Repository Layout
215
 
216
- - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
217
  - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
218
  - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
219
  - `convert_to_char_dataset.py`: full character-token projection for weak labels
220
  - `inference.py`: end-to-end filename parser CLI
221
  - `export_onnx.py`: ONNX export for Android integration
222
  - `exports/`: exported ONNX model and metadata
223
- - `data/dmhy/*.manifest.json`: dataset waterlines and counts
224
  - `datasets/AnimeName/`: nested dataset submodule
225
 
226
  ## Maintenance Notes
 
19
 
20
  AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
21
 
22
+ The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
23
 
24
  ## Model
25
 
 
28
  - Layers: 4
29
  - Attention heads: 8
30
  - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
31
+ - Tokenizer: custom character tokenizer implemented in `tokenizer.py`
32
+ - Max sequence length: 128
33
+ - Parameters: 4,783,631
34
 
35
  The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
36
 
 
47
 
48
  ## Vocabulary
49
 
50
+ The published checkpoint uses a character vocabulary. `vocab.json` at the
51
+ repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
52
+ as a mirrored explicit copy for training/data maintenance. The full DMHY weak
53
+ dataset has **6195 unique characters**, so the complete character vocab is only
54
+ **6199** entries including special tokens and reaches 100% token coverage.
55
 
56
+ The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
57
+ dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ## Evaluation
60
 
61
+ Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
62
+ seed 48):
 
 
 
 
63
 
64
+ | Metric | Value |
65
+ |--------|-------|
66
+ | Eval loss | 0.0163 |
67
+ | Entity precision | 0.9800 |
68
+ | Entity recall | 0.9867 |
69
+ | Entity F1 | 0.9833 |
70
+ | Token accuracy | 0.9943 |
71
+ | Held-out parse full match | 2008/2048 (0.9805) |
72
+ | Fixed regression full match | 21/21 (1.0000) |
73
 
74
+ The fixed regression set includes second-season aliases such as `Ni`,
75
+ `Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
76
+ blocks.
 
 
 
 
 
77
 
78
  ## Usage
79
 
80
  Install dependencies:
81
 
82
  ```bash
83
+ uv sync
84
  ```
85
 
86
  Parse a filename with this repository cloned locally:
 
109
 
110
  ## Training
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  ### Character-token DMHY training
113
 
114
  ```bash
115
+ uv run python convert_to_char_dataset.py \
116
  --input datasets/AnimeName/dmhy_weak.jsonl \
117
  --output datasets/AnimeName/dmhy_weak_char.jsonl \
118
+ --vocab-output datasets/AnimeName/vocab.char.json \
119
  --manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
120
 
121
+ uv run python train.py --tokenizer char \
122
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
123
+ --vocab-file datasets/AnimeName/vocab.char.json \
124
+ --save-dir checkpoints/dmhy-char-full-relabel \
125
+ --init-model-dir . \
126
+ --epochs 2 --batch-size 256 \
127
+ --learning-rate 0.00008 --warmup-steps 300 \
128
+ --checkpoint-steps 1000 --save-total-limit 3 \
129
+ --parse-eval-limit 2048 \
130
+ --max-seq-length 128 --seed 48
131
  ```
132
 
133
  The converter keeps source metadata and adds `tokenizer_variant`, source token
 
135
  p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
136
  while leaving room for `[CLS]` and `[SEP]`.
137
 
138
+ ### Relabel the full dataset
139
 
140
  ```bash
141
+ uv run python relabel_dataset_from_filenames.py \
142
+ --input datasets/AnimeName/dmhy_weak.jsonl \
143
+ --output datasets/AnimeName/dmhy_weak.relabel.jsonl \
144
+ --manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
145
+ --vocab-output datasets/AnimeName/vocab.relabel.json \
146
+ --base-vocab datasets/AnimeName/vocab.json \
147
+ --max-vocab-size 8000
148
+
149
+ Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
150
+ Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
151
+ Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
152
+ Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
153
  ```
154
 
155
  ### Rebuild vocabulary (if needed)
 
167
  ### Export ONNX for MiruPlay Android
168
 
169
  ```bash
170
+ uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
171
  ```
172
 
173
  ---
 
188
 
189
  ## Repository Layout
190
 
191
+ - `model.safetensors`, `config.json`, `vocab.json`: default published model
192
  - `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
193
  - `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
194
  - `convert_to_char_dataset.py`: full character-token projection for weak labels
195
  - `inference.py`: end-to-end filename parser CLI
196
  - `export_onnx.py`: ONNX export for Android integration
197
  - `exports/`: exported ONNX model and metadata
 
198
  - `datasets/AnimeName/`: nested dataset submodule
199
 
200
  ## Maintenance Notes
build_repair_focus_dataset.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Build a small fine-tuning set focused on repaired filename structures."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ import random
8
+ from pathlib import Path
9
+ from typing import Iterable, List
10
+
11
+ from label_repairs import repair_jsonl_item
12
+
13
+
14
+ def parse_args() -> argparse.Namespace:
15
+ parser = argparse.ArgumentParser(description="Build repair-focused char JSONL fine-tune data")
16
+ parser.add_argument("--input", required=True, help="Repaired char JSONL dataset")
17
+ parser.add_argument("--output", required=True, help="Output focus JSONL")
18
+ parser.add_argument("--context-samples", type=int, default=50000,
19
+ help="Random non-repaired rows to include for stability")
20
+ parser.add_argument("--repeat-repaired", type=int, default=4,
21
+ help="Repeat rows that still trigger a repair pass")
22
+ parser.add_argument("--repeat-manual", type=int, default=24,
23
+ help="Repeat hand-labeled hard cases")
24
+ parser.add_argument("--seed", type=int, default=42)
25
+ return parser.parse_args()
26
+
27
+
28
+ def char_item(filename: str, spans: List[tuple[str, str]]) -> dict:
29
+ tokens = list(filename)
30
+ labels = ["O"] * len(tokens)
31
+ cursor = 0
32
+ for text, entity in spans:
33
+ start = filename.find(text, cursor)
34
+ if start < 0:
35
+ start = filename.find(text)
36
+ if start < 0:
37
+ raise ValueError(f"Could not find span {text!r} in {filename!r}")
38
+ end = start + len(text)
39
+ labels[start] = f"B-{entity}"
40
+ for idx in range(start + 1, end):
41
+ labels[idx] = f"I-{entity}"
42
+ cursor = end
43
+ return {
44
+ "filename": filename,
45
+ "tokens": tokens,
46
+ "labels": labels,
47
+ "tokenizer_variant": "char",
48
+ "source": "manual_repair_focus",
49
+ }
50
+
51
+
52
+ def manual_cases() -> Iterable[dict]:
53
+ yield char_item(
54
+ "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
55
+ [
56
+ ("AI-Raws", "GROUP"),
57
+ ("炎炎の消防隊", "TITLE"),
58
+ ("弐ノ章", "SEASON"),
59
+ ("13", "EPISODE"),
60
+ ("BD", "SOURCE"),
61
+ ("HEVC", "SOURCE"),
62
+ ("1920x1080", "RESOLUTION"),
63
+ ("FLAC", "SOURCE"),
64
+ ],
65
+ )
66
+ yield char_item(
67
+ "[AI-Raws] 炎炎の消防隊 弐ノ章 #01 (BD HEVC 1920x1080 FLAC).mkv",
68
+ [
69
+ ("AI-Raws", "GROUP"),
70
+ ("炎炎の消防隊", "TITLE"),
71
+ ("弐ノ章", "SEASON"),
72
+ ("01", "EPISODE"),
73
+ ("BD", "SOURCE"),
74
+ ("HEVC", "SOURCE"),
75
+ ("1920x1080", "RESOLUTION"),
76
+ ("FLAC", "SOURCE"),
77
+ ],
78
+ )
79
+ yield char_item(
80
+ "[DBD-Raws][炎炎消防队 貳之章][01][1080P][BDRip][HEVC-10bit][FLAC]",
81
+ [
82
+ ("DBD-Raws", "GROUP"),
83
+ ("炎炎消防队", "TITLE"),
84
+ ("貳之章", "SEASON"),
85
+ ("01", "EPISODE"),
86
+ ("1080P", "RESOLUTION"),
87
+ ("BDRip", "SOURCE"),
88
+ ("FLAC", "SOURCE"),
89
+ ],
90
+ )
91
+
92
+
93
+ def main() -> None:
94
+ args = parse_args()
95
+ rng = random.Random(args.seed)
96
+ input_path = Path(args.input)
97
+ output_path = Path(args.output)
98
+
99
+ repaired_rows: List[dict] = []
100
+ reservoir: List[dict] = []
101
+ seen_filenames = set()
102
+ total_rows = 0
103
+
104
+ with input_path.open("r", encoding="utf-8") as handle:
105
+ for line in handle:
106
+ if not line.strip():
107
+ continue
108
+ total_rows += 1
109
+ item = json.loads(line)
110
+ _repaired_item, repairs = repair_jsonl_item(item)
111
+ filename = item.get("filename")
112
+ if repairs:
113
+ repaired_rows.append(item)
114
+ if filename:
115
+ seen_filenames.add(filename)
116
+ continue
117
+ if filename in seen_filenames:
118
+ continue
119
+ if len(reservoir) < args.context_samples:
120
+ reservoir.append(item)
121
+ else:
122
+ index = rng.randrange(total_rows)
123
+ if index < args.context_samples:
124
+ reservoir[index] = item
125
+
126
+ rows: List[dict] = []
127
+ for item in repaired_rows:
128
+ rows.extend([item] * max(1, args.repeat_repaired))
129
+ rows.extend(reservoir)
130
+ for item in manual_cases():
131
+ rows.extend([item] * max(1, args.repeat_manual))
132
+
133
+ rng.shuffle(rows)
134
+ output_path.parent.mkdir(parents=True, exist_ok=True)
135
+ with output_path.open("w", encoding="utf-8") as handle:
136
+ for item in rows:
137
+ handle.write(json.dumps(item, ensure_ascii=False, separators=(",", ":")) + "\n")
138
+
139
+ print(json.dumps({
140
+ "input": str(input_path),
141
+ "output": str(output_path),
142
+ "total_rows": total_rows,
143
+ "repaired_rows": len(repaired_rows),
144
+ "context_rows": len(reservoir),
145
+ "manual_rows": len(list(manual_cases())),
146
+ "written_rows": len(rows),
147
+ }, ensure_ascii=False, indent=2))
148
+
149
+
150
+ if __name__ == "__main__":
151
+ main()
case_metrics.json ADDED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_dir": ".",
3
+ "case_file": "data\\parser_regression_cases.json",
4
+ "tokenizer_variant": "char",
5
+ "max_length": 128,
6
+ "use_rules": true,
7
+ "constrain_bio": true,
8
+ "case_count": 21,
9
+ "full_correct": 21,
10
+ "full_accuracy": 1.0,
11
+ "field_correct": {
12
+ "group": 18,
13
+ "title": 21,
14
+ "episode": 21,
15
+ "resolution": 21,
16
+ "source": 14,
17
+ "season": 8,
18
+ "special": 1
19
+ },
20
+ "field_total": {
21
+ "group": 18,
22
+ "title": 21,
23
+ "episode": 21,
24
+ "resolution": 21,
25
+ "source": 14,
26
+ "season": 8,
27
+ "special": 1
28
+ },
29
+ "field_accuracy": {
30
+ "episode": 1.0,
31
+ "group": 1.0,
32
+ "resolution": 1.0,
33
+ "season": 1.0,
34
+ "source": 1.0,
35
+ "special": 1.0,
36
+ "title": 1.0
37
+ },
38
+ "failures": [],
39
+ "results": [
40
+ {
41
+ "id": "lolihouse_dash_episode",
42
+ "filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
43
+ "ok": true,
44
+ "errors": {},
45
+ "expected": {
46
+ "group": "LoliHouse",
47
+ "title": "Yomi no Tsugai",
48
+ "episode": 7,
49
+ "resolution": "1080p",
50
+ "source": "WebRip"
51
+ },
52
+ "pred": {
53
+ "episode": 7,
54
+ "group": "LoliHouse",
55
+ "resolution": "1080p",
56
+ "source": "WebRip",
57
+ "title": "Yomi no Tsugai"
58
+ }
59
+ },
60
+ {
61
+ "id": "dot_season_episode_no_group",
62
+ "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
63
+ "ok": true,
64
+ "errors": {},
65
+ "expected": {
66
+ "title": "Witch.Hat.Atelier",
67
+ "season": 1,
68
+ "episode": 7,
69
+ "group": null,
70
+ "resolution": "1080p",
71
+ "source": "NF"
72
+ },
73
+ "pred": {
74
+ "episode": 7,
75
+ "group": null,
76
+ "resolution": "1080p",
77
+ "season": 1,
78
+ "source": "NF",
79
+ "title": "Witch.Hat.Atelier"
80
+ }
81
+ },
82
+ {
83
+ "id": "ani_cjk_season_dash_episode",
84
+ "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
85
+ "ok": true,
86
+ "errors": {},
87
+ "expected": {
88
+ "group": "ANi",
89
+ "title": "異世界悠閒農家",
90
+ "season": 2,
91
+ "episode": 6,
92
+ "resolution": "1080P",
93
+ "source": "Baha"
94
+ },
95
+ "pred": {
96
+ "episode": 6,
97
+ "group": "ANi",
98
+ "resolution": "1080P",
99
+ "season": 2,
100
+ "source": "Baha",
101
+ "title": "異世界悠閒農家"
102
+ }
103
+ },
104
+ {
105
+ "id": "kisssub_bracket_title_episode",
106
+ "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
107
+ "ok": true,
108
+ "errors": {},
109
+ "expected": {
110
+ "group": "KissSub",
111
+ "title": "Shunkashuutou Daikousha - Haru no Mai",
112
+ "episode": 5,
113
+ "resolution": "1080P",
114
+ "source": "GB"
115
+ },
116
+ "pred": {
117
+ "episode": 5,
118
+ "group": "KissSub",
119
+ "resolution": "1080P",
120
+ "source": "GB",
121
+ "title": "Shunkashuutou Daikousha - Haru no Mai"
122
+ }
123
+ },
124
+ {
125
+ "id": "airotabracket_title_episode",
126
+ "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
127
+ "ok": true,
128
+ "errors": {},
129
+ "expected": {
130
+ "group": "Airota",
131
+ "title": "Sousou no Frieren",
132
+ "episode": 29,
133
+ "resolution": "1080p",
134
+ "source": "CHT"
135
+ },
136
+ "pred": {
137
+ "episode": 29,
138
+ "group": "Airota",
139
+ "resolution": "1080p",
140
+ "source": "CHT",
141
+ "title": "Sousou no Frieren"
142
+ }
143
+ },
144
+ {
145
+ "id": "subsplease_parenthesized_resolution",
146
+ "filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
147
+ "ok": true,
148
+ "errors": {},
149
+ "expected": {
150
+ "group": "SubsPlease",
151
+ "title": "Mushoku Tensei",
152
+ "episode": 12,
153
+ "resolution": "1080p"
154
+ },
155
+ "pred": {
156
+ "episode": 12,
157
+ "group": "SubsPlease",
158
+ "resolution": "1080p",
159
+ "title": "Mushoku Tensei"
160
+ }
161
+ },
162
+ {
163
+ "id": "vcb_bracket_episode",
164
+ "filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
165
+ "ok": true,
166
+ "errors": {},
167
+ "expected": {
168
+ "group": "VCB-Studio",
169
+ "title": "Girls Band Cry",
170
+ "episode": 1,
171
+ "resolution": "1080p"
172
+ },
173
+ "pred": {
174
+ "episode": 1,
175
+ "group": "VCB-Studio",
176
+ "resolution": "1080p",
177
+ "title": "Girls Band Cry"
178
+ }
179
+ },
180
+ {
181
+ "id": "numeric_title_not_episode",
182
+ "filename": "86 Eighty Six - 01 [1080P][Baha]",
183
+ "ok": true,
184
+ "errors": {},
185
+ "expected": {
186
+ "title": "86 Eighty Six",
187
+ "episode": 1,
188
+ "resolution": "1080P",
189
+ "source": "Baha"
190
+ },
191
+ "pred": {
192
+ "episode": 1,
193
+ "resolution": "1080P",
194
+ "source": "Baha",
195
+ "title": "86 Eighty Six"
196
+ }
197
+ },
198
+ {
199
+ "id": "erai_raws_dash_episode",
200
+ "filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
201
+ "ok": true,
202
+ "errors": {},
203
+ "expected": {
204
+ "group": "Erai-raws",
205
+ "title": "Sousou no Frieren",
206
+ "episode": 1,
207
+ "resolution": "1080p"
208
+ },
209
+ "pred": {
210
+ "episode": 1,
211
+ "group": "Erai-raws",
212
+ "resolution": "1080p",
213
+ "title": "Sousou no Frieren"
214
+ }
215
+ },
216
+ {
217
+ "id": "nekomoe_space_group",
218
+ "filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
219
+ "ok": true,
220
+ "errors": {},
221
+ "expected": {
222
+ "group": "Nekomoe kissaten",
223
+ "title": "Watashi no Shiawase na Kekkon",
224
+ "episode": 1,
225
+ "resolution": "1080p"
226
+ },
227
+ "pred": {
228
+ "episode": 1,
229
+ "group": "Nekomoe kissaten",
230
+ "resolution": "1080p",
231
+ "title": "Watashi no Shiawase na Kekkon"
232
+ }
233
+ },
234
+ {
235
+ "id": "long_running_episode",
236
+ "filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
237
+ "ok": true,
238
+ "errors": {},
239
+ "expected": {
240
+ "title": "One.Piece",
241
+ "episode": 1110,
242
+ "resolution": "1080p",
243
+ "source": "WEB-DL"
244
+ },
245
+ "pred": {
246
+ "episode": 1110,
247
+ "resolution": "1080p",
248
+ "source": "WEB-DL",
249
+ "title": "One.Piece"
250
+ }
251
+ },
252
+ {
253
+ "id": "season_episode_amzn",
254
+ "filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
255
+ "ok": true,
256
+ "errors": {},
257
+ "expected": {
258
+ "title": "Example.Show",
259
+ "season": 2,
260
+ "episode": 3,
261
+ "resolution": "2160p",
262
+ "source": "AMZN"
263
+ },
264
+ "pred": {
265
+ "episode": 3,
266
+ "resolution": "2160p",
267
+ "season": 2,
268
+ "source": "AMZN",
269
+ "title": "Example.Show"
270
+ }
271
+ },
272
+ {
273
+ "id": "cjk_group_with_prefix_tag",
274
+ "filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
275
+ "ok": true,
276
+ "errors": {},
277
+ "expected": {
278
+ "group": "喵萌奶茶屋",
279
+ "title": "葬送的芙莉莲",
280
+ "episode": 1,
281
+ "resolution": "1080P"
282
+ },
283
+ "pred": {
284
+ "episode": 1,
285
+ "group": "喵萌奶茶屋",
286
+ "resolution": "1080P",
287
+ "title": "葬送的芙莉莲"
288
+ }
289
+ },
290
+ {
291
+ "id": "leading_meta_not_group",
292
+ "filename": "[1080p] Witch Watch - 15 [CHS]",
293
+ "ok": true,
294
+ "errors": {},
295
+ "expected": {
296
+ "group": null,
297
+ "title": "Witch Watch",
298
+ "episode": 15,
299
+ "resolution": "1080p",
300
+ "source": "CHS"
301
+ },
302
+ "pred": {
303
+ "episode": 15,
304
+ "group": null,
305
+ "resolution": "1080p",
306
+ "source": "CHS",
307
+ "title": "Witch Watch"
308
+ }
309
+ },
310
+ {
311
+ "id": "sakurato_group_language_source",
312
+ "filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
313
+ "ok": true,
314
+ "errors": {},
315
+ "expected": {
316
+ "group": "Sakurato",
317
+ "title": "Witch Watch",
318
+ "episode": 15,
319
+ "resolution": "1080p",
320
+ "source": "CHS"
321
+ },
322
+ "pred": {
323
+ "episode": 15,
324
+ "group": "Sakurato",
325
+ "resolution": "1080p",
326
+ "source": "CHS",
327
+ "title": "Witch Watch"
328
+ }
329
+ },
330
+ {
331
+ "id": "billion_meta_lab_search_special",
332
+ "filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
333
+ "ok": true,
334
+ "errors": {},
335
+ "expected": {
336
+ "group": "Billion Meta Lab",
337
+ "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
338
+ "episode": 7,
339
+ "resolution": "1080P",
340
+ "source": "CHT&JPN",
341
+ "special": "檢索:魔法姊妹露露特莉莉"
342
+ },
343
+ "pred": {
344
+ "episode": 7,
345
+ "group": "Billion Meta Lab",
346
+ "resolution": "1080P",
347
+ "source": "CHT&JPN",
348
+ "special": "檢索:魔法姊妹露露特莉莉",
349
+ "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi"
350
+ }
351
+ },
352
+ {
353
+ "id": "studio_greentea_s2_bracket_episode",
354
+ "filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
355
+ "ok": true,
356
+ "errors": {},
357
+ "expected": {
358
+ "group": "Studio GreenTea",
359
+ "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
360
+ "season": 2,
361
+ "episode": 6,
362
+ "resolution": "1080p",
363
+ "source": "WebRip"
364
+ },
365
+ "pred": {
366
+ "episode": 6,
367
+ "group": "Studio GreenTea",
368
+ "resolution": "1080p",
369
+ "season": 2,
370
+ "source": "WebRip",
371
+ "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken"
372
+ }
373
+ },
374
+ {
375
+ "id": "lolihouse_kakuriyo_bare_ni_season",
376
+ "filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
377
+ "ok": true,
378
+ "errors": {},
379
+ "expected": {
380
+ "group": "LoliHouse",
381
+ "title": "Kakuriyo no Yadomeshi",
382
+ "season": 2,
383
+ "episode": 12,
384
+ "resolution": "1080p",
385
+ "source": "WebRip"
386
+ },
387
+ "pred": {
388
+ "episode": 12,
389
+ "group": "LoliHouse",
390
+ "resolution": "1080p",
391
+ "season": 2,
392
+ "source": "WebRip",
393
+ "title": "Kakuriyo no Yadomeshi"
394
+ }
395
+ },
396
+ {
397
+ "id": "ani_kakuriyo_traditional_ni",
398
+ "filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
399
+ "ok": true,
400
+ "errors": {},
401
+ "expected": {
402
+ "group": "ANi",
403
+ "title": "妖怪旅館營業中",
404
+ "season": 2,
405
+ "episode": 11,
406
+ "resolution": "1080P",
407
+ "source": "Baha"
408
+ },
409
+ "pred": {
410
+ "episode": 11,
411
+ "group": "ANi",
412
+ "resolution": "1080P",
413
+ "season": 2,
414
+ "source": "Baha",
415
+ "title": "妖怪旅館營業中"
416
+ }
417
+ },
418
+ {
419
+ "id": "jibaketa_shokugeki_ni_no_sara",
420
+ "filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
421
+ "ok": true,
422
+ "errors": {},
423
+ "expected": {
424
+ "group": "jibaketa",
425
+ "title": "Shokugeki no Souma",
426
+ "season": 2,
427
+ "episode": 13,
428
+ "resolution": "1920x1080"
429
+ },
430
+ "pred": {
431
+ "episode": 13,
432
+ "group": "jibaketa",
433
+ "resolution": "1920x1080",
434
+ "season": 2,
435
+ "title": "Shokugeki no Souma"
436
+ }
437
+ },
438
+ {
439
+ "id": "ai_raws_fire_force_cjk_season_hash_episode",
440
+ "filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
441
+ "ok": true,
442
+ "errors": {},
443
+ "expected": {
444
+ "group": "AI-Raws",
445
+ "title": "炎炎の消防隊",
446
+ "season": 2,
447
+ "episode": 13,
448
+ "resolution": "1920x1080"
449
+ },
450
+ "pred": {
451
+ "episode": 13,
452
+ "group": "AI-Raws",
453
+ "resolution": "1920x1080",
454
+ "season": 2,
455
+ "title": "炎炎の消防隊"
456
+ }
457
+ }
458
+ ]
459
+ }
config.json CHANGED
@@ -50,15 +50,15 @@
50
  },
51
  "layer_norm_eps": 1e-12,
52
  "max_position_embeddings": 128,
53
- "max_seq_length": 64,
54
  "model_type": "bert",
55
  "num_attention_heads": 8,
56
  "num_hidden_layers": 4,
57
  "pad_token_id": 0,
58
  "tie_word_embeddings": true,
59
- "tokenizer_variant": "regex",
60
- "transformers_version": "5.8.0",
61
  "type_vocab_size": 2,
62
  "use_cache": false,
63
- "vocab_size": 3000
64
  }
 
50
  },
51
  "layer_norm_eps": 1e-12,
52
  "max_position_embeddings": 128,
53
+ "max_seq_length": 128,
54
  "model_type": "bert",
55
  "num_attention_heads": 8,
56
  "num_hidden_layers": 4,
57
  "pad_token_id": 0,
58
  "tie_word_embeddings": true,
59
+ "tokenizer_variant": "char",
60
+ "transformers_version": "5.8.1",
61
  "type_vocab_size": 2,
62
  "use_cache": false,
63
+ "vocab_size": 6199
64
  }
data/parser_regression_cases.json ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "id": "lolihouse_dash_episode",
4
+ "filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
5
+ "expected": {
6
+ "group": "LoliHouse",
7
+ "title": "Yomi no Tsugai",
8
+ "episode": 7,
9
+ "resolution": "1080p",
10
+ "source": "WebRip"
11
+ }
12
+ },
13
+ {
14
+ "id": "dot_season_episode_no_group",
15
+ "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
16
+ "expected": {
17
+ "title": "Witch.Hat.Atelier",
18
+ "season": 1,
19
+ "episode": 7,
20
+ "group": null,
21
+ "resolution": "1080p",
22
+ "source": "NF"
23
+ }
24
+ },
25
+ {
26
+ "id": "ani_cjk_season_dash_episode",
27
+ "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
28
+ "expected": {
29
+ "group": "ANi",
30
+ "title": "異世界悠閒農家",
31
+ "season": 2,
32
+ "episode": 6,
33
+ "resolution": "1080P",
34
+ "source": "Baha"
35
+ }
36
+ },
37
+ {
38
+ "id": "kisssub_bracket_title_episode",
39
+ "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
40
+ "expected": {
41
+ "group": "KissSub",
42
+ "title": "Shunkashuutou Daikousha - Haru no Mai",
43
+ "episode": 5,
44
+ "resolution": "1080P",
45
+ "source": "GB"
46
+ }
47
+ },
48
+ {
49
+ "id": "airotabracket_title_episode",
50
+ "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
51
+ "expected": {
52
+ "group": "Airota",
53
+ "title": "Sousou no Frieren",
54
+ "episode": 29,
55
+ "resolution": "1080p",
56
+ "source": "CHT"
57
+ }
58
+ },
59
+ {
60
+ "id": "subsplease_parenthesized_resolution",
61
+ "filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
62
+ "expected": {
63
+ "group": "SubsPlease",
64
+ "title": "Mushoku Tensei",
65
+ "episode": 12,
66
+ "resolution": "1080p"
67
+ }
68
+ },
69
+ {
70
+ "id": "vcb_bracket_episode",
71
+ "filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
72
+ "expected": {
73
+ "group": "VCB-Studio",
74
+ "title": "Girls Band Cry",
75
+ "episode": 1,
76
+ "resolution": "1080p"
77
+ }
78
+ },
79
+ {
80
+ "id": "numeric_title_not_episode",
81
+ "filename": "86 Eighty Six - 01 [1080P][Baha]",
82
+ "expected": {
83
+ "title": "86 Eighty Six",
84
+ "episode": 1,
85
+ "resolution": "1080P",
86
+ "source": "Baha"
87
+ }
88
+ },
89
+ {
90
+ "id": "erai_raws_dash_episode",
91
+ "filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
92
+ "expected": {
93
+ "group": "Erai-raws",
94
+ "title": "Sousou no Frieren",
95
+ "episode": 1,
96
+ "resolution": "1080p"
97
+ }
98
+ },
99
+ {
100
+ "id": "nekomoe_space_group",
101
+ "filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
102
+ "expected": {
103
+ "group": "Nekomoe kissaten",
104
+ "title": "Watashi no Shiawase na Kekkon",
105
+ "episode": 1,
106
+ "resolution": "1080p"
107
+ }
108
+ },
109
+ {
110
+ "id": "long_running_episode",
111
+ "filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
112
+ "expected": {
113
+ "title": "One.Piece",
114
+ "episode": 1110,
115
+ "resolution": "1080p",
116
+ "source": "WEB-DL"
117
+ }
118
+ },
119
+ {
120
+ "id": "season_episode_amzn",
121
+ "filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
122
+ "expected": {
123
+ "title": "Example.Show",
124
+ "season": 2,
125
+ "episode": 3,
126
+ "resolution": "2160p",
127
+ "source": "AMZN"
128
+ }
129
+ },
130
+ {
131
+ "id": "cjk_group_with_prefix_tag",
132
+ "filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
133
+ "expected": {
134
+ "group": "喵萌奶茶屋",
135
+ "title": "葬送的芙莉莲",
136
+ "episode": 1,
137
+ "resolution": "1080P"
138
+ }
139
+ },
140
+ {
141
+ "id": "leading_meta_not_group",
142
+ "filename": "[1080p] Witch Watch - 15 [CHS]",
143
+ "expected": {
144
+ "group": null,
145
+ "title": "Witch Watch",
146
+ "episode": 15,
147
+ "resolution": "1080p",
148
+ "source": "CHS"
149
+ }
150
+ },
151
+ {
152
+ "id": "sakurato_group_language_source",
153
+ "filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
154
+ "expected": {
155
+ "group": "Sakurato",
156
+ "title": "Witch Watch",
157
+ "episode": 15,
158
+ "resolution": "1080p",
159
+ "source": "CHS"
160
+ }
161
+ },
162
+ {
163
+ "id": "billion_meta_lab_search_special",
164
+ "filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
165
+ "expected": {
166
+ "group": "Billion Meta Lab",
167
+ "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
168
+ "episode": 7,
169
+ "resolution": "1080P",
170
+ "source": "CHT&JPN",
171
+ "special": "檢索:魔法姊妹露露特莉莉"
172
+ }
173
+ },
174
+ {
175
+ "id": "studio_greentea_s2_bracket_episode",
176
+ "filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
177
+ "expected": {
178
+ "group": "Studio GreenTea",
179
+ "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
180
+ "season": 2,
181
+ "episode": 6,
182
+ "resolution": "1080p",
183
+ "source": "WebRip"
184
+ }
185
+ },
186
+ {
187
+ "id": "lolihouse_kakuriyo_bare_ni_season",
188
+ "filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
189
+ "expected": {
190
+ "group": "LoliHouse",
191
+ "title": "Kakuriyo no Yadomeshi",
192
+ "season": 2,
193
+ "episode": 12,
194
+ "resolution": "1080p",
195
+ "source": "WebRip"
196
+ }
197
+ },
198
+ {
199
+ "id": "ani_kakuriyo_traditional_ni",
200
+ "filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
201
+ "expected": {
202
+ "group": "ANi",
203
+ "title": "妖怪旅館營業中",
204
+ "season": 2,
205
+ "episode": 11,
206
+ "resolution": "1080P",
207
+ "source": "Baha"
208
+ }
209
+ },
210
+ {
211
+ "id": "jibaketa_shokugeki_ni_no_sara",
212
+ "filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
213
+ "expected": {
214
+ "group": "jibaketa",
215
+ "title": "Shokugeki no Souma",
216
+ "season": 2,
217
+ "episode": 13,
218
+ "resolution": "1920x1080"
219
+ }
220
+ },
221
+ {
222
+ "id": "ai_raws_fire_force_cjk_season_hash_episode",
223
+ "filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
224
+ "expected": {
225
+ "group": "AI-Raws",
226
+ "title": "炎炎の消防隊",
227
+ "season": 2,
228
+ "episode": 13,
229
+ "resolution": "1920x1080"
230
+ }
231
+ }
232
+ ]
dataset.py CHANGED
@@ -6,11 +6,13 @@ Handles token-ID conversion, label encoding, padding, and truncation.
6
  """
7
 
8
  import json
 
9
  import torch
10
  from torch.utils.data import Dataset
11
- from typing import Dict, List, Optional
12
 
13
  from config import Config
 
14
  from tokenizer import AnimeTokenizer
15
 
16
 
@@ -62,9 +64,7 @@ class AnimeDataset(Dataset):
62
  Dictionary with input_ids, attention_mask, labels as LongTensors.
63
  """
64
  item = self.data[idx]
65
- tokens: List[str] = item["tokens"]
66
- labels: List[str] = item["labels"]
67
- tokens, labels = align_tokens_for_tokenizer(tokens, labels, self.tokenizer)
68
 
69
  # Convert tokens to IDs
70
  input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
@@ -137,6 +137,146 @@ def align_tokens_for_tokenizer(
137
  return aligned_tokens, aligned_labels
138
 
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  def create_datasets(
141
  data_path: str,
142
  tokenizer: AnimeTokenizer,
 
6
  """
7
 
8
  import json
9
+ from collections import Counter
10
  import torch
11
  from torch.utils.data import Dataset
12
+ from typing import Dict, List, Optional, Tuple
13
 
14
  from config import Config
15
+ from label_repairs import repair_sequel_season_labels
16
  from tokenizer import AnimeTokenizer
17
 
18
 
 
64
  Dictionary with input_ids, attention_mask, labels as LongTensors.
65
  """
66
  item = self.data[idx]
67
+ tokens, labels = labels_for_tokenizer(item, self.tokenizer)
 
 
68
 
69
  # Convert tokens to IDs
70
  input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
 
137
  return aligned_tokens, aligned_labels
138
 
139
 
140
+ def labels_for_tokenizer(
141
+ item: Dict,
142
+ tokenizer: AnimeTokenizer,
143
+ ) -> Tuple[List[str], List[str]]:
144
+ """
145
+ Return tokens and labels in the exact tokenizer space used by the model.
146
+
147
+ Older DMHY weak-label files store a post-processed token sequence where
148
+ group/title brackets may be expanded even though AnimeTokenizer keeps the
149
+ same bracketed text as one inference token. If the raw filename is present,
150
+ project those weak labels back to character spans and then onto the current
151
+ tokenizer output. This keeps train/eval/inference preprocessing identical.
152
+ """
153
+ filename = item.get("filename")
154
+ source_tokens, source_labels, _repairs = repair_sequel_season_labels(item)
155
+ tokenizer_variant = getattr(tokenizer, "tokenizer_variant", "regex")
156
+
157
+ if not filename:
158
+ return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
159
+
160
+ # Current char datasets are already in the exact inference token space.
161
+ # Avoid re-scanning every filename during training.
162
+ if item.get("tokenizer_variant") == tokenizer_variant:
163
+ target_tokens = tokenizer.tokenize(filename)
164
+ if source_tokens == target_tokens:
165
+ return source_tokens, source_labels
166
+
167
+ projected = project_labels_from_filename(
168
+ filename=filename,
169
+ source_tokens=source_tokens,
170
+ source_labels=source_labels,
171
+ tokenizer=tokenizer,
172
+ )
173
+ if projected is not None:
174
+ return projected
175
+
176
+ # Fall back to the legacy behavior for synthetic fixtures or malformed rows.
177
+ return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
178
+
179
+
180
+ def token_offsets_in_text(text: str, tokens: List[str]) -> Optional[List[Tuple[int, int]]]:
181
+ """Find token character offsets by scanning left to right."""
182
+ offsets: List[Tuple[int, int]] = []
183
+ cursor = 0
184
+ for token in tokens:
185
+ if token == "":
186
+ offsets.append((cursor, cursor))
187
+ continue
188
+ start = text.find(token, cursor)
189
+ if start < 0:
190
+ return None
191
+ end = start + len(token)
192
+ offsets.append((start, end))
193
+ cursor = end
194
+ return offsets
195
+
196
+
197
+ def project_source_labels_to_chars(
198
+ text: str,
199
+ source_tokens: List[str],
200
+ source_labels: List[str],
201
+ ) -> Optional[List[str]]:
202
+ """Project source token BIO labels to per-character entity names."""
203
+ offsets = token_offsets_in_text(text, source_tokens)
204
+ if offsets is None or len(source_tokens) != len(source_labels):
205
+ return None
206
+
207
+ char_entities = ["O"] * len(text)
208
+ for token, label, (start, end) in zip(source_tokens, source_labels, offsets):
209
+ if not label.startswith(("B-", "I-")):
210
+ continue
211
+ entity = label.split("-", 1)[1]
212
+
213
+ # Bracketed single-token metadata in older data often includes the
214
+ # brackets in the token. Keep container punctuation as O so a tokenizer
215
+ # that splits brackets can learn cleaner boundaries.
216
+ inner_start = start
217
+ inner_end = end
218
+ if len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》":
219
+ inner_start += 1
220
+ inner_end -= 1
221
+
222
+ for pos in range(inner_start, inner_end):
223
+ if 0 <= pos < len(char_entities):
224
+ char_entities[pos] = entity
225
+ return char_entities
226
+
227
+
228
+ def labels_from_char_projection(
229
+ text: str,
230
+ target_tokens: List[str],
231
+ char_entities: List[str],
232
+ ) -> Optional[List[str]]:
233
+ """Assign legal IOB2 labels to target tokens from per-character entities."""
234
+ offsets = token_offsets_in_text(text, target_tokens)
235
+ if offsets is None:
236
+ return None
237
+
238
+ labels: List[str] = []
239
+ active_entity: Optional[str] = None
240
+ for start, end in offsets:
241
+ span_entities = [
242
+ char_entities[pos]
243
+ for pos in range(start, end)
244
+ if 0 <= pos < len(char_entities) and char_entities[pos] != "O"
245
+ ]
246
+ if not span_entities:
247
+ labels.append("O")
248
+ active_entity = None
249
+ continue
250
+
251
+ entity = Counter(span_entities).most_common(1)[0][0]
252
+ prefix = "I" if active_entity == entity else "B"
253
+ labels.append(f"{prefix}-{entity}")
254
+ active_entity = entity
255
+ return labels
256
+
257
+
258
+ def project_labels_from_filename(
259
+ filename: str,
260
+ source_tokens: List[str],
261
+ source_labels: List[str],
262
+ tokenizer: AnimeTokenizer,
263
+ ) -> Optional[Tuple[List[str], List[str]]]:
264
+ """
265
+ Re-tokenize filename and project weak BIO labels onto that tokenizer.
266
+
267
+ Returns None when source tokens cannot be aligned to the filename.
268
+ """
269
+ char_entities = project_source_labels_to_chars(filename, source_tokens, source_labels)
270
+ if char_entities is None:
271
+ return None
272
+
273
+ target_tokens = tokenizer.tokenize(filename)
274
+ target_labels = labels_from_char_projection(filename, target_tokens, char_entities)
275
+ if target_labels is None or len(target_tokens) != len(target_labels):
276
+ return None
277
+ return target_tokens, target_labels
278
+
279
+
280
  def create_datasets(
281
  data_path: str,
282
  tokenizer: AnimeTokenizer,
datasets/AnimeName CHANGED
@@ -1 +1 @@
1
- Subproject commit 867350a1712e50cc71f5a9e81dd331ca46a7b1dd
 
1
+ Subproject commit 8d2b6c9e639fde6be0e428e5f34f56fccd5aa2ea
diagnose_pipeline.py CHANGED
@@ -27,7 +27,8 @@ from seqeval.metrics import classification_report, f1_score, precision_score, re
27
  from transformers import BertForTokenClassification
28
 
29
  from config import Config
30
- from dataset import align_tokens_for_tokenizer
 
31
  from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
32
 
33
 
@@ -81,16 +82,6 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
81
  for idx, label in enumerate(labels):
82
  token = tokens[idx] if idx < len(tokens) else None
83
  if label == "O":
84
- if previous_label.startswith("B-"):
85
- violations.append(
86
- {
87
- "type": "B_DIRECT_TO_O",
88
- "index": idx,
89
- "prev_label": previous_label,
90
- "label": label,
91
- "token": token,
92
- }
93
- )
94
  current_entity = None
95
  elif label.startswith("B-"):
96
  current_entity = entity_type(label)
@@ -124,6 +115,24 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
124
  return violations
125
 
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
128
  spans: List[dict] = []
129
  start: Optional[int] = None
@@ -241,7 +250,7 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
241
  unk = 0
242
  unk_counter: Counter = Counter()
243
  for sample in samples:
244
- tokens, labels = align_tokens_for_tokenizer(sample["tokens"], sample["labels"], tokenizer)
245
  ids = tokenizer.convert_tokens_to_ids(tokens)
246
  for token, token_id in zip(tokens, ids):
247
  total += 1
@@ -257,13 +266,12 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
257
 
258
 
259
  def prepare_inputs(
260
- tokens: List[str],
261
- labels: List[str],
262
  tokenizer: AnimeTokenizer,
263
  label2id: Dict[str, int],
264
  max_length: int,
265
  ) -> Tuple[List[int], List[int], List[int], List[str]]:
266
- tokens, labels = align_tokens_for_tokenizer(tokens, labels, tokenizer)
267
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
268
  input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
269
  label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
@@ -283,6 +291,48 @@ def prepare_inputs(
283
  return input_ids, attention_mask, label_ids, tokens
284
 
285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
286
  def evaluate_model(
287
  samples: List[dict],
288
  model_dir: Path,
@@ -313,12 +363,15 @@ def evaluate_model(
313
  confusion: Counter = Counter()
314
  entity_confusion: Counter = Counter()
315
  boundary_errors: Counter = Counter()
 
 
 
 
316
 
317
  with torch.no_grad():
318
  for sample in eval_samples:
319
- input_ids, attention_mask, label_ids, _tokens = prepare_inputs(
320
- sample["tokens"],
321
- sample["labels"],
322
  tokenizer,
323
  label2id,
324
  max_length,
@@ -326,13 +379,17 @@ def evaluate_model(
326
  input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
327
  mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
328
  logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
329
- pred_ids = torch.argmax(logits, dim=-1)[0].detach().cpu().tolist()
 
330
 
331
  true_labels: List[str] = []
332
  pred_labels: List[str] = []
333
- for pred_id, label_id in zip(pred_ids, label_ids):
 
334
  if label_id == -100:
335
  continue
 
 
336
  true_label = id2label.get(label_id, "O")
337
  pred_label = id2label.get(pred_id, "O")
338
  true_labels.append(true_label)
@@ -348,6 +405,57 @@ def evaluate_model(
348
  boundary_errors["BIO-prefix"] += 1
349
  true_sequences.append(true_labels)
350
  pred_sequences.append(pred_labels)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
351
 
352
  errors = confusion.copy()
353
  for label in set(label for pair in confusion for label in pair):
@@ -364,6 +472,10 @@ def evaluate_model(
364
  {k: v for k, v in entity_confusion.items() if k[0] != k[1]}
365
  ).most_common(30),
366
  "boundary_errors": boundary_errors,
 
 
 
 
367
  }
368
 
369
 
@@ -444,6 +556,7 @@ def main() -> None:
444
  length_values: List[int] = []
445
  aligned_length_values: List[int] = []
446
  violations: List[dict] = []
 
447
  mismatch_examples: List[dict] = []
448
  space_label_counter: Counter = Counter()
449
  boundary_drift_counter: Counter = Counter()
@@ -472,7 +585,7 @@ def main() -> None:
472
 
473
  label_counter.update(labels)
474
  length_values.append(len(tokens))
475
- aligned_tokens, aligned_labels = align_tokens_for_tokenizer(tokens, labels, tokenizer)
476
  aligned_length_values.append(len(aligned_tokens))
477
  if len(aligned_tokens) + 2 > max_length:
478
  truncation_count += 1
@@ -490,6 +603,17 @@ def main() -> None:
490
  }
491
  )
492
  violations.append(violation)
 
 
 
 
 
 
 
 
 
 
 
493
  for span in spans_from_labels(tokens, labels):
494
  text = span["text"]
495
  if span["type"] == "TITLE":
@@ -594,19 +718,26 @@ def main() -> None:
594
  )
595
 
596
  violation_counter = Counter(v["type"] for v in violations)
 
597
  sections.append(
598
  (
599
  "BIO Violations And Boundary Drift",
600
  "\n".join(
601
  [
602
- "### Violation counts",
603
  format_counter(violation_counter),
604
  "",
 
 
 
605
  "### Boundary drift heuristics",
606
  format_counter(boundary_drift_counter),
607
  "",
608
  "### Sample violations",
609
  markdown_json(violations[:30]),
 
 
 
610
  ]
611
  ),
612
  )
@@ -659,6 +790,29 @@ def main() -> None:
659
  [true, pred, f"{count:,}"]
660
  for (true, pred), count in model_eval["top_entity_confusions"]
661
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
662
  sections.append(
663
  (
664
  "Model Confusion Analysis",
@@ -678,6 +832,28 @@ def main() -> None:
678
  "### Top entity-type confusions",
679
  markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
680
  "",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
681
  "### Seqeval report",
682
  "```text\n" + model_eval["classification_report"] + "\n```",
683
  ]
 
27
  from transformers import BertForTokenClassification
28
 
29
  from config import Config
30
+ from dataset import labels_for_tokenizer
31
+ from inference import constrained_bio_decode, postprocess
32
  from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
33
 
34
 
 
82
  for idx, label in enumerate(labels):
83
  token = tokens[idx] if idx < len(tokens) else None
84
  if label == "O":
 
 
 
 
 
 
 
 
 
 
85
  current_entity = None
86
  elif label.startswith("B-"):
87
  current_entity = entity_type(label)
 
115
  return violations
116
 
117
 
118
+ def bio_boundary_warnings(tokens: List[str], labels: List[str]) -> List[dict]:
119
+ """Collect legal-but-suspicious boundary patterns separately from BIO errors."""
120
+ warnings: List[dict] = []
121
+ for idx, label in enumerate(labels[1:], 1):
122
+ previous_label = labels[idx - 1]
123
+ if label == "O" and previous_label.startswith("B-"):
124
+ warnings.append(
125
+ {
126
+ "type": "SINGLE_TOKEN_ENTITY",
127
+ "index": idx,
128
+ "prev_label": previous_label,
129
+ "label": label,
130
+ "token": tokens[idx] if idx < len(tokens) else None,
131
+ }
132
+ )
133
+ return warnings
134
+
135
+
136
  def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
137
  spans: List[dict] = []
138
  start: Optional[int] = None
 
250
  unk = 0
251
  unk_counter: Counter = Counter()
252
  for sample in samples:
253
+ tokens, _labels = labels_for_tokenizer(sample, tokenizer)
254
  ids = tokenizer.convert_tokens_to_ids(tokens)
255
  for token, token_id in zip(tokens, ids):
256
  total += 1
 
266
 
267
 
268
  def prepare_inputs(
269
+ sample: dict,
 
270
  tokenizer: AnimeTokenizer,
271
  label2id: Dict[str, int],
272
  max_length: int,
273
  ) -> Tuple[List[int], List[int], List[int], List[str]]:
274
+ tokens, labels = labels_for_tokenizer(sample, tokenizer)
275
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
276
  input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
277
  label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
 
291
  return input_ids, attention_mask, label_ids, tokens
292
 
293
 
294
+ def normalize_field_value(field: str, value) -> Optional[str]:
295
+ if value is None:
296
+ return None
297
+ if field in {"episode", "season"}:
298
+ try:
299
+ return str(int(value))
300
+ except (TypeError, ValueError):
301
+ return str(value).strip().lower()
302
+ text = str(value).strip()
303
+ if field in {"resolution", "source"}:
304
+ return text.lower().replace("_", "-")
305
+ return re.sub(r"\s+", " ", text).strip().lower()
306
+
307
+
308
+ def update_parse_metrics(counter: Counter, gold: dict, pred: dict) -> None:
309
+ fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
310
+ all_match = True
311
+ for field in fields:
312
+ gold_value = normalize_field_value(field, gold.get(field))
313
+ pred_value = normalize_field_value(field, pred.get(field))
314
+ if gold_value == pred_value:
315
+ counter[f"{field}_correct"] += 1
316
+ else:
317
+ all_match = False
318
+ counter[(field, gold_value, pred_value)] += 1
319
+ counter[f"{field}_total"] += 1
320
+ if all_match:
321
+ counter["full_match_correct"] += 1
322
+ counter["full_match_total"] += 1
323
+
324
+
325
+ def collect_field_failures(gold: dict, pred: dict) -> Dict[str, Dict[str, Optional[str]]]:
326
+ return {
327
+ field: {
328
+ "gold": normalize_field_value(field, gold.get(field)),
329
+ "pred": normalize_field_value(field, pred.get(field)),
330
+ }
331
+ for field in ["group", "title", "season", "episode", "resolution", "source", "special"]
332
+ if normalize_field_value(field, gold.get(field)) != normalize_field_value(field, pred.get(field))
333
+ }
334
+
335
+
336
  def evaluate_model(
337
  samples: List[dict],
338
  model_dir: Path,
 
363
  confusion: Counter = Counter()
364
  entity_confusion: Counter = Counter()
365
  boundary_errors: Counter = Counter()
366
+ parse_metrics: Counter = Counter()
367
+ parse_metrics_no_rules: Counter = Counter()
368
+ field_failures: List[dict] = []
369
+ field_failures_no_rules: List[dict] = []
370
 
371
  with torch.no_grad():
372
  for sample in eval_samples:
373
+ input_ids, attention_mask, label_ids, sample_tokens = prepare_inputs(
374
+ sample,
 
375
  tokenizer,
376
  label2id,
377
  max_length,
 
379
  input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
380
  mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
381
  logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
382
+ active_count = sum(1 for label_id in label_ids if label_id != -100)
383
+ pred_ids = constrained_bio_decode(logits[0, 1:1 + active_count, :], id2label)
384
 
385
  true_labels: List[str] = []
386
  pred_labels: List[str] = []
387
+ pred_idx = 0
388
+ for label_id in label_ids:
389
  if label_id == -100:
390
  continue
391
+ pred_id = pred_ids[pred_idx]
392
+ pred_idx += 1
393
  true_label = id2label.get(label_id, "O")
394
  pred_label = id2label.get(pred_id, "O")
395
  true_labels.append(true_label)
 
405
  boundary_errors["BIO-prefix"] += 1
406
  true_sequences.append(true_labels)
407
  pred_sequences.append(pred_labels)
408
+ active_tokens = sample_tokens[:len(true_labels)]
409
+ gold_parse = postprocess(
410
+ active_tokens,
411
+ true_labels,
412
+ tokenizer=tokenizer,
413
+ filename=sample.get("filename"),
414
+ use_rules=True,
415
+ )
416
+ pred_parse = postprocess(
417
+ active_tokens,
418
+ pred_labels,
419
+ tokenizer=tokenizer,
420
+ filename=sample.get("filename"),
421
+ use_rules=True,
422
+ )
423
+ gold_parse_no_rules = postprocess(
424
+ active_tokens,
425
+ true_labels,
426
+ tokenizer=tokenizer,
427
+ filename=sample.get("filename"),
428
+ use_rules=False,
429
+ )
430
+ pred_parse_no_rules = postprocess(
431
+ active_tokens,
432
+ pred_labels,
433
+ tokenizer=tokenizer,
434
+ filename=sample.get("filename"),
435
+ use_rules=False,
436
+ )
437
+ update_parse_metrics(parse_metrics, gold_parse, pred_parse)
438
+ update_parse_metrics(parse_metrics_no_rules, gold_parse_no_rules, pred_parse_no_rules)
439
+ failures = collect_field_failures(gold_parse, pred_parse)
440
+ if failures and len(field_failures) < 30:
441
+ field_failures.append(
442
+ {
443
+ "filename": sample.get("filename"),
444
+ "errors": failures,
445
+ "gold": gold_parse,
446
+ "pred": pred_parse,
447
+ }
448
+ )
449
+ failures_no_rules = collect_field_failures(gold_parse_no_rules, pred_parse_no_rules)
450
+ if failures_no_rules and len(field_failures_no_rules) < 30:
451
+ field_failures_no_rules.append(
452
+ {
453
+ "filename": sample.get("filename"),
454
+ "errors": failures_no_rules,
455
+ "gold": gold_parse_no_rules,
456
+ "pred": pred_parse_no_rules,
457
+ }
458
+ )
459
 
460
  errors = confusion.copy()
461
  for label in set(label for pair in confusion for label in pair):
 
472
  {k: v for k, v in entity_confusion.items() if k[0] != k[1]}
473
  ).most_common(30),
474
  "boundary_errors": boundary_errors,
475
+ "parse_metrics": parse_metrics,
476
+ "parse_metrics_no_rules": parse_metrics_no_rules,
477
+ "field_failures": field_failures,
478
+ "field_failures_no_rules": field_failures_no_rules,
479
  }
480
 
481
 
 
556
  length_values: List[int] = []
557
  aligned_length_values: List[int] = []
558
  violations: List[dict] = []
559
+ boundary_warnings: List[dict] = []
560
  mismatch_examples: List[dict] = []
561
  space_label_counter: Counter = Counter()
562
  boundary_drift_counter: Counter = Counter()
 
585
 
586
  label_counter.update(labels)
587
  length_values.append(len(tokens))
588
+ aligned_tokens, aligned_labels = labels_for_tokenizer(sample, tokenizer)
589
  aligned_length_values.append(len(aligned_tokens))
590
  if len(aligned_tokens) + 2 > max_length:
591
  truncation_count += 1
 
603
  }
604
  )
605
  violations.append(violation)
606
+ for warning in bio_boundary_warnings(tokens, labels):
607
+ warning.update(
608
+ {
609
+ "row": row_idx,
610
+ "file_id": sample.get("file_id"),
611
+ "filename": sample.get("filename"),
612
+ "context_tokens": tokens[max(0, warning["index"] - 5):warning["index"] + 6],
613
+ "context_labels": labels[max(0, warning["index"] - 5):warning["index"] + 6],
614
+ }
615
+ )
616
+ boundary_warnings.append(warning)
617
  for span in spans_from_labels(tokens, labels):
618
  text = span["text"]
619
  if span["type"] == "TITLE":
 
718
  )
719
 
720
  violation_counter = Counter(v["type"] for v in violations)
721
+ warning_counter = Counter(w["type"] for w in boundary_warnings)
722
  sections.append(
723
  (
724
  "BIO Violations And Boundary Drift",
725
  "\n".join(
726
  [
727
+ "### True BIO violation counts",
728
  format_counter(violation_counter),
729
  "",
730
+ "### Legal boundary warning counts",
731
+ format_counter(warning_counter),
732
+ "",
733
  "### Boundary drift heuristics",
734
  format_counter(boundary_drift_counter),
735
  "",
736
  "### Sample violations",
737
  markdown_json(violations[:30]),
738
+ "",
739
+ "### Sample boundary warnings",
740
+ markdown_json(boundary_warnings[:30]),
741
  ]
742
  ),
743
  )
 
790
  [true, pred, f"{count:,}"]
791
  for (true, pred), count in model_eval["top_entity_confusions"]
792
  ]
793
+ def parse_metric_tables(metrics: Counter) -> Tuple[List[List[str]], str, List[List[str]]]:
794
+ field_rows = []
795
+ for field in ["group", "title", "season", "episode", "resolution", "source", "special"]:
796
+ total = metrics.get(f"{field}_total", 0)
797
+ correct = metrics.get(f"{field}_correct", 0)
798
+ acc = correct / total if total else 0.0
799
+ field_rows.append([field, f"{correct:,}/{total:,}", f"{acc:.4f}"])
800
+ full_total = metrics.get("full_match_total", 0)
801
+ full_correct = metrics.get("full_match_correct", 0)
802
+ full_acc = full_correct / full_total if full_total else 0.0
803
+ full_line = f"{full_correct:,}/{full_total:,} ({full_acc:.4f})"
804
+ error_rows = [
805
+ [field, str(gold), str(pred), f"{count:,}"]
806
+ for key, count in Counter(
807
+ {key: count for key, count in metrics.items() if isinstance(key, tuple)}
808
+ ).most_common(30)
809
+ if isinstance(key, tuple)
810
+ for field, gold, pred in [key]
811
+ ]
812
+ return field_rows, full_line, error_rows
813
+
814
+ rule_field_rows, rule_full_line, rule_error_rows = parse_metric_tables(model_eval["parse_metrics"])
815
+ ner_field_rows, ner_full_line, ner_error_rows = parse_metric_tables(model_eval["parse_metrics_no_rules"])
816
  sections.append(
817
  (
818
  "Model Confusion Analysis",
 
832
  "### Top entity-type confusions",
833
  markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
834
  "",
835
+ "### Field exact-match accuracy (rule-assisted)",
836
+ markdown_table(["field", "correct/total", "accuracy"], rule_field_rows),
837
+ "",
838
+ f"Rule-assisted full parse exact match: {rule_full_line}",
839
+ "",
840
+ "### Top rule-assisted field parse errors",
841
+ markdown_table(["field", "gold", "pred", "count"], rule_error_rows) if rule_error_rows else "- none",
842
+ "",
843
+ "### Field exact-match accuracy (NER-only, no rules)",
844
+ markdown_table(["field", "correct/total", "accuracy"], ner_field_rows),
845
+ "",
846
+ f"NER-only full parse exact match: {ner_full_line}",
847
+ "",
848
+ "### Top NER-only field parse errors",
849
+ markdown_table(["field", "gold", "pred", "count"], ner_error_rows) if ner_error_rows else "- none",
850
+ "",
851
+ "### Hardest sampled parse failures (rule-assisted)",
852
+ markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
853
+ "",
854
+ "### Hardest sampled parse failures (NER-only)",
855
+ markdown_json(model_eval["field_failures_no_rules"][:10]) if model_eval["field_failures_no_rules"] else "- none",
856
+ "",
857
  "### Seqeval report",
858
  "```text\n" + model_eval["classification_report"] + "\n```",
859
  ]
dmhy_dataset.py CHANGED
@@ -19,7 +19,8 @@ from datetime import datetime, timezone
19
  from pathlib import Path
20
  from typing import Iterable, List, Optional, Sequence
21
 
22
- from data_generator import assign_bio, categorize_meta_token
 
23
  from tokenizer import AnimeTokenizer
24
 
25
 
@@ -35,8 +36,9 @@ NOISE_BRACKETS = {
35
  "繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
36
  }
37
 
38
- SPECIAL_RE = re.compile(r"^(?:ova|oad|sp|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
39
- EPISODE_RE = re.compile(r"^(?:[Ee][Pp]?|#)?(\d{1,4})(?:v\d+)?$", re.I)
 
40
  SEASON_RE = re.compile(
41
  r"^(?:"
42
  r"[Ss](\d{1,2})|"
@@ -45,16 +47,28 @@ SEASON_RE = re.compile(
45
  r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
46
  r")$", re.I
47
  )
 
 
 
 
 
 
 
 
 
 
 
48
  SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
49
  DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
50
  HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
51
  DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
52
  RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
 
53
  SOURCE_RE = re.compile(
54
- r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|HDTV|"
55
  r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
56
  r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
57
- r"CHS|CHT|BIG5|GB|JPN?|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
58
  re.I,
59
  )
60
  GROUP_HINT_RE = re.compile(
@@ -112,12 +126,20 @@ def cn_number_to_int(text: str) -> Optional[int]:
112
  def season_number(token: str) -> Optional[int]:
113
  clean = clean_bracket(token)
114
  match = SEASON_RE.match(clean)
115
- if not match:
116
- return None
117
- value = next((g for g in match.groups() if g), None)
118
- if value is None:
119
- return None
120
- return cn_number_to_int(value)
 
 
 
 
 
 
 
 
121
 
122
 
123
  def episode_number(token: str) -> Optional[int]:
@@ -126,7 +148,13 @@ def episode_number(token: str) -> Optional[int]:
126
  return None
127
  if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
128
  return None
129
- if re.match(r"^第\d{1,4}[话話集]$", clean):
 
 
 
 
 
 
130
  return int(re.search(r"\d+", clean).group())
131
  match = EPISODE_RE.match(clean)
132
  if not match:
@@ -137,8 +165,13 @@ def episode_number(token: str) -> Optional[int]:
137
  return number
138
 
139
 
 
 
 
 
140
  def is_resolution(token: str) -> bool:
141
- return bool(RESOLUTION_RE.match(clean_bracket(token)))
 
142
 
143
 
144
  def is_source(token: str) -> bool:
@@ -149,11 +182,17 @@ def is_source(token: str) -> bool:
149
  is_resolution(clean) or SOURCE_RE.match(clean)
150
  ):
151
  return True
152
- return bool(SOURCE_RE.match(clean))
 
 
 
 
 
153
 
154
 
155
  def is_special(token: str) -> bool:
156
- return bool(SPECIAL_RE.match(clean_bracket(token)))
 
157
 
158
 
159
  def is_noise_bracket(token: str) -> bool:
@@ -194,7 +233,7 @@ def is_title_token(token: str) -> bool:
194
  return False
195
  if is_resolution(clean) or is_source(clean) or is_special(clean):
196
  return False
197
- if season_number(clean) is not None or episode_number(clean) is not None:
198
  return False
199
  if DATE_RE.match(clean) or HASH_RE.match(clean):
200
  return False
@@ -221,9 +260,13 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
221
  number = episode_number(token)
222
  if number is None:
223
  continue
224
- score = 0
225
  clean = clean_bracket(token)
226
- if re.match(r"^(?:[Ee][Pp]?|#|第)", clean, re.I):
 
 
 
 
 
227
  score += 4
228
  if token.startswith("[") or token.startswith("(") or token.startswith("【"):
229
  score += 3
@@ -239,12 +282,317 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
239
  return max(candidates, key=lambda item: (item[0], item[1]))[1]
240
 
241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
243
  inner = clean_bracket(token)
244
  if not inner:
245
  return [token], [category]
246
- open_char = token[0] if token[0] in "[【(《" else ""
247
- close_char = token[-1] if token[-1] in "]】)》" else ""
248
  inner_tokens = tokenizer.tokenize(inner)
249
  tokens: List[str] = []
250
  cats: List[str] = []
@@ -259,6 +607,38 @@ def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer)
259
  return tokens, cats
260
 
261
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
  def expand_tokens_and_categories(
263
  tokens: Sequence[str],
264
  categories: Sequence[str],
@@ -281,15 +661,34 @@ def expand_tokens_and_categories(
281
  expanded_tokens.extend(split_tokens)
282
  expanded_categories.extend(split_categories)
283
  continue
 
 
 
 
 
 
 
 
284
  expanded_tokens.append(token)
285
  expanded_categories.append(category)
286
  return expanded_tokens, expanded_categories
287
 
288
 
289
  def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
 
 
 
 
 
 
 
290
  tokens = tokenizer.tokenize(filename)
291
  if not tokens:
292
  return None
 
 
 
 
293
 
294
  categories = ["sep" if token in {" ", "-", "_", "|", "~", "~", "."} else "title" for token in tokens]
295
 
@@ -306,15 +705,16 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
306
  categories[idx] = "source"
307
  elif is_special(token):
308
  categories[idx] = "special"
309
- elif season_number(token) is not None:
310
  categories[idx] = "season"
311
  elif is_noise_bracket(token):
312
  categories[idx] = "sep"
313
 
314
  episode_idx = find_episode_index(tokens)
315
  if episode_idx is None:
316
- return None
317
  categories[episode_idx] = "episode"
 
318
 
319
  # S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
320
  # token slips through, expand_tokens_and_categories will split it.
@@ -341,7 +741,11 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
341
  title_start += 1
342
  title_start, title_end = trim_title_span(tokens, title_start, title_end)
343
  if title_start >= title_end:
344
- return None
 
 
 
 
345
 
346
  for idx, token in enumerate(tokens):
347
  if title_start <= idx < title_end:
@@ -351,28 +755,13 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
351
  categories[idx] = "sep"
352
 
353
  if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
354
- return None
 
 
 
 
355
 
356
- # Expand bracket content for group/title tokens (e.g. [剑来 第2季] →
357
- # [, 剑, 来, , 第2季, ]) so that season markers mixed with title text
358
- # inside a bracket can be detected as separate tokens.
359
- expanded_tokens, expanded_categories = expand_tokens_and_categories(
360
- tokens, categories, tokenizer
361
- )
362
-
363
- # Re-detect season markers in expanded tokens (bracket expansion exposes
364
- # patterns like 第2季 that were previously hidden inside mixed brackets).
365
- for idx in range(len(expanded_tokens)):
366
- cat = expanded_categories[idx]
367
- if cat not in {"sep", "episode", "group", "source", "resolution",
368
- "special", "season"}:
369
- if season_number(expanded_tokens[idx]) is not None:
370
- expanded_categories[idx] = "season"
371
-
372
- labels = assign_bio(expanded_tokens, expanded_categories)
373
- if len(expanded_tokens) != len(labels):
374
- return None
375
- return {"tokens": expanded_tokens, "labels": labels}
376
 
377
 
378
  def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:
 
19
  from pathlib import Path
20
  from typing import Iterable, List, Optional, Sequence
21
 
22
+ from data_generator import LABEL_MAP, categorize_meta_token
23
+ from label_repairs import season_marker_number
24
  from tokenizer import AnimeTokenizer
25
 
26
 
 
36
  "繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
37
  }
38
 
39
+ SPECIAL_RE = re.compile(r"^(?:ova\d*|oad\d*|sp\d*|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
40
+ SPECIAL_SEARCH_RE = re.compile(r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+", re.I)
41
+ EPISODE_RE = re.compile(r"^(?:[Ee][Pp]?|#)?(\d{1,4})(?:v\d+|END)?$", re.I)
42
  SEASON_RE = re.compile(
43
  r"^(?:"
44
  r"[Ss](\d{1,2})|"
 
47
  r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
48
  r")$", re.I
49
  )
50
+ READING_SEASON_RE = re.compile(
51
+ r"^(?:Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|Ni\s+Gakki|Sono\s+Ni|"
52
+ r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|(?:Yon|Shi|Shin)\s+no\s+Sara|"
53
+ r"(?:Go|Gou)\s+no\s+Sara)$",
54
+ re.I,
55
+ )
56
+ CJK_SEQUEL_SEASON_RE = re.compile(
57
+ r"^(?:[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
58
+ r"[ⅡⅢⅣⅤⅥⅦⅧⅨ]|II|III|IV|V|VI|VII|VIII|IX)$",
59
+ re.I,
60
+ )
61
  SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
62
  DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
63
  HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
64
  DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
65
  RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
66
+ RESOLUTION_SEARCH_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
67
  SOURCE_RE = re.compile(
68
+ r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
69
  r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
70
  r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
71
+ r"CHS|CHT|BIG5|GB|JPN?|JPSC|JPTC|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
72
  re.I,
73
  )
74
  GROUP_HINT_RE = re.compile(
 
126
  def season_number(token: str) -> Optional[int]:
127
  clean = clean_bracket(token)
128
  match = SEASON_RE.match(clean)
129
+ if match:
130
+ value = next((g for g in match.groups() if g), None)
131
+ if value is None:
132
+ return None
133
+ return cn_number_to_int(value)
134
+ if READING_SEASON_RE.match(clean) or CJK_SEQUEL_SEASON_RE.match(clean):
135
+ return season_marker_number(clean)
136
+ return None
137
+
138
+
139
+ def is_explicit_season(token: str) -> bool:
140
+ """Return True for unambiguous season syntax such as S02 or 第2季."""
141
+ clean = clean_bracket(token)
142
+ return bool(SEASON_RE.match(clean))
143
 
144
 
145
  def episode_number(token: str) -> Optional[int]:
 
148
  return None
149
  if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
150
  return None
151
+ if re.match(r"^第\d{1,4}(?:\(\d{1,4}\))?[话話集]$", clean):
152
+ return int(re.search(r"\d+", clean).group())
153
+ if re.match(r"^(?:OVA|OAD|SP)\d{1,4}$", clean, re.I):
154
+ return int(re.search(r"\d+", clean).group())
155
+ if re.match(r"^\d{1,4}\s*END$", clean, re.I):
156
+ return int(re.search(r"\d+", clean).group())
157
+ if re.match(r"^\d{1,4}[._]\d+$", clean):
158
  return int(re.search(r"\d+", clean).group())
159
  match = EPISODE_RE.match(clean)
160
  if not match:
 
165
  return number
166
 
167
 
168
+ def has_wrapping_brackets(token: str) -> bool:
169
+ return len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》"
170
+
171
+
172
  def is_resolution(token: str) -> bool:
173
+ clean = clean_bracket(token)
174
+ return bool(RESOLUTION_RE.match(clean) or (has_wrapping_brackets(token) and RESOLUTION_SEARCH_RE.search(clean)))
175
 
176
 
177
  def is_source(token: str) -> bool:
 
182
  is_resolution(clean) or SOURCE_RE.match(clean)
183
  ):
184
  return True
185
+ if SOURCE_RE.match(clean):
186
+ return True
187
+ if has_wrapping_brackets(token):
188
+ parts = [part for part in re.split(r"[\s&+/,._-]+", clean) if part]
189
+ return bool(parts) and all(SOURCE_RE.match(part) or is_noise_bracket(part) for part in parts)
190
+ return False
191
 
192
 
193
  def is_special(token: str) -> bool:
194
+ clean = clean_bracket(token)
195
+ return bool(SPECIAL_RE.match(clean) or SPECIAL_SEARCH_RE.match(clean))
196
 
197
 
198
  def is_noise_bracket(token: str) -> bool:
 
233
  return False
234
  if is_resolution(clean) or is_source(clean) or is_special(clean):
235
  return False
236
+ if is_explicit_season(clean) or episode_number(clean) is not None:
237
  return False
238
  if DATE_RE.match(clean) or HASH_RE.match(clean):
239
  return False
 
260
  number = episode_number(token)
261
  if number is None:
262
  continue
 
263
  clean = clean_bracket(token)
264
+ if idx > 0 and tokens[idx - 1] == "." and re.fullmatch(r"\d+", clean):
265
+ previous_clean = clean_bracket(tokens[idx - 2]) if idx >= 2 else ""
266
+ if previous_clean.lower() in VIDEO_EXTENSIONS or f".{clean}".lower() in VIDEO_EXTENSIONS:
267
+ continue
268
+ score = 0
269
+ if re.match(r"^(?:[Ee][Pp]?|#|第|OVA|OAD|SP)", clean, re.I):
270
  score += 4
271
  if token.startswith("[") or token.startswith("(") or token.startswith("【"):
272
  score += 3
 
282
  return max(candidates, key=lambda item: (item[0], item[1]))[1]
283
 
284
 
285
+ def is_separator_token(token: str) -> bool:
286
+ return token in {" ", "-", "_", "|", "~", "~", ".", "+", "&", "/", ","}
287
+
288
+
289
+ def has_only_separators_between(tokens: Sequence[str], start: int, end: int) -> bool:
290
+ return all(is_separator_token(token) for token in tokens[start:end])
291
+
292
+
293
+ def is_context_season_token(tokens: Sequence[str], idx: int, episode_idx: int) -> bool:
294
+ """Detect compact season markers only when they structurally lead into an episode."""
295
+ if idx >= episode_idx:
296
+ return False
297
+
298
+ token = tokens[idx]
299
+ clean = clean_bracket(token)
300
+ if not clean:
301
+ return False
302
+ if is_explicit_season(clean):
303
+ return True
304
+
305
+ if season_number(clean) is None:
306
+ return False
307
+ if not has_only_separators_between(tokens, idx + 1, episode_idx):
308
+ return False
309
+
310
+ # A bare V is often the volume prefix in V02E01, not season five.
311
+ if clean.upper() == "V":
312
+ return False
313
+ return True
314
+
315
+
316
+ def label_context_season_tokens(
317
+ tokens: Sequence[str],
318
+ categories: List[str],
319
+ episode_idx: int,
320
+ ) -> None:
321
+ if (
322
+ episode_idx >= 2
323
+ and clean_bracket(tokens[episode_idx]).upper().startswith("E")
324
+ and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
325
+ and clean_bracket(tokens[episode_idx - 1]).isdigit()
326
+ ):
327
+ categories[episode_idx - 2] = "season"
328
+ categories[episode_idx - 1] = "season"
329
+ return
330
+
331
+ for idx in range(episode_idx):
332
+ if categories[idx] in {"group", "episode", "resolution", "source", "special"}:
333
+ continue
334
+ if is_context_season_token(tokens, idx, episode_idx):
335
+ categories[idx] = "season"
336
+
337
+
338
+ def embedded_bracket_episode(token: str) -> Optional[tuple[str, str, str]]:
339
+ """Split malformed tokens such as '[Group}Title[658]' into title + episode."""
340
+ if episode_number(token) is not None:
341
+ return None
342
+ match = re.match(r"^(?P<prefix>.+?)\[(?P<episode>\d{1,4}(?:v\d+)?)(?P<close>\])?$", token, re.I)
343
+ if match is None and has_wrapping_brackets(token):
344
+ match = re.match(r"^(?P<prefix>.+?)(?P<episode>\d{2,4})(?P<close>[\]\)】》])$", token, re.I)
345
+ if not match:
346
+ return None
347
+ prefix = match.group("prefix")
348
+ episode = match.group("episode")
349
+ close = match.group("close") or ""
350
+ if not clean_bracket(prefix):
351
+ return None
352
+ number = int(re.search(r"\d+", episode).group())
353
+ if number == 0 or number > 2000:
354
+ return None
355
+ return prefix, episode, close
356
+
357
+
358
+ def append_tokenized_category(
359
+ tokens: List[str],
360
+ categories: List[str],
361
+ text: str,
362
+ category: str,
363
+ tokenizer: AnimeTokenizer,
364
+ ) -> None:
365
+ for piece in tokenizer.tokenize(text):
366
+ if not piece:
367
+ continue
368
+ if is_separator_token(piece) or piece in {"[", "]", "(", ")", "【", "】", "《", "》"}:
369
+ piece_category = "sep"
370
+ else:
371
+ piece_category = category
372
+ tokens.append(piece)
373
+ categories.append(piece_category)
374
+
375
+
376
+ def finalize_weak_sample(
377
+ tokens: Sequence[str],
378
+ categories: Sequence[str],
379
+ tokenizer: AnimeTokenizer,
380
+ require_episode: bool = True,
381
+ ) -> Optional[dict]:
382
+ expanded_tokens, expanded_categories = expand_tokens_and_categories(tokens, categories, tokenizer)
383
+
384
+ # Only unambiguous season forms are promoted here. Compact sequel markers
385
+ # such as 貳, II, or Ni no Sara need episode context and are repaired by
386
+ # label_repairs from character spans; treating every single CJK numeral as
387
+ # season would corrupt titles like 魯邦三世.
388
+ for idx, token in enumerate(expanded_tokens):
389
+ if expanded_categories[idx] in {"sep", "episode", "group", "source", "resolution", "special", "season"}:
390
+ continue
391
+ if is_explicit_season(token):
392
+ expanded_categories[idx] = "season"
393
+
394
+ labels = assign_iob2(expanded_categories)
395
+ if len(expanded_tokens) != len(labels):
396
+ return None
397
+ if not any(label.endswith("TITLE") for label in labels):
398
+ return None
399
+ if require_episode and not any(label.endswith("EPISODE") for label in labels):
400
+ return None
401
+ return {"tokens": expanded_tokens, "labels": labels}
402
+
403
+
404
+ def assign_iob2(categories: Sequence[str]) -> List[str]:
405
+ labels: List[str] = []
406
+ previous_entity: Optional[str] = None
407
+ for category in categories:
408
+ entity = LABEL_MAP.get(category, "O")
409
+ if entity == "O":
410
+ labels.append("O")
411
+ previous_entity = None
412
+ continue
413
+ prefix = "I" if previous_entity == entity else "B"
414
+ labels.append(f"{prefix}-{entity}")
415
+ previous_entity = entity
416
+ return labels
417
+
418
+
419
+ def fallback_embedded_episode_sample(
420
+ tokens: Sequence[str],
421
+ tokenizer: AnimeTokenizer,
422
+ ) -> Optional[dict]:
423
+ rebuilt_tokens: List[str] = []
424
+ rebuilt_categories: List[str] = []
425
+ used_episode = False
426
+
427
+ for token in tokens:
428
+ embedded = embedded_bracket_episode(token)
429
+ if embedded and not used_episode:
430
+ prefix, episode, close = embedded
431
+ append_tokenized_category(rebuilt_tokens, rebuilt_categories, prefix, "title", tokenizer)
432
+ rebuilt_tokens.append(episode)
433
+ rebuilt_categories.append("episode")
434
+ if close:
435
+ rebuilt_tokens.append(close)
436
+ rebuilt_categories.append("sep")
437
+ used_episode = True
438
+ continue
439
+
440
+ if not used_episode:
441
+ category = "sep" if is_separator_token(token) else "title"
442
+ elif is_resolution(token):
443
+ category = "resolution"
444
+ elif is_source(token):
445
+ category = "source"
446
+ elif is_special(token):
447
+ category = "special"
448
+ else:
449
+ category = "sep"
450
+ rebuilt_tokens.append(token)
451
+ rebuilt_categories.append(category)
452
+
453
+ if not used_episode:
454
+ return None
455
+ return finalize_weak_sample(rebuilt_tokens, rebuilt_categories, tokenizer)
456
+
457
+
458
+ def has_embedded_episode_candidate(tokens: Sequence[str]) -> bool:
459
+ return any(embedded_bracket_episode(token) is not None for token in tokens)
460
+
461
+
462
+ def fallback_episode_first_sample(
463
+ tokens: Sequence[str],
464
+ categories: Sequence[str],
465
+ episode_idx: int,
466
+ tokenizer: AnimeTokenizer,
467
+ ) -> Optional[dict]:
468
+ fallback_categories = ["sep"] * len(tokens)
469
+
470
+ # V02E01-style catalog rows are episode-first. The tokenizer currently
471
+ # exposes them as V, 02, E01, so keep V02 together as a season span.
472
+ if (
473
+ episode_idx >= 2
474
+ and clean_bracket(tokens[episode_idx]).upper().startswith("E")
475
+ and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
476
+ and clean_bracket(tokens[episode_idx - 1]).isdigit()
477
+ ):
478
+ fallback_categories[episode_idx - 2] = "season"
479
+ fallback_categories[episode_idx - 1] = "season"
480
+ else:
481
+ label_context_season_tokens(tokens, fallback_categories, episode_idx)
482
+
483
+ fallback_categories[episode_idx] = "episode"
484
+
485
+ title_indices: List[int] = []
486
+ for idx in range(episode_idx + 1, len(tokens)):
487
+ token = tokens[idx]
488
+ if is_separator_token(token):
489
+ continue
490
+ if is_resolution(token) or is_source(token) or is_special(token) or is_noise_bracket(token):
491
+ fallback_categories[idx] = "resolution" if is_resolution(token) else "source" if is_source(token) else "special" if is_special(token) else "sep"
492
+ continue
493
+ title_indices.append(idx)
494
+
495
+ if not title_indices:
496
+ # Some rows are title-only brackets followed by season/episode,
497
+ # e.g. [伊蘇] II-01. If the leading bracket was guessed as GROUP but
498
+ # no real title exists, use it as TITLE to keep the row useful.
499
+ for idx in range(episode_idx):
500
+ if categories[idx] == "group" and clean_bracket(tokens[idx]):
501
+ title_indices.append(idx)
502
+ break
503
+
504
+ for idx in title_indices:
505
+ fallback_categories[idx] = "title"
506
+ if title_indices:
507
+ for idx in range(title_indices[0], title_indices[-1] + 1):
508
+ if is_separator_token(tokens[idx]):
509
+ fallback_categories[idx] = "title"
510
+
511
+ return finalize_weak_sample(tokens, fallback_categories, tokenizer)
512
+
513
+
514
+ def fallback_minimal_sample(
515
+ tokens: Sequence[str],
516
+ episode_idx: int,
517
+ tokenizer: AnimeTokenizer,
518
+ ) -> Optional[dict]:
519
+ """Keep malformed low-information rows instead of silently dropping them."""
520
+ categories: List[str] = []
521
+ title_idx: Optional[int] = None
522
+
523
+ for idx, token in enumerate(tokens):
524
+ if idx == episode_idx:
525
+ categories.append("episode")
526
+ elif is_resolution(token):
527
+ categories.append("resolution")
528
+ elif is_source(token):
529
+ categories.append("source")
530
+ elif is_special(token):
531
+ categories.append("special")
532
+ if title_idx is None:
533
+ title_idx = idx
534
+ else:
535
+ categories.append("sep")
536
+
537
+ if title_idx is None:
538
+ for idx, token in enumerate(tokens):
539
+ if idx == episode_idx or is_separator_token(token):
540
+ continue
541
+ if categories[idx] not in {"resolution", "source"}:
542
+ title_idx = idx
543
+ break
544
+ if title_idx is None:
545
+ return None
546
+
547
+ categories[title_idx] = "title"
548
+ return finalize_weak_sample(tokens, categories, tokenizer)
549
+
550
+
551
+ def fallback_no_episode_sample(tokens: Sequence[str], tokenizer: AnimeTokenizer) -> Optional[dict]:
552
+ """Label movies, OP/ED/SP, and malformed rows that have no true episode token."""
553
+ categories: List[str] = []
554
+ seen_title = False
555
+ title_allowed = True
556
+
557
+ for idx, token in enumerate(tokens):
558
+ if is_separator_token(token):
559
+ categories.append("title" if seen_title and title_allowed else "sep")
560
+ continue
561
+ if idx == 0 and is_group_bracket(token, idx, tokens):
562
+ categories.append("group")
563
+ continue
564
+ if is_resolution(token):
565
+ categories.append("resolution")
566
+ title_allowed = False
567
+ continue
568
+ if is_source(token):
569
+ categories.append("source")
570
+ title_allowed = False
571
+ continue
572
+ if is_special(token):
573
+ categories.append("special")
574
+ title_allowed = False
575
+ continue
576
+ if is_noise_bracket(token):
577
+ categories.append("sep")
578
+ continue
579
+ categories.append("title")
580
+ seen_title = True
581
+
582
+ return finalize_weak_sample(tokens, categories, tokenizer, require_episode=False)
583
+
584
+
585
+ def bracket_delimiters(token: str) -> tuple[str, str]:
586
+ open_char = token[0] if token and token[0] in "[【(《" else ""
587
+ close_char = token[-1] if token and token[-1] in "]】)》" else ""
588
+ return open_char, close_char
589
+
590
+
591
  def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
592
  inner = clean_bracket(token)
593
  if not inner:
594
  return [token], [category]
595
+ open_char, close_char = bracket_delimiters(token)
 
596
  inner_tokens = tokenizer.tokenize(inner)
597
  tokens: List[str] = []
598
  cats: List[str] = []
 
607
  return tokens, cats
608
 
609
 
610
+ def label_meta_bracket_contents(token: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
611
+ inner = clean_bracket(token)
612
+ if not inner:
613
+ return [token], ["sep"]
614
+ open_char, close_char = bracket_delimiters(token)
615
+ inner_tokens = tokenizer.tokenize(inner)
616
+ tokens: List[str] = []
617
+ cats: List[str] = []
618
+ if open_char:
619
+ tokens.append(open_char)
620
+ cats.append("sep")
621
+ for inner_token in inner_tokens:
622
+ if inner_token in {" ", "-", "_", "|", "~", "~", ".", "+", "&", "/", ","}:
623
+ cat = "sep"
624
+ elif is_resolution(inner_token) or RESOLUTION_SEARCH_RE.fullmatch(inner_token):
625
+ cat = "resolution"
626
+ elif is_source(inner_token):
627
+ cat = "source"
628
+ elif is_special(inner_token):
629
+ cat = "special"
630
+ elif is_noise_bracket(inner_token):
631
+ cat = "sep"
632
+ else:
633
+ cat = "sep"
634
+ tokens.append(inner_token)
635
+ cats.append(cat)
636
+ if close_char:
637
+ tokens.append(close_char)
638
+ cats.append("sep")
639
+ return tokens, cats
640
+
641
+
642
  def expand_tokens_and_categories(
643
  tokens: Sequence[str],
644
  categories: Sequence[str],
 
661
  expanded_tokens.extend(split_tokens)
662
  expanded_categories.extend(split_categories)
663
  continue
664
+ if category in {"source", "resolution", "special", "sep"} and (
665
+ token.startswith("[") or token.startswith("(") or token.startswith("【") or token.startswith("《")
666
+ ):
667
+ split_tokens, split_categories = label_meta_bracket_contents(token, tokenizer)
668
+ if any(cat != "sep" for cat in split_categories):
669
+ expanded_tokens.extend(split_tokens)
670
+ expanded_categories.extend(split_categories)
671
+ continue
672
  expanded_tokens.append(token)
673
  expanded_categories.append(category)
674
  return expanded_tokens, expanded_categories
675
 
676
 
677
  def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
678
+ basename = normalize_path_basename(str(filename))
679
+ stem, ext = strip_video_extension(basename)
680
+ if ext in VIDEO_EXTENSIONS:
681
+ filename = stem
682
+ else:
683
+ filename = basename
684
+
685
  tokens = tokenizer.tokenize(filename)
686
  if not tokens:
687
  return None
688
+ if has_embedded_episode_candidate(tokens):
689
+ embedded_sample = fallback_embedded_episode_sample(tokens, tokenizer)
690
+ if embedded_sample is not None:
691
+ return embedded_sample
692
 
693
  categories = ["sep" if token in {" ", "-", "_", "|", "~", "~", "."} else "title" for token in tokens]
694
 
 
705
  categories[idx] = "source"
706
  elif is_special(token):
707
  categories[idx] = "special"
708
+ elif is_explicit_season(token):
709
  categories[idx] = "season"
710
  elif is_noise_bracket(token):
711
  categories[idx] = "sep"
712
 
713
  episode_idx = find_episode_index(tokens)
714
  if episode_idx is None:
715
+ return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_no_episode_sample(tokens, tokenizer)
716
  categories[episode_idx] = "episode"
717
+ label_context_season_tokens(tokens, categories, episode_idx)
718
 
719
  # S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
720
  # token slips through, expand_tokens_and_categories will split it.
 
741
  title_start += 1
742
  title_start, title_end = trim_title_span(tokens, title_start, title_end)
743
  if title_start >= title_end:
744
+ return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
745
+ tokens, categories, episode_idx, tokenizer
746
+ ) or fallback_minimal_sample(
747
+ tokens, episode_idx, tokenizer
748
+ )
749
 
750
  for idx, token in enumerate(tokens):
751
  if title_start <= idx < title_end:
 
755
  categories[idx] = "sep"
756
 
757
  if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
758
+ return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
759
+ tokens, categories, episode_idx, tokenizer
760
+ ) or fallback_minimal_sample(
761
+ tokens, episode_idx, tokenizer
762
+ )
763
 
764
+ return finalize_weak_sample(tokens, categories, tokenizer)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
765
 
766
 
767
  def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:
evaluate_parser_cases.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluate parser checkpoints on fixed real-world filename cases."""
2
+
3
+ import argparse
4
+ import json
5
+ import os
6
+ from typing import Dict, List, Optional
7
+
8
+ import torch
9
+ from transformers import BertForTokenClassification
10
+
11
+ from config import Config
12
+ from inference import parse_filename
13
+ from tokenizer import load_tokenizer
14
+
15
+
16
+ DEFAULT_CASE_FILE = os.path.join("data", "parser_regression_cases.json")
17
+
18
+
19
+ def normalize_field_value(field: str, value) -> Optional[str]:
20
+ if value is None:
21
+ return None
22
+ if field in {"episode", "season"}:
23
+ try:
24
+ return str(int(value))
25
+ except (TypeError, ValueError):
26
+ return str(value).strip().lower()
27
+ text = str(value).strip()
28
+ if field in {"resolution", "source"}:
29
+ return text.lower().replace("_", "-")
30
+ return " ".join(text.lower().split())
31
+
32
+
33
+ def load_cases(path: str) -> List[Dict]:
34
+ with open(path, "r", encoding="utf-8") as f:
35
+ cases = json.load(f)
36
+ if not isinstance(cases, list):
37
+ raise ValueError(f"{path} must contain a JSON list")
38
+ return cases
39
+
40
+
41
+ def evaluate_cases(
42
+ model_dir: str,
43
+ case_file: str,
44
+ tokenizer_variant: Optional[str],
45
+ max_length: Optional[int],
46
+ use_rules: bool,
47
+ constrain_bio: bool,
48
+ ) -> Dict:
49
+ cfg = Config()
50
+ tokenizer = load_tokenizer(model_dir, tokenizer_variant)
51
+ model = BertForTokenClassification.from_pretrained(model_dir)
52
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
53
+ model.to(device)
54
+ model.eval()
55
+
56
+ id2label = {int(k): v for k, v in getattr(model.config, "id2label", cfg.id2label).items()}
57
+ resolved_max_length = max_length or int(getattr(model.config, "max_seq_length", 64))
58
+ cases = load_cases(case_file)
59
+
60
+ field_totals: Dict[str, int] = {}
61
+ field_correct: Dict[str, int] = {}
62
+ results = []
63
+ full_correct = 0
64
+
65
+ for case in cases:
66
+ expected = case.get("expected", {})
67
+ pred = parse_filename(
68
+ case["filename"],
69
+ model,
70
+ tokenizer,
71
+ id2label,
72
+ max_length=resolved_max_length,
73
+ debug=False,
74
+ use_rules=use_rules,
75
+ constrain_bio=constrain_bio,
76
+ )
77
+ errors = {}
78
+ for field, expected_value in expected.items():
79
+ field_totals[field] = field_totals.get(field, 0) + 1
80
+ expected_norm = normalize_field_value(field, expected_value)
81
+ pred_norm = normalize_field_value(field, pred.get(field))
82
+ if expected_norm == pred_norm:
83
+ field_correct[field] = field_correct.get(field, 0) + 1
84
+ else:
85
+ errors[field] = {
86
+ "expected": expected_value,
87
+ "pred": pred.get(field),
88
+ }
89
+ if not errors:
90
+ full_correct += 1
91
+ results.append(
92
+ {
93
+ "id": case.get("id"),
94
+ "filename": case["filename"],
95
+ "ok": not errors,
96
+ "errors": errors,
97
+ "expected": expected,
98
+ "pred": {field: pred.get(field) for field in sorted(expected)},
99
+ }
100
+ )
101
+
102
+ field_accuracy = {
103
+ field: field_correct.get(field, 0) / total
104
+ for field, total in sorted(field_totals.items())
105
+ }
106
+ return {
107
+ "model_dir": model_dir,
108
+ "case_file": case_file,
109
+ "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
110
+ "max_length": resolved_max_length,
111
+ "use_rules": use_rules,
112
+ "constrain_bio": constrain_bio,
113
+ "case_count": len(cases),
114
+ "full_correct": full_correct,
115
+ "full_accuracy": full_correct / len(cases) if cases else 0.0,
116
+ "field_correct": field_correct,
117
+ "field_total": field_totals,
118
+ "field_accuracy": field_accuracy,
119
+ "failures": [result for result in results if not result["ok"]],
120
+ "results": results,
121
+ }
122
+
123
+
124
+ def main() -> None:
125
+ parser = argparse.ArgumentParser(description="Evaluate parser on fixed filename regression cases")
126
+ parser.add_argument("--model-dir", required=True)
127
+ parser.add_argument("--case-file", default=DEFAULT_CASE_FILE)
128
+ parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
129
+ parser.add_argument("--max-length", type=int, default=None)
130
+ parser.add_argument("--output", default=None, help="Optional JSON output path")
131
+ parser.add_argument("--no-rule-assist", action="store_true")
132
+ parser.add_argument("--no-constrained-bio", action="store_true")
133
+ args = parser.parse_args()
134
+
135
+ metrics = evaluate_cases(
136
+ model_dir=args.model_dir,
137
+ case_file=args.case_file,
138
+ tokenizer_variant=args.tokenizer,
139
+ max_length=args.max_length,
140
+ use_rules=not args.no_rule_assist,
141
+ constrain_bio=not args.no_constrained_bio,
142
+ )
143
+
144
+ print(
145
+ f"Full case accuracy: {metrics['full_correct']}/{metrics['case_count']} "
146
+ f"({metrics['full_accuracy']:.4f})"
147
+ )
148
+ for field, total in metrics["field_total"].items():
149
+ correct = metrics["field_correct"].get(field, 0)
150
+ print(f" {field}: {correct}/{total} ({correct / total:.4f})")
151
+ if metrics["failures"]:
152
+ print("\nFailures:")
153
+ for failure in metrics["failures"]:
154
+ print(json.dumps(failure, ensure_ascii=False))
155
+
156
+ if args.output:
157
+ os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
158
+ with open(args.output, "w", encoding="utf-8") as f:
159
+ json.dump(metrics, f, ensure_ascii=False, indent=2)
160
+
161
+
162
+ if __name__ == "__main__":
163
+ main()
exports/anime_filename_parser.metadata.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "model_dir": "checkpoints\\dmhy-finetune\\final",
3
  "output": "exports\\anime_filename_parser.onnx",
4
- "max_length": 64,
5
  "sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
6
  "logits_shape": [
7
  1,
8
- 64,
9
  15
10
  ],
11
- "max_abs_diff": 3.1948089599609375e-05
12
  }
 
1
  {
2
+ "model_dir": ".",
3
  "output": "exports\\anime_filename_parser.onnx",
4
+ "max_length": 128,
5
  "sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
6
  "logits_shape": [
7
  1,
8
+ 128,
9
  15
10
  ],
11
+ "max_abs_diff": 3.3855438232421875e-05
12
  }
exports/anime_filename_parser.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:684a9bd25f9e53e01adcf1e3bd60c8c674fa66d94e11167ab807f73517501603
3
- size 16356487
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9b874fbd4217a190487f512dcc6dd7ce2f0e610147703ca0cddcc0db44fb1c7
3
+ size 19633926
inference.py CHANGED
@@ -20,6 +20,7 @@ import torch
20
  from transformers import BertForTokenClassification
21
 
22
  from config import Config
 
23
  from tokenizer import AnimeTokenizer, load_tokenizer
24
 
25
 
@@ -37,6 +38,10 @@ def extract_season_number(text: str) -> Optional[int]:
37
  Examples:
38
  "S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
39
  """
 
 
 
 
40
  # Arabic digits
41
  match = re.search(r'(\d+)', text)
42
  if match:
@@ -261,19 +266,66 @@ def postprocess(
261
 
262
 
263
  BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
264
- RESOLUTION_RE = re.compile(r"\b(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})\b")
265
- SOURCE_RE = re.compile(
266
- r"\b(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|HDTV|"
267
- r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X)\b",
 
 
 
 
 
 
 
 
 
268
  re.I,
269
  )
270
  EPISODE_PATTERNS = [
271
- re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第)?(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I),
272
- re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I),
 
 
 
 
 
 
 
 
 
 
 
273
  ]
274
  SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
  NOISE_META_RE = re.compile(
276
- r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|"
277
  r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
278
  r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
279
  re.I,
@@ -316,6 +368,52 @@ def looks_like_group(text: str) -> bool:
316
  )
317
 
318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
319
  def apply_rule_assists(filename: str, result: Dict) -> Dict:
320
  """
321
  Fill high-confidence structural fields from filename conventions.
@@ -327,8 +425,8 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
327
  brackets = bracket_parts(filename)
328
 
329
  if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
330
- first_text, first_start, _first_end = brackets[0]
331
- if first_start == 0 and looks_like_group(first_text):
332
  repaired["group"] = first_text
333
 
334
  if not repaired.get("resolution"):
@@ -336,10 +434,34 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
336
  if match:
337
  repaired["resolution"] = match.group(0)
338
 
339
- if not repaired.get("source"):
340
- match = SOURCE_RE.search(filename)
341
- if match:
342
- repaired["source"] = match.group(0).replace("_", "-")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
  if repaired.get("season") is None:
345
  match = SEASON_RE.search(filename)
@@ -348,52 +470,223 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
348
  season = cn_number_to_int(value)
349
  if season is not None:
350
  repaired["season"] = season
351
-
352
- if repaired.get("episode") is None:
353
- candidates: List[Tuple[int, int, str]] = []
354
- for pattern in EPISODE_PATTERNS:
355
- for match in pattern.finditer(filename):
356
- ep_text = match.group("ep")
357
- ep = int(ep_text)
358
- if ep == 0 or ep > 2000:
359
- continue
360
- score = match.start()
361
- if 1 <= ep <= 200:
362
- score += 10000
363
- if "-" in filename[max(0, match.start() - 3):match.start() + 1]:
364
- score += 1000
365
- if match.start() > len(filename) // 3:
366
- score += 200
367
- candidates.append((score, ep, ep_text))
368
- if candidates:
369
- repaired["episode"] = max(candidates, key=lambda item: item[0])[1]
370
 
371
  title = repaired.get("title")
372
  group = repaired.get("group")
 
 
 
 
373
  if title and group and title.startswith(group):
374
  title = title[len(group):].lstrip("]】)>})》 \t-_.")
375
  repaired["title"] = title or repaired["title"]
376
 
377
- if (not repaired.get("title") or (group and repaired["title"].startswith(group))) and repaired.get("episode"):
378
  repaired_title = infer_title_span(filename, group, repaired["episode"])
379
  if repaired_title:
380
  repaired["title"] = repaired_title
381
 
 
 
 
382
  return repaired
383
 
384
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385
  def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
386
  start = 0
387
  if group:
388
  first = BRACKET_RE.match(filename)
389
  if first and group in first.group(0):
390
  start = first.end()
 
 
 
 
 
 
 
 
 
 
 
 
391
 
392
  end = None
393
  if episode is not None:
394
  ep_patterns = [
 
395
  rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
396
  rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
 
 
397
  rf"[Ee]0*{episode}(?:v\d+)?",
398
  ]
399
  for pattern in ep_patterns:
@@ -412,7 +705,7 @@ def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]
412
 
413
  if end is None or end <= start:
414
  return None
415
- title = filename[start:end].strip(" \t-_.[]()【】《》()")
416
  return title or None
417
 
418
 
@@ -448,6 +741,16 @@ def parse_filename(
448
 
449
  # Convert to input IDs
450
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
 
 
 
 
 
 
 
 
 
 
451
  unk_token_id = tokenizer.unk_token_id
452
  unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
453
 
@@ -516,6 +819,10 @@ def parse_filename(
516
  "unk_count": len(unk_tokens),
517
  "unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
518
  "unk_tokens": unk_tokens[:50],
 
 
 
 
519
  "tokens": tokens[:available],
520
  "labels": label_strings,
521
  "scores": [round(float(score), 4) for score in selected_scores],
@@ -544,7 +851,7 @@ def main():
544
  parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
545
  parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
546
  parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
547
- parser.add_argument("--model-dir", type=str, default="./checkpoints/final",
548
  help="Path to trained model directory")
549
  parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
550
  help="Tokenizer variant override. Defaults to checkpoint metadata")
 
20
  from transformers import BertForTokenClassification
21
 
22
  from config import Config
23
+ from label_repairs import season_marker_number
24
  from tokenizer import AnimeTokenizer, load_tokenizer
25
 
26
 
 
38
  Examples:
39
  "S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
40
  """
41
+ marker_value = season_marker_number(text)
42
+ if marker_value is not None:
43
+ return marker_value
44
+
45
  # Arabic digits
46
  match = re.search(r'(\d+)', text)
47
  if match:
 
266
 
267
 
268
  BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
269
+ RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
270
+ SOURCE_TOKEN_PATTERN = (
271
+ r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
272
+ r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
273
+ r"CHS|CHT|GB|BIG5|JPN?|繁中|简中"
274
+ )
275
+ SOURCE_RE = re.compile(rf"\b(?:{SOURCE_TOKEN_PATTERN})\b", re.I)
276
+ SOURCE_TAG_RE = re.compile(
277
+ rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
278
+ re.I,
279
+ )
280
+ SPECIAL_TAG_RE = re.compile(
281
+ r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
282
  re.I,
283
  )
284
  EPISODE_PATTERNS = [
285
+ ("season_episode", re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I)),
286
+ ("dash_episode", re.compile(r"(?:^|[\s._])[-_]\s*(?P<ep>\d{1,4})(?:v\d+)?(?=$|[\s._\-\]\)】》\[])")),
287
+ ("bracket_episode", re.compile(r"[\[\(【《](?:EP?|#)?(?P<ep>\d{1,4})(?:v\d+)?[\]\)】》]", re.I)),
288
+ ("explicit_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I)),
289
+ (
290
+ "long_episode",
291
+ re.compile(
292
+ r"(?:^|[\s._\-\[\(【《])(?P<ep>\d{3,4})(?:v\d+)?"
293
+ r"(?=[\s._\-\]\)】》\[]+(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
294
+ re.I,
295
+ ),
296
+ ),
297
+ ("generic_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?P<ep>\d{1,3})(?:v\d+)?(?=$|[\s._\-\]\)】》])", re.I)),
298
  ]
299
  SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
300
+ SEQUEL_MARKER_RE = re.compile(
301
+ r"(?<![A-Za-z0-9])"
302
+ r"(?P<marker>"
303
+ r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
304
+ r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
305
+ r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
306
+ r"(?:Go|Gou)\s+no\s+Sara|"
307
+ r"Ni\s+Gakki|Sono\s+Ni|Ni|"
308
+ r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
309
+ r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
310
+ r")"
311
+ r"(?![A-Za-z0-9])",
312
+ re.I,
313
+ )
314
+ TRAILING_SEQUEL_MARKER_RE = re.compile(
315
+ r"(?:^|[\s._-])"
316
+ r"(?P<marker>"
317
+ r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
318
+ r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
319
+ r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
320
+ r"(?:Go|Gou)\s+no\s+Sara|"
321
+ r"Ni\s+Gakki|Sono\s+Ni|Ni|"
322
+ r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
323
+ r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
324
+ r")$",
325
+ re.I,
326
+ )
327
  NOISE_META_RE = re.compile(
328
+ r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|"
329
  r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
330
  r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
331
  re.I,
 
368
  )
369
 
370
 
371
+ def looks_like_episode_or_meta(text: str) -> bool:
372
+ if not text:
373
+ return False
374
+ clean = text.strip()
375
+ return bool(
376
+ re.fullmatch(r"(?:EP?|#)?\d{1,4}(?:v\d+)?", clean, re.I)
377
+ or RESOLUTION_RE.search(clean)
378
+ or SOURCE_TAG_RE.fullmatch(clean)
379
+ or SOURCE_RE.search(clean)
380
+ or SPECIAL_TAG_RE.search(clean)
381
+ or NOISE_META_RE.search(clean)
382
+ )
383
+
384
+
385
+ def looks_like_structural_group(text: str, filename: str, bracket_end: int) -> bool:
386
+ """Heuristic for short leading release-group brackets not in the name list."""
387
+ if looks_like_group(text):
388
+ return True
389
+ if not text or looks_like_episode_or_meta(text):
390
+ return False
391
+
392
+ after = filename[bracket_end:].lstrip(" \t._")
393
+ if after.startswith("-"):
394
+ return False
395
+ next_bracket = BRACKET_RE.match(after)
396
+ if next_bracket:
397
+ next_text = next(group for group in next_bracket.groups() if group is not None)
398
+ if looks_like_episode_or_meta(next_text):
399
+ return False
400
+
401
+ words = re.findall(r"[A-Za-z0-9]+", text)
402
+ if not words:
403
+ if re.search(r"[\u3400-\u9fff]", text) and len(text) <= 32:
404
+ return True
405
+ return False
406
+ if len(text) > 32:
407
+ return False
408
+ if len(words) == 1:
409
+ return True
410
+ if any(sep in text for sep in "-_"):
411
+ return True
412
+ if words[0].isupper() and len(words[0]) <= 4 and len(words) <= 3:
413
+ return True
414
+ return False
415
+
416
+
417
  def apply_rule_assists(filename: str, result: Dict) -> Dict:
418
  """
419
  Fill high-confidence structural fields from filename conventions.
 
425
  brackets = bracket_parts(filename)
426
 
427
  if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
428
+ first_text, first_start, first_end = brackets[0]
429
+ if first_start == 0 and looks_like_structural_group(first_text, filename, first_end):
430
  repaired["group"] = first_text
431
 
432
  if not repaired.get("resolution"):
 
434
  if match:
435
  repaired["resolution"] = match.group(0)
436
 
437
+ source_matches = source_candidates(filename)
438
+ current_source = repaired.get("source")
439
+ preferred_source = source_matches[0] if source_matches else None
440
+ if source_matches and (
441
+ not current_source
442
+ or not SOURCE_RE.fullmatch(str(current_source))
443
+ or len(str(current_source)) <= 3 and str(current_source).lower() not in {"nf", "cr"}
444
+ or (
445
+ preferred_source
446
+ and str(current_source).lower().replace("_", "-") in {"web-dl", "webdl", "webrip", "web-rip"}
447
+ and preferred_source.lower().replace("_", "-") not in {"web-dl", "webdl", "webrip", "web-rip"}
448
+ )
449
+ ):
450
+ repaired["source"] = preferred_source
451
+
452
+ if not repaired.get("special"):
453
+ for text, _start, _end in brackets:
454
+ clean = text.strip()
455
+ if SPECIAL_TAG_RE.search(clean):
456
+ repaired["special"] = clean
457
+ break
458
+
459
+ episode = best_structural_episode(filename)
460
+ if episode is not None and (
461
+ repaired.get("episode") is None
462
+ or not plausible_episode_context(filename, int(repaired["episode"]))
463
+ ):
464
+ repaired["episode"] = episode
465
 
466
  if repaired.get("season") is None:
467
  match = SEASON_RE.search(filename)
 
470
  season = cn_number_to_int(value)
471
  if season is not None:
472
  repaired["season"] = season
473
+ if repaired.get("season") is None and repaired.get("episode") is not None:
474
+ sequel = structural_sequel_marker(filename, repaired.get("group"), repaired.get("episode"))
475
+ if sequel is not None:
476
+ repaired["season"] = sequel[1]
477
+ elif repaired.get("episode") == repaired.get("season") and not SEASON_RE.search(filename):
478
+ repaired["season"] = None
 
 
 
 
 
 
 
 
 
 
 
 
 
479
 
480
  title = repaired.get("title")
481
  group = repaired.get("group")
482
+ if group and (NOISE_META_RE.search(str(group)) or SOURCE_RE.fullmatch(str(group)) or RESOLUTION_RE.fullmatch(str(group))):
483
+ repaired["group"] = None
484
+ group = None
485
+
486
  if title and group and title.startswith(group):
487
  title = title[len(group):].lstrip("]】)>})》 \t-_.")
488
  repaired["title"] = title or repaired["title"]
489
 
490
+ if repaired.get("episode"):
491
  repaired_title = infer_title_span(filename, group, repaired["episode"])
492
  if repaired_title:
493
  repaired["title"] = repaired_title
494
 
495
+ if repaired.get("title") and repaired.get("season") is not None:
496
+ repaired["title"] = strip_trailing_season_from_title(repaired["title"], repaired["season"])
497
+
498
  return repaired
499
 
500
 
501
+ def structural_sequel_marker(
502
+ filename: str,
503
+ group: Optional[str],
504
+ episode: Optional[int],
505
+ ) -> Optional[Tuple[str, int]]:
506
+ if episode is None:
507
+ return None
508
+ title_end = None
509
+ if episode is not None:
510
+ ep_patterns = [
511
+ rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
512
+ rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
513
+ rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
514
+ rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
515
+ rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
516
+ ]
517
+ start = 0
518
+ if group:
519
+ first = BRACKET_RE.match(filename)
520
+ if first and group in first.group(0):
521
+ start = first.end()
522
+ for pattern in ep_patterns:
523
+ match = re.search(pattern, filename[start:], re.I)
524
+ if match:
525
+ title_end = start + match.start()
526
+ break
527
+ if title_end is None:
528
+ return None
529
+
530
+ prefix = filename[:title_end].rstrip(" \t-_.")
531
+ for match in reversed(list(SEQUEL_MARKER_RE.finditer(prefix))):
532
+ marker = match.group("marker")
533
+ value = season_marker_number(marker)
534
+ if value is None:
535
+ continue
536
+ tail = prefix[match.end():].strip(" \t-_.")
537
+ if tail:
538
+ continue
539
+ if marker.lower() == "ni" and "Kakuriyo no Yadomeshi Ni" not in prefix:
540
+ continue
541
+ return marker, value
542
+ return None
543
+
544
+
545
+ def normalize_source_text(text: str) -> str:
546
+ text = re.sub(r"\s+", "", text.strip())
547
+ text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
548
+ text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
549
+ text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
550
+ text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
551
+ return text.replace("_", "-")
552
+
553
+
554
+ def source_priority(source: str) -> int:
555
+ normalized = source.lower().replace("_", "-").replace(" ", "")
556
+ parts = re.split(r"[&+/,]", normalized)
557
+ if any(part in {"nf", "netflix", "amzn", "baha", "cr", "abema", "dsnp", "u-next", "hulu", "at-x"} for part in parts):
558
+ return 90
559
+ if any(part in {"web-dl", "webdl", "webrip", "web-rip", "bdrip", "bluray", "bdmv", "bd", "dvdrip", "dvd", "tvrip", "hdtv"} for part in parts):
560
+ return 60
561
+ if len(parts) > 1:
562
+ return 40
563
+ return 20
564
+
565
+
566
+ def source_candidates(filename: str) -> List[str]:
567
+ candidates: List[Tuple[int, int, str]] = []
568
+ for text, start, _end in bracket_parts(filename):
569
+ clean = text.strip()
570
+ if SOURCE_TAG_RE.fullmatch(clean):
571
+ normalized = normalize_source_text(clean)
572
+ candidates.append((source_priority(normalized), -start, normalized))
573
+
574
+ for match in SOURCE_RE.finditer(filename):
575
+ normalized = normalize_source_text(match.group(0))
576
+ candidates.append((source_priority(normalized), -match.start(), normalized))
577
+
578
+ deduped: Dict[str, Tuple[int, int, str]] = {}
579
+ for priority, neg_start, value in candidates:
580
+ key = value.lower()
581
+ if key not in deduped or (priority, neg_start) > (deduped[key][0], deduped[key][1]):
582
+ deduped[key] = (priority, neg_start, value)
583
+
584
+ return [value for _priority, _neg_start, value in sorted(deduped.values(), reverse=True)]
585
+
586
+
587
+ def best_structural_episode(filename: str) -> Optional[int]:
588
+ priorities = {
589
+ "season_episode": 1000,
590
+ "dash_episode": 900,
591
+ "bracket_episode": 850,
592
+ "explicit_episode": 800,
593
+ "long_episode": 750,
594
+ "generic_episode": 100,
595
+ }
596
+ candidates: List[Tuple[int, int, int]] = []
597
+ for name, pattern in EPISODE_PATTERNS:
598
+ for match in pattern.finditer(filename):
599
+ ep_text = match.group("ep")
600
+ ep = int(ep_text)
601
+ if ep == 0 or ep > 2000:
602
+ continue
603
+ context = filename[max(0, match.start() - 5):match.end() + 5]
604
+ if RESOLUTION_RE.search(context) or re.search(r"AAC|DDP|AC3|H\.?26[45]|x26[45]", context, re.I):
605
+ continue
606
+ priority = priorities[name]
607
+ if 1 <= ep <= 200:
608
+ priority += 20
609
+ candidates.append((priority, match.start(), ep))
610
+ if not candidates:
611
+ return None
612
+ return max(candidates, key=lambda item: (item[0], item[1]))[2]
613
+
614
+
615
+ def plausible_episode_context(filename: str, episode: int) -> bool:
616
+ ep_text = str(episode)
617
+ padded = f"{episode:02d}"
618
+ if re.search(rf"(?<![A-Za-z0-9])(?:H|x)\.?0*{re.escape(ep_text)}(?!\d)", filename, re.I):
619
+ return False
620
+ patterns = [
621
+ rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
622
+ rf"(?:^|[\s._])[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])",
623
+ rf"[\[\(【《](?:EP?|#)?0*{episode}(?:v\d+)?[\]\)】》]",
624
+ rf"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)0*{episode}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])",
625
+ rf"(?:^|[\s._\-\[\(【《])0*{episode}(?:v\d+)?(?=[\s._\-\]\)】》\[]+(?:\d{{3,4}}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
626
+ ]
627
+ return any(re.search(pattern, filename, re.I) for pattern in patterns) or bool(
628
+ re.search(rf"(?:^|[\s._\-\[\(【《])(?:{re.escape(ep_text)}|{re.escape(padded)})(?=$|[\s._\-\]\)】》])", filename)
629
+ )
630
+
631
+
632
+ def strip_trailing_season_from_title(title: str, season: int) -> str:
633
+ season_text = str(season)
634
+ patterns = [
635
+ rf"\s+[Ss]0*{season_text}$",
636
+ rf"\s+Season\s*0*{season_text}$",
637
+ rf"\s+0*{season_text}$",
638
+ ]
639
+ cleaned = title
640
+ for pattern in patterns:
641
+ cleaned = re.sub(pattern, "", cleaned, flags=re.I).strip(" \t-_.")
642
+ match = TRAILING_SEQUEL_MARKER_RE.search(cleaned)
643
+ if match and season_marker_number(match.group("marker")) == season:
644
+ cleaned = cleaned[:match.start()].strip(" \t-_.")
645
+ return cleaned or title
646
+
647
+
648
+ def clean_inferred_title(title: str) -> str:
649
+ raw_title = title.strip(" \t-_.")
650
+ bracket_matches = list(BRACKET_RE.finditer(raw_title))
651
+ if bracket_matches:
652
+ first = bracket_matches[0]
653
+ prefix = raw_title[:first.start()].strip(" \t-_.★☆")
654
+ text = next(group for group in first.groups() if group is not None).strip()
655
+ if text and not looks_like_episode_or_meta(text) and (
656
+ not prefix
657
+ or re.search(r"(?:新番|月|合集|繁|简|字幕|先行|合集|★|☆)", prefix, re.I)
658
+ ):
659
+ return text
660
+ return raw_title.strip("[]()【】《》()")
661
+
662
+
663
  def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
664
  start = 0
665
  if group:
666
  first = BRACKET_RE.match(filename)
667
  if first and group in first.group(0):
668
  start = first.end()
669
+ else:
670
+ # Some releases put leading metadata before the actual title, e.g.
671
+ # `[1080p] Title - 01`. Do not keep that wrapper as title text.
672
+ while True:
673
+ leading = BRACKET_RE.match(filename[start:].lstrip(" \t._-"))
674
+ if not leading:
675
+ break
676
+ skipped_ws = len(filename[start:]) - len(filename[start:].lstrip(" \t._-"))
677
+ text = next(group for group in leading.groups() if group is not None)
678
+ if not looks_like_episode_or_meta(text):
679
+ break
680
+ start += skipped_ws + leading.end()
681
 
682
  end = None
683
  if episode is not None:
684
  ep_patterns = [
685
+ rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
686
  rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
687
  rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
688
+ rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
689
+ rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
690
  rf"[Ee]0*{episode}(?:v\d+)?",
691
  ]
692
  for pattern in ep_patterns:
 
705
 
706
  if end is None or end <= start:
707
  return None
708
+ title = clean_inferred_title(filename[start:end])
709
  return title or None
710
 
711
 
 
741
 
742
  # Convert to input IDs
743
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
744
+ embedding_size = model.get_input_embeddings().weight.shape[0]
745
+ out_of_range_tokens = [
746
+ token for token, token_id in zip(tokens, input_ids)
747
+ if token_id >= embedding_size
748
+ ]
749
+ if out_of_range_tokens:
750
+ input_ids = [
751
+ token_id if token_id < embedding_size else tokenizer.unk_token_id
752
+ for token_id in input_ids
753
+ ]
754
  unk_token_id = tokenizer.unk_token_id
755
  unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
756
 
 
819
  "unk_count": len(unk_tokens),
820
  "unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
821
  "unk_tokens": unk_tokens[:50],
822
+ "vocab_mismatch": bool(out_of_range_tokens),
823
+ "model_embedding_size": int(embedding_size),
824
+ "tokenizer_vocab_size": int(tokenizer.vocab_size),
825
+ "out_of_range_tokens": out_of_range_tokens[:50],
826
  "tokens": tokens[:available],
827
  "labels": label_strings,
828
  "scores": [round(float(score), 4) for score in selected_scores],
 
851
  parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
852
  parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
853
  parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
854
+ parser.add_argument("--model-dir", type=str, default=".",
855
  help="Path to trained model directory")
856
  parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
857
  help="Tokenizer variant override. Defaults to checkpoint metadata")
label_repairs.py ADDED
@@ -0,0 +1,513 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic label repairs for known weak-label blind spots."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import re
6
+ from dataclasses import dataclass
7
+ from typing import Dict, Iterable, List, Optional, Sequence, Tuple
8
+
9
+
10
+ SEPARATOR_CHARS = set(" \t-_.|~~")
11
+
12
+ ROMAN_NUMERAL_VALUES = {
13
+ "II": 2,
14
+ "III": 3,
15
+ "IV": 4,
16
+ "V": 5,
17
+ "VI": 6,
18
+ "VII": 7,
19
+ "VIII": 8,
20
+ "IX": 9,
21
+ "Ⅱ": 2,
22
+ "Ⅲ": 3,
23
+ "Ⅳ": 4,
24
+ "Ⅴ": 5,
25
+ "Ⅵ": 6,
26
+ "Ⅶ": 7,
27
+ "Ⅷ": 8,
28
+ "Ⅸ": 9,
29
+ }
30
+
31
+ CN_NUMERAL_VALUES = {
32
+ "一": 1,
33
+ "二": 2,
34
+ "兩": 2,
35
+ "两": 2,
36
+ "貳": 2,
37
+ "贰": 2,
38
+ "弐": 2,
39
+ "弍": 2,
40
+ "三": 3,
41
+ "參": 3,
42
+ "叁": 3,
43
+ "参": 3,
44
+ "四": 4,
45
+ "肆": 4,
46
+ "五": 5,
47
+ "伍": 5,
48
+ "六": 6,
49
+ "陸": 6,
50
+ "陆": 6,
51
+ "七": 7,
52
+ "柒": 7,
53
+ "八": 8,
54
+ "捌": 8,
55
+ "九": 9,
56
+ "玖": 9,
57
+ "十": 10,
58
+ }
59
+
60
+ READING_MARKER_VALUES = {
61
+ "ni no sara": 2,
62
+ "ni no shou": 2,
63
+ "ni no sho": 2,
64
+ "ni no syo": 2,
65
+ "ni no shō": 2,
66
+ "ni gakki": 2,
67
+ "sono ni": 2,
68
+ "san no sara": 3,
69
+ "san no shou": 3,
70
+ "san no sho": 3,
71
+ "san no syo": 3,
72
+ "yon no sara": 4,
73
+ "shi no sara": 4,
74
+ "shin no sara": 4,
75
+ "go no sara": 5,
76
+ "gou no sara": 5,
77
+ }
78
+
79
+ # Bare "Ni" is often the Japanese particle に in romanized titles. Only repair
80
+ # it for titles that have been verified as a sequel marker in the release name.
81
+ STANDALONE_NI_SEASON_BASES = {
82
+ "Kakuriyo no Yadomeshi": 2,
83
+ }
84
+
85
+ EPISODE_CONTEXT_RE = re.compile(
86
+ r"^\s*(?:"
87
+ r"[-_]\s*(?:\d{1,4}|NCOP|NCED|OP|ED|OVA|OAD|SP|END)\b|"
88
+ r"#\s*\d{1,4}|"
89
+ r"[\[\(【《]\s*(?:EP?|#)?\d{1,4}"
90
+ r")",
91
+ re.I,
92
+ )
93
+
94
+ EPISODE_SPAN_RE = re.compile(
95
+ r"(?:"
96
+ r"[Ss]\d{1,2}[Ee]\d{1,4}(?:v\d+)?|"
97
+ r"(?:^|[\s._])[-_]\s*\d{1,4}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])|"
98
+ r"[\[\(【《](?:EP?|#)?\d{1,4}(?:v\d+)?[\]\)】》]|"
99
+ r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)\d{1,4}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])"
100
+ r")",
101
+ re.I,
102
+ )
103
+ BRACKET_RE = re.compile(r"\[([^\]]*)\]|\(([^)]*)\)|【([^】]*)】|《([^》]*)》")
104
+ RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
105
+ SOURCE_TOKEN_PATTERN = (
106
+ r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
107
+ r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
108
+ r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
109
+ r"CHS|CHT|GB|BIG5|JPN?|JPSC|JPTC|繁中|简中"
110
+ )
111
+ SOURCE_RE = re.compile(rf"(?<![A-Za-z0-9])(?:{SOURCE_TOKEN_PATTERN})(?![A-Za-z0-9])", re.I)
112
+ SOURCE_TAG_RE = re.compile(
113
+ rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/,_-]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
114
+ re.I,
115
+ )
116
+ SPECIAL_TAG_RE = re.compile(
117
+ r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
118
+ re.I,
119
+ )
120
+
121
+ READING_MARKER_RE = re.compile(
122
+ r"(?<![A-Za-z0-9])"
123
+ r"(?P<marker>"
124
+ r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
125
+ r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
126
+ r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
127
+ r"(?:Go|Gou)\s+no\s+Sara|"
128
+ r"Ni\s+Gakki|"
129
+ r"Sono\s+Ni"
130
+ r")"
131
+ r"(?![A-Za-z0-9])",
132
+ )
133
+
134
+ ROMAN_MARKER_RE = re.compile(
135
+ r"(?<![A-Za-z0-9])"
136
+ r"(?P<marker>II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ])"
137
+ r"(?![A-Za-z0-9])"
138
+ )
139
+
140
+ CJK_MARKER_RE = re.compile(
141
+ r"(?P<marker>"
142
+ r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
143
+ r"第[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖\d]+[季期部章]"
144
+ r")"
145
+ )
146
+
147
+
148
+ @dataclass(frozen=True)
149
+ class LabelRepair:
150
+ kind: str
151
+ marker: str
152
+ value: int
153
+ start: int
154
+ end: int
155
+
156
+
157
+ def clean_marker_text(text: str) -> str:
158
+ return text.strip().strip("[]()【】《》()").strip()
159
+
160
+
161
+ def cn_number_to_int(text: str) -> Optional[int]:
162
+ text = text.strip()
163
+ if text.isdigit():
164
+ return int(text)
165
+ if text in CN_NUMERAL_VALUES:
166
+ return CN_NUMERAL_VALUES[text]
167
+ values = CN_NUMERAL_VALUES
168
+ if text.startswith("十") and len(text) == 2:
169
+ return 10 + values.get(text[1], 0)
170
+ if text.endswith("十") and len(text) == 2:
171
+ return values.get(text[0], 0) * 10
172
+ if "十" in text and len(text) == 3:
173
+ return values.get(text[0], 0) * 10 + values.get(text[2], 0)
174
+ return None
175
+
176
+
177
+ def season_marker_number(text: str) -> Optional[int]:
178
+ """Return season number for compact sequel markers such as II or Ni no Sara."""
179
+ clean = clean_marker_text(text)
180
+ if not clean:
181
+ return None
182
+
183
+ if clean in ROMAN_NUMERAL_VALUES:
184
+ return ROMAN_NUMERAL_VALUES[clean]
185
+
186
+ lowered = re.sub(r"\s+", " ", clean.lower()).strip()
187
+ if lowered in READING_MARKER_VALUES:
188
+ return READING_MARKER_VALUES[lowered]
189
+ if lowered == "ni":
190
+ return 2
191
+
192
+ explicit = re.fullmatch(r"第(.+)[季期部章]", clean)
193
+ if explicit:
194
+ return cn_number_to_int(explicit.group(1))
195
+
196
+ cjk = re.fullmatch(r"([一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖])(?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?", clean)
197
+ if cjk:
198
+ return cn_number_to_int(cjk.group(1))
199
+
200
+ return None
201
+
202
+
203
+ def token_offsets_in_text(text: str, tokens: Sequence[str]) -> Optional[List[Tuple[int, int]]]:
204
+ offsets: List[Tuple[int, int]] = []
205
+ cursor = 0
206
+ for token in tokens:
207
+ if token == "":
208
+ offsets.append((cursor, cursor))
209
+ continue
210
+ position = text.find(token, cursor)
211
+ if position < 0:
212
+ return None
213
+ end = position + len(token)
214
+ offsets.append((position, end))
215
+ cursor = end
216
+ return offsets
217
+
218
+
219
+ def has_episode_context(text: str, marker_end: int) -> bool:
220
+ tail = text[marker_end:]
221
+ if EPISODE_CONTEXT_RE.match(tail):
222
+ return True
223
+
224
+ # Some releases put a season marker at the end of a title bracket and the
225
+ # episode in the next bracket: `[Title 貳之章][01]`.
226
+ tail = tail.lstrip()
227
+ tail = re.sub(r"^[\]\)】》]\s*", "", tail)
228
+ tail = re.sub(
229
+ r"^(?:[\[\(【《]\s*(?:menu|menus|bdmenu|ncop|nced|op|ed|ova|oad|sp)\s*[\]\)】》]\s*){0,2}",
230
+ "",
231
+ tail,
232
+ flags=re.I,
233
+ )
234
+ return bool(EPISODE_CONTEXT_RE.match(tail))
235
+
236
+
237
+ def find_sequel_season_markers(text: str) -> List[LabelRepair]:
238
+ """Find high-confidence sequel markers that should be labeled as SEASON."""
239
+ repairs: List[LabelRepair] = []
240
+
241
+ for pattern, kind in (
242
+ (READING_MARKER_RE, "reading"),
243
+ (ROMAN_MARKER_RE, "roman"),
244
+ (CJK_MARKER_RE, "cjk"),
245
+ ):
246
+ for match in pattern.finditer(text):
247
+ marker = match.group("marker")
248
+ value = season_marker_number(marker)
249
+ if value is None or not has_episode_context(text, match.end()):
250
+ continue
251
+ repairs.append(LabelRepair(kind, marker, value, match.start(), match.end()))
252
+
253
+ for base, value in STANDALONE_NI_SEASON_BASES.items():
254
+ pattern = re.compile(rf"(?<![A-Za-z0-9]){re.escape(base)}\s+(?P<marker>Ni)(?![A-Za-z0-9])")
255
+ for match in pattern.finditer(text):
256
+ if not has_episode_context(text, match.end("marker")):
257
+ continue
258
+ repairs.append(
259
+ LabelRepair(
260
+ kind="verified_bare_ni",
261
+ marker=match.group("marker"),
262
+ value=value,
263
+ start=match.start("marker"),
264
+ end=match.end("marker"),
265
+ )
266
+ )
267
+
268
+ repairs.sort(key=lambda item: (item.start, item.end))
269
+ deduped: List[LabelRepair] = []
270
+ for repair in repairs:
271
+ if deduped and repair.start < deduped[-1].end:
272
+ previous = deduped[-1]
273
+ if (repair.end - repair.start) > (previous.end - previous.start):
274
+ deduped[-1] = repair
275
+ continue
276
+ deduped.append(repair)
277
+ return deduped
278
+
279
+
280
+ def labels_have_season_before(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], marker_start: int) -> bool:
281
+ return any(label.endswith("SEASON") and end <= marker_start for label, (_start, end) in zip(labels, offsets))
282
+
283
+
284
+ def token_indices_for_span(offsets: Sequence[Tuple[int, int]], start: int, end: int) -> List[int]:
285
+ return [
286
+ idx for idx, (tok_start, tok_end) in enumerate(offsets)
287
+ if tok_start < end and tok_end > start
288
+ ]
289
+
290
+
291
+ def label_span(labels: List[str], indices: Sequence[int], entity: str) -> None:
292
+ previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
293
+ first = not previous_is_same_entity
294
+ for idx in indices:
295
+ labels[idx] = f"B-{entity}" if first else f"I-{entity}"
296
+ first = False
297
+
298
+
299
+ def label_span_if_changed(labels: List[str], indices: Sequence[int], entity: str) -> bool:
300
+ previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
301
+ first_label = f"I-{entity}" if previous_is_same_entity else f"B-{entity}"
302
+ expected = [first_label] + [f"I-{entity}"] * max(0, len(indices) - 1)
303
+ if [labels[idx] for idx in indices] == expected:
304
+ return False
305
+ label_span(labels, indices, entity)
306
+ return True
307
+
308
+
309
+ def safe_to_overwrite_meta(labels: Sequence[str], indices: Sequence[int]) -> bool:
310
+ if not indices:
311
+ return False
312
+ return not any(
313
+ labels[idx].endswith(("GROUP", "EPISODE", "SEASON"))
314
+ for idx in indices
315
+ )
316
+
317
+
318
+ def mark_adjacent_title_separators_o(
319
+ tokens: Sequence[str],
320
+ labels: List[str],
321
+ marker_indices: Sequence[int],
322
+ ) -> None:
323
+ if not marker_indices:
324
+ return
325
+
326
+ idx = marker_indices[0] - 1
327
+ while idx >= 0 and "".join(tokens[idx]).strip() == "" and labels[idx].endswith("TITLE"):
328
+ labels[idx] = "O"
329
+ idx -= 1
330
+
331
+ idx = marker_indices[-1] + 1
332
+ while idx < len(tokens) and tokens[idx] in SEPARATOR_CHARS and labels[idx].endswith("TITLE"):
333
+ labels[idx] = "O"
334
+ idx += 1
335
+
336
+
337
+ def first_episode_end(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], text: str) -> int:
338
+ ends = [
339
+ end for label, (_start, end) in zip(labels, offsets)
340
+ if label.endswith("EPISODE")
341
+ ]
342
+ if ends:
343
+ return min(ends)
344
+ match = EPISODE_SPAN_RE.search(text)
345
+ return match.end() if match else 0
346
+
347
+
348
+ def bracket_content_spans(text: str) -> Iterable[Tuple[str, int, int, int, int]]:
349
+ for match in BRACKET_RE.finditer(text):
350
+ groups = match.groups()
351
+ group_index = next((idx for idx, value in enumerate(groups) if value is not None), None)
352
+ if group_index is None:
353
+ continue
354
+ inner = groups[group_index] or ""
355
+ # The opening delimiter is one code point in all supported bracket forms.
356
+ inner_start = match.start() + 1
357
+ inner_end = inner_start + len(inner)
358
+ yield inner.strip(), inner_start, inner_end, match.start(), match.end()
359
+
360
+
361
+ def repair_structural_meta_labels(
362
+ text: str,
363
+ tokens: Sequence[str],
364
+ labels: List[str],
365
+ offsets: Sequence[Tuple[int, int]],
366
+ ) -> List[LabelRepair]:
367
+ repairs: List[LabelRepair] = []
368
+ episode_end = first_episode_end(labels, offsets, text)
369
+
370
+ for clean, inner_start, inner_end, bracket_start, _bracket_end in bracket_content_spans(text):
371
+ if bracket_start < episode_end:
372
+ continue
373
+ if not clean:
374
+ continue
375
+
376
+ if SPECIAL_TAG_RE.fullmatch(clean):
377
+ indices = token_indices_for_span(offsets, inner_start, inner_end)
378
+ if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SPECIAL"):
379
+ repairs.append(LabelRepair("special", clean, 0, inner_start, inner_end))
380
+ continue
381
+
382
+ if SOURCE_TAG_RE.fullmatch(clean):
383
+ indices = token_indices_for_span(offsets, inner_start, inner_end)
384
+ if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
385
+ repairs.append(LabelRepair("source", clean, 0, inner_start, inner_end))
386
+ continue
387
+
388
+ for match in RESOLUTION_RE.finditer(clean):
389
+ start = inner_start + match.start()
390
+ end = inner_start + match.end()
391
+ indices = token_indices_for_span(offsets, start, end)
392
+ if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "RESOLUTION"):
393
+ repairs.append(LabelRepair("resolution", match.group(0), 0, start, end))
394
+
395
+ for match in SOURCE_RE.finditer(clean):
396
+ start = inner_start + match.start()
397
+ end = inner_start + match.end()
398
+ indices = token_indices_for_span(offsets, start, end)
399
+ if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
400
+ repairs.append(LabelRepair("source", match.group(0), 0, start, end))
401
+
402
+ # Dot-separated WEB names often carry source/resolution after SxxEyy without
403
+ # brackets. Repair only after the episode span to avoid touching titles.
404
+ for pattern, entity in ((RESOLUTION_RE, "RESOLUTION"), (SOURCE_RE, "SOURCE")):
405
+ for match in pattern.finditer(text):
406
+ if match.start() < episode_end:
407
+ continue
408
+ indices = token_indices_for_span(offsets, match.start(), match.end())
409
+ if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, entity):
410
+ repairs.append(LabelRepair(entity.lower(), match.group(0), 0, match.start(), match.end()))
411
+
412
+ return repairs
413
+
414
+
415
+ def repair_known_label_issues(
416
+ item: Dict,
417
+ ) -> Tuple[List[str], List[str], List[LabelRepair]]:
418
+ """
419
+ Repair known weak-label issues.
420
+
421
+ The repair is intentionally conservative:
422
+ - sequel markers must be immediately before an episode/special context;
423
+ - sequel marker spans must currently be part of TITLE/O, not group/meta;
424
+ - rows that already have a season before the marker are left alone;
425
+ - structural meta repairs only touch spans after the first episode.
426
+ """
427
+ source_tokens = [str(token) for token in item.get("tokens", [])]
428
+ source_labels = [str(label) for label in item.get("labels", [])]
429
+ if len(source_tokens) != len(source_labels):
430
+ return source_tokens, source_labels, []
431
+
432
+ filename = str(item.get("filename") or "")
433
+ text = filename if filename else "".join(source_tokens)
434
+ offsets = token_offsets_in_text(text, source_tokens)
435
+ if offsets is None:
436
+ text = "".join(source_tokens)
437
+ offsets = token_offsets_in_text(text, source_tokens)
438
+ if offsets is None:
439
+ return source_tokens, source_labels, []
440
+
441
+ repaired_labels = list(source_labels)
442
+ applied: List[LabelRepair] = []
443
+
444
+ quick_text = text.lower()
445
+ has_sequel_marker_hint = any(
446
+ needle in text or needle in quick_text
447
+ for needle in (
448
+ " II", " III", " IV", " V", " VI", " VII", " VIII", " IX",
449
+ "Ⅱ", "Ⅲ", "Ⅳ", "Ⅴ", "Ⅵ", "Ⅶ", "Ⅷ", "Ⅸ",
450
+ "之章", "之期", "之季", "之部", "ノ章", "ノ期", "の章", "の期",
451
+ "貳", "贰", "弐", "弍", "參", "叁", "参", "肆", "陸", "陆",
452
+ "Ni ", " ni ", " no Sara", "Gakki",
453
+ )
454
+ )
455
+ if has_sequel_marker_hint:
456
+ for repair in find_sequel_season_markers(text):
457
+ if labels_have_season_before(repaired_labels, offsets, repair.start):
458
+ continue
459
+ indices = token_indices_for_span(offsets, repair.start, repair.end)
460
+ if not indices:
461
+ continue
462
+ existing = [repaired_labels[idx] for idx in indices]
463
+ if any(
464
+ label.endswith(("GROUP", "EPISODE", "RESOLUTION", "SOURCE", "SPECIAL"))
465
+ for label in existing
466
+ ):
467
+ continue
468
+ if not any(label.endswith("TITLE") for label in existing):
469
+ continue
470
+
471
+ label_span(repaired_labels, indices, "SEASON")
472
+ mark_adjacent_title_separators_o(source_tokens, repaired_labels, indices)
473
+ applied.append(repair)
474
+
475
+ applied.extend(repair_structural_meta_labels(text, source_tokens, repaired_labels, offsets))
476
+ return source_tokens, repaired_labels, applied
477
+
478
+
479
+ def repair_sequel_season_labels(
480
+ item: Dict,
481
+ ) -> Tuple[List[str], List[str], List[LabelRepair]]:
482
+ """Backward-compatible wrapper for callers that repair known label issues."""
483
+ return repair_known_label_issues(item)
484
+
485
+
486
+ def repair_jsonl_item(item: Dict) -> Tuple[Dict, List[LabelRepair]]:
487
+ tokens, labels, repairs = repair_known_label_issues(item)
488
+ labels = normalize_iob2(labels)
489
+ if not repairs:
490
+ if labels == item.get("labels", []):
491
+ return item, []
492
+ repaired = dict(item)
493
+ repaired["labels"] = labels
494
+ return repaired, []
495
+ repaired = dict(item)
496
+ repaired["tokens"] = tokens
497
+ repaired["labels"] = labels
498
+ return repaired, repairs
499
+
500
+
501
+ def normalize_iob2(labels: Sequence[str]) -> List[str]:
502
+ normalized: List[str] = []
503
+ previous_entity: Optional[str] = None
504
+ for label in labels:
505
+ if not label.startswith(("B-", "I-")):
506
+ normalized.append("O")
507
+ previous_entity = None
508
+ continue
509
+ entity = label.split("-", 1)[1]
510
+ prefix = "I" if previous_entity == entity else "B"
511
+ normalized.append(f"{prefix}-{entity}")
512
+ previous_entity = entity
513
+ return normalized
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f2ad5fcbe0fe0e8ce563aa65347368f410e9825d998283e300a446ee2a921cf3
3
- size 15866796
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:697d7491b83ef615994e02f11f0f65362c400f5eb6b4be8f43f02435ad43173f
3
+ size 19142604
model/config.json DELETED
@@ -1,64 +0,0 @@
1
- {
2
- "add_cross_attention": false,
3
- "architectures": [
4
- "BertForTokenClassification"
5
- ],
6
- "attention_probs_dropout_prob": 0.1,
7
- "bos_token_id": null,
8
- "classifier_dropout": null,
9
- "dtype": "float32",
10
- "eos_token_id": null,
11
- "hidden_act": "gelu",
12
- "hidden_dropout_prob": 0.1,
13
- "hidden_size": 256,
14
- "id2label": {
15
- "0": "O",
16
- "1": "B-TITLE",
17
- "2": "I-TITLE",
18
- "3": "B-SEASON",
19
- "4": "I-SEASON",
20
- "5": "B-EPISODE",
21
- "6": "I-EPISODE",
22
- "7": "B-SPECIAL",
23
- "8": "I-SPECIAL",
24
- "9": "B-GROUP",
25
- "10": "I-GROUP",
26
- "11": "B-RESOLUTION",
27
- "12": "I-RESOLUTION",
28
- "13": "B-SOURCE",
29
- "14": "I-SOURCE"
30
- },
31
- "initializer_range": 0.02,
32
- "intermediate_size": 1024,
33
- "is_decoder": false,
34
- "label2id": {
35
- "B-EPISODE": 5,
36
- "B-GROUP": 9,
37
- "B-RESOLUTION": 11,
38
- "B-SEASON": 3,
39
- "B-SOURCE": 13,
40
- "B-SPECIAL": 7,
41
- "B-TITLE": 1,
42
- "I-EPISODE": 6,
43
- "I-GROUP": 10,
44
- "I-RESOLUTION": 12,
45
- "I-SEASON": 4,
46
- "I-SOURCE": 14,
47
- "I-SPECIAL": 8,
48
- "I-TITLE": 2,
49
- "O": 0
50
- },
51
- "layer_norm_eps": 1e-12,
52
- "max_position_embeddings": 128,
53
- "max_seq_length": 64,
54
- "model_type": "bert",
55
- "num_attention_heads": 8,
56
- "num_hidden_layers": 4,
57
- "pad_token_id": 0,
58
- "tie_word_embeddings": true,
59
- "tokenizer_variant": "regex",
60
- "transformers_version": "5.8.1",
61
- "type_vocab_size": 2,
62
- "use_cache": false,
63
- "vocab_size": 3000
64
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8213677836eed2c4e4f64f81ebeff58e6166c808aee158954055475cbf90601b
3
- size 15866796
 
 
 
 
model/tokenizer_config.json DELETED
@@ -1,44 +0,0 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "[PAD]",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "[UNK]",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "[CLS]",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "[SEP]",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- }
35
- },
36
- "backend": "custom",
37
- "cls_token": "[CLS]",
38
- "model_max_length": 1000000000000000019884624838656,
39
- "pad_token": "[PAD]",
40
- "sep_token": "[SEP]",
41
- "tokenizer_class": "AnimeTokenizer",
42
- "tokenizer_variant": "regex",
43
- "unk_token": "[UNK]"
44
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model/training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:67f4980e6e5c8a3b151030042cae7449e798e3fc87518f33ed4d557e6fa17e41
3
- size 5265
 
 
 
 
model/vocab.json DELETED
The diff for this file is too large to render. See raw diff
 
parse_eval_metrics.json ADDED
@@ -0,0 +1,595 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "sample_count": 2048,
3
+ "field_accuracy": {
4
+ "group": 1.0,
5
+ "title": 0.99658203125,
6
+ "season": 0.994140625,
7
+ "episode": 0.99609375,
8
+ "resolution": 0.998046875,
9
+ "source": 0.99365234375,
10
+ "special": 0.998046875
11
+ },
12
+ "field_correct": {
13
+ "group": 2048,
14
+ "title": 2041,
15
+ "season": 2036,
16
+ "episode": 2040,
17
+ "resolution": 2044,
18
+ "source": 2035,
19
+ "special": 2044
20
+ },
21
+ "field_total": {
22
+ "group": 2048,
23
+ "title": 2048,
24
+ "season": 2048,
25
+ "episode": 2048,
26
+ "resolution": 2048,
27
+ "source": 2048,
28
+ "special": 2048
29
+ },
30
+ "full_match_accuracy": 0.98046875,
31
+ "full_match_correct": 2008,
32
+ "full_match_total": 2048,
33
+ "failures": [
34
+ {
35
+ "filename": "[DBD-Raws][Boruto Naruto Next Generations][menu][S13][D2][02][1080P][BDRip][HEVC-10bit][FLAC]",
36
+ "errors": {
37
+ "season": {
38
+ "gold": null,
39
+ "pred": "13"
40
+ }
41
+ },
42
+ "gold": {
43
+ "group": "DBD-Raws",
44
+ "title": "Boruto Naruto Next Generations",
45
+ "season": null,
46
+ "episode": 2,
47
+ "resolution": "1080P",
48
+ "source": "BDRip",
49
+ "special": null
50
+ },
51
+ "pred": {
52
+ "group": "DBD-Raws",
53
+ "title": "Boruto Naruto Next Generations",
54
+ "season": 13,
55
+ "episode": 2,
56
+ "resolution": "1080P",
57
+ "source": "BDRip",
58
+ "special": null
59
+ }
60
+ },
61
+ {
62
+ "filename": "[アニメ BD] ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」(1424x1072 HEVC 10bit FLAC softSub(chi+eng) chap)",
63
+ "errors": {
64
+ "season": {
65
+ "gold": null,
66
+ "pred": "1"
67
+ }
68
+ },
69
+ "gold": {
70
+ "group": "アニメ BD",
71
+ "title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
72
+ "season": null,
73
+ "episode": 9,
74
+ "resolution": "1424x1072",
75
+ "source": "BD",
76
+ "special": null
77
+ },
78
+ "pred": {
79
+ "group": "アニメ BD",
80
+ "title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
81
+ "season": 1,
82
+ "episode": 9,
83
+ "resolution": "1424x1072",
84
+ "source": "BD",
85
+ "special": null
86
+ }
87
+ },
88
+ {
89
+ "filename": "コメットさん☆ 第11話 「バトンの力」(DVD DivX4.12 QB95 640x480 24f) [CRC32_C09E1AB0]",
90
+ "errors": {
91
+ "source": {
92
+ "gold": "cr",
93
+ "pred": "dvd"
94
+ }
95
+ },
96
+ "gold": {
97
+ "group": null,
98
+ "title": "コメットさん☆",
99
+ "season": null,
100
+ "episode": 11,
101
+ "resolution": "640x480",
102
+ "source": "CR",
103
+ "special": null
104
+ },
105
+ "pred": {
106
+ "group": null,
107
+ "title": "コメットさん☆",
108
+ "season": null,
109
+ "episode": 11,
110
+ "resolution": "640x480",
111
+ "source": "DVD",
112
+ "special": null
113
+ }
114
+ },
115
+ {
116
+ "filename": "[Kamigami&Mabors&VCB-Studio] Saenai Heroine no Sodatekata Flat [07][Ma10p_1080p][x265_2aac]",
117
+ "errors": {
118
+ "source": {
119
+ "gold": "aac",
120
+ "pred": "x265-2aac"
121
+ }
122
+ },
123
+ "gold": {
124
+ "group": "Kamigami&Mabors&VCB-Studio",
125
+ "title": "Saenai Heroine no Sodatekata Flat",
126
+ "season": null,
127
+ "episode": 7,
128
+ "resolution": "1080p",
129
+ "source": "aac",
130
+ "special": null
131
+ },
132
+ "pred": {
133
+ "group": "Kamigami&Mabors&VCB-Studio",
134
+ "title": "Saenai Heroine no Sodatekata Flat",
135
+ "season": null,
136
+ "episode": 7,
137
+ "resolution": "1080p",
138
+ "source": "x265_2aac",
139
+ "special": null
140
+ }
141
+ },
142
+ {
143
+ "filename": "[Liuyun&VCB-Studio] Hanasaku Iroha [07][Hi10p_1080p][x264_flac_ac3]",
144
+ "errors": {
145
+ "source": {
146
+ "gold": "flac",
147
+ "pred": "x264-flac"
148
+ }
149
+ },
150
+ "gold": {
151
+ "group": "Liuyun&VCB-Studio",
152
+ "title": "Hanasaku Iroha",
153
+ "season": null,
154
+ "episode": 7,
155
+ "resolution": "1080p",
156
+ "source": "flac",
157
+ "special": null
158
+ },
159
+ "pred": {
160
+ "group": "Liuyun&VCB-Studio",
161
+ "title": "Hanasaku Iroha",
162
+ "season": null,
163
+ "episode": 7,
164
+ "resolution": "1080p",
165
+ "source": "x264_flac",
166
+ "special": null
167
+ }
168
+ },
169
+ {
170
+ "filename": "小新外传4[EP02][2017.06.07]出动!妖怪克星",
171
+ "errors": {
172
+ "title": {
173
+ "gold": "小新外传4 ep02 2017 06",
174
+ "pred": "小新外传 ep02 2"
175
+ },
176
+ "season": {
177
+ "gold": null,
178
+ "pred": "4"
179
+ },
180
+ "episode": {
181
+ "gold": "7",
182
+ "pred": "2"
183
+ }
184
+ },
185
+ "gold": {
186
+ "group": null,
187
+ "title": "小新外传4 EP02 2017 06",
188
+ "season": null,
189
+ "episode": 7,
190
+ "resolution": null,
191
+ "source": null,
192
+ "special": null
193
+ },
194
+ "pred": {
195
+ "group": null,
196
+ "title": "小新外传 EP02 2",
197
+ "season": 4,
198
+ "episode": 2,
199
+ "resolution": null,
200
+ "source": null,
201
+ "special": null
202
+ }
203
+ },
204
+ {
205
+ "filename": "[GM-Team][国漫][异常生物见闻录][The Record of Unusual Creatures][2019][12][HEVC][GB][3840×2160]",
206
+ "errors": {
207
+ "resolution": {
208
+ "gold": "3840×2160",
209
+ "pred": "3840×"
210
+ }
211
+ },
212
+ "gold": {
213
+ "group": "GM-Team",
214
+ "title": "国漫",
215
+ "season": null,
216
+ "episode": 12,
217
+ "resolution": "3840×2160",
218
+ "source": "GB",
219
+ "special": null
220
+ },
221
+ "pred": {
222
+ "group": "GM-Team",
223
+ "title": "国漫",
224
+ "season": null,
225
+ "episode": 12,
226
+ "resolution": "3840×",
227
+ "source": "GB",
228
+ "special": null
229
+ }
230
+ },
231
+ {
232
+ "filename": "Ⅱ 116 第108次鐘聲已經敲過了嗎?",
233
+ "errors": {
234
+ "title": {
235
+ "gold": "ⅱ 116 第",
236
+ "pred": "第"
237
+ }
238
+ },
239
+ "gold": {
240
+ "group": null,
241
+ "title": "Ⅱ 116 第",
242
+ "season": null,
243
+ "episode": 116,
244
+ "resolution": null,
245
+ "source": null,
246
+ "special": null
247
+ },
248
+ "pred": {
249
+ "group": null,
250
+ "title": "第",
251
+ "season": null,
252
+ "episode": 116,
253
+ "resolution": null,
254
+ "source": null,
255
+ "special": null
256
+ }
257
+ },
258
+ {
259
+ "filename": "EP08 & EP11 NCED",
260
+ "errors": {
261
+ "title": {
262
+ "gold": "&",
263
+ "pred": "ep"
264
+ }
265
+ },
266
+ "gold": {
267
+ "group": null,
268
+ "title": "&",
269
+ "season": null,
270
+ "episode": 11,
271
+ "resolution": null,
272
+ "source": null,
273
+ "special": "NCED"
274
+ },
275
+ "pred": {
276
+ "group": null,
277
+ "title": "EP",
278
+ "season": null,
279
+ "episode": 11,
280
+ "resolution": null,
281
+ "source": null,
282
+ "special": "NCED"
283
+ }
284
+ },
285
+ {
286
+ "filename": "[S1YURICON] Necronomico no Cosmic Horror Show[06][1080p][WebRip][HEVC_AAC][CHS]",
287
+ "errors": {
288
+ "season": {
289
+ "gold": null,
290
+ "pred": "1"
291
+ }
292
+ },
293
+ "gold": {
294
+ "group": "S1YURICON",
295
+ "title": "Necronomico no Cosmic Horror Show",
296
+ "season": null,
297
+ "episode": 6,
298
+ "resolution": "1080p",
299
+ "source": "WebRip",
300
+ "special": null
301
+ },
302
+ "pred": {
303
+ "group": "S1YURICON",
304
+ "title": "Necronomico no Cosmic Horror Show",
305
+ "season": 1,
306
+ "episode": 6,
307
+ "resolution": "1080p",
308
+ "source": "WebRip",
309
+ "special": null
310
+ }
311
+ },
312
+ {
313
+ "filename": "[FZsub]Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02(14) (MX 1280x720 x264 AAC)_x264",
314
+ "errors": {
315
+ "title": {
316
+ "gold": "gate - jieitai kanochi nite, kaku tatakaeri 2",
317
+ "pred": "gate - jieitai kanochi nite, kaku tatakaeri 2 - 02"
318
+ },
319
+ "season": {
320
+ "gold": "2",
321
+ "pred": null
322
+ }
323
+ },
324
+ "gold": {
325
+ "group": "FZsub",
326
+ "title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2",
327
+ "season": 2,
328
+ "episode": 14,
329
+ "resolution": "1280x720",
330
+ "source": "x264",
331
+ "special": null
332
+ },
333
+ "pred": {
334
+ "group": "FZsub",
335
+ "title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02",
336
+ "season": null,
337
+ "episode": 14,
338
+ "resolution": "1280x720",
339
+ "source": "x264",
340
+ "special": null
341
+ }
342
+ },
343
+ {
344
+ "filename": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On [BD 1248x702 23.976fps AVC-yuv420p10 FLAC] v2 - yan04000985",
345
+ "errors": {
346
+ "episode": {
347
+ "gold": null,
348
+ "pred": "23"
349
+ }
350
+ },
351
+ "gold": {
352
+ "group": null,
353
+ "title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
354
+ "season": null,
355
+ "episode": null,
356
+ "resolution": "1248x702",
357
+ "source": "BD",
358
+ "special": null
359
+ },
360
+ "pred": {
361
+ "group": null,
362
+ "title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
363
+ "season": null,
364
+ "episode": 23,
365
+ "resolution": "1248x702",
366
+ "source": "BD",
367
+ "special": null
368
+ }
369
+ },
370
+ {
371
+ "filename": "Mary_E_Il_Giardino_Segreto_-_07_-_Camilla_[DvdMUX_by_Magic_©2008]",
372
+ "errors": {
373
+ "source": {
374
+ "gold": null,
375
+ "pred": "dvd"
376
+ }
377
+ },
378
+ "gold": {
379
+ "group": null,
380
+ "title": "Mary_E_Il_Giardino_Segreto",
381
+ "season": null,
382
+ "episode": 7,
383
+ "resolution": null,
384
+ "source": null,
385
+ "special": null
386
+ },
387
+ "pred": {
388
+ "group": null,
389
+ "title": "Mary_E_Il_Giardino_Segreto",
390
+ "season": null,
391
+ "episode": 7,
392
+ "resolution": null,
393
+ "source": "Dvd",
394
+ "special": null
395
+ }
396
+ },
397
+ {
398
+ "filename": "(アニメ) アイドル伝説えり子 第24話 「心をつなぐ輪舞曲」 (DVD 640x480DivX5.02QB93 48kHz128kbps)",
399
+ "errors": {
400
+ "resolution": {
401
+ "gold": null,
402
+ "pred": "640x480"
403
+ }
404
+ },
405
+ "gold": {
406
+ "group": "アニメ",
407
+ "title": "アイドル伝説えり子",
408
+ "season": null,
409
+ "episode": 24,
410
+ "resolution": null,
411
+ "source": "DVD",
412
+ "special": null
413
+ },
414
+ "pred": {
415
+ "group": "アニメ",
416
+ "title": "アイドル伝説えり子",
417
+ "season": null,
418
+ "episode": 24,
419
+ "resolution": "640x480",
420
+ "source": "DVD",
421
+ "special": null
422
+ }
423
+ },
424
+ {
425
+ "filename": "[DMG] 東京レイヴンズ 第06話「days in nest -休日-」 [BDRip][AVC_AAC][720P][CHS](A8161323)",
426
+ "errors": {
427
+ "episode": {
428
+ "gold": "1323",
429
+ "pred": "6"
430
+ }
431
+ },
432
+ "gold": {
433
+ "group": "DMG",
434
+ "title": "東京レイヴンズ 第06話「days in nest -休日-」",
435
+ "season": null,
436
+ "episode": 1323,
437
+ "resolution": "720P",
438
+ "source": "BDRip",
439
+ "special": null
440
+ },
441
+ "pred": {
442
+ "group": "DMG",
443
+ "title": "東京レイヴンズ 第06話「days in nest -休日-」",
444
+ "season": null,
445
+ "episode": 6,
446
+ "resolution": "720P",
447
+ "source": "BDRip",
448
+ "special": null
449
+ }
450
+ },
451
+ {
452
+ "filename": "[S1YURICON] Necronomico no Cosmic Horror Show[05v2][1080p][WebRip][AVC_AAC][CHS]",
453
+ "errors": {
454
+ "season": {
455
+ "gold": null,
456
+ "pred": "1"
457
+ }
458
+ },
459
+ "gold": {
460
+ "group": "S1YURICON",
461
+ "title": "Necronomico no Cosmic Horror Show",
462
+ "season": null,
463
+ "episode": 5,
464
+ "resolution": "1080p",
465
+ "source": "WebRip",
466
+ "special": null
467
+ },
468
+ "pred": {
469
+ "group": "S1YURICON",
470
+ "title": "Necronomico no Cosmic Horror Show",
471
+ "season": 1,
472
+ "episode": 5,
473
+ "resolution": "1080p",
474
+ "source": "WebRip",
475
+ "special": null
476
+ }
477
+ },
478
+ {
479
+ "filename": "Cardcaptor Sakura - 17 [x264-AAC-BD1440x1080p][Sakura][C-W][E2B50799]",
480
+ "errors": {
481
+ "resolution": {
482
+ "gold": null,
483
+ "pred": "1080p"
484
+ },
485
+ "source": {
486
+ "gold": null,
487
+ "pred": "e2b50799"
488
+ }
489
+ },
490
+ "gold": {
491
+ "group": null,
492
+ "title": "Cardcaptor Sakura",
493
+ "season": null,
494
+ "episode": 17,
495
+ "resolution": null,
496
+ "source": null,
497
+ "special": null
498
+ },
499
+ "pred": {
500
+ "group": null,
501
+ "title": "Cardcaptor Sakura",
502
+ "season": null,
503
+ "episode": 17,
504
+ "resolution": "1080p",
505
+ "source": "E2B50799",
506
+ "special": null
507
+ }
508
+ },
509
+ {
510
+ "filename": "[Xspitfire911] Tate no Yuusha no Nariagari S01E20 BDRIP 1080p X265 10bit VOSTFR",
511
+ "errors": {
512
+ "season": {
513
+ "gold": null,
514
+ "pred": "1"
515
+ }
516
+ },
517
+ "gold": {
518
+ "group": "Xspitfire911",
519
+ "title": "Tate no Yuusha no Nariagari",
520
+ "season": null,
521
+ "episode": 20,
522
+ "resolution": "1080p",
523
+ "source": "BDRIP",
524
+ "special": null
525
+ },
526
+ "pred": {
527
+ "group": "Xspitfire911",
528
+ "title": "Tate no Yuusha no Nariagari",
529
+ "season": 1,
530
+ "episode": 20,
531
+ "resolution": "1080p",
532
+ "source": "BDRIP",
533
+ "special": null
534
+ }
535
+ },
536
+ {
537
+ "filename": "[KTXP][Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV][13][BIG5][720P][MP4]",
538
+ "errors": {
539
+ "title": {
540
+ "gold": "dungeon ni deai wo motomeru no wa machigatteiru darou ka",
541
+ "pred": "dungeon ni deai wo motomeru no wa machigatteiru darou ka iv"
542
+ },
543
+ "season": {
544
+ "gold": "4",
545
+ "pred": null
546
+ }
547
+ },
548
+ "gold": {
549
+ "group": "KTXP",
550
+ "title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka",
551
+ "season": 4,
552
+ "episode": 13,
553
+ "resolution": "720P",
554
+ "source": "BIG5",
555
+ "special": null
556
+ },
557
+ "pred": {
558
+ "group": "KTXP",
559
+ "title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV",
560
+ "season": null,
561
+ "episode": 13,
562
+ "resolution": "720P",
563
+ "source": "BIG5",
564
+ "special": null
565
+ }
566
+ },
567
+ {
568
+ "filename": "[JyFanSub][Fate_Apocrypha][15][GB][1080]p",
569
+ "errors": {
570
+ "episode": {
571
+ "gold": "1080",
572
+ "pred": "15"
573
+ }
574
+ },
575
+ "gold": {
576
+ "group": "JyFanSub",
577
+ "title": "Fate_Apocrypha",
578
+ "season": null,
579
+ "episode": 1080,
580
+ "resolution": null,
581
+ "source": "GB",
582
+ "special": null
583
+ },
584
+ "pred": {
585
+ "group": "JyFanSub",
586
+ "title": "Fate_Apocrypha",
587
+ "season": null,
588
+ "episode": 15,
589
+ "resolution": null,
590
+ "source": "GB",
591
+ "special": null
592
+ }
593
+ }
594
+ ]
595
+ }
pyproject.toml ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "anifilebert"
3
+ version = "0.1.0"
4
+ description = "Tiny BERT token-classification model and tooling for parsing anime release filenames."
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ license = { text = "Apache-2.0" }
8
+ dependencies = [
9
+ "accelerate==1.13.0",
10
+ "datasets==4.8.5",
11
+ "numpy==2.4.5",
12
+ "onnx==1.21.0",
13
+ "onnxruntime==1.26.0",
14
+ "onnxscript==0.7.0",
15
+ "seqeval==1.2.2",
16
+ "tensorboard>=2.14.0",
17
+ "torch==2.12.0+cu126",
18
+ "transformers==5.8.1",
19
+ ]
20
+
21
+ [project.urls]
22
+ Repository = "https://huggingface.co/ModerRAS/AniFileBERT"
23
+
24
+ [tool.uv]
25
+ package = false
26
+ environments = ["sys_platform == 'win32'"]
27
+
28
+ [tool.uv.sources]
29
+ torch = [
30
+ { index = "pytorch-cu126", marker = "platform_system == 'Windows'" },
31
+ ]
32
+
33
+ [[tool.uv.index]]
34
+ name = "pytorch-cu126"
35
+ url = "https://download.pytorch.org/whl/cu126"
36
+ explicit = true
relabel_dataset_from_filenames.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Rebuild AnimeName weak labels from each stored filename."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from collections import Counter
8
+ from datetime import datetime, timezone
9
+ from pathlib import Path
10
+ from statistics import mean
11
+ from typing import Iterable
12
+
13
+ from dmhy_dataset import weak_label_filename
14
+ from label_repairs import repair_jsonl_item
15
+ from tokenizer import AnimeTokenizer
16
+
17
+
18
+ def parse_args() -> argparse.Namespace:
19
+ parser = argparse.ArgumentParser(description="Relabel a JSONL dataset from filename strings")
20
+ parser.add_argument("--input", required=True, help="Input JSONL containing filename fields")
21
+ parser.add_argument("--output", required=True, help="Output relabeled regex-token JSONL")
22
+ parser.add_argument("--manifest-output", default=None, help="Relabel manifest JSON")
23
+ parser.add_argument("--vocab-output", default=None, help="Optional regex vocab JSON")
24
+ parser.add_argument("--base-vocab", default=None, help="Optional regex vocab whose IDs should be preserved")
25
+ parser.add_argument("--max-vocab-size", type=int, default=3000)
26
+ parser.add_argument("--limit", type=int, default=None)
27
+ parser.add_argument("--progress", type=int, default=50000)
28
+ parser.add_argument("--example-count", type=int, default=20)
29
+ return parser.parse_args()
30
+
31
+
32
+ def iter_jsonl(path: Path) -> Iterable[dict]:
33
+ with path.open("r", encoding="utf-8") as handle:
34
+ for line_no, line in enumerate(handle, 1):
35
+ line = line.strip()
36
+ if not line:
37
+ continue
38
+ try:
39
+ yield json.loads(line)
40
+ except json.JSONDecodeError as exc:
41
+ raise ValueError(f"{path}:{line_no}: invalid JSON") from exc
42
+
43
+
44
+ def length_stats(values: list[int]) -> dict:
45
+ if not values:
46
+ return {"min": 0, "mean": 0, "p50": 0, "p90": 0, "p95": 0, "p99": 0, "max": 0}
47
+ ordered = sorted(values)
48
+
49
+ def percentile(pct: float) -> int:
50
+ index = min(len(ordered) - 1, round((pct / 100) * (len(ordered) - 1)))
51
+ return ordered[index]
52
+
53
+ return {
54
+ "min": min(values),
55
+ "mean": mean(values),
56
+ "p50": percentile(50),
57
+ "p90": percentile(90),
58
+ "p95": percentile(95),
59
+ "p99": percentile(99),
60
+ "max": max(values),
61
+ }
62
+
63
+
64
+ def main() -> None:
65
+ args = parse_args()
66
+ input_path = Path(args.input)
67
+ output_path = Path(args.output)
68
+ manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
69
+ vocab_path = Path(args.vocab_output) if args.vocab_output else None
70
+
71
+ output_path.parent.mkdir(parents=True, exist_ok=True)
72
+ manifest_path.parent.mkdir(parents=True, exist_ok=True)
73
+ if vocab_path:
74
+ vocab_path.parent.mkdir(parents=True, exist_ok=True)
75
+
76
+ tokenizer = AnimeTokenizer()
77
+ rows_in = 0
78
+ rows_written = 0
79
+ rows_failed = 0
80
+ rows_repaired_after_relabel = 0
81
+ label_counter: Counter[str] = Counter()
82
+ failure_counter: Counter[str] = Counter()
83
+ token_lists: list[list[str]] = []
84
+ lengths: list[int] = []
85
+ examples: list[dict] = []
86
+ failures: list[dict] = []
87
+
88
+ with output_path.open("w", encoding="utf-8", newline="\n") as out:
89
+ for item in iter_jsonl(input_path):
90
+ rows_in += 1
91
+ filename = item.get("filename")
92
+ if not filename:
93
+ rows_failed += 1
94
+ failure_counter["missing_filename"] += 1
95
+ continue
96
+ sample = weak_label_filename(str(filename), tokenizer)
97
+ if sample is None:
98
+ rows_failed += 1
99
+ failure_counter["weak_label_failed"] += 1
100
+ if len(failures) < args.example_count:
101
+ failures.append({"file_id": item.get("file_id"), "filename": filename})
102
+ continue
103
+ record = dict(item)
104
+ record.pop("tokenizer_variant", None)
105
+ record.pop("source_token_count", None)
106
+ record.pop("char_token_count", None)
107
+ record["tokens"] = sample["tokens"]
108
+ record["labels"] = sample["labels"]
109
+
110
+ repaired, repairs = repair_jsonl_item(record)
111
+ if repairs:
112
+ rows_repaired_after_relabel += 1
113
+ record = repaired
114
+
115
+ out.write(json.dumps(record, ensure_ascii=False, separators=(",", ":")) + "\n")
116
+ rows_written += 1
117
+ label_counter.update(record["labels"])
118
+ token_lists.append(record["tokens"])
119
+ lengths.append(len(record["tokens"]))
120
+ if len(examples) < args.example_count:
121
+ examples.append(record)
122
+
123
+ if args.limit is not None and rows_written >= args.limit:
124
+ break
125
+ if args.progress and rows_written % args.progress == 0:
126
+ print(f"relabeled {rows_written:,} rows; failed={rows_failed:,}")
127
+
128
+ base_vocab = None
129
+ if args.base_vocab:
130
+ with Path(args.base_vocab).open("r", encoding="utf-8") as handle:
131
+ base_vocab = json.load(handle)
132
+ tokenizer.build_vocab(token_lists, max_size=args.max_vocab_size, base_vocab=base_vocab)
133
+ if vocab_path:
134
+ vocab_path.write_text(json.dumps(tokenizer.get_vocab(), ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
135
+
136
+ manifest = {
137
+ "created_at": datetime.now(timezone.utc).isoformat(),
138
+ "input": str(input_path),
139
+ "output": str(output_path),
140
+ "vocab_output": str(vocab_path) if vocab_path else None,
141
+ "row_count": rows_written,
142
+ "input_rows": rows_in,
143
+ "failed_rows": rows_failed,
144
+ "repaired_after_relabel_rows": rows_repaired_after_relabel,
145
+ "failure_counts": dict(failure_counter),
146
+ "label_counts": dict(label_counter),
147
+ "token_length": length_stats(lengths),
148
+ "vocab_size": tokenizer.vocab_size,
149
+ "examples": examples,
150
+ "failures": failures,
151
+ }
152
+ manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
153
+ print(json.dumps({k: v for k, v in manifest.items() if k not in {"examples", "failures"}}, ensure_ascii=False, indent=2))
154
+
155
+
156
+ if __name__ == "__main__":
157
+ main()
repair_dataset_labels.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Repair known weak-label mistakes in exported AnimeName JSONL datasets."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from collections import Counter, defaultdict
8
+ from datetime import datetime, timezone
9
+ from pathlib import Path
10
+ from typing import Dict, List
11
+
12
+ from label_repairs import LabelRepair, repair_jsonl_item
13
+
14
+
15
+ def parse_args() -> argparse.Namespace:
16
+ parser = argparse.ArgumentParser(description="Repair weak BIO labels in a JSONL dataset")
17
+ parser.add_argument("--input", required=True, help="Input JSONL")
18
+ parser.add_argument("--output", required=True, help="Output repaired JSONL")
19
+ parser.add_argument("--manifest-output", default=None, help="Optional repair manifest JSON")
20
+ parser.add_argument("--dry-run", action="store_true", help="Scan only; do not write output JSONL")
21
+ parser.add_argument("--example-limit", type=int, default=40)
22
+ return parser.parse_args()
23
+
24
+
25
+ def repair_key(repair: LabelRepair) -> str:
26
+ return f"{repair.kind}:{repair.marker}"
27
+
28
+
29
+ def main() -> None:
30
+ args = parse_args()
31
+ input_path = Path(args.input)
32
+ output_path = Path(args.output)
33
+ manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
34
+
35
+ counts: Counter[str] = Counter()
36
+ marker_counts: Counter[str] = Counter()
37
+ examples: Dict[str, List[dict]] = defaultdict(list)
38
+ label_counts: Counter[str] = Counter()
39
+ row_count = 0
40
+ repaired_rows = 0
41
+
42
+ output_handle = None
43
+ if not args.dry_run:
44
+ output_path.parent.mkdir(parents=True, exist_ok=True)
45
+ output_handle = output_path.open("w", encoding="utf-8")
46
+
47
+ try:
48
+ with input_path.open("r", encoding="utf-8") as handle:
49
+ for line in handle:
50
+ line = line.strip()
51
+ if not line:
52
+ continue
53
+ row_count += 1
54
+ item = json.loads(line)
55
+ repaired, repairs = repair_jsonl_item(item)
56
+ if repairs:
57
+ repaired_rows += 1
58
+ for repair in repairs:
59
+ key = repair_key(repair)
60
+ counts[repair.kind] += 1
61
+ marker_counts[key] += 1
62
+ if len(examples[key]) < args.example_limit:
63
+ examples[key].append(
64
+ {
65
+ "file_id": item.get("file_id"),
66
+ "filename": item.get("filename"),
67
+ "marker": repair.marker,
68
+ "value": repair.value,
69
+ "span": [repair.start, repair.end],
70
+ }
71
+ )
72
+ label_counts.update(repaired.get("labels", []))
73
+ if output_handle is not None:
74
+ output_handle.write(json.dumps(repaired, ensure_ascii=False, separators=(",", ":")) + "\n")
75
+ finally:
76
+ if output_handle is not None:
77
+ output_handle.close()
78
+
79
+ manifest = {
80
+ "created_at": datetime.now(timezone.utc).isoformat(),
81
+ "input": str(input_path),
82
+ "output": None if args.dry_run else str(output_path),
83
+ "dry_run": args.dry_run,
84
+ "row_count": row_count,
85
+ "repaired_rows": repaired_rows,
86
+ "repair_counts": dict(counts),
87
+ "marker_counts": dict(marker_counts),
88
+ "label_counts": dict(label_counts),
89
+ "examples": examples,
90
+ }
91
+ manifest_path.parent.mkdir(parents=True, exist_ok=True)
92
+ manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
93
+ print(json.dumps({
94
+ "row_count": row_count,
95
+ "repaired_rows": repaired_rows,
96
+ "repair_counts": dict(counts),
97
+ "manifest": str(manifest_path),
98
+ "output": None if args.dry_run else str(output_path),
99
+ }, ensure_ascii=False, indent=2))
100
+
101
+
102
+ if __name__ == "__main__":
103
+ main()
requirements.txt CHANGED
@@ -1,10 +1,12 @@
1
- torch>=2.0.0
2
- transformers>=4.30.0
3
- datasets>=2.12.0
4
- accelerate>=1.1.0
5
- seqeval>=1.2.2
6
- numpy>=1.24.0
7
- tqdm>=4.65.0
8
- onnx>=1.16.0
9
- onnxruntime>=1.18.0
10
- onnxscript>=0.1.0
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu126
2
+
3
+ accelerate==1.13.0
4
+ datasets==4.8.5
5
+ numpy==2.4.5
6
+ onnx==1.21.0
7
+ onnxruntime==1.26.0
8
+ onnxscript==0.7.0
9
+ seqeval==1.2.2
10
+ tensorboard>=2.14.0
11
+ torch==2.12.0+cu126
12
+ transformers==5.8.1
run_metadata.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "experiment_name": "dmhy-char-full-relabel",
3
+ "data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
4
+ "tokenizer_variant": "char",
5
+ "vocab_file": "datasets/AnimeName/vocab.char.json",
6
+ "vocab_size": 6199,
7
+ "max_seq_length": 128,
8
+ "hidden_size": 256,
9
+ "num_hidden_layers": 4,
10
+ "num_attention_heads": 8,
11
+ "intermediate_size": 1024,
12
+ "train_samples": 619361,
13
+ "eval_samples": 12641,
14
+ "epochs": 2.0,
15
+ "batch_size": 256,
16
+ "learning_rate": 8e-05,
17
+ "warmup_steps": 300,
18
+ "seed": 48,
19
+ "device": "cuda",
20
+ "fp16": true,
21
+ "gradient_accumulation_steps": 1,
22
+ "dataloader_num_workers": 4
23
+ }
tokenizer.py CHANGED
@@ -45,9 +45,9 @@ class AnimeTokenizer(PreTrainedTokenizer):
45
  # Layer 2: Individual format token patterns
46
  FORMAT_PATTERNS: List[str] = [
47
  # Resolution
48
- r'\d{3,4}[pP]',
49
- r'\d{3,4}[xX×]\d{3,4}',
50
- r'\d[Kk]',
51
 
52
  # Codec
53
  r'[xX]26[45]',
 
45
  # Layer 2: Individual format token patterns
46
  FORMAT_PATTERNS: List[str] = [
47
  # Resolution
48
+ r'(?<![A-Za-z0-9])\d{3,4}[pP](?![A-Za-z0-9])',
49
+ r'(?<![A-Za-z0-9])\d{3,4}[xX×]\d{3,4}(?![A-Za-z0-9])',
50
+ r'(?<![A-Za-z0-9])\d[Kk](?![A-Za-z0-9])',
51
 
52
  # Codec
53
  r'[xX]26[45]',
tokenizer_config.json CHANGED
@@ -38,7 +38,7 @@
38
  "model_max_length": 1000000000000000019884624838656,
39
  "pad_token": "[PAD]",
40
  "sep_token": "[SEP]",
41
- "tokenizer_class": "AnimeTokenizer",
42
- "tokenizer_variant": "regex",
43
  "unk_token": "[UNK]"
44
  }
 
38
  "model_max_length": 1000000000000000019884624838656,
39
  "pad_token": "[PAD]",
40
  "sep_token": "[SEP]",
41
+ "tokenizer_class": "CharAnimeTokenizer",
42
+ "tokenizer_variant": "char",
43
  "unk_token": "[UNK]"
44
  }
train.py CHANGED
@@ -14,6 +14,7 @@ import json
14
  import tempfile
15
  import argparse
16
  import random
 
17
  from typing import Dict, List, Optional
18
 
19
  import numpy as np
@@ -29,7 +30,8 @@ from seqeval.metrics import classification_report, accuracy_score, f1_score, pre
29
  from config import Config
30
  from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
31
  from model import create_model, print_model_summary, count_parameters
32
- from dataset import AnimeDataset, align_tokens_for_tokenizer
 
33
 
34
 
35
  def compute_metrics(p):
@@ -88,10 +90,27 @@ def parse_args() -> argparse.Namespace:
88
  help="Save resumable checkpoints every N steps instead of only at epoch end")
89
  parser.add_argument("--save-total-limit", type=int, default=2,
90
  help="Maximum number of checkpoints to keep")
 
 
 
 
91
  parser.add_argument("--cpu", action="store_true", help="Force CPU training")
92
  parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
93
  parser.add_argument("--resume-from-checkpoint", default=None,
94
  help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  return parser.parse_args()
96
 
97
 
@@ -172,6 +191,118 @@ def validate_dataset_tokenizer_metadata(data: List[Dict], tokenizer_variant: str
172
  )
173
 
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  def remap_token_embeddings(
176
  model: BertForTokenClassification,
177
  old_vocab: Dict[str, int],
@@ -220,7 +351,7 @@ def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_pat
220
  max_size: Optional[int] = None) -> None:
221
  token_lists: List[List[str]] = []
222
  for item in data:
223
- tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
224
  token_lists.append(tokens)
225
 
226
  tokenizer.build_vocab(token_lists, max_size=max_size)
@@ -250,20 +381,35 @@ def main():
250
  config.warmup_steps = args.warmup_steps
251
  if args.train_split is not None:
252
  config.train_split = args.train_split
 
 
253
  if args.max_seq_length is not None:
254
  config.max_seq_length = args.max_seq_length
255
  elif tokenizer_variant == "char":
256
  config.max_seq_length = max(config.max_seq_length, 128)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
257
 
258
  random.seed(args.seed)
259
  np.random.seed(args.seed)
260
  torch.manual_seed(args.seed)
261
 
262
  print("Loading dataset...")
263
- with open(config.data_file, 'r', encoding='utf-8') as f:
264
- all_data = [json.loads(line) for line in f if line.strip()]
265
- if args.limit_samples is not None:
266
- all_data = all_data[:args.limit_samples]
267
  if not args.no_shuffle:
268
  random.shuffle(all_data)
269
  validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
@@ -280,6 +426,9 @@ def main():
280
  print(f" Variant: {tokenizer_variant}")
281
  print(f" Vocab size: {tokenizer.vocab_size}")
282
  print(f" Max sequence length: {config.max_seq_length}")
 
 
 
283
 
284
  # Update config with actual vocab size
285
  config.vocab_size = tokenizer.vocab_size
@@ -288,15 +437,22 @@ def main():
288
  if args.init_model_dir:
289
  print(f"Loading model for fine-tuning: {args.init_model_dir}")
290
  model = BertForTokenClassification.from_pretrained(args.init_model_dir)
291
- init_tokenizer = load_tokenizer(args.init_model_dir)
 
 
 
 
 
 
 
292
  init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
293
  if init_variant != tokenizer_variant:
294
  print(f" WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
295
  print(" Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
296
- if model.config.vocab_size != config.vocab_size or init_tokenizer.get_vocab() != tokenizer.get_vocab():
297
  copied = remap_token_embeddings(
298
  model=model,
299
- old_vocab=init_tokenizer.get_vocab(),
300
  new_vocab=tokenizer.get_vocab(),
301
  pad_token_id=tokenizer.pad_token_id,
302
  )
@@ -316,6 +472,7 @@ def main():
316
  print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
317
 
318
  split_idx = int(len(all_data) * config.train_split)
 
319
  train_data = all_data[:split_idx]
320
  eval_data = all_data[split_idx:]
321
 
@@ -350,8 +507,7 @@ def main():
350
  use_cpu = args.cpu or not torch.cuda.is_available()
351
  use_fp16 = not use_cpu
352
  print(f" Device: {'CPU' if use_cpu else 'CUDA'}")
353
- save_strategy = "steps" if args.checkpoint_steps else "epoch"
354
- load_best_model_at_end = args.checkpoint_steps is None
355
 
356
  # Training arguments
357
  training_args = TrainingArguments(
@@ -359,20 +515,23 @@ def main():
359
  num_train_epochs=config.num_epochs,
360
  per_device_train_batch_size=config.batch_size,
361
  per_device_eval_batch_size=config.batch_size,
362
- eval_strategy="epoch",
363
- save_strategy=save_strategy,
 
364
  save_steps=args.checkpoint_steps,
365
  logging_steps=config.log_interval,
366
  learning_rate=config.learning_rate,
367
  weight_decay=config.weight_decay,
368
  warmup_steps=config.warmup_steps,
 
369
  use_cpu=use_cpu,
370
- report_to="none",
371
  save_total_limit=args.save_total_limit,
372
- load_best_model_at_end=load_best_model_at_end,
373
  metric_for_best_model="f1",
374
  greater_is_better=True,
375
  dataloader_num_workers=config.num_workers,
 
376
  fp16=use_fp16,
377
  )
378
 
@@ -410,6 +569,31 @@ def main():
410
  final_save_path = os.path.join(config.save_dir, "final")
411
  trainer.save_model(final_save_path)
412
  tokenizer.save_pretrained(final_save_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
413
  print(f"Model saved to: {final_save_path}")
414
 
415
  # Final evaluation
@@ -417,6 +601,30 @@ def main():
417
  eval_results = trainer.evaluate()
418
  for key, value in eval_results.items():
419
  print(f" {key}: {value:.4f}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
420
 
421
 
422
  if __name__ == "__main__":
 
14
  import tempfile
15
  import argparse
16
  import random
17
+ from collections import Counter
18
  from typing import Dict, List, Optional
19
 
20
  import numpy as np
 
30
  from config import Config
31
  from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
32
  from model import create_model, print_model_summary, count_parameters
33
+ from dataset import AnimeDataset, labels_for_tokenizer
34
+ from inference import parse_filename, postprocess
35
 
36
 
37
  def compute_metrics(p):
 
90
  help="Save resumable checkpoints every N steps instead of only at epoch end")
91
  parser.add_argument("--save-total-limit", type=int, default=2,
92
  help="Maximum number of checkpoints to keep")
93
+ parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
94
+ help="Accumulate gradients across this many steps")
95
+ parser.add_argument("--num-workers", type=int, default=None,
96
+ help="DataLoader worker count. Defaults to config.num_workers")
97
  parser.add_argument("--cpu", action="store_true", help="Force CPU training")
98
  parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
99
  parser.add_argument("--resume-from-checkpoint", default=None,
100
  help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
101
+ parser.add_argument("--tensorboard", dest="tensorboard", action="store_true",
102
+ help="Log metrics to TensorBoard in addition to stdout/checkpoints")
103
+ parser.add_argument("--no-tensorboard", dest="tensorboard", action="store_false",
104
+ help="Disable TensorBoard logging")
105
+ parser.add_argument("--experiment-name", default=None,
106
+ help="Optional experiment name written to run_metadata.json")
107
+ parser.add_argument("--parse-eval-limit", type=int, default=512,
108
+ help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
109
+ parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
110
+ parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
111
+ parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
112
+ parser.add_argument("--intermediate-size", type=int, default=None, help="Override BERT FFN intermediate size")
113
+ parser.set_defaults(tensorboard=True)
114
  return parser.parse_args()
115
 
116
 
 
191
  )
192
 
193
 
194
+ def load_jsonl(data_file: str, limit: Optional[int] = None) -> List[Dict]:
195
+ """Load JSONL rows, stopping early for smoke runs."""
196
+ data: List[Dict] = []
197
+ with open(data_file, "r", encoding="utf-8") as f:
198
+ for line in f:
199
+ line = line.strip()
200
+ if not line:
201
+ continue
202
+ data.append(json.loads(line))
203
+ if limit is not None and len(data) >= limit:
204
+ break
205
+ return data
206
+
207
+
208
+ def normalize_field_value(field: str, value) -> Optional[str]:
209
+ if value is None:
210
+ return None
211
+ if field in {"episode", "season"}:
212
+ try:
213
+ return str(int(value))
214
+ except (TypeError, ValueError):
215
+ return str(value).strip().lower()
216
+ text = str(value).strip()
217
+ if field in {"resolution", "source"}:
218
+ return text.lower().replace("_", "-")
219
+ return " ".join(text.lower().split())
220
+
221
+
222
+ def parse_exact_metrics(
223
+ samples: List[Dict],
224
+ model: BertForTokenClassification,
225
+ tokenizer: AnimeTokenizer,
226
+ id2label: Dict[int, str],
227
+ max_length: int,
228
+ limit: Optional[int],
229
+ ) -> Dict:
230
+ """Evaluate end-to-end field exact match on filenames, not just token loss."""
231
+ fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
232
+ selected = [sample for sample in samples if sample.get("filename")]
233
+ if limit is not None and limit > 0:
234
+ selected = selected[:limit]
235
+
236
+ counter: Counter = Counter()
237
+ failures: List[Dict] = []
238
+ model.eval()
239
+
240
+ for sample in selected:
241
+ filename = sample["filename"]
242
+ tokens, gold_labels = labels_for_tokenizer(sample, tokenizer)
243
+ available = max(0, max_length - 2)
244
+ tokens = tokens[:available]
245
+ gold_labels = gold_labels[:available]
246
+ gold = postprocess(tokens, gold_labels, tokenizer=tokenizer, filename=filename, use_rules=True)
247
+ gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
248
+ for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
249
+ if entity not in gold_entities:
250
+ gold[optional_field] = None
251
+ pred = parse_filename(
252
+ filename,
253
+ model,
254
+ tokenizer,
255
+ id2label,
256
+ max_length=max_length,
257
+ debug=False,
258
+ use_rules=True,
259
+ constrain_bio=True,
260
+ )
261
+
262
+ full_match = True
263
+ field_errors: Dict[str, Dict[str, Optional[str]]] = {}
264
+ for field in fields:
265
+ gold_value = normalize_field_value(field, gold.get(field))
266
+ pred_value = normalize_field_value(field, pred.get(field))
267
+ counter[f"{field}_total"] += 1
268
+ if gold_value == pred_value:
269
+ counter[f"{field}_correct"] += 1
270
+ else:
271
+ full_match = False
272
+ field_errors[field] = {"gold": gold_value, "pred": pred_value}
273
+ counter["full_total"] += 1
274
+ if full_match:
275
+ counter["full_correct"] += 1
276
+ elif len(failures) < 20:
277
+ failures.append(
278
+ {
279
+ "filename": filename,
280
+ "errors": field_errors,
281
+ "gold": {field: gold.get(field) for field in fields},
282
+ "pred": {field: pred.get(field) for field in fields},
283
+ }
284
+ )
285
+
286
+ field_accuracy = {}
287
+ for field in fields:
288
+ total = counter.get(f"{field}_total", 0)
289
+ correct = counter.get(f"{field}_correct", 0)
290
+ field_accuracy[field] = correct / total if total else 0.0
291
+
292
+ total = counter.get("full_total", 0)
293
+ correct = counter.get("full_correct", 0)
294
+ return {
295
+ "sample_count": total,
296
+ "field_accuracy": field_accuracy,
297
+ "field_correct": {field: counter.get(f"{field}_correct", 0) for field in fields},
298
+ "field_total": {field: counter.get(f"{field}_total", 0) for field in fields},
299
+ "full_match_accuracy": correct / total if total else 0.0,
300
+ "full_match_correct": correct,
301
+ "full_match_total": total,
302
+ "failures": failures,
303
+ }
304
+
305
+
306
  def remap_token_embeddings(
307
  model: BertForTokenClassification,
308
  old_vocab: Dict[str, int],
 
351
  max_size: Optional[int] = None) -> None:
352
  token_lists: List[List[str]] = []
353
  for item in data:
354
+ tokens, _labels = labels_for_tokenizer(item, tokenizer)
355
  token_lists.append(tokens)
356
 
357
  tokenizer.build_vocab(token_lists, max_size=max_size)
 
381
  config.warmup_steps = args.warmup_steps
382
  if args.train_split is not None:
383
  config.train_split = args.train_split
384
+ if args.num_workers is not None:
385
+ config.num_workers = args.num_workers
386
  if args.max_seq_length is not None:
387
  config.max_seq_length = args.max_seq_length
388
  elif tokenizer_variant == "char":
389
  config.max_seq_length = max(config.max_seq_length, 128)
390
+ if args.hidden_size is not None:
391
+ config.hidden_size = args.hidden_size
392
+ if args.num_hidden_layers is not None:
393
+ config.num_hidden_layers = args.num_hidden_layers
394
+ if args.num_attention_heads is not None:
395
+ config.num_attention_heads = args.num_attention_heads
396
+ if args.intermediate_size is not None:
397
+ config.intermediate_size = args.intermediate_size
398
+ if config.hidden_size % config.num_attention_heads != 0:
399
+ raise ValueError(
400
+ f"hidden_size ({config.hidden_size}) must be divisible by "
401
+ f"num_attention_heads ({config.num_attention_heads})."
402
+ )
403
+ config.max_position_embeddings = max(config.max_position_embeddings, config.max_seq_length)
404
 
405
  random.seed(args.seed)
406
  np.random.seed(args.seed)
407
  torch.manual_seed(args.seed)
408
 
409
  print("Loading dataset...")
410
+ all_data = load_jsonl(config.data_file, args.limit_samples)
411
+ if len(all_data) < 2:
412
+ raise ValueError("Need at least two samples so train/eval split is non-empty.")
 
413
  if not args.no_shuffle:
414
  random.shuffle(all_data)
415
  validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
 
426
  print(f" Variant: {tokenizer_variant}")
427
  print(f" Vocab size: {tokenizer.vocab_size}")
428
  print(f" Max sequence length: {config.max_seq_length}")
429
+ if torch.cuda.is_available() and not args.cpu:
430
+ print(f" CUDA device: {torch.cuda.get_device_name(0)}")
431
+ print(" Mixed precision: fp16")
432
 
433
  # Update config with actual vocab size
434
  config.vocab_size = tokenizer.vocab_size
 
437
  if args.init_model_dir:
438
  print(f"Loading model for fine-tuning: {args.init_model_dir}")
439
  model = BertForTokenClassification.from_pretrained(args.init_model_dir)
440
+ init_tokenizer = load_tokenizer(args.init_model_dir, tokenizer_variant)
441
+ init_vocab = init_tokenizer.get_vocab()
442
+ embedding_size = model.get_input_embeddings().weight.shape[0]
443
+ if len(init_vocab) != embedding_size:
444
+ print(
445
+ " WARNING: init checkpoint tokenizer vocab length does not match model embedding size "
446
+ f"({len(init_vocab):,} vs {embedding_size:,}). Prefer a self-consistent checkpoint."
447
+ )
448
  init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
449
  if init_variant != tokenizer_variant:
450
  print(f" WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
451
  print(" Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
452
+ if model.config.vocab_size != config.vocab_size or init_vocab != tokenizer.get_vocab():
453
  copied = remap_token_embeddings(
454
  model=model,
455
+ old_vocab=init_vocab,
456
  new_vocab=tokenizer.get_vocab(),
457
  pad_token_id=tokenizer.pad_token_id,
458
  )
 
472
  print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
473
 
474
  split_idx = int(len(all_data) * config.train_split)
475
+ split_idx = max(1, min(len(all_data) - 1, split_idx))
476
  train_data = all_data[:split_idx]
477
  eval_data = all_data[split_idx:]
478
 
 
507
  use_cpu = args.cpu or not torch.cuda.is_available()
508
  use_fp16 = not use_cpu
509
  print(f" Device: {'CPU' if use_cpu else 'CUDA'}")
510
+ eval_save_strategy = "steps" if args.checkpoint_steps else "epoch"
 
511
 
512
  # Training arguments
513
  training_args = TrainingArguments(
 
515
  num_train_epochs=config.num_epochs,
516
  per_device_train_batch_size=config.batch_size,
517
  per_device_eval_batch_size=config.batch_size,
518
+ eval_strategy=eval_save_strategy,
519
+ save_strategy=eval_save_strategy,
520
+ eval_steps=args.checkpoint_steps,
521
  save_steps=args.checkpoint_steps,
522
  logging_steps=config.log_interval,
523
  learning_rate=config.learning_rate,
524
  weight_decay=config.weight_decay,
525
  warmup_steps=config.warmup_steps,
526
+ gradient_accumulation_steps=args.gradient_accumulation_steps,
527
  use_cpu=use_cpu,
528
+ report_to=["tensorboard"] if args.tensorboard else "none",
529
  save_total_limit=args.save_total_limit,
530
+ load_best_model_at_end=True,
531
  metric_for_best_model="f1",
532
  greater_is_better=True,
533
  dataloader_num_workers=config.num_workers,
534
+ dataloader_pin_memory=not use_cpu,
535
  fp16=use_fp16,
536
  )
537
 
 
569
  final_save_path = os.path.join(config.save_dir, "final")
570
  trainer.save_model(final_save_path)
571
  tokenizer.save_pretrained(final_save_path)
572
+ metadata = {
573
+ "experiment_name": args.experiment_name,
574
+ "data_file": config.data_file,
575
+ "tokenizer_variant": tokenizer_variant,
576
+ "vocab_file": vocab_path,
577
+ "vocab_size": tokenizer.vocab_size,
578
+ "max_seq_length": config.max_seq_length,
579
+ "hidden_size": config.hidden_size,
580
+ "num_hidden_layers": config.num_hidden_layers,
581
+ "num_attention_heads": config.num_attention_heads,
582
+ "intermediate_size": config.intermediate_size,
583
+ "train_samples": len(train_dataset),
584
+ "eval_samples": len(eval_dataset),
585
+ "epochs": config.num_epochs,
586
+ "batch_size": config.batch_size,
587
+ "learning_rate": config.learning_rate,
588
+ "warmup_steps": config.warmup_steps,
589
+ "seed": args.seed,
590
+ "device": "cpu" if use_cpu else "cuda",
591
+ "fp16": use_fp16,
592
+ "gradient_accumulation_steps": training_args.gradient_accumulation_steps,
593
+ "dataloader_num_workers": config.num_workers,
594
+ }
595
+ with open(os.path.join(final_save_path, "run_metadata.json"), "w", encoding="utf-8") as f:
596
+ json.dump(metadata, f, ensure_ascii=False, indent=2)
597
  print(f"Model saved to: {final_save_path}")
598
 
599
  # Final evaluation
 
601
  eval_results = trainer.evaluate()
602
  for key, value in eval_results.items():
603
  print(f" {key}: {value:.4f}")
604
+ with open(os.path.join(final_save_path, "trainer_eval_metrics.json"), "w", encoding="utf-8") as f:
605
+ json.dump({key: float(value) for key, value in eval_results.items()}, f, ensure_ascii=False, indent=2)
606
+
607
+ if args.parse_eval_limit != 0:
608
+ parse_limit = args.parse_eval_limit if args.parse_eval_limit and args.parse_eval_limit > 0 else None
609
+ parse_metrics = parse_exact_metrics(
610
+ eval_data,
611
+ trainer.model,
612
+ tokenizer,
613
+ config.id2label,
614
+ config.max_seq_length,
615
+ parse_limit,
616
+ )
617
+ with open(os.path.join(final_save_path, "parse_eval_metrics.json"), "w", encoding="utf-8") as f:
618
+ json.dump(parse_metrics, f, ensure_ascii=False, indent=2)
619
+ print("\nParse exact-match evaluation:")
620
+ print(
621
+ f" full_match: {parse_metrics['full_match_correct']}/"
622
+ f"{parse_metrics['full_match_total']} ({parse_metrics['full_match_accuracy']:.4f})"
623
+ )
624
+ for field, accuracy in parse_metrics["field_accuracy"].items():
625
+ correct = parse_metrics["field_correct"][field]
626
+ total = parse_metrics["field_total"][field]
627
+ print(f" {field}: {correct}/{total} ({accuracy:.4f})")
628
 
629
 
630
  if __name__ == "__main__":
trainer_eval_metrics.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eval_loss": 0.01631847210228443,
3
+ "eval_precision": 0.9799749533444652,
4
+ "eval_recall": 0.986698478236683,
5
+ "eval_f1": 0.9833252228334185,
6
+ "eval_accuracy": 0.9943065860243627,
7
+ "eval_runtime": 39.3604,
8
+ "eval_samples_per_second": 321.161,
9
+ "eval_steps_per_second": 1.27,
10
+ "epoch": 2.0
11
+ }
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d71d921b0df7747e0ef56e0c8d857b27141dc8dfa47a8c93c20f39216b35e0db
3
  size 5265
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5aa0df615ce731796aa9934b0505e00a685611be134c071d7b2487d8112dde1
3
  size 5265
uv.lock ADDED
The diff for this file is too large to render. See raw diff
 
vocab.char.json CHANGED
@@ -56,8 +56,8 @@
56
  "N": 54,
57
  "3": 55,
58
  "(": 56,
59
- ")": 57,
60
- "K": 58,
61
  "g": 59,
62
  "y": 60,
63
  "O": 61,
 
56
  "N": 54,
57
  "3": 55,
58
  "(": 56,
59
+ "K": 57,
60
+ ")": 58,
61
  "g": 59,
62
  "y": 60,
63
  "O": 61,
vocab.json CHANGED
The diff for this file is too large to render. See raw diff