Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Improve anime filename parser model
Browse files- .gitignore +5 -0
- MAINTENANCE.md +28 -18
- README.md +52 -78
- build_repair_focus_dataset.py +151 -0
- case_metrics.json +459 -0
- config.json +4 -4
- data/parser_regression_cases.json +232 -0
- dataset.py +144 -4
- datasets/AnimeName +1 -1
- diagnose_pipeline.py +198 -22
- dmhy_dataset.py +433 -44
- evaluate_parser_cases.py +163 -0
- exports/anime_filename_parser.metadata.json +4 -4
- exports/anime_filename_parser.onnx +2 -2
- inference.py +342 -35
- label_repairs.py +513 -0
- model.safetensors +2 -2
- model/config.json +0 -64
- model/model.safetensors +0 -3
- model/tokenizer_config.json +0 -44
- model/training_args.bin +0 -3
- model/vocab.json +0 -0
- parse_eval_metrics.json +595 -0
- pyproject.toml +36 -0
- relabel_dataset_from_filenames.py +157 -0
- repair_dataset_labels.py +103 -0
- requirements.txt +12 -10
- run_metadata.json +23 -0
- tokenizer.py +3 -3
- tokenizer_config.json +2 -2
- train.py +223 -15
- trainer_eval_metrics.json +11 -0
- training_args.bin +1 -1
- uv.lock +0 -0
- vocab.char.json +2 -2
- vocab.json +0 -0
.gitignore
CHANGED
|
@@ -1,9 +1,14 @@
|
|
| 1 |
__pycache__/
|
| 2 |
*.pyc
|
|
|
|
|
|
|
|
|
|
| 3 |
logs/
|
| 4 |
checkpoints/
|
| 5 |
test_checkpoints*/
|
| 6 |
ab_checkpoints*/
|
|
|
|
|
|
|
| 7 |
data/**/*.jsonl
|
| 8 |
!data/synthetic_small.jsonl
|
| 9 |
!data/test_smoke.jsonl
|
|
|
|
| 1 |
__pycache__/
|
| 2 |
*.pyc
|
| 3 |
+
.venv/
|
| 4 |
+
.pytest_cache/
|
| 5 |
+
.ruff_cache/
|
| 6 |
logs/
|
| 7 |
checkpoints/
|
| 8 |
test_checkpoints*/
|
| 9 |
ab_checkpoints*/
|
| 10 |
+
*.log
|
| 11 |
+
*.onnx.data
|
| 12 |
data/**/*.jsonl
|
| 13 |
!data/synthetic_small.jsonl
|
| 14 |
!data/test_smoke.jsonl
|
MAINTENANCE.md
CHANGED
|
@@ -35,10 +35,9 @@ git submodule update --init --recursive
|
|
| 35 |
Current DMHY snapshot:
|
| 36 |
|
| 37 |
```text
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
mixed_train_samples: 363042
|
| 42 |
```
|
| 43 |
|
| 44 |
The authoritative dataset files live in `datasets/AnimeName`.
|
|
@@ -46,17 +45,21 @@ The authoritative dataset files live in `datasets/AnimeName`.
|
|
| 46 |
## Train
|
| 47 |
|
| 48 |
```bash
|
| 49 |
-
|
| 50 |
-
python train.py \
|
| 51 |
-
--
|
| 52 |
-
--
|
| 53 |
-
--
|
|
|
|
| 54 |
--init-model-dir . \
|
| 55 |
-
--epochs
|
| 56 |
-
--batch-size
|
| 57 |
-
--learning-rate 0.
|
| 58 |
--warmup-steps 300 \
|
| 59 |
-
--
|
|
|
|
|
|
|
|
|
|
| 60 |
```
|
| 61 |
|
| 62 |
## Publish a New Checkpoint
|
|
@@ -64,13 +67,20 @@ python train.py \
|
|
| 64 |
Copy the final checkpoint to the repository root:
|
| 65 |
|
| 66 |
```powershell
|
| 67 |
-
Copy-Item checkpoints/dmhy-
|
| 68 |
-
Copy-Item checkpoints/dmhy-
|
| 69 |
-
Copy-Item checkpoints/dmhy-
|
| 70 |
-
Copy-Item checkpoints/dmhy-
|
| 71 |
-
Copy-Item checkpoints/dmhy-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
| 73 |
|
|
|
|
|
|
|
|
|
|
| 74 |
Then commit and push:
|
| 75 |
|
| 76 |
```bash
|
|
|
|
| 35 |
Current DMHY snapshot:
|
| 36 |
|
| 37 |
```text
|
| 38 |
+
labeled_samples: 632002
|
| 39 |
+
char_vocab_size: 6199
|
| 40 |
+
strict_bio_violations: 0
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
The authoritative dataset files live in `datasets/AnimeName`.
|
|
|
|
| 45 |
## Train
|
| 46 |
|
| 47 |
```bash
|
| 48 |
+
uv sync
|
| 49 |
+
uv run python train.py \
|
| 50 |
+
--tokenizer char \
|
| 51 |
+
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 52 |
+
--vocab-file datasets/AnimeName/vocab.char.json \
|
| 53 |
+
--save-dir checkpoints/dmhy-char-full-relabel \
|
| 54 |
--init-model-dir . \
|
| 55 |
+
--epochs 2 \
|
| 56 |
+
--batch-size 256 \
|
| 57 |
+
--learning-rate 0.00008 \
|
| 58 |
--warmup-steps 300 \
|
| 59 |
+
--max-seq-length 128 \
|
| 60 |
+
--checkpoint-steps 1000 \
|
| 61 |
+
--parse-eval-limit 2048 \
|
| 62 |
+
--seed 48
|
| 63 |
```
|
| 64 |
|
| 65 |
## Publish a New Checkpoint
|
|
|
|
| 67 |
Copy the final checkpoint to the repository root:
|
| 68 |
|
| 69 |
```powershell
|
| 70 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
|
| 71 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
|
| 72 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
|
| 73 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
|
| 74 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
|
| 75 |
+
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
|
| 76 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
|
| 77 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
|
| 78 |
+
Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force
|
| 79 |
```
|
| 80 |
|
| 81 |
+
There is no tracked `model/` duplicate. The root checkpoint is the publishing
|
| 82 |
+
surface; ignored `checkpoints/` directories are training artifacts.
|
| 83 |
+
|
| 84 |
Then commit and push:
|
| 85 |
|
| 86 |
```bash
|
README.md
CHANGED
|
@@ -19,7 +19,7 @@ language:
|
|
| 19 |
|
| 20 |
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
|
| 21 |
|
| 22 |
-
The checkpoint in this repository is the
|
| 23 |
|
| 24 |
## Model
|
| 25 |
|
|
@@ -28,9 +28,9 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
|
|
| 28 |
- Layers: 4
|
| 29 |
- Attention heads: 8
|
| 30 |
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
|
| 31 |
-
- Tokenizer: custom
|
| 32 |
-
- Max sequence length:
|
| 33 |
-
- Parameters:
|
| 34 |
|
| 35 |
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
|
| 36 |
|
|
@@ -47,52 +47,40 @@ Current DMHY export waterline (from `datasets/AnimeName`):
|
|
| 47 |
|
| 48 |
## Vocabulary
|
| 49 |
|
| 50 |
-
The
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
| 3000 (old) | 90.4% | ~4.0M |
|
| 57 |
-
| 8000 (current) | 96.2% | ~5.3M |
|
| 58 |
-
|
| 59 |
-
Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
|
| 60 |
-
and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
|
| 61 |
-
vocabulary.
|
| 62 |
-
|
| 63 |
-
For character-token training, `vocab.char.json` is mirrored at the repository
|
| 64 |
-
root for plain `git pull` users and also lives at
|
| 65 |
-
`datasets/AnimeName/vocab.char.json` beside the dataset. It is built from the
|
| 66 |
-
full `dmhy_weak_char.jsonl` export. The full DMHY weak dataset has **6195
|
| 67 |
-
unique characters**, so the complete character vocab is only **6199** entries
|
| 68 |
-
including special tokens and reaches 100% token coverage.
|
| 69 |
|
| 70 |
## Evaluation
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
| Variant | Max length | Vocab | Params | Eval F1 | Accuracy | Train runtime |
|
| 75 |
-
|---------|------------|-------|--------|---------|----------|---------------|
|
| 76 |
-
| regex | 64 | 3000 | 3.96M | 0.9911 | 0.9951 | 827s |
|
| 77 |
-
| char | 128 | 2654 | 3.88M | 0.8142 | 0.9637 | 1983s |
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
| TITLE | 0.9761 | 0.7983 |
|
| 85 |
-
| SEASON | 0.9880 | 0.6290 |
|
| 86 |
-
| EPISODE | 0.9950 | 0.8082 |
|
| 87 |
-
|
| 88 |
-
The regex tokenizer remains the default. Both variants can parse simple `S01E07`, but the character tokenizer was weaker on season/episode boundaries and long title spans.
|
| 89 |
|
| 90 |
## Usage
|
| 91 |
|
| 92 |
Install dependencies:
|
| 93 |
|
| 94 |
```bash
|
| 95 |
-
|
| 96 |
```
|
| 97 |
|
| 98 |
Parse a filename with this repository cloned locally:
|
|
@@ -121,47 +109,25 @@ git submodule update --init --recursive
|
|
| 121 |
|
| 122 |
## Training
|
| 123 |
|
| 124 |
-
### Prerequisites (Windows / Local GPU)
|
| 125 |
-
|
| 126 |
-
PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
|
| 127 |
-
|
| 128 |
-
```bash
|
| 129 |
-
pip install torch --index-url https://download.pytorch.org/whl/cu126
|
| 130 |
-
pip install -r requirements.txt
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### Fine-tune with rebuilt vocabulary
|
| 134 |
-
|
| 135 |
-
```bash
|
| 136 |
-
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
|
| 137 |
-
--vocab-file datasets/AnimeName/vocab.json \
|
| 138 |
-
--save-dir checkpoints/dmhy-finetune \
|
| 139 |
-
--init-model-dir . \
|
| 140 |
-
--epochs 10 --batch-size 128 \
|
| 141 |
-
--learning-rate 0.0003 --warmup-steps 300 --seed 42
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
|
| 145 |
-
5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
|
| 146 |
-
trains the full model. About 96% of token occurrences are now covered (vs 90%
|
| 147 |
-
with the old 3000-token vocabulary).
|
| 148 |
-
|
| 149 |
### Character-token DMHY training
|
| 150 |
|
| 151 |
```bash
|
| 152 |
-
python convert_to_char_dataset.py \
|
| 153 |
--input datasets/AnimeName/dmhy_weak.jsonl \
|
| 154 |
--output datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 155 |
-
--vocab-output vocab.char.json \
|
| 156 |
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
|
| 157 |
|
| 158 |
-
python train.py --tokenizer char \
|
| 159 |
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 160 |
-
--vocab-file vocab.char.json \
|
| 161 |
-
--save-dir
|
| 162 |
-
--
|
| 163 |
-
--
|
| 164 |
-
--
|
|
|
|
|
|
|
|
|
|
| 165 |
```
|
| 166 |
|
| 167 |
The converter keeps source metadata and adds `tokenizer_variant`, source token
|
|
@@ -169,12 +135,21 @@ count, and character token count fields to each record. The char dataset's
|
|
| 169 |
p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
|
| 170 |
while leaving room for `[CLS]` and `[SEP]`.
|
| 171 |
|
| 172 |
-
###
|
| 173 |
|
| 174 |
```bash
|
| 175 |
-
python
|
| 176 |
-
|
| 177 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
```
|
| 179 |
|
| 180 |
### Rebuild vocabulary (if needed)
|
|
@@ -192,7 +167,7 @@ json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
|
|
| 192 |
### Export ONNX for MiruPlay Android
|
| 193 |
|
| 194 |
```bash
|
| 195 |
-
python export_onnx.py --model-dir
|
| 196 |
```
|
| 197 |
|
| 198 |
---
|
|
@@ -213,14 +188,13 @@ python colab_train.py --profile dmhy_regex_finetune
|
|
| 213 |
|
| 214 |
## Repository Layout
|
| 215 |
|
| 216 |
-
- `model.safetensors`, `config.json`, `vocab.json`: default
|
| 217 |
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
|
| 218 |
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
|
| 219 |
- `convert_to_char_dataset.py`: full character-token projection for weak labels
|
| 220 |
- `inference.py`: end-to-end filename parser CLI
|
| 221 |
- `export_onnx.py`: ONNX export for Android integration
|
| 222 |
- `exports/`: exported ONNX model and metadata
|
| 223 |
-
- `data/dmhy/*.manifest.json`: dataset waterlines and counts
|
| 224 |
- `datasets/AnimeName/`: nested dataset submodule
|
| 225 |
|
| 226 |
## Maintenance Notes
|
|
|
|
| 19 |
|
| 20 |
AniFileBERT is a tiny BERT token-classification model for parsing anime release filenames into structured fields such as release group, title, season, episode, resolution, source, and special tags.
|
| 21 |
|
| 22 |
+
The checkpoint in this repository is the full-relabel DMHY character-token model used by MiruPlay.
|
| 23 |
|
| 24 |
## Model
|
| 25 |
|
|
|
|
| 28 |
- Layers: 4
|
| 29 |
- Attention heads: 8
|
| 30 |
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
|
| 31 |
+
- Tokenizer: custom character tokenizer implemented in `tokenizer.py`
|
| 32 |
+
- Max sequence length: 128
|
| 33 |
+
- Parameters: 4,783,631
|
| 34 |
|
| 35 |
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
|
| 36 |
|
|
|
|
| 47 |
|
| 48 |
## Vocabulary
|
| 49 |
|
| 50 |
+
The published checkpoint uses a character vocabulary. `vocab.json` at the
|
| 51 |
+
repository root is the deployed tokenizer vocab, and `vocab.char.json` is kept
|
| 52 |
+
as a mirrored explicit copy for training/data maintenance. The full DMHY weak
|
| 53 |
+
dataset has **6195 unique characters**, so the complete character vocab is only
|
| 54 |
+
**6199** entries including special tokens and reaches 100% token coverage.
|
| 55 |
|
| 56 |
+
The regex vocabulary is still maintained in `datasets/AnimeName/vocab.json` for
|
| 57 |
+
dataset relabeling and diagnostics, but the root checkpoint loads as `char`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
## Evaluation
|
| 60 |
|
| 61 |
+
Final full-relabel char training (`632002` DMHY rows, 2 epochs, batch size 256,
|
| 62 |
+
seed 48):
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
| Metric | Value |
|
| 65 |
+
|--------|-------|
|
| 66 |
+
| Eval loss | 0.0163 |
|
| 67 |
+
| Entity precision | 0.9800 |
|
| 68 |
+
| Entity recall | 0.9867 |
|
| 69 |
+
| Entity F1 | 0.9833 |
|
| 70 |
+
| Token accuracy | 0.9943 |
|
| 71 |
+
| Held-out parse full match | 2008/2048 (0.9805) |
|
| 72 |
+
| Fixed regression full match | 21/21 (1.0000) |
|
| 73 |
|
| 74 |
+
The fixed regression set includes second-season aliases such as `Ni`,
|
| 75 |
+
`Ni no Sara`, `貳`, and `弐ノ章`, plus long-running episode IDs and dense meta
|
| 76 |
+
blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
## Usage
|
| 79 |
|
| 80 |
Install dependencies:
|
| 81 |
|
| 82 |
```bash
|
| 83 |
+
uv sync
|
| 84 |
```
|
| 85 |
|
| 86 |
Parse a filename with this repository cloned locally:
|
|
|
|
| 109 |
|
| 110 |
## Training
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
### Character-token DMHY training
|
| 113 |
|
| 114 |
```bash
|
| 115 |
+
uv run python convert_to_char_dataset.py \
|
| 116 |
--input datasets/AnimeName/dmhy_weak.jsonl \
|
| 117 |
--output datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 118 |
+
--vocab-output datasets/AnimeName/vocab.char.json \
|
| 119 |
--manifest-output datasets/AnimeName/dmhy_weak_char.manifest.json
|
| 120 |
|
| 121 |
+
uv run python train.py --tokenizer char \
|
| 122 |
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
|
| 123 |
+
--vocab-file datasets/AnimeName/vocab.char.json \
|
| 124 |
+
--save-dir checkpoints/dmhy-char-full-relabel \
|
| 125 |
+
--init-model-dir . \
|
| 126 |
+
--epochs 2 --batch-size 256 \
|
| 127 |
+
--learning-rate 0.00008 --warmup-steps 300 \
|
| 128 |
+
--checkpoint-steps 1000 --save-total-limit 3 \
|
| 129 |
+
--parse-eval-limit 2048 \
|
| 130 |
+
--max-seq-length 128 --seed 48
|
| 131 |
```
|
| 132 |
|
| 133 |
The converter keeps source metadata and adds `tokenizer_variant`, source token
|
|
|
|
| 135 |
p99 length is 107 characters, so `--max-seq-length 128` covers almost all rows
|
| 136 |
while leaving room for `[CLS]` and `[SEP]`.
|
| 137 |
|
| 138 |
+
### Relabel the full dataset
|
| 139 |
|
| 140 |
```bash
|
| 141 |
+
uv run python relabel_dataset_from_filenames.py \
|
| 142 |
+
--input datasets/AnimeName/dmhy_weak.jsonl \
|
| 143 |
+
--output datasets/AnimeName/dmhy_weak.relabel.jsonl \
|
| 144 |
+
--manifest-output datasets/AnimeName/dmhy_weak.relabel.manifest.json \
|
| 145 |
+
--vocab-output datasets/AnimeName/vocab.relabel.json \
|
| 146 |
+
--base-vocab datasets/AnimeName/vocab.json \
|
| 147 |
+
--max-vocab-size 8000
|
| 148 |
+
|
| 149 |
+
Move-Item datasets/AnimeName/dmhy_weak.relabel.jsonl datasets/AnimeName/dmhy_weak.jsonl -Force
|
| 150 |
+
Move-Item datasets/AnimeName/vocab.relabel.json datasets/AnimeName/vocab.json -Force
|
| 151 |
+
Copy-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json datasets/AnimeName/dmhy_weak.manifest.json -Force
|
| 152 |
+
Remove-Item datasets/AnimeName/dmhy_weak.relabel.manifest.json -Force
|
| 153 |
```
|
| 154 |
|
| 155 |
### Rebuild vocabulary (if needed)
|
|
|
|
| 167 |
### Export ONNX for MiruPlay Android
|
| 168 |
|
| 169 |
```bash
|
| 170 |
+
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
|
| 171 |
```
|
| 172 |
|
| 173 |
---
|
|
|
|
| 188 |
|
| 189 |
## Repository Layout
|
| 190 |
|
| 191 |
+
- `model.safetensors`, `config.json`, `vocab.json`: default published model
|
| 192 |
- `train.py`, `dataset.py`, `tokenizer.py`, `model.py`: training pipeline
|
| 193 |
- `dmhy_dataset.py`, `mix_datasets.py`: weak-label export and dataset mixing
|
| 194 |
- `convert_to_char_dataset.py`: full character-token projection for weak labels
|
| 195 |
- `inference.py`: end-to-end filename parser CLI
|
| 196 |
- `export_onnx.py`: ONNX export for Android integration
|
| 197 |
- `exports/`: exported ONNX model and metadata
|
|
|
|
| 198 |
- `datasets/AnimeName/`: nested dataset submodule
|
| 199 |
|
| 200 |
## Maintenance Notes
|
build_repair_focus_dataset.py
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Build a small fine-tuning set focused on repaired filename structures."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
import random
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from typing import Iterable, List
|
| 10 |
+
|
| 11 |
+
from label_repairs import repair_jsonl_item
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def parse_args() -> argparse.Namespace:
|
| 15 |
+
parser = argparse.ArgumentParser(description="Build repair-focused char JSONL fine-tune data")
|
| 16 |
+
parser.add_argument("--input", required=True, help="Repaired char JSONL dataset")
|
| 17 |
+
parser.add_argument("--output", required=True, help="Output focus JSONL")
|
| 18 |
+
parser.add_argument("--context-samples", type=int, default=50000,
|
| 19 |
+
help="Random non-repaired rows to include for stability")
|
| 20 |
+
parser.add_argument("--repeat-repaired", type=int, default=4,
|
| 21 |
+
help="Repeat rows that still trigger a repair pass")
|
| 22 |
+
parser.add_argument("--repeat-manual", type=int, default=24,
|
| 23 |
+
help="Repeat hand-labeled hard cases")
|
| 24 |
+
parser.add_argument("--seed", type=int, default=42)
|
| 25 |
+
return parser.parse_args()
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def char_item(filename: str, spans: List[tuple[str, str]]) -> dict:
|
| 29 |
+
tokens = list(filename)
|
| 30 |
+
labels = ["O"] * len(tokens)
|
| 31 |
+
cursor = 0
|
| 32 |
+
for text, entity in spans:
|
| 33 |
+
start = filename.find(text, cursor)
|
| 34 |
+
if start < 0:
|
| 35 |
+
start = filename.find(text)
|
| 36 |
+
if start < 0:
|
| 37 |
+
raise ValueError(f"Could not find span {text!r} in {filename!r}")
|
| 38 |
+
end = start + len(text)
|
| 39 |
+
labels[start] = f"B-{entity}"
|
| 40 |
+
for idx in range(start + 1, end):
|
| 41 |
+
labels[idx] = f"I-{entity}"
|
| 42 |
+
cursor = end
|
| 43 |
+
return {
|
| 44 |
+
"filename": filename,
|
| 45 |
+
"tokens": tokens,
|
| 46 |
+
"labels": labels,
|
| 47 |
+
"tokenizer_variant": "char",
|
| 48 |
+
"source": "manual_repair_focus",
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def manual_cases() -> Iterable[dict]:
|
| 53 |
+
yield char_item(
|
| 54 |
+
"[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
|
| 55 |
+
[
|
| 56 |
+
("AI-Raws", "GROUP"),
|
| 57 |
+
("炎炎の消防隊", "TITLE"),
|
| 58 |
+
("弐ノ章", "SEASON"),
|
| 59 |
+
("13", "EPISODE"),
|
| 60 |
+
("BD", "SOURCE"),
|
| 61 |
+
("HEVC", "SOURCE"),
|
| 62 |
+
("1920x1080", "RESOLUTION"),
|
| 63 |
+
("FLAC", "SOURCE"),
|
| 64 |
+
],
|
| 65 |
+
)
|
| 66 |
+
yield char_item(
|
| 67 |
+
"[AI-Raws] 炎炎の消防隊 弐ノ章 #01 (BD HEVC 1920x1080 FLAC).mkv",
|
| 68 |
+
[
|
| 69 |
+
("AI-Raws", "GROUP"),
|
| 70 |
+
("炎炎の消防隊", "TITLE"),
|
| 71 |
+
("弐ノ章", "SEASON"),
|
| 72 |
+
("01", "EPISODE"),
|
| 73 |
+
("BD", "SOURCE"),
|
| 74 |
+
("HEVC", "SOURCE"),
|
| 75 |
+
("1920x1080", "RESOLUTION"),
|
| 76 |
+
("FLAC", "SOURCE"),
|
| 77 |
+
],
|
| 78 |
+
)
|
| 79 |
+
yield char_item(
|
| 80 |
+
"[DBD-Raws][炎炎消防队 貳之章][01][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 81 |
+
[
|
| 82 |
+
("DBD-Raws", "GROUP"),
|
| 83 |
+
("炎炎消防队", "TITLE"),
|
| 84 |
+
("貳之章", "SEASON"),
|
| 85 |
+
("01", "EPISODE"),
|
| 86 |
+
("1080P", "RESOLUTION"),
|
| 87 |
+
("BDRip", "SOURCE"),
|
| 88 |
+
("FLAC", "SOURCE"),
|
| 89 |
+
],
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def main() -> None:
|
| 94 |
+
args = parse_args()
|
| 95 |
+
rng = random.Random(args.seed)
|
| 96 |
+
input_path = Path(args.input)
|
| 97 |
+
output_path = Path(args.output)
|
| 98 |
+
|
| 99 |
+
repaired_rows: List[dict] = []
|
| 100 |
+
reservoir: List[dict] = []
|
| 101 |
+
seen_filenames = set()
|
| 102 |
+
total_rows = 0
|
| 103 |
+
|
| 104 |
+
with input_path.open("r", encoding="utf-8") as handle:
|
| 105 |
+
for line in handle:
|
| 106 |
+
if not line.strip():
|
| 107 |
+
continue
|
| 108 |
+
total_rows += 1
|
| 109 |
+
item = json.loads(line)
|
| 110 |
+
_repaired_item, repairs = repair_jsonl_item(item)
|
| 111 |
+
filename = item.get("filename")
|
| 112 |
+
if repairs:
|
| 113 |
+
repaired_rows.append(item)
|
| 114 |
+
if filename:
|
| 115 |
+
seen_filenames.add(filename)
|
| 116 |
+
continue
|
| 117 |
+
if filename in seen_filenames:
|
| 118 |
+
continue
|
| 119 |
+
if len(reservoir) < args.context_samples:
|
| 120 |
+
reservoir.append(item)
|
| 121 |
+
else:
|
| 122 |
+
index = rng.randrange(total_rows)
|
| 123 |
+
if index < args.context_samples:
|
| 124 |
+
reservoir[index] = item
|
| 125 |
+
|
| 126 |
+
rows: List[dict] = []
|
| 127 |
+
for item in repaired_rows:
|
| 128 |
+
rows.extend([item] * max(1, args.repeat_repaired))
|
| 129 |
+
rows.extend(reservoir)
|
| 130 |
+
for item in manual_cases():
|
| 131 |
+
rows.extend([item] * max(1, args.repeat_manual))
|
| 132 |
+
|
| 133 |
+
rng.shuffle(rows)
|
| 134 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 135 |
+
with output_path.open("w", encoding="utf-8") as handle:
|
| 136 |
+
for item in rows:
|
| 137 |
+
handle.write(json.dumps(item, ensure_ascii=False, separators=(",", ":")) + "\n")
|
| 138 |
+
|
| 139 |
+
print(json.dumps({
|
| 140 |
+
"input": str(input_path),
|
| 141 |
+
"output": str(output_path),
|
| 142 |
+
"total_rows": total_rows,
|
| 143 |
+
"repaired_rows": len(repaired_rows),
|
| 144 |
+
"context_rows": len(reservoir),
|
| 145 |
+
"manual_rows": len(list(manual_cases())),
|
| 146 |
+
"written_rows": len(rows),
|
| 147 |
+
}, ensure_ascii=False, indent=2))
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
if __name__ == "__main__":
|
| 151 |
+
main()
|
case_metrics.json
ADDED
|
@@ -0,0 +1,459 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_dir": ".",
|
| 3 |
+
"case_file": "data\\parser_regression_cases.json",
|
| 4 |
+
"tokenizer_variant": "char",
|
| 5 |
+
"max_length": 128,
|
| 6 |
+
"use_rules": true,
|
| 7 |
+
"constrain_bio": true,
|
| 8 |
+
"case_count": 21,
|
| 9 |
+
"full_correct": 21,
|
| 10 |
+
"full_accuracy": 1.0,
|
| 11 |
+
"field_correct": {
|
| 12 |
+
"group": 18,
|
| 13 |
+
"title": 21,
|
| 14 |
+
"episode": 21,
|
| 15 |
+
"resolution": 21,
|
| 16 |
+
"source": 14,
|
| 17 |
+
"season": 8,
|
| 18 |
+
"special": 1
|
| 19 |
+
},
|
| 20 |
+
"field_total": {
|
| 21 |
+
"group": 18,
|
| 22 |
+
"title": 21,
|
| 23 |
+
"episode": 21,
|
| 24 |
+
"resolution": 21,
|
| 25 |
+
"source": 14,
|
| 26 |
+
"season": 8,
|
| 27 |
+
"special": 1
|
| 28 |
+
},
|
| 29 |
+
"field_accuracy": {
|
| 30 |
+
"episode": 1.0,
|
| 31 |
+
"group": 1.0,
|
| 32 |
+
"resolution": 1.0,
|
| 33 |
+
"season": 1.0,
|
| 34 |
+
"source": 1.0,
|
| 35 |
+
"special": 1.0,
|
| 36 |
+
"title": 1.0
|
| 37 |
+
},
|
| 38 |
+
"failures": [],
|
| 39 |
+
"results": [
|
| 40 |
+
{
|
| 41 |
+
"id": "lolihouse_dash_episode",
|
| 42 |
+
"filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
|
| 43 |
+
"ok": true,
|
| 44 |
+
"errors": {},
|
| 45 |
+
"expected": {
|
| 46 |
+
"group": "LoliHouse",
|
| 47 |
+
"title": "Yomi no Tsugai",
|
| 48 |
+
"episode": 7,
|
| 49 |
+
"resolution": "1080p",
|
| 50 |
+
"source": "WebRip"
|
| 51 |
+
},
|
| 52 |
+
"pred": {
|
| 53 |
+
"episode": 7,
|
| 54 |
+
"group": "LoliHouse",
|
| 55 |
+
"resolution": "1080p",
|
| 56 |
+
"source": "WebRip",
|
| 57 |
+
"title": "Yomi no Tsugai"
|
| 58 |
+
}
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"id": "dot_season_episode_no_group",
|
| 62 |
+
"filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
|
| 63 |
+
"ok": true,
|
| 64 |
+
"errors": {},
|
| 65 |
+
"expected": {
|
| 66 |
+
"title": "Witch.Hat.Atelier",
|
| 67 |
+
"season": 1,
|
| 68 |
+
"episode": 7,
|
| 69 |
+
"group": null,
|
| 70 |
+
"resolution": "1080p",
|
| 71 |
+
"source": "NF"
|
| 72 |
+
},
|
| 73 |
+
"pred": {
|
| 74 |
+
"episode": 7,
|
| 75 |
+
"group": null,
|
| 76 |
+
"resolution": "1080p",
|
| 77 |
+
"season": 1,
|
| 78 |
+
"source": "NF",
|
| 79 |
+
"title": "Witch.Hat.Atelier"
|
| 80 |
+
}
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"id": "ani_cjk_season_dash_episode",
|
| 84 |
+
"filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
|
| 85 |
+
"ok": true,
|
| 86 |
+
"errors": {},
|
| 87 |
+
"expected": {
|
| 88 |
+
"group": "ANi",
|
| 89 |
+
"title": "異世界悠閒農家",
|
| 90 |
+
"season": 2,
|
| 91 |
+
"episode": 6,
|
| 92 |
+
"resolution": "1080P",
|
| 93 |
+
"source": "Baha"
|
| 94 |
+
},
|
| 95 |
+
"pred": {
|
| 96 |
+
"episode": 6,
|
| 97 |
+
"group": "ANi",
|
| 98 |
+
"resolution": "1080P",
|
| 99 |
+
"season": 2,
|
| 100 |
+
"source": "Baha",
|
| 101 |
+
"title": "異世界悠閒農家"
|
| 102 |
+
}
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"id": "kisssub_bracket_title_episode",
|
| 106 |
+
"filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
|
| 107 |
+
"ok": true,
|
| 108 |
+
"errors": {},
|
| 109 |
+
"expected": {
|
| 110 |
+
"group": "KissSub",
|
| 111 |
+
"title": "Shunkashuutou Daikousha - Haru no Mai",
|
| 112 |
+
"episode": 5,
|
| 113 |
+
"resolution": "1080P",
|
| 114 |
+
"source": "GB"
|
| 115 |
+
},
|
| 116 |
+
"pred": {
|
| 117 |
+
"episode": 5,
|
| 118 |
+
"group": "KissSub",
|
| 119 |
+
"resolution": "1080P",
|
| 120 |
+
"source": "GB",
|
| 121 |
+
"title": "Shunkashuutou Daikousha - Haru no Mai"
|
| 122 |
+
}
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"id": "airotabracket_title_episode",
|
| 126 |
+
"filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
|
| 127 |
+
"ok": true,
|
| 128 |
+
"errors": {},
|
| 129 |
+
"expected": {
|
| 130 |
+
"group": "Airota",
|
| 131 |
+
"title": "Sousou no Frieren",
|
| 132 |
+
"episode": 29,
|
| 133 |
+
"resolution": "1080p",
|
| 134 |
+
"source": "CHT"
|
| 135 |
+
},
|
| 136 |
+
"pred": {
|
| 137 |
+
"episode": 29,
|
| 138 |
+
"group": "Airota",
|
| 139 |
+
"resolution": "1080p",
|
| 140 |
+
"source": "CHT",
|
| 141 |
+
"title": "Sousou no Frieren"
|
| 142 |
+
}
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"id": "subsplease_parenthesized_resolution",
|
| 146 |
+
"filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
|
| 147 |
+
"ok": true,
|
| 148 |
+
"errors": {},
|
| 149 |
+
"expected": {
|
| 150 |
+
"group": "SubsPlease",
|
| 151 |
+
"title": "Mushoku Tensei",
|
| 152 |
+
"episode": 12,
|
| 153 |
+
"resolution": "1080p"
|
| 154 |
+
},
|
| 155 |
+
"pred": {
|
| 156 |
+
"episode": 12,
|
| 157 |
+
"group": "SubsPlease",
|
| 158 |
+
"resolution": "1080p",
|
| 159 |
+
"title": "Mushoku Tensei"
|
| 160 |
+
}
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"id": "vcb_bracket_episode",
|
| 164 |
+
"filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
|
| 165 |
+
"ok": true,
|
| 166 |
+
"errors": {},
|
| 167 |
+
"expected": {
|
| 168 |
+
"group": "VCB-Studio",
|
| 169 |
+
"title": "Girls Band Cry",
|
| 170 |
+
"episode": 1,
|
| 171 |
+
"resolution": "1080p"
|
| 172 |
+
},
|
| 173 |
+
"pred": {
|
| 174 |
+
"episode": 1,
|
| 175 |
+
"group": "VCB-Studio",
|
| 176 |
+
"resolution": "1080p",
|
| 177 |
+
"title": "Girls Band Cry"
|
| 178 |
+
}
|
| 179 |
+
},
|
| 180 |
+
{
|
| 181 |
+
"id": "numeric_title_not_episode",
|
| 182 |
+
"filename": "86 Eighty Six - 01 [1080P][Baha]",
|
| 183 |
+
"ok": true,
|
| 184 |
+
"errors": {},
|
| 185 |
+
"expected": {
|
| 186 |
+
"title": "86 Eighty Six",
|
| 187 |
+
"episode": 1,
|
| 188 |
+
"resolution": "1080P",
|
| 189 |
+
"source": "Baha"
|
| 190 |
+
},
|
| 191 |
+
"pred": {
|
| 192 |
+
"episode": 1,
|
| 193 |
+
"resolution": "1080P",
|
| 194 |
+
"source": "Baha",
|
| 195 |
+
"title": "86 Eighty Six"
|
| 196 |
+
}
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"id": "erai_raws_dash_episode",
|
| 200 |
+
"filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
|
| 201 |
+
"ok": true,
|
| 202 |
+
"errors": {},
|
| 203 |
+
"expected": {
|
| 204 |
+
"group": "Erai-raws",
|
| 205 |
+
"title": "Sousou no Frieren",
|
| 206 |
+
"episode": 1,
|
| 207 |
+
"resolution": "1080p"
|
| 208 |
+
},
|
| 209 |
+
"pred": {
|
| 210 |
+
"episode": 1,
|
| 211 |
+
"group": "Erai-raws",
|
| 212 |
+
"resolution": "1080p",
|
| 213 |
+
"title": "Sousou no Frieren"
|
| 214 |
+
}
|
| 215 |
+
},
|
| 216 |
+
{
|
| 217 |
+
"id": "nekomoe_space_group",
|
| 218 |
+
"filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
|
| 219 |
+
"ok": true,
|
| 220 |
+
"errors": {},
|
| 221 |
+
"expected": {
|
| 222 |
+
"group": "Nekomoe kissaten",
|
| 223 |
+
"title": "Watashi no Shiawase na Kekkon",
|
| 224 |
+
"episode": 1,
|
| 225 |
+
"resolution": "1080p"
|
| 226 |
+
},
|
| 227 |
+
"pred": {
|
| 228 |
+
"episode": 1,
|
| 229 |
+
"group": "Nekomoe kissaten",
|
| 230 |
+
"resolution": "1080p",
|
| 231 |
+
"title": "Watashi no Shiawase na Kekkon"
|
| 232 |
+
}
|
| 233 |
+
},
|
| 234 |
+
{
|
| 235 |
+
"id": "long_running_episode",
|
| 236 |
+
"filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
|
| 237 |
+
"ok": true,
|
| 238 |
+
"errors": {},
|
| 239 |
+
"expected": {
|
| 240 |
+
"title": "One.Piece",
|
| 241 |
+
"episode": 1110,
|
| 242 |
+
"resolution": "1080p",
|
| 243 |
+
"source": "WEB-DL"
|
| 244 |
+
},
|
| 245 |
+
"pred": {
|
| 246 |
+
"episode": 1110,
|
| 247 |
+
"resolution": "1080p",
|
| 248 |
+
"source": "WEB-DL",
|
| 249 |
+
"title": "One.Piece"
|
| 250 |
+
}
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"id": "season_episode_amzn",
|
| 254 |
+
"filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
|
| 255 |
+
"ok": true,
|
| 256 |
+
"errors": {},
|
| 257 |
+
"expected": {
|
| 258 |
+
"title": "Example.Show",
|
| 259 |
+
"season": 2,
|
| 260 |
+
"episode": 3,
|
| 261 |
+
"resolution": "2160p",
|
| 262 |
+
"source": "AMZN"
|
| 263 |
+
},
|
| 264 |
+
"pred": {
|
| 265 |
+
"episode": 3,
|
| 266 |
+
"resolution": "2160p",
|
| 267 |
+
"season": 2,
|
| 268 |
+
"source": "AMZN",
|
| 269 |
+
"title": "Example.Show"
|
| 270 |
+
}
|
| 271 |
+
},
|
| 272 |
+
{
|
| 273 |
+
"id": "cjk_group_with_prefix_tag",
|
| 274 |
+
"filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
|
| 275 |
+
"ok": true,
|
| 276 |
+
"errors": {},
|
| 277 |
+
"expected": {
|
| 278 |
+
"group": "喵萌奶茶屋",
|
| 279 |
+
"title": "葬送的芙莉莲",
|
| 280 |
+
"episode": 1,
|
| 281 |
+
"resolution": "1080P"
|
| 282 |
+
},
|
| 283 |
+
"pred": {
|
| 284 |
+
"episode": 1,
|
| 285 |
+
"group": "喵萌奶茶屋",
|
| 286 |
+
"resolution": "1080P",
|
| 287 |
+
"title": "葬送的芙莉莲"
|
| 288 |
+
}
|
| 289 |
+
},
|
| 290 |
+
{
|
| 291 |
+
"id": "leading_meta_not_group",
|
| 292 |
+
"filename": "[1080p] Witch Watch - 15 [CHS]",
|
| 293 |
+
"ok": true,
|
| 294 |
+
"errors": {},
|
| 295 |
+
"expected": {
|
| 296 |
+
"group": null,
|
| 297 |
+
"title": "Witch Watch",
|
| 298 |
+
"episode": 15,
|
| 299 |
+
"resolution": "1080p",
|
| 300 |
+
"source": "CHS"
|
| 301 |
+
},
|
| 302 |
+
"pred": {
|
| 303 |
+
"episode": 15,
|
| 304 |
+
"group": null,
|
| 305 |
+
"resolution": "1080p",
|
| 306 |
+
"source": "CHS",
|
| 307 |
+
"title": "Witch Watch"
|
| 308 |
+
}
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"id": "sakurato_group_language_source",
|
| 312 |
+
"filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
|
| 313 |
+
"ok": true,
|
| 314 |
+
"errors": {},
|
| 315 |
+
"expected": {
|
| 316 |
+
"group": "Sakurato",
|
| 317 |
+
"title": "Witch Watch",
|
| 318 |
+
"episode": 15,
|
| 319 |
+
"resolution": "1080p",
|
| 320 |
+
"source": "CHS"
|
| 321 |
+
},
|
| 322 |
+
"pred": {
|
| 323 |
+
"episode": 15,
|
| 324 |
+
"group": "Sakurato",
|
| 325 |
+
"resolution": "1080p",
|
| 326 |
+
"source": "CHS",
|
| 327 |
+
"title": "Witch Watch"
|
| 328 |
+
}
|
| 329 |
+
},
|
| 330 |
+
{
|
| 331 |
+
"id": "billion_meta_lab_search_special",
|
| 332 |
+
"filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
|
| 333 |
+
"ok": true,
|
| 334 |
+
"errors": {},
|
| 335 |
+
"expected": {
|
| 336 |
+
"group": "Billion Meta Lab",
|
| 337 |
+
"title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
|
| 338 |
+
"episode": 7,
|
| 339 |
+
"resolution": "1080P",
|
| 340 |
+
"source": "CHT&JPN",
|
| 341 |
+
"special": "檢索:魔法姊妹露露特莉莉"
|
| 342 |
+
},
|
| 343 |
+
"pred": {
|
| 344 |
+
"episode": 7,
|
| 345 |
+
"group": "Billion Meta Lab",
|
| 346 |
+
"resolution": "1080P",
|
| 347 |
+
"source": "CHT&JPN",
|
| 348 |
+
"special": "檢索:魔法姊妹露露特莉莉",
|
| 349 |
+
"title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi"
|
| 350 |
+
}
|
| 351 |
+
},
|
| 352 |
+
{
|
| 353 |
+
"id": "studio_greentea_s2_bracket_episode",
|
| 354 |
+
"filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
|
| 355 |
+
"ok": true,
|
| 356 |
+
"errors": {},
|
| 357 |
+
"expected": {
|
| 358 |
+
"group": "Studio GreenTea",
|
| 359 |
+
"title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
|
| 360 |
+
"season": 2,
|
| 361 |
+
"episode": 6,
|
| 362 |
+
"resolution": "1080p",
|
| 363 |
+
"source": "WebRip"
|
| 364 |
+
},
|
| 365 |
+
"pred": {
|
| 366 |
+
"episode": 6,
|
| 367 |
+
"group": "Studio GreenTea",
|
| 368 |
+
"resolution": "1080p",
|
| 369 |
+
"season": 2,
|
| 370 |
+
"source": "WebRip",
|
| 371 |
+
"title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken"
|
| 372 |
+
}
|
| 373 |
+
},
|
| 374 |
+
{
|
| 375 |
+
"id": "lolihouse_kakuriyo_bare_ni_season",
|
| 376 |
+
"filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
|
| 377 |
+
"ok": true,
|
| 378 |
+
"errors": {},
|
| 379 |
+
"expected": {
|
| 380 |
+
"group": "LoliHouse",
|
| 381 |
+
"title": "Kakuriyo no Yadomeshi",
|
| 382 |
+
"season": 2,
|
| 383 |
+
"episode": 12,
|
| 384 |
+
"resolution": "1080p",
|
| 385 |
+
"source": "WebRip"
|
| 386 |
+
},
|
| 387 |
+
"pred": {
|
| 388 |
+
"episode": 12,
|
| 389 |
+
"group": "LoliHouse",
|
| 390 |
+
"resolution": "1080p",
|
| 391 |
+
"season": 2,
|
| 392 |
+
"source": "WebRip",
|
| 393 |
+
"title": "Kakuriyo no Yadomeshi"
|
| 394 |
+
}
|
| 395 |
+
},
|
| 396 |
+
{
|
| 397 |
+
"id": "ani_kakuriyo_traditional_ni",
|
| 398 |
+
"filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
|
| 399 |
+
"ok": true,
|
| 400 |
+
"errors": {},
|
| 401 |
+
"expected": {
|
| 402 |
+
"group": "ANi",
|
| 403 |
+
"title": "妖怪旅館營業中",
|
| 404 |
+
"season": 2,
|
| 405 |
+
"episode": 11,
|
| 406 |
+
"resolution": "1080P",
|
| 407 |
+
"source": "Baha"
|
| 408 |
+
},
|
| 409 |
+
"pred": {
|
| 410 |
+
"episode": 11,
|
| 411 |
+
"group": "ANi",
|
| 412 |
+
"resolution": "1080P",
|
| 413 |
+
"season": 2,
|
| 414 |
+
"source": "Baha",
|
| 415 |
+
"title": "妖怪旅館營業中"
|
| 416 |
+
}
|
| 417 |
+
},
|
| 418 |
+
{
|
| 419 |
+
"id": "jibaketa_shokugeki_ni_no_sara",
|
| 420 |
+
"filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
|
| 421 |
+
"ok": true,
|
| 422 |
+
"errors": {},
|
| 423 |
+
"expected": {
|
| 424 |
+
"group": "jibaketa",
|
| 425 |
+
"title": "Shokugeki no Souma",
|
| 426 |
+
"season": 2,
|
| 427 |
+
"episode": 13,
|
| 428 |
+
"resolution": "1920x1080"
|
| 429 |
+
},
|
| 430 |
+
"pred": {
|
| 431 |
+
"episode": 13,
|
| 432 |
+
"group": "jibaketa",
|
| 433 |
+
"resolution": "1920x1080",
|
| 434 |
+
"season": 2,
|
| 435 |
+
"title": "Shokugeki no Souma"
|
| 436 |
+
}
|
| 437 |
+
},
|
| 438 |
+
{
|
| 439 |
+
"id": "ai_raws_fire_force_cjk_season_hash_episode",
|
| 440 |
+
"filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
|
| 441 |
+
"ok": true,
|
| 442 |
+
"errors": {},
|
| 443 |
+
"expected": {
|
| 444 |
+
"group": "AI-Raws",
|
| 445 |
+
"title": "炎炎の消防隊",
|
| 446 |
+
"season": 2,
|
| 447 |
+
"episode": 13,
|
| 448 |
+
"resolution": "1920x1080"
|
| 449 |
+
},
|
| 450 |
+
"pred": {
|
| 451 |
+
"episode": 13,
|
| 452 |
+
"group": "AI-Raws",
|
| 453 |
+
"resolution": "1920x1080",
|
| 454 |
+
"season": 2,
|
| 455 |
+
"title": "炎炎の消防隊"
|
| 456 |
+
}
|
| 457 |
+
}
|
| 458 |
+
]
|
| 459 |
+
}
|
config.json
CHANGED
|
@@ -50,15 +50,15 @@
|
|
| 50 |
},
|
| 51 |
"layer_norm_eps": 1e-12,
|
| 52 |
"max_position_embeddings": 128,
|
| 53 |
-
"max_seq_length":
|
| 54 |
"model_type": "bert",
|
| 55 |
"num_attention_heads": 8,
|
| 56 |
"num_hidden_layers": 4,
|
| 57 |
"pad_token_id": 0,
|
| 58 |
"tie_word_embeddings": true,
|
| 59 |
-
"tokenizer_variant": "
|
| 60 |
-
"transformers_version": "5.8.
|
| 61 |
"type_vocab_size": 2,
|
| 62 |
"use_cache": false,
|
| 63 |
-
"vocab_size":
|
| 64 |
}
|
|
|
|
| 50 |
},
|
| 51 |
"layer_norm_eps": 1e-12,
|
| 52 |
"max_position_embeddings": 128,
|
| 53 |
+
"max_seq_length": 128,
|
| 54 |
"model_type": "bert",
|
| 55 |
"num_attention_heads": 8,
|
| 56 |
"num_hidden_layers": 4,
|
| 57 |
"pad_token_id": 0,
|
| 58 |
"tie_word_embeddings": true,
|
| 59 |
+
"tokenizer_variant": "char",
|
| 60 |
+
"transformers_version": "5.8.1",
|
| 61 |
"type_vocab_size": 2,
|
| 62 |
"use_cache": false,
|
| 63 |
+
"vocab_size": 6199
|
| 64 |
}
|
data/parser_regression_cases.json
ADDED
|
@@ -0,0 +1,232 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"id": "lolihouse_dash_episode",
|
| 4 |
+
"filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
|
| 5 |
+
"expected": {
|
| 6 |
+
"group": "LoliHouse",
|
| 7 |
+
"title": "Yomi no Tsugai",
|
| 8 |
+
"episode": 7,
|
| 9 |
+
"resolution": "1080p",
|
| 10 |
+
"source": "WebRip"
|
| 11 |
+
}
|
| 12 |
+
},
|
| 13 |
+
{
|
| 14 |
+
"id": "dot_season_episode_no_group",
|
| 15 |
+
"filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
|
| 16 |
+
"expected": {
|
| 17 |
+
"title": "Witch.Hat.Atelier",
|
| 18 |
+
"season": 1,
|
| 19 |
+
"episode": 7,
|
| 20 |
+
"group": null,
|
| 21 |
+
"resolution": "1080p",
|
| 22 |
+
"source": "NF"
|
| 23 |
+
}
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
"id": "ani_cjk_season_dash_episode",
|
| 27 |
+
"filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
|
| 28 |
+
"expected": {
|
| 29 |
+
"group": "ANi",
|
| 30 |
+
"title": "異世界悠閒農家",
|
| 31 |
+
"season": 2,
|
| 32 |
+
"episode": 6,
|
| 33 |
+
"resolution": "1080P",
|
| 34 |
+
"source": "Baha"
|
| 35 |
+
}
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"id": "kisssub_bracket_title_episode",
|
| 39 |
+
"filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
|
| 40 |
+
"expected": {
|
| 41 |
+
"group": "KissSub",
|
| 42 |
+
"title": "Shunkashuutou Daikousha - Haru no Mai",
|
| 43 |
+
"episode": 5,
|
| 44 |
+
"resolution": "1080P",
|
| 45 |
+
"source": "GB"
|
| 46 |
+
}
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"id": "airotabracket_title_episode",
|
| 50 |
+
"filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
|
| 51 |
+
"expected": {
|
| 52 |
+
"group": "Airota",
|
| 53 |
+
"title": "Sousou no Frieren",
|
| 54 |
+
"episode": 29,
|
| 55 |
+
"resolution": "1080p",
|
| 56 |
+
"source": "CHT"
|
| 57 |
+
}
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "subsplease_parenthesized_resolution",
|
| 61 |
+
"filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
|
| 62 |
+
"expected": {
|
| 63 |
+
"group": "SubsPlease",
|
| 64 |
+
"title": "Mushoku Tensei",
|
| 65 |
+
"episode": 12,
|
| 66 |
+
"resolution": "1080p"
|
| 67 |
+
}
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"id": "vcb_bracket_episode",
|
| 71 |
+
"filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
|
| 72 |
+
"expected": {
|
| 73 |
+
"group": "VCB-Studio",
|
| 74 |
+
"title": "Girls Band Cry",
|
| 75 |
+
"episode": 1,
|
| 76 |
+
"resolution": "1080p"
|
| 77 |
+
}
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"id": "numeric_title_not_episode",
|
| 81 |
+
"filename": "86 Eighty Six - 01 [1080P][Baha]",
|
| 82 |
+
"expected": {
|
| 83 |
+
"title": "86 Eighty Six",
|
| 84 |
+
"episode": 1,
|
| 85 |
+
"resolution": "1080P",
|
| 86 |
+
"source": "Baha"
|
| 87 |
+
}
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"id": "erai_raws_dash_episode",
|
| 91 |
+
"filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
|
| 92 |
+
"expected": {
|
| 93 |
+
"group": "Erai-raws",
|
| 94 |
+
"title": "Sousou no Frieren",
|
| 95 |
+
"episode": 1,
|
| 96 |
+
"resolution": "1080p"
|
| 97 |
+
}
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"id": "nekomoe_space_group",
|
| 101 |
+
"filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
|
| 102 |
+
"expected": {
|
| 103 |
+
"group": "Nekomoe kissaten",
|
| 104 |
+
"title": "Watashi no Shiawase na Kekkon",
|
| 105 |
+
"episode": 1,
|
| 106 |
+
"resolution": "1080p"
|
| 107 |
+
}
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"id": "long_running_episode",
|
| 111 |
+
"filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
|
| 112 |
+
"expected": {
|
| 113 |
+
"title": "One.Piece",
|
| 114 |
+
"episode": 1110,
|
| 115 |
+
"resolution": "1080p",
|
| 116 |
+
"source": "WEB-DL"
|
| 117 |
+
}
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"id": "season_episode_amzn",
|
| 121 |
+
"filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
|
| 122 |
+
"expected": {
|
| 123 |
+
"title": "Example.Show",
|
| 124 |
+
"season": 2,
|
| 125 |
+
"episode": 3,
|
| 126 |
+
"resolution": "2160p",
|
| 127 |
+
"source": "AMZN"
|
| 128 |
+
}
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"id": "cjk_group_with_prefix_tag",
|
| 132 |
+
"filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
|
| 133 |
+
"expected": {
|
| 134 |
+
"group": "喵萌奶茶屋",
|
| 135 |
+
"title": "葬送的芙莉莲",
|
| 136 |
+
"episode": 1,
|
| 137 |
+
"resolution": "1080P"
|
| 138 |
+
}
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"id": "leading_meta_not_group",
|
| 142 |
+
"filename": "[1080p] Witch Watch - 15 [CHS]",
|
| 143 |
+
"expected": {
|
| 144 |
+
"group": null,
|
| 145 |
+
"title": "Witch Watch",
|
| 146 |
+
"episode": 15,
|
| 147 |
+
"resolution": "1080p",
|
| 148 |
+
"source": "CHS"
|
| 149 |
+
}
|
| 150 |
+
},
|
| 151 |
+
{
|
| 152 |
+
"id": "sakurato_group_language_source",
|
| 153 |
+
"filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
|
| 154 |
+
"expected": {
|
| 155 |
+
"group": "Sakurato",
|
| 156 |
+
"title": "Witch Watch",
|
| 157 |
+
"episode": 15,
|
| 158 |
+
"resolution": "1080p",
|
| 159 |
+
"source": "CHS"
|
| 160 |
+
}
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"id": "billion_meta_lab_search_special",
|
| 164 |
+
"filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
|
| 165 |
+
"expected": {
|
| 166 |
+
"group": "Billion Meta Lab",
|
| 167 |
+
"title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
|
| 168 |
+
"episode": 7,
|
| 169 |
+
"resolution": "1080P",
|
| 170 |
+
"source": "CHT&JPN",
|
| 171 |
+
"special": "檢索:魔法姊妹露露特莉莉"
|
| 172 |
+
}
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"id": "studio_greentea_s2_bracket_episode",
|
| 176 |
+
"filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
|
| 177 |
+
"expected": {
|
| 178 |
+
"group": "Studio GreenTea",
|
| 179 |
+
"title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
|
| 180 |
+
"season": 2,
|
| 181 |
+
"episode": 6,
|
| 182 |
+
"resolution": "1080p",
|
| 183 |
+
"source": "WebRip"
|
| 184 |
+
}
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
"id": "lolihouse_kakuriyo_bare_ni_season",
|
| 188 |
+
"filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
|
| 189 |
+
"expected": {
|
| 190 |
+
"group": "LoliHouse",
|
| 191 |
+
"title": "Kakuriyo no Yadomeshi",
|
| 192 |
+
"season": 2,
|
| 193 |
+
"episode": 12,
|
| 194 |
+
"resolution": "1080p",
|
| 195 |
+
"source": "WebRip"
|
| 196 |
+
}
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"id": "ani_kakuriyo_traditional_ni",
|
| 200 |
+
"filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
|
| 201 |
+
"expected": {
|
| 202 |
+
"group": "ANi",
|
| 203 |
+
"title": "妖怪旅館營業中",
|
| 204 |
+
"season": 2,
|
| 205 |
+
"episode": 11,
|
| 206 |
+
"resolution": "1080P",
|
| 207 |
+
"source": "Baha"
|
| 208 |
+
}
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"id": "jibaketa_shokugeki_ni_no_sara",
|
| 212 |
+
"filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
|
| 213 |
+
"expected": {
|
| 214 |
+
"group": "jibaketa",
|
| 215 |
+
"title": "Shokugeki no Souma",
|
| 216 |
+
"season": 2,
|
| 217 |
+
"episode": 13,
|
| 218 |
+
"resolution": "1920x1080"
|
| 219 |
+
}
|
| 220 |
+
},
|
| 221 |
+
{
|
| 222 |
+
"id": "ai_raws_fire_force_cjk_season_hash_episode",
|
| 223 |
+
"filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
|
| 224 |
+
"expected": {
|
| 225 |
+
"group": "AI-Raws",
|
| 226 |
+
"title": "炎炎の消防隊",
|
| 227 |
+
"season": 2,
|
| 228 |
+
"episode": 13,
|
| 229 |
+
"resolution": "1920x1080"
|
| 230 |
+
}
|
| 231 |
+
}
|
| 232 |
+
]
|
dataset.py
CHANGED
|
@@ -6,11 +6,13 @@ Handles token-ID conversion, label encoding, padding, and truncation.
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
import json
|
|
|
|
| 9 |
import torch
|
| 10 |
from torch.utils.data import Dataset
|
| 11 |
-
from typing import Dict, List, Optional
|
| 12 |
|
| 13 |
from config import Config
|
|
|
|
| 14 |
from tokenizer import AnimeTokenizer
|
| 15 |
|
| 16 |
|
|
@@ -62,9 +64,7 @@ class AnimeDataset(Dataset):
|
|
| 62 |
Dictionary with input_ids, attention_mask, labels as LongTensors.
|
| 63 |
"""
|
| 64 |
item = self.data[idx]
|
| 65 |
-
tokens
|
| 66 |
-
labels: List[str] = item["labels"]
|
| 67 |
-
tokens, labels = align_tokens_for_tokenizer(tokens, labels, self.tokenizer)
|
| 68 |
|
| 69 |
# Convert tokens to IDs
|
| 70 |
input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
|
|
@@ -137,6 +137,146 @@ def align_tokens_for_tokenizer(
|
|
| 137 |
return aligned_tokens, aligned_labels
|
| 138 |
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
def create_datasets(
|
| 141 |
data_path: str,
|
| 142 |
tokenizer: AnimeTokenizer,
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
import json
|
| 9 |
+
from collections import Counter
|
| 10 |
import torch
|
| 11 |
from torch.utils.data import Dataset
|
| 12 |
+
from typing import Dict, List, Optional, Tuple
|
| 13 |
|
| 14 |
from config import Config
|
| 15 |
+
from label_repairs import repair_sequel_season_labels
|
| 16 |
from tokenizer import AnimeTokenizer
|
| 17 |
|
| 18 |
|
|
|
|
| 64 |
Dictionary with input_ids, attention_mask, labels as LongTensors.
|
| 65 |
"""
|
| 66 |
item = self.data[idx]
|
| 67 |
+
tokens, labels = labels_for_tokenizer(item, self.tokenizer)
|
|
|
|
|
|
|
| 68 |
|
| 69 |
# Convert tokens to IDs
|
| 70 |
input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
|
|
|
|
| 137 |
return aligned_tokens, aligned_labels
|
| 138 |
|
| 139 |
|
| 140 |
+
def labels_for_tokenizer(
|
| 141 |
+
item: Dict,
|
| 142 |
+
tokenizer: AnimeTokenizer,
|
| 143 |
+
) -> Tuple[List[str], List[str]]:
|
| 144 |
+
"""
|
| 145 |
+
Return tokens and labels in the exact tokenizer space used by the model.
|
| 146 |
+
|
| 147 |
+
Older DMHY weak-label files store a post-processed token sequence where
|
| 148 |
+
group/title brackets may be expanded even though AnimeTokenizer keeps the
|
| 149 |
+
same bracketed text as one inference token. If the raw filename is present,
|
| 150 |
+
project those weak labels back to character spans and then onto the current
|
| 151 |
+
tokenizer output. This keeps train/eval/inference preprocessing identical.
|
| 152 |
+
"""
|
| 153 |
+
filename = item.get("filename")
|
| 154 |
+
source_tokens, source_labels, _repairs = repair_sequel_season_labels(item)
|
| 155 |
+
tokenizer_variant = getattr(tokenizer, "tokenizer_variant", "regex")
|
| 156 |
+
|
| 157 |
+
if not filename:
|
| 158 |
+
return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
|
| 159 |
+
|
| 160 |
+
# Current char datasets are already in the exact inference token space.
|
| 161 |
+
# Avoid re-scanning every filename during training.
|
| 162 |
+
if item.get("tokenizer_variant") == tokenizer_variant:
|
| 163 |
+
target_tokens = tokenizer.tokenize(filename)
|
| 164 |
+
if source_tokens == target_tokens:
|
| 165 |
+
return source_tokens, source_labels
|
| 166 |
+
|
| 167 |
+
projected = project_labels_from_filename(
|
| 168 |
+
filename=filename,
|
| 169 |
+
source_tokens=source_tokens,
|
| 170 |
+
source_labels=source_labels,
|
| 171 |
+
tokenizer=tokenizer,
|
| 172 |
+
)
|
| 173 |
+
if projected is not None:
|
| 174 |
+
return projected
|
| 175 |
+
|
| 176 |
+
# Fall back to the legacy behavior for synthetic fixtures or malformed rows.
|
| 177 |
+
return align_tokens_for_tokenizer(source_tokens, source_labels, tokenizer)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def token_offsets_in_text(text: str, tokens: List[str]) -> Optional[List[Tuple[int, int]]]:
|
| 181 |
+
"""Find token character offsets by scanning left to right."""
|
| 182 |
+
offsets: List[Tuple[int, int]] = []
|
| 183 |
+
cursor = 0
|
| 184 |
+
for token in tokens:
|
| 185 |
+
if token == "":
|
| 186 |
+
offsets.append((cursor, cursor))
|
| 187 |
+
continue
|
| 188 |
+
start = text.find(token, cursor)
|
| 189 |
+
if start < 0:
|
| 190 |
+
return None
|
| 191 |
+
end = start + len(token)
|
| 192 |
+
offsets.append((start, end))
|
| 193 |
+
cursor = end
|
| 194 |
+
return offsets
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def project_source_labels_to_chars(
|
| 198 |
+
text: str,
|
| 199 |
+
source_tokens: List[str],
|
| 200 |
+
source_labels: List[str],
|
| 201 |
+
) -> Optional[List[str]]:
|
| 202 |
+
"""Project source token BIO labels to per-character entity names."""
|
| 203 |
+
offsets = token_offsets_in_text(text, source_tokens)
|
| 204 |
+
if offsets is None or len(source_tokens) != len(source_labels):
|
| 205 |
+
return None
|
| 206 |
+
|
| 207 |
+
char_entities = ["O"] * len(text)
|
| 208 |
+
for token, label, (start, end) in zip(source_tokens, source_labels, offsets):
|
| 209 |
+
if not label.startswith(("B-", "I-")):
|
| 210 |
+
continue
|
| 211 |
+
entity = label.split("-", 1)[1]
|
| 212 |
+
|
| 213 |
+
# Bracketed single-token metadata in older data often includes the
|
| 214 |
+
# brackets in the token. Keep container punctuation as O so a tokenizer
|
| 215 |
+
# that splits brackets can learn cleaner boundaries.
|
| 216 |
+
inner_start = start
|
| 217 |
+
inner_end = end
|
| 218 |
+
if len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》":
|
| 219 |
+
inner_start += 1
|
| 220 |
+
inner_end -= 1
|
| 221 |
+
|
| 222 |
+
for pos in range(inner_start, inner_end):
|
| 223 |
+
if 0 <= pos < len(char_entities):
|
| 224 |
+
char_entities[pos] = entity
|
| 225 |
+
return char_entities
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
def labels_from_char_projection(
|
| 229 |
+
text: str,
|
| 230 |
+
target_tokens: List[str],
|
| 231 |
+
char_entities: List[str],
|
| 232 |
+
) -> Optional[List[str]]:
|
| 233 |
+
"""Assign legal IOB2 labels to target tokens from per-character entities."""
|
| 234 |
+
offsets = token_offsets_in_text(text, target_tokens)
|
| 235 |
+
if offsets is None:
|
| 236 |
+
return None
|
| 237 |
+
|
| 238 |
+
labels: List[str] = []
|
| 239 |
+
active_entity: Optional[str] = None
|
| 240 |
+
for start, end in offsets:
|
| 241 |
+
span_entities = [
|
| 242 |
+
char_entities[pos]
|
| 243 |
+
for pos in range(start, end)
|
| 244 |
+
if 0 <= pos < len(char_entities) and char_entities[pos] != "O"
|
| 245 |
+
]
|
| 246 |
+
if not span_entities:
|
| 247 |
+
labels.append("O")
|
| 248 |
+
active_entity = None
|
| 249 |
+
continue
|
| 250 |
+
|
| 251 |
+
entity = Counter(span_entities).most_common(1)[0][0]
|
| 252 |
+
prefix = "I" if active_entity == entity else "B"
|
| 253 |
+
labels.append(f"{prefix}-{entity}")
|
| 254 |
+
active_entity = entity
|
| 255 |
+
return labels
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
def project_labels_from_filename(
|
| 259 |
+
filename: str,
|
| 260 |
+
source_tokens: List[str],
|
| 261 |
+
source_labels: List[str],
|
| 262 |
+
tokenizer: AnimeTokenizer,
|
| 263 |
+
) -> Optional[Tuple[List[str], List[str]]]:
|
| 264 |
+
"""
|
| 265 |
+
Re-tokenize filename and project weak BIO labels onto that tokenizer.
|
| 266 |
+
|
| 267 |
+
Returns None when source tokens cannot be aligned to the filename.
|
| 268 |
+
"""
|
| 269 |
+
char_entities = project_source_labels_to_chars(filename, source_tokens, source_labels)
|
| 270 |
+
if char_entities is None:
|
| 271 |
+
return None
|
| 272 |
+
|
| 273 |
+
target_tokens = tokenizer.tokenize(filename)
|
| 274 |
+
target_labels = labels_from_char_projection(filename, target_tokens, char_entities)
|
| 275 |
+
if target_labels is None or len(target_tokens) != len(target_labels):
|
| 276 |
+
return None
|
| 277 |
+
return target_tokens, target_labels
|
| 278 |
+
|
| 279 |
+
|
| 280 |
def create_datasets(
|
| 281 |
data_path: str,
|
| 282 |
tokenizer: AnimeTokenizer,
|
datasets/AnimeName
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
Subproject commit
|
|
|
|
| 1 |
+
Subproject commit 8d2b6c9e639fde6be0e428e5f34f56fccd5aa2ea
|
diagnose_pipeline.py
CHANGED
|
@@ -27,7 +27,8 @@ from seqeval.metrics import classification_report, f1_score, precision_score, re
|
|
| 27 |
from transformers import BertForTokenClassification
|
| 28 |
|
| 29 |
from config import Config
|
| 30 |
-
from dataset import
|
|
|
|
| 31 |
from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
|
| 32 |
|
| 33 |
|
|
@@ -81,16 +82,6 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
|
|
| 81 |
for idx, label in enumerate(labels):
|
| 82 |
token = tokens[idx] if idx < len(tokens) else None
|
| 83 |
if label == "O":
|
| 84 |
-
if previous_label.startswith("B-"):
|
| 85 |
-
violations.append(
|
| 86 |
-
{
|
| 87 |
-
"type": "B_DIRECT_TO_O",
|
| 88 |
-
"index": idx,
|
| 89 |
-
"prev_label": previous_label,
|
| 90 |
-
"label": label,
|
| 91 |
-
"token": token,
|
| 92 |
-
}
|
| 93 |
-
)
|
| 94 |
current_entity = None
|
| 95 |
elif label.startswith("B-"):
|
| 96 |
current_entity = entity_type(label)
|
|
@@ -124,6 +115,24 @@ def bio_violations(tokens: List[str], labels: List[str]) -> List[dict]:
|
|
| 124 |
return violations
|
| 125 |
|
| 126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
|
| 128 |
spans: List[dict] = []
|
| 129 |
start: Optional[int] = None
|
|
@@ -241,7 +250,7 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
|
|
| 241 |
unk = 0
|
| 242 |
unk_counter: Counter = Counter()
|
| 243 |
for sample in samples:
|
| 244 |
-
tokens,
|
| 245 |
ids = tokenizer.convert_tokens_to_ids(tokens)
|
| 246 |
for token, token_id in zip(tokens, ids):
|
| 247 |
total += 1
|
|
@@ -257,13 +266,12 @@ def token_id_stats(samples: List[dict], tokenizer: AnimeTokenizer) -> dict:
|
|
| 257 |
|
| 258 |
|
| 259 |
def prepare_inputs(
|
| 260 |
-
|
| 261 |
-
labels: List[str],
|
| 262 |
tokenizer: AnimeTokenizer,
|
| 263 |
label2id: Dict[str, int],
|
| 264 |
max_length: int,
|
| 265 |
) -> Tuple[List[int], List[int], List[int], List[str]]:
|
| 266 |
-
tokens, labels =
|
| 267 |
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
| 268 |
input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
|
| 269 |
label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
|
|
@@ -283,6 +291,48 @@ def prepare_inputs(
|
|
| 283 |
return input_ids, attention_mask, label_ids, tokens
|
| 284 |
|
| 285 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
def evaluate_model(
|
| 287 |
samples: List[dict],
|
| 288 |
model_dir: Path,
|
|
@@ -313,12 +363,15 @@ def evaluate_model(
|
|
| 313 |
confusion: Counter = Counter()
|
| 314 |
entity_confusion: Counter = Counter()
|
| 315 |
boundary_errors: Counter = Counter()
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
|
| 317 |
with torch.no_grad():
|
| 318 |
for sample in eval_samples:
|
| 319 |
-
input_ids, attention_mask, label_ids,
|
| 320 |
-
sample
|
| 321 |
-
sample["labels"],
|
| 322 |
tokenizer,
|
| 323 |
label2id,
|
| 324 |
max_length,
|
|
@@ -326,13 +379,17 @@ def evaluate_model(
|
|
| 326 |
input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
|
| 327 |
mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
|
| 328 |
logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
|
| 329 |
-
|
|
|
|
| 330 |
|
| 331 |
true_labels: List[str] = []
|
| 332 |
pred_labels: List[str] = []
|
| 333 |
-
|
|
|
|
| 334 |
if label_id == -100:
|
| 335 |
continue
|
|
|
|
|
|
|
| 336 |
true_label = id2label.get(label_id, "O")
|
| 337 |
pred_label = id2label.get(pred_id, "O")
|
| 338 |
true_labels.append(true_label)
|
|
@@ -348,6 +405,57 @@ def evaluate_model(
|
|
| 348 |
boundary_errors["BIO-prefix"] += 1
|
| 349 |
true_sequences.append(true_labels)
|
| 350 |
pred_sequences.append(pred_labels)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
errors = confusion.copy()
|
| 353 |
for label in set(label for pair in confusion for label in pair):
|
|
@@ -364,6 +472,10 @@ def evaluate_model(
|
|
| 364 |
{k: v for k, v in entity_confusion.items() if k[0] != k[1]}
|
| 365 |
).most_common(30),
|
| 366 |
"boundary_errors": boundary_errors,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 367 |
}
|
| 368 |
|
| 369 |
|
|
@@ -444,6 +556,7 @@ def main() -> None:
|
|
| 444 |
length_values: List[int] = []
|
| 445 |
aligned_length_values: List[int] = []
|
| 446 |
violations: List[dict] = []
|
|
|
|
| 447 |
mismatch_examples: List[dict] = []
|
| 448 |
space_label_counter: Counter = Counter()
|
| 449 |
boundary_drift_counter: Counter = Counter()
|
|
@@ -472,7 +585,7 @@ def main() -> None:
|
|
| 472 |
|
| 473 |
label_counter.update(labels)
|
| 474 |
length_values.append(len(tokens))
|
| 475 |
-
aligned_tokens, aligned_labels =
|
| 476 |
aligned_length_values.append(len(aligned_tokens))
|
| 477 |
if len(aligned_tokens) + 2 > max_length:
|
| 478 |
truncation_count += 1
|
|
@@ -490,6 +603,17 @@ def main() -> None:
|
|
| 490 |
}
|
| 491 |
)
|
| 492 |
violations.append(violation)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 493 |
for span in spans_from_labels(tokens, labels):
|
| 494 |
text = span["text"]
|
| 495 |
if span["type"] == "TITLE":
|
|
@@ -594,19 +718,26 @@ def main() -> None:
|
|
| 594 |
)
|
| 595 |
|
| 596 |
violation_counter = Counter(v["type"] for v in violations)
|
|
|
|
| 597 |
sections.append(
|
| 598 |
(
|
| 599 |
"BIO Violations And Boundary Drift",
|
| 600 |
"\n".join(
|
| 601 |
[
|
| 602 |
-
"###
|
| 603 |
format_counter(violation_counter),
|
| 604 |
"",
|
|
|
|
|
|
|
|
|
|
| 605 |
"### Boundary drift heuristics",
|
| 606 |
format_counter(boundary_drift_counter),
|
| 607 |
"",
|
| 608 |
"### Sample violations",
|
| 609 |
markdown_json(violations[:30]),
|
|
|
|
|
|
|
|
|
|
| 610 |
]
|
| 611 |
),
|
| 612 |
)
|
|
@@ -659,6 +790,29 @@ def main() -> None:
|
|
| 659 |
[true, pred, f"{count:,}"]
|
| 660 |
for (true, pred), count in model_eval["top_entity_confusions"]
|
| 661 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 662 |
sections.append(
|
| 663 |
(
|
| 664 |
"Model Confusion Analysis",
|
|
@@ -678,6 +832,28 @@ def main() -> None:
|
|
| 678 |
"### Top entity-type confusions",
|
| 679 |
markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
|
| 680 |
"",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 681 |
"### Seqeval report",
|
| 682 |
"```text\n" + model_eval["classification_report"] + "\n```",
|
| 683 |
]
|
|
|
|
| 27 |
from transformers import BertForTokenClassification
|
| 28 |
|
| 29 |
from config import Config
|
| 30 |
+
from dataset import labels_for_tokenizer
|
| 31 |
+
from inference import constrained_bio_decode, postprocess
|
| 32 |
from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
|
| 33 |
|
| 34 |
|
|
|
|
| 82 |
for idx, label in enumerate(labels):
|
| 83 |
token = tokens[idx] if idx < len(tokens) else None
|
| 84 |
if label == "O":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
current_entity = None
|
| 86 |
elif label.startswith("B-"):
|
| 87 |
current_entity = entity_type(label)
|
|
|
|
| 115 |
return violations
|
| 116 |
|
| 117 |
|
| 118 |
+
def bio_boundary_warnings(tokens: List[str], labels: List[str]) -> List[dict]:
|
| 119 |
+
"""Collect legal-but-suspicious boundary patterns separately from BIO errors."""
|
| 120 |
+
warnings: List[dict] = []
|
| 121 |
+
for idx, label in enumerate(labels[1:], 1):
|
| 122 |
+
previous_label = labels[idx - 1]
|
| 123 |
+
if label == "O" and previous_label.startswith("B-"):
|
| 124 |
+
warnings.append(
|
| 125 |
+
{
|
| 126 |
+
"type": "SINGLE_TOKEN_ENTITY",
|
| 127 |
+
"index": idx,
|
| 128 |
+
"prev_label": previous_label,
|
| 129 |
+
"label": label,
|
| 130 |
+
"token": tokens[idx] if idx < len(tokens) else None,
|
| 131 |
+
}
|
| 132 |
+
)
|
| 133 |
+
return warnings
|
| 134 |
+
|
| 135 |
+
|
| 136 |
def spans_from_labels(tokens: List[str], labels: List[str]) -> List[dict]:
|
| 137 |
spans: List[dict] = []
|
| 138 |
start: Optional[int] = None
|
|
|
|
| 250 |
unk = 0
|
| 251 |
unk_counter: Counter = Counter()
|
| 252 |
for sample in samples:
|
| 253 |
+
tokens, _labels = labels_for_tokenizer(sample, tokenizer)
|
| 254 |
ids = tokenizer.convert_tokens_to_ids(tokens)
|
| 255 |
for token, token_id in zip(tokens, ids):
|
| 256 |
total += 1
|
|
|
|
| 266 |
|
| 267 |
|
| 268 |
def prepare_inputs(
|
| 269 |
+
sample: dict,
|
|
|
|
| 270 |
tokenizer: AnimeTokenizer,
|
| 271 |
label2id: Dict[str, int],
|
| 272 |
max_length: int,
|
| 273 |
) -> Tuple[List[int], List[int], List[int], List[str]]:
|
| 274 |
+
tokens, labels = labels_for_tokenizer(sample, tokenizer)
|
| 275 |
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
| 276 |
input_ids = [tokenizer.cls_token_id] + input_ids + [tokenizer.sep_token_id]
|
| 277 |
label_ids = [-100] + [label2id.get(label, 0) for label in labels] + [-100]
|
|
|
|
| 291 |
return input_ids, attention_mask, label_ids, tokens
|
| 292 |
|
| 293 |
|
| 294 |
+
def normalize_field_value(field: str, value) -> Optional[str]:
|
| 295 |
+
if value is None:
|
| 296 |
+
return None
|
| 297 |
+
if field in {"episode", "season"}:
|
| 298 |
+
try:
|
| 299 |
+
return str(int(value))
|
| 300 |
+
except (TypeError, ValueError):
|
| 301 |
+
return str(value).strip().lower()
|
| 302 |
+
text = str(value).strip()
|
| 303 |
+
if field in {"resolution", "source"}:
|
| 304 |
+
return text.lower().replace("_", "-")
|
| 305 |
+
return re.sub(r"\s+", " ", text).strip().lower()
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def update_parse_metrics(counter: Counter, gold: dict, pred: dict) -> None:
|
| 309 |
+
fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
|
| 310 |
+
all_match = True
|
| 311 |
+
for field in fields:
|
| 312 |
+
gold_value = normalize_field_value(field, gold.get(field))
|
| 313 |
+
pred_value = normalize_field_value(field, pred.get(field))
|
| 314 |
+
if gold_value == pred_value:
|
| 315 |
+
counter[f"{field}_correct"] += 1
|
| 316 |
+
else:
|
| 317 |
+
all_match = False
|
| 318 |
+
counter[(field, gold_value, pred_value)] += 1
|
| 319 |
+
counter[f"{field}_total"] += 1
|
| 320 |
+
if all_match:
|
| 321 |
+
counter["full_match_correct"] += 1
|
| 322 |
+
counter["full_match_total"] += 1
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
def collect_field_failures(gold: dict, pred: dict) -> Dict[str, Dict[str, Optional[str]]]:
|
| 326 |
+
return {
|
| 327 |
+
field: {
|
| 328 |
+
"gold": normalize_field_value(field, gold.get(field)),
|
| 329 |
+
"pred": normalize_field_value(field, pred.get(field)),
|
| 330 |
+
}
|
| 331 |
+
for field in ["group", "title", "season", "episode", "resolution", "source", "special"]
|
| 332 |
+
if normalize_field_value(field, gold.get(field)) != normalize_field_value(field, pred.get(field))
|
| 333 |
+
}
|
| 334 |
+
|
| 335 |
+
|
| 336 |
def evaluate_model(
|
| 337 |
samples: List[dict],
|
| 338 |
model_dir: Path,
|
|
|
|
| 363 |
confusion: Counter = Counter()
|
| 364 |
entity_confusion: Counter = Counter()
|
| 365 |
boundary_errors: Counter = Counter()
|
| 366 |
+
parse_metrics: Counter = Counter()
|
| 367 |
+
parse_metrics_no_rules: Counter = Counter()
|
| 368 |
+
field_failures: List[dict] = []
|
| 369 |
+
field_failures_no_rules: List[dict] = []
|
| 370 |
|
| 371 |
with torch.no_grad():
|
| 372 |
for sample in eval_samples:
|
| 373 |
+
input_ids, attention_mask, label_ids, sample_tokens = prepare_inputs(
|
| 374 |
+
sample,
|
|
|
|
| 375 |
tokenizer,
|
| 376 |
label2id,
|
| 377 |
max_length,
|
|
|
|
| 379 |
input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
|
| 380 |
mask_tensor = torch.tensor([attention_mask], dtype=torch.long, device=device)
|
| 381 |
logits = model(input_ids=input_tensor, attention_mask=mask_tensor).logits
|
| 382 |
+
active_count = sum(1 for label_id in label_ids if label_id != -100)
|
| 383 |
+
pred_ids = constrained_bio_decode(logits[0, 1:1 + active_count, :], id2label)
|
| 384 |
|
| 385 |
true_labels: List[str] = []
|
| 386 |
pred_labels: List[str] = []
|
| 387 |
+
pred_idx = 0
|
| 388 |
+
for label_id in label_ids:
|
| 389 |
if label_id == -100:
|
| 390 |
continue
|
| 391 |
+
pred_id = pred_ids[pred_idx]
|
| 392 |
+
pred_idx += 1
|
| 393 |
true_label = id2label.get(label_id, "O")
|
| 394 |
pred_label = id2label.get(pred_id, "O")
|
| 395 |
true_labels.append(true_label)
|
|
|
|
| 405 |
boundary_errors["BIO-prefix"] += 1
|
| 406 |
true_sequences.append(true_labels)
|
| 407 |
pred_sequences.append(pred_labels)
|
| 408 |
+
active_tokens = sample_tokens[:len(true_labels)]
|
| 409 |
+
gold_parse = postprocess(
|
| 410 |
+
active_tokens,
|
| 411 |
+
true_labels,
|
| 412 |
+
tokenizer=tokenizer,
|
| 413 |
+
filename=sample.get("filename"),
|
| 414 |
+
use_rules=True,
|
| 415 |
+
)
|
| 416 |
+
pred_parse = postprocess(
|
| 417 |
+
active_tokens,
|
| 418 |
+
pred_labels,
|
| 419 |
+
tokenizer=tokenizer,
|
| 420 |
+
filename=sample.get("filename"),
|
| 421 |
+
use_rules=True,
|
| 422 |
+
)
|
| 423 |
+
gold_parse_no_rules = postprocess(
|
| 424 |
+
active_tokens,
|
| 425 |
+
true_labels,
|
| 426 |
+
tokenizer=tokenizer,
|
| 427 |
+
filename=sample.get("filename"),
|
| 428 |
+
use_rules=False,
|
| 429 |
+
)
|
| 430 |
+
pred_parse_no_rules = postprocess(
|
| 431 |
+
active_tokens,
|
| 432 |
+
pred_labels,
|
| 433 |
+
tokenizer=tokenizer,
|
| 434 |
+
filename=sample.get("filename"),
|
| 435 |
+
use_rules=False,
|
| 436 |
+
)
|
| 437 |
+
update_parse_metrics(parse_metrics, gold_parse, pred_parse)
|
| 438 |
+
update_parse_metrics(parse_metrics_no_rules, gold_parse_no_rules, pred_parse_no_rules)
|
| 439 |
+
failures = collect_field_failures(gold_parse, pred_parse)
|
| 440 |
+
if failures and len(field_failures) < 30:
|
| 441 |
+
field_failures.append(
|
| 442 |
+
{
|
| 443 |
+
"filename": sample.get("filename"),
|
| 444 |
+
"errors": failures,
|
| 445 |
+
"gold": gold_parse,
|
| 446 |
+
"pred": pred_parse,
|
| 447 |
+
}
|
| 448 |
+
)
|
| 449 |
+
failures_no_rules = collect_field_failures(gold_parse_no_rules, pred_parse_no_rules)
|
| 450 |
+
if failures_no_rules and len(field_failures_no_rules) < 30:
|
| 451 |
+
field_failures_no_rules.append(
|
| 452 |
+
{
|
| 453 |
+
"filename": sample.get("filename"),
|
| 454 |
+
"errors": failures_no_rules,
|
| 455 |
+
"gold": gold_parse_no_rules,
|
| 456 |
+
"pred": pred_parse_no_rules,
|
| 457 |
+
}
|
| 458 |
+
)
|
| 459 |
|
| 460 |
errors = confusion.copy()
|
| 461 |
for label in set(label for pair in confusion for label in pair):
|
|
|
|
| 472 |
{k: v for k, v in entity_confusion.items() if k[0] != k[1]}
|
| 473 |
).most_common(30),
|
| 474 |
"boundary_errors": boundary_errors,
|
| 475 |
+
"parse_metrics": parse_metrics,
|
| 476 |
+
"parse_metrics_no_rules": parse_metrics_no_rules,
|
| 477 |
+
"field_failures": field_failures,
|
| 478 |
+
"field_failures_no_rules": field_failures_no_rules,
|
| 479 |
}
|
| 480 |
|
| 481 |
|
|
|
|
| 556 |
length_values: List[int] = []
|
| 557 |
aligned_length_values: List[int] = []
|
| 558 |
violations: List[dict] = []
|
| 559 |
+
boundary_warnings: List[dict] = []
|
| 560 |
mismatch_examples: List[dict] = []
|
| 561 |
space_label_counter: Counter = Counter()
|
| 562 |
boundary_drift_counter: Counter = Counter()
|
|
|
|
| 585 |
|
| 586 |
label_counter.update(labels)
|
| 587 |
length_values.append(len(tokens))
|
| 588 |
+
aligned_tokens, aligned_labels = labels_for_tokenizer(sample, tokenizer)
|
| 589 |
aligned_length_values.append(len(aligned_tokens))
|
| 590 |
if len(aligned_tokens) + 2 > max_length:
|
| 591 |
truncation_count += 1
|
|
|
|
| 603 |
}
|
| 604 |
)
|
| 605 |
violations.append(violation)
|
| 606 |
+
for warning in bio_boundary_warnings(tokens, labels):
|
| 607 |
+
warning.update(
|
| 608 |
+
{
|
| 609 |
+
"row": row_idx,
|
| 610 |
+
"file_id": sample.get("file_id"),
|
| 611 |
+
"filename": sample.get("filename"),
|
| 612 |
+
"context_tokens": tokens[max(0, warning["index"] - 5):warning["index"] + 6],
|
| 613 |
+
"context_labels": labels[max(0, warning["index"] - 5):warning["index"] + 6],
|
| 614 |
+
}
|
| 615 |
+
)
|
| 616 |
+
boundary_warnings.append(warning)
|
| 617 |
for span in spans_from_labels(tokens, labels):
|
| 618 |
text = span["text"]
|
| 619 |
if span["type"] == "TITLE":
|
|
|
|
| 718 |
)
|
| 719 |
|
| 720 |
violation_counter = Counter(v["type"] for v in violations)
|
| 721 |
+
warning_counter = Counter(w["type"] for w in boundary_warnings)
|
| 722 |
sections.append(
|
| 723 |
(
|
| 724 |
"BIO Violations And Boundary Drift",
|
| 725 |
"\n".join(
|
| 726 |
[
|
| 727 |
+
"### True BIO violation counts",
|
| 728 |
format_counter(violation_counter),
|
| 729 |
"",
|
| 730 |
+
"### Legal boundary warning counts",
|
| 731 |
+
format_counter(warning_counter),
|
| 732 |
+
"",
|
| 733 |
"### Boundary drift heuristics",
|
| 734 |
format_counter(boundary_drift_counter),
|
| 735 |
"",
|
| 736 |
"### Sample violations",
|
| 737 |
markdown_json(violations[:30]),
|
| 738 |
+
"",
|
| 739 |
+
"### Sample boundary warnings",
|
| 740 |
+
markdown_json(boundary_warnings[:30]),
|
| 741 |
]
|
| 742 |
),
|
| 743 |
)
|
|
|
|
| 790 |
[true, pred, f"{count:,}"]
|
| 791 |
for (true, pred), count in model_eval["top_entity_confusions"]
|
| 792 |
]
|
| 793 |
+
def parse_metric_tables(metrics: Counter) -> Tuple[List[List[str]], str, List[List[str]]]:
|
| 794 |
+
field_rows = []
|
| 795 |
+
for field in ["group", "title", "season", "episode", "resolution", "source", "special"]:
|
| 796 |
+
total = metrics.get(f"{field}_total", 0)
|
| 797 |
+
correct = metrics.get(f"{field}_correct", 0)
|
| 798 |
+
acc = correct / total if total else 0.0
|
| 799 |
+
field_rows.append([field, f"{correct:,}/{total:,}", f"{acc:.4f}"])
|
| 800 |
+
full_total = metrics.get("full_match_total", 0)
|
| 801 |
+
full_correct = metrics.get("full_match_correct", 0)
|
| 802 |
+
full_acc = full_correct / full_total if full_total else 0.0
|
| 803 |
+
full_line = f"{full_correct:,}/{full_total:,} ({full_acc:.4f})"
|
| 804 |
+
error_rows = [
|
| 805 |
+
[field, str(gold), str(pred), f"{count:,}"]
|
| 806 |
+
for key, count in Counter(
|
| 807 |
+
{key: count for key, count in metrics.items() if isinstance(key, tuple)}
|
| 808 |
+
).most_common(30)
|
| 809 |
+
if isinstance(key, tuple)
|
| 810 |
+
for field, gold, pred in [key]
|
| 811 |
+
]
|
| 812 |
+
return field_rows, full_line, error_rows
|
| 813 |
+
|
| 814 |
+
rule_field_rows, rule_full_line, rule_error_rows = parse_metric_tables(model_eval["parse_metrics"])
|
| 815 |
+
ner_field_rows, ner_full_line, ner_error_rows = parse_metric_tables(model_eval["parse_metrics_no_rules"])
|
| 816 |
sections.append(
|
| 817 |
(
|
| 818 |
"Model Confusion Analysis",
|
|
|
|
| 832 |
"### Top entity-type confusions",
|
| 833 |
markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
|
| 834 |
"",
|
| 835 |
+
"### Field exact-match accuracy (rule-assisted)",
|
| 836 |
+
markdown_table(["field", "correct/total", "accuracy"], rule_field_rows),
|
| 837 |
+
"",
|
| 838 |
+
f"Rule-assisted full parse exact match: {rule_full_line}",
|
| 839 |
+
"",
|
| 840 |
+
"### Top rule-assisted field parse errors",
|
| 841 |
+
markdown_table(["field", "gold", "pred", "count"], rule_error_rows) if rule_error_rows else "- none",
|
| 842 |
+
"",
|
| 843 |
+
"### Field exact-match accuracy (NER-only, no rules)",
|
| 844 |
+
markdown_table(["field", "correct/total", "accuracy"], ner_field_rows),
|
| 845 |
+
"",
|
| 846 |
+
f"NER-only full parse exact match: {ner_full_line}",
|
| 847 |
+
"",
|
| 848 |
+
"### Top NER-only field parse errors",
|
| 849 |
+
markdown_table(["field", "gold", "pred", "count"], ner_error_rows) if ner_error_rows else "- none",
|
| 850 |
+
"",
|
| 851 |
+
"### Hardest sampled parse failures (rule-assisted)",
|
| 852 |
+
markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
|
| 853 |
+
"",
|
| 854 |
+
"### Hardest sampled parse failures (NER-only)",
|
| 855 |
+
markdown_json(model_eval["field_failures_no_rules"][:10]) if model_eval["field_failures_no_rules"] else "- none",
|
| 856 |
+
"",
|
| 857 |
"### Seqeval report",
|
| 858 |
"```text\n" + model_eval["classification_report"] + "\n```",
|
| 859 |
]
|
dmhy_dataset.py
CHANGED
|
@@ -19,7 +19,8 @@ from datetime import datetime, timezone
|
|
| 19 |
from pathlib import Path
|
| 20 |
from typing import Iterable, List, Optional, Sequence
|
| 21 |
|
| 22 |
-
from data_generator import
|
|
|
|
| 23 |
from tokenizer import AnimeTokenizer
|
| 24 |
|
| 25 |
|
|
@@ -35,8 +36,9 @@ NOISE_BRACKETS = {
|
|
| 35 |
"繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
|
| 36 |
}
|
| 37 |
|
| 38 |
-
SPECIAL_RE = re.compile(r"^(?:ova|oad|sp|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
|
| 39 |
-
|
|
|
|
| 40 |
SEASON_RE = re.compile(
|
| 41 |
r"^(?:"
|
| 42 |
r"[Ss](\d{1,2})|"
|
|
@@ -45,16 +47,28 @@ SEASON_RE = re.compile(
|
|
| 45 |
r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
|
| 46 |
r")$", re.I
|
| 47 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
|
| 49 |
DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
|
| 50 |
HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
|
| 51 |
DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
|
| 52 |
RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
|
|
|
|
| 53 |
SOURCE_RE = re.compile(
|
| 54 |
-
r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|HDTV|"
|
| 55 |
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
|
| 56 |
r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
|
| 57 |
-
r"CHS|CHT|BIG5|GB|JPN?|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
|
| 58 |
re.I,
|
| 59 |
)
|
| 60 |
GROUP_HINT_RE = re.compile(
|
|
@@ -112,12 +126,20 @@ def cn_number_to_int(text: str) -> Optional[int]:
|
|
| 112 |
def season_number(token: str) -> Optional[int]:
|
| 113 |
clean = clean_bracket(token)
|
| 114 |
match = SEASON_RE.match(clean)
|
| 115 |
-
if
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
return
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
|
| 123 |
def episode_number(token: str) -> Optional[int]:
|
|
@@ -126,7 +148,13 @@ def episode_number(token: str) -> Optional[int]:
|
|
| 126 |
return None
|
| 127 |
if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
|
| 128 |
return None
|
| 129 |
-
if re.match(r"^第\d{1,4}[话話集]$", clean):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
return int(re.search(r"\d+", clean).group())
|
| 131 |
match = EPISODE_RE.match(clean)
|
| 132 |
if not match:
|
|
@@ -137,8 +165,13 @@ def episode_number(token: str) -> Optional[int]:
|
|
| 137 |
return number
|
| 138 |
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
def is_resolution(token: str) -> bool:
|
| 141 |
-
|
|
|
|
| 142 |
|
| 143 |
|
| 144 |
def is_source(token: str) -> bool:
|
|
@@ -149,11 +182,17 @@ def is_source(token: str) -> bool:
|
|
| 149 |
is_resolution(clean) or SOURCE_RE.match(clean)
|
| 150 |
):
|
| 151 |
return True
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
|
| 155 |
def is_special(token: str) -> bool:
|
| 156 |
-
|
|
|
|
| 157 |
|
| 158 |
|
| 159 |
def is_noise_bracket(token: str) -> bool:
|
|
@@ -194,7 +233,7 @@ def is_title_token(token: str) -> bool:
|
|
| 194 |
return False
|
| 195 |
if is_resolution(clean) or is_source(clean) or is_special(clean):
|
| 196 |
return False
|
| 197 |
-
if
|
| 198 |
return False
|
| 199 |
if DATE_RE.match(clean) or HASH_RE.match(clean):
|
| 200 |
return False
|
|
@@ -221,9 +260,13 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
|
|
| 221 |
number = episode_number(token)
|
| 222 |
if number is None:
|
| 223 |
continue
|
| 224 |
-
score = 0
|
| 225 |
clean = clean_bracket(token)
|
| 226 |
-
if re.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
score += 4
|
| 228 |
if token.startswith("[") or token.startswith("(") or token.startswith("【"):
|
| 229 |
score += 3
|
|
@@ -239,12 +282,317 @@ def find_episode_index(tokens: Sequence[str]) -> Optional[int]:
|
|
| 239 |
return max(candidates, key=lambda item: (item[0], item[1]))[1]
|
| 240 |
|
| 241 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 242 |
def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
|
| 243 |
inner = clean_bracket(token)
|
| 244 |
if not inner:
|
| 245 |
return [token], [category]
|
| 246 |
-
open_char =
|
| 247 |
-
close_char = token[-1] if token[-1] in "]】)》" else ""
|
| 248 |
inner_tokens = tokenizer.tokenize(inner)
|
| 249 |
tokens: List[str] = []
|
| 250 |
cats: List[str] = []
|
|
@@ -259,6 +607,38 @@ def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer)
|
|
| 259 |
return tokens, cats
|
| 260 |
|
| 261 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
def expand_tokens_and_categories(
|
| 263 |
tokens: Sequence[str],
|
| 264 |
categories: Sequence[str],
|
|
@@ -281,15 +661,34 @@ def expand_tokens_and_categories(
|
|
| 281 |
expanded_tokens.extend(split_tokens)
|
| 282 |
expanded_categories.extend(split_categories)
|
| 283 |
continue
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
expanded_tokens.append(token)
|
| 285 |
expanded_categories.append(category)
|
| 286 |
return expanded_tokens, expanded_categories
|
| 287 |
|
| 288 |
|
| 289 |
def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
tokens = tokenizer.tokenize(filename)
|
| 291 |
if not tokens:
|
| 292 |
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
| 293 |
|
| 294 |
categories = ["sep" if token in {" ", "-", "_", "|", "~", "~", "."} else "title" for token in tokens]
|
| 295 |
|
|
@@ -306,15 +705,16 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
|
|
| 306 |
categories[idx] = "source"
|
| 307 |
elif is_special(token):
|
| 308 |
categories[idx] = "special"
|
| 309 |
-
elif
|
| 310 |
categories[idx] = "season"
|
| 311 |
elif is_noise_bracket(token):
|
| 312 |
categories[idx] = "sep"
|
| 313 |
|
| 314 |
episode_idx = find_episode_index(tokens)
|
| 315 |
if episode_idx is None:
|
| 316 |
-
return
|
| 317 |
categories[episode_idx] = "episode"
|
|
|
|
| 318 |
|
| 319 |
# S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
|
| 320 |
# token slips through, expand_tokens_and_categories will split it.
|
|
@@ -341,7 +741,11 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
|
|
| 341 |
title_start += 1
|
| 342 |
title_start, title_end = trim_title_span(tokens, title_start, title_end)
|
| 343 |
if title_start >= title_end:
|
| 344 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
| 345 |
|
| 346 |
for idx, token in enumerate(tokens):
|
| 347 |
if title_start <= idx < title_end:
|
|
@@ -351,28 +755,13 @@ def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[di
|
|
| 351 |
categories[idx] = "sep"
|
| 352 |
|
| 353 |
if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
|
| 354 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
| 355 |
|
| 356 |
-
|
| 357 |
-
# [, 剑, 来, , 第2季, ]) so that season markers mixed with title text
|
| 358 |
-
# inside a bracket can be detected as separate tokens.
|
| 359 |
-
expanded_tokens, expanded_categories = expand_tokens_and_categories(
|
| 360 |
-
tokens, categories, tokenizer
|
| 361 |
-
)
|
| 362 |
-
|
| 363 |
-
# Re-detect season markers in expanded tokens (bracket expansion exposes
|
| 364 |
-
# patterns like 第2季 that were previously hidden inside mixed brackets).
|
| 365 |
-
for idx in range(len(expanded_tokens)):
|
| 366 |
-
cat = expanded_categories[idx]
|
| 367 |
-
if cat not in {"sep", "episode", "group", "source", "resolution",
|
| 368 |
-
"special", "season"}:
|
| 369 |
-
if season_number(expanded_tokens[idx]) is not None:
|
| 370 |
-
expanded_categories[idx] = "season"
|
| 371 |
-
|
| 372 |
-
labels = assign_bio(expanded_tokens, expanded_categories)
|
| 373 |
-
if len(expanded_tokens) != len(labels):
|
| 374 |
-
return None
|
| 375 |
-
return {"tokens": expanded_tokens, "labels": labels}
|
| 376 |
|
| 377 |
|
| 378 |
def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:
|
|
|
|
| 19 |
from pathlib import Path
|
| 20 |
from typing import Iterable, List, Optional, Sequence
|
| 21 |
|
| 22 |
+
from data_generator import LABEL_MAP, categorize_meta_token
|
| 23 |
+
from label_repairs import season_marker_number
|
| 24 |
from tokenizer import AnimeTokenizer
|
| 25 |
|
| 26 |
|
|
|
|
| 36 |
"繁中", "简中", "繁日", "简日", "日语", "日文", "外挂", "内封", "字幕",
|
| 37 |
}
|
| 38 |
|
| 39 |
+
SPECIAL_RE = re.compile(r"^(?:ova\d*|oad\d*|sp\d*|movie|the\s*movie|op|ed|pv|cm|ncop|nced|剧场版|劇場版|特别篇|特別篇)$", re.I)
|
| 40 |
+
SPECIAL_SEARCH_RE = re.compile(r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+", re.I)
|
| 41 |
+
EPISODE_RE = re.compile(r"^(?:[Ee][Pp]?|#)?(\d{1,4})(?:v\d+|END)?$", re.I)
|
| 42 |
SEASON_RE = re.compile(
|
| 43 |
r"^(?:"
|
| 44 |
r"[Ss](\d{1,2})|"
|
|
|
|
| 47 |
r"(\d+)(?:st|nd|rd|th)\s+[Ss]eason"
|
| 48 |
r")$", re.I
|
| 49 |
)
|
| 50 |
+
READING_SEASON_RE = re.compile(
|
| 51 |
+
r"^(?:Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|Ni\s+Gakki|Sono\s+Ni|"
|
| 52 |
+
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 53 |
+
r"(?:Go|Gou)\s+no\s+Sara)$",
|
| 54 |
+
re.I,
|
| 55 |
+
)
|
| 56 |
+
CJK_SEQUEL_SEASON_RE = re.compile(
|
| 57 |
+
r"^(?:[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
|
| 58 |
+
r"[ⅡⅢⅣⅤⅥⅦⅧⅨ]|II|III|IV|V|VI|VII|VIII|IX)$",
|
| 59 |
+
re.I,
|
| 60 |
+
)
|
| 61 |
SXE_RE = re.compile(r"^([Ss]\d{1,2})([Ee]\d{1,4})(?:v\d+)?$")
|
| 62 |
DATE_RE = re.compile(r"^(?:19|20)\d{2}[.\-_年]?(?:0?[1-9]|1[0-2])?[.\-_月]?(?:0?[1-9]|[12]\d|3[01])?日?$")
|
| 63 |
HASH_RE = re.compile(r"^[A-Fa-f0-9]{8,}$")
|
| 64 |
DIMENSION_RE = re.compile(r"^\d{3,4}[xX×]\d{3,4}$")
|
| 65 |
RESOLUTION_RE = re.compile(r"^(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})$")
|
| 66 |
+
RESOLUTION_SEARCH_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
|
| 67 |
SOURCE_RE = re.compile(
|
| 68 |
+
r"^(?:WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
|
| 69 |
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
|
| 70 |
r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
|
| 71 |
+
r"CHS|CHT|BIG5|GB|JPN?|JPSC|JPTC|简[体體]?|繁[体體]?|简日双语|繁日双语|内封|外挂|MSubs?)$",
|
| 72 |
re.I,
|
| 73 |
)
|
| 74 |
GROUP_HINT_RE = re.compile(
|
|
|
|
| 126 |
def season_number(token: str) -> Optional[int]:
|
| 127 |
clean = clean_bracket(token)
|
| 128 |
match = SEASON_RE.match(clean)
|
| 129 |
+
if match:
|
| 130 |
+
value = next((g for g in match.groups() if g), None)
|
| 131 |
+
if value is None:
|
| 132 |
+
return None
|
| 133 |
+
return cn_number_to_int(value)
|
| 134 |
+
if READING_SEASON_RE.match(clean) or CJK_SEQUEL_SEASON_RE.match(clean):
|
| 135 |
+
return season_marker_number(clean)
|
| 136 |
+
return None
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def is_explicit_season(token: str) -> bool:
|
| 140 |
+
"""Return True for unambiguous season syntax such as S02 or 第2季."""
|
| 141 |
+
clean = clean_bracket(token)
|
| 142 |
+
return bool(SEASON_RE.match(clean))
|
| 143 |
|
| 144 |
|
| 145 |
def episode_number(token: str) -> Optional[int]:
|
|
|
|
| 148 |
return None
|
| 149 |
if DIMENSION_RE.match(clean) or DATE_RE.match(clean) or HASH_RE.match(clean):
|
| 150 |
return None
|
| 151 |
+
if re.match(r"^第\d{1,4}(?:\(\d{1,4}\))?[话話集]$", clean):
|
| 152 |
+
return int(re.search(r"\d+", clean).group())
|
| 153 |
+
if re.match(r"^(?:OVA|OAD|SP)\d{1,4}$", clean, re.I):
|
| 154 |
+
return int(re.search(r"\d+", clean).group())
|
| 155 |
+
if re.match(r"^\d{1,4}\s*END$", clean, re.I):
|
| 156 |
+
return int(re.search(r"\d+", clean).group())
|
| 157 |
+
if re.match(r"^\d{1,4}[._]\d+$", clean):
|
| 158 |
return int(re.search(r"\d+", clean).group())
|
| 159 |
match = EPISODE_RE.match(clean)
|
| 160 |
if not match:
|
|
|
|
| 165 |
return number
|
| 166 |
|
| 167 |
|
| 168 |
+
def has_wrapping_brackets(token: str) -> bool:
|
| 169 |
+
return len(token) >= 2 and token[0] in "[【(《" and token[-1] in "]】)》"
|
| 170 |
+
|
| 171 |
+
|
| 172 |
def is_resolution(token: str) -> bool:
|
| 173 |
+
clean = clean_bracket(token)
|
| 174 |
+
return bool(RESOLUTION_RE.match(clean) or (has_wrapping_brackets(token) and RESOLUTION_SEARCH_RE.search(clean)))
|
| 175 |
|
| 176 |
|
| 177 |
def is_source(token: str) -> bool:
|
|
|
|
| 182 |
is_resolution(clean) or SOURCE_RE.match(clean)
|
| 183 |
):
|
| 184 |
return True
|
| 185 |
+
if SOURCE_RE.match(clean):
|
| 186 |
+
return True
|
| 187 |
+
if has_wrapping_brackets(token):
|
| 188 |
+
parts = [part for part in re.split(r"[\s&+/,._-]+", clean) if part]
|
| 189 |
+
return bool(parts) and all(SOURCE_RE.match(part) or is_noise_bracket(part) for part in parts)
|
| 190 |
+
return False
|
| 191 |
|
| 192 |
|
| 193 |
def is_special(token: str) -> bool:
|
| 194 |
+
clean = clean_bracket(token)
|
| 195 |
+
return bool(SPECIAL_RE.match(clean) or SPECIAL_SEARCH_RE.match(clean))
|
| 196 |
|
| 197 |
|
| 198 |
def is_noise_bracket(token: str) -> bool:
|
|
|
|
| 233 |
return False
|
| 234 |
if is_resolution(clean) or is_source(clean) or is_special(clean):
|
| 235 |
return False
|
| 236 |
+
if is_explicit_season(clean) or episode_number(clean) is not None:
|
| 237 |
return False
|
| 238 |
if DATE_RE.match(clean) or HASH_RE.match(clean):
|
| 239 |
return False
|
|
|
|
| 260 |
number = episode_number(token)
|
| 261 |
if number is None:
|
| 262 |
continue
|
|
|
|
| 263 |
clean = clean_bracket(token)
|
| 264 |
+
if idx > 0 and tokens[idx - 1] == "." and re.fullmatch(r"\d+", clean):
|
| 265 |
+
previous_clean = clean_bracket(tokens[idx - 2]) if idx >= 2 else ""
|
| 266 |
+
if previous_clean.lower() in VIDEO_EXTENSIONS or f".{clean}".lower() in VIDEO_EXTENSIONS:
|
| 267 |
+
continue
|
| 268 |
+
score = 0
|
| 269 |
+
if re.match(r"^(?:[Ee][Pp]?|#|第|OVA|OAD|SP)", clean, re.I):
|
| 270 |
score += 4
|
| 271 |
if token.startswith("[") or token.startswith("(") or token.startswith("【"):
|
| 272 |
score += 3
|
|
|
|
| 282 |
return max(candidates, key=lambda item: (item[0], item[1]))[1]
|
| 283 |
|
| 284 |
|
| 285 |
+
def is_separator_token(token: str) -> bool:
|
| 286 |
+
return token in {" ", "-", "_", "|", "~", "~", ".", "+", "&", "/", ","}
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
def has_only_separators_between(tokens: Sequence[str], start: int, end: int) -> bool:
|
| 290 |
+
return all(is_separator_token(token) for token in tokens[start:end])
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
def is_context_season_token(tokens: Sequence[str], idx: int, episode_idx: int) -> bool:
|
| 294 |
+
"""Detect compact season markers only when they structurally lead into an episode."""
|
| 295 |
+
if idx >= episode_idx:
|
| 296 |
+
return False
|
| 297 |
+
|
| 298 |
+
token = tokens[idx]
|
| 299 |
+
clean = clean_bracket(token)
|
| 300 |
+
if not clean:
|
| 301 |
+
return False
|
| 302 |
+
if is_explicit_season(clean):
|
| 303 |
+
return True
|
| 304 |
+
|
| 305 |
+
if season_number(clean) is None:
|
| 306 |
+
return False
|
| 307 |
+
if not has_only_separators_between(tokens, idx + 1, episode_idx):
|
| 308 |
+
return False
|
| 309 |
+
|
| 310 |
+
# A bare V is often the volume prefix in V02E01, not season five.
|
| 311 |
+
if clean.upper() == "V":
|
| 312 |
+
return False
|
| 313 |
+
return True
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
def label_context_season_tokens(
|
| 317 |
+
tokens: Sequence[str],
|
| 318 |
+
categories: List[str],
|
| 319 |
+
episode_idx: int,
|
| 320 |
+
) -> None:
|
| 321 |
+
if (
|
| 322 |
+
episode_idx >= 2
|
| 323 |
+
and clean_bracket(tokens[episode_idx]).upper().startswith("E")
|
| 324 |
+
and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
|
| 325 |
+
and clean_bracket(tokens[episode_idx - 1]).isdigit()
|
| 326 |
+
):
|
| 327 |
+
categories[episode_idx - 2] = "season"
|
| 328 |
+
categories[episode_idx - 1] = "season"
|
| 329 |
+
return
|
| 330 |
+
|
| 331 |
+
for idx in range(episode_idx):
|
| 332 |
+
if categories[idx] in {"group", "episode", "resolution", "source", "special"}:
|
| 333 |
+
continue
|
| 334 |
+
if is_context_season_token(tokens, idx, episode_idx):
|
| 335 |
+
categories[idx] = "season"
|
| 336 |
+
|
| 337 |
+
|
| 338 |
+
def embedded_bracket_episode(token: str) -> Optional[tuple[str, str, str]]:
|
| 339 |
+
"""Split malformed tokens such as '[Group}Title[658]' into title + episode."""
|
| 340 |
+
if episode_number(token) is not None:
|
| 341 |
+
return None
|
| 342 |
+
match = re.match(r"^(?P<prefix>.+?)\[(?P<episode>\d{1,4}(?:v\d+)?)(?P<close>\])?$", token, re.I)
|
| 343 |
+
if match is None and has_wrapping_brackets(token):
|
| 344 |
+
match = re.match(r"^(?P<prefix>.+?)(?P<episode>\d{2,4})(?P<close>[\]\)】》])$", token, re.I)
|
| 345 |
+
if not match:
|
| 346 |
+
return None
|
| 347 |
+
prefix = match.group("prefix")
|
| 348 |
+
episode = match.group("episode")
|
| 349 |
+
close = match.group("close") or ""
|
| 350 |
+
if not clean_bracket(prefix):
|
| 351 |
+
return None
|
| 352 |
+
number = int(re.search(r"\d+", episode).group())
|
| 353 |
+
if number == 0 or number > 2000:
|
| 354 |
+
return None
|
| 355 |
+
return prefix, episode, close
|
| 356 |
+
|
| 357 |
+
|
| 358 |
+
def append_tokenized_category(
|
| 359 |
+
tokens: List[str],
|
| 360 |
+
categories: List[str],
|
| 361 |
+
text: str,
|
| 362 |
+
category: str,
|
| 363 |
+
tokenizer: AnimeTokenizer,
|
| 364 |
+
) -> None:
|
| 365 |
+
for piece in tokenizer.tokenize(text):
|
| 366 |
+
if not piece:
|
| 367 |
+
continue
|
| 368 |
+
if is_separator_token(piece) or piece in {"[", "]", "(", ")", "【", "】", "《", "》"}:
|
| 369 |
+
piece_category = "sep"
|
| 370 |
+
else:
|
| 371 |
+
piece_category = category
|
| 372 |
+
tokens.append(piece)
|
| 373 |
+
categories.append(piece_category)
|
| 374 |
+
|
| 375 |
+
|
| 376 |
+
def finalize_weak_sample(
|
| 377 |
+
tokens: Sequence[str],
|
| 378 |
+
categories: Sequence[str],
|
| 379 |
+
tokenizer: AnimeTokenizer,
|
| 380 |
+
require_episode: bool = True,
|
| 381 |
+
) -> Optional[dict]:
|
| 382 |
+
expanded_tokens, expanded_categories = expand_tokens_and_categories(tokens, categories, tokenizer)
|
| 383 |
+
|
| 384 |
+
# Only unambiguous season forms are promoted here. Compact sequel markers
|
| 385 |
+
# such as 貳, II, or Ni no Sara need episode context and are repaired by
|
| 386 |
+
# label_repairs from character spans; treating every single CJK numeral as
|
| 387 |
+
# season would corrupt titles like 魯邦三世.
|
| 388 |
+
for idx, token in enumerate(expanded_tokens):
|
| 389 |
+
if expanded_categories[idx] in {"sep", "episode", "group", "source", "resolution", "special", "season"}:
|
| 390 |
+
continue
|
| 391 |
+
if is_explicit_season(token):
|
| 392 |
+
expanded_categories[idx] = "season"
|
| 393 |
+
|
| 394 |
+
labels = assign_iob2(expanded_categories)
|
| 395 |
+
if len(expanded_tokens) != len(labels):
|
| 396 |
+
return None
|
| 397 |
+
if not any(label.endswith("TITLE") for label in labels):
|
| 398 |
+
return None
|
| 399 |
+
if require_episode and not any(label.endswith("EPISODE") for label in labels):
|
| 400 |
+
return None
|
| 401 |
+
return {"tokens": expanded_tokens, "labels": labels}
|
| 402 |
+
|
| 403 |
+
|
| 404 |
+
def assign_iob2(categories: Sequence[str]) -> List[str]:
|
| 405 |
+
labels: List[str] = []
|
| 406 |
+
previous_entity: Optional[str] = None
|
| 407 |
+
for category in categories:
|
| 408 |
+
entity = LABEL_MAP.get(category, "O")
|
| 409 |
+
if entity == "O":
|
| 410 |
+
labels.append("O")
|
| 411 |
+
previous_entity = None
|
| 412 |
+
continue
|
| 413 |
+
prefix = "I" if previous_entity == entity else "B"
|
| 414 |
+
labels.append(f"{prefix}-{entity}")
|
| 415 |
+
previous_entity = entity
|
| 416 |
+
return labels
|
| 417 |
+
|
| 418 |
+
|
| 419 |
+
def fallback_embedded_episode_sample(
|
| 420 |
+
tokens: Sequence[str],
|
| 421 |
+
tokenizer: AnimeTokenizer,
|
| 422 |
+
) -> Optional[dict]:
|
| 423 |
+
rebuilt_tokens: List[str] = []
|
| 424 |
+
rebuilt_categories: List[str] = []
|
| 425 |
+
used_episode = False
|
| 426 |
+
|
| 427 |
+
for token in tokens:
|
| 428 |
+
embedded = embedded_bracket_episode(token)
|
| 429 |
+
if embedded and not used_episode:
|
| 430 |
+
prefix, episode, close = embedded
|
| 431 |
+
append_tokenized_category(rebuilt_tokens, rebuilt_categories, prefix, "title", tokenizer)
|
| 432 |
+
rebuilt_tokens.append(episode)
|
| 433 |
+
rebuilt_categories.append("episode")
|
| 434 |
+
if close:
|
| 435 |
+
rebuilt_tokens.append(close)
|
| 436 |
+
rebuilt_categories.append("sep")
|
| 437 |
+
used_episode = True
|
| 438 |
+
continue
|
| 439 |
+
|
| 440 |
+
if not used_episode:
|
| 441 |
+
category = "sep" if is_separator_token(token) else "title"
|
| 442 |
+
elif is_resolution(token):
|
| 443 |
+
category = "resolution"
|
| 444 |
+
elif is_source(token):
|
| 445 |
+
category = "source"
|
| 446 |
+
elif is_special(token):
|
| 447 |
+
category = "special"
|
| 448 |
+
else:
|
| 449 |
+
category = "sep"
|
| 450 |
+
rebuilt_tokens.append(token)
|
| 451 |
+
rebuilt_categories.append(category)
|
| 452 |
+
|
| 453 |
+
if not used_episode:
|
| 454 |
+
return None
|
| 455 |
+
return finalize_weak_sample(rebuilt_tokens, rebuilt_categories, tokenizer)
|
| 456 |
+
|
| 457 |
+
|
| 458 |
+
def has_embedded_episode_candidate(tokens: Sequence[str]) -> bool:
|
| 459 |
+
return any(embedded_bracket_episode(token) is not None for token in tokens)
|
| 460 |
+
|
| 461 |
+
|
| 462 |
+
def fallback_episode_first_sample(
|
| 463 |
+
tokens: Sequence[str],
|
| 464 |
+
categories: Sequence[str],
|
| 465 |
+
episode_idx: int,
|
| 466 |
+
tokenizer: AnimeTokenizer,
|
| 467 |
+
) -> Optional[dict]:
|
| 468 |
+
fallback_categories = ["sep"] * len(tokens)
|
| 469 |
+
|
| 470 |
+
# V02E01-style catalog rows are episode-first. The tokenizer currently
|
| 471 |
+
# exposes them as V, 02, E01, so keep V02 together as a season span.
|
| 472 |
+
if (
|
| 473 |
+
episode_idx >= 2
|
| 474 |
+
and clean_bracket(tokens[episode_idx]).upper().startswith("E")
|
| 475 |
+
and clean_bracket(tokens[episode_idx - 2]).upper() == "V"
|
| 476 |
+
and clean_bracket(tokens[episode_idx - 1]).isdigit()
|
| 477 |
+
):
|
| 478 |
+
fallback_categories[episode_idx - 2] = "season"
|
| 479 |
+
fallback_categories[episode_idx - 1] = "season"
|
| 480 |
+
else:
|
| 481 |
+
label_context_season_tokens(tokens, fallback_categories, episode_idx)
|
| 482 |
+
|
| 483 |
+
fallback_categories[episode_idx] = "episode"
|
| 484 |
+
|
| 485 |
+
title_indices: List[int] = []
|
| 486 |
+
for idx in range(episode_idx + 1, len(tokens)):
|
| 487 |
+
token = tokens[idx]
|
| 488 |
+
if is_separator_token(token):
|
| 489 |
+
continue
|
| 490 |
+
if is_resolution(token) or is_source(token) or is_special(token) or is_noise_bracket(token):
|
| 491 |
+
fallback_categories[idx] = "resolution" if is_resolution(token) else "source" if is_source(token) else "special" if is_special(token) else "sep"
|
| 492 |
+
continue
|
| 493 |
+
title_indices.append(idx)
|
| 494 |
+
|
| 495 |
+
if not title_indices:
|
| 496 |
+
# Some rows are title-only brackets followed by season/episode,
|
| 497 |
+
# e.g. [伊蘇] II-01. If the leading bracket was guessed as GROUP but
|
| 498 |
+
# no real title exists, use it as TITLE to keep the row useful.
|
| 499 |
+
for idx in range(episode_idx):
|
| 500 |
+
if categories[idx] == "group" and clean_bracket(tokens[idx]):
|
| 501 |
+
title_indices.append(idx)
|
| 502 |
+
break
|
| 503 |
+
|
| 504 |
+
for idx in title_indices:
|
| 505 |
+
fallback_categories[idx] = "title"
|
| 506 |
+
if title_indices:
|
| 507 |
+
for idx in range(title_indices[0], title_indices[-1] + 1):
|
| 508 |
+
if is_separator_token(tokens[idx]):
|
| 509 |
+
fallback_categories[idx] = "title"
|
| 510 |
+
|
| 511 |
+
return finalize_weak_sample(tokens, fallback_categories, tokenizer)
|
| 512 |
+
|
| 513 |
+
|
| 514 |
+
def fallback_minimal_sample(
|
| 515 |
+
tokens: Sequence[str],
|
| 516 |
+
episode_idx: int,
|
| 517 |
+
tokenizer: AnimeTokenizer,
|
| 518 |
+
) -> Optional[dict]:
|
| 519 |
+
"""Keep malformed low-information rows instead of silently dropping them."""
|
| 520 |
+
categories: List[str] = []
|
| 521 |
+
title_idx: Optional[int] = None
|
| 522 |
+
|
| 523 |
+
for idx, token in enumerate(tokens):
|
| 524 |
+
if idx == episode_idx:
|
| 525 |
+
categories.append("episode")
|
| 526 |
+
elif is_resolution(token):
|
| 527 |
+
categories.append("resolution")
|
| 528 |
+
elif is_source(token):
|
| 529 |
+
categories.append("source")
|
| 530 |
+
elif is_special(token):
|
| 531 |
+
categories.append("special")
|
| 532 |
+
if title_idx is None:
|
| 533 |
+
title_idx = idx
|
| 534 |
+
else:
|
| 535 |
+
categories.append("sep")
|
| 536 |
+
|
| 537 |
+
if title_idx is None:
|
| 538 |
+
for idx, token in enumerate(tokens):
|
| 539 |
+
if idx == episode_idx or is_separator_token(token):
|
| 540 |
+
continue
|
| 541 |
+
if categories[idx] not in {"resolution", "source"}:
|
| 542 |
+
title_idx = idx
|
| 543 |
+
break
|
| 544 |
+
if title_idx is None:
|
| 545 |
+
return None
|
| 546 |
+
|
| 547 |
+
categories[title_idx] = "title"
|
| 548 |
+
return finalize_weak_sample(tokens, categories, tokenizer)
|
| 549 |
+
|
| 550 |
+
|
| 551 |
+
def fallback_no_episode_sample(tokens: Sequence[str], tokenizer: AnimeTokenizer) -> Optional[dict]:
|
| 552 |
+
"""Label movies, OP/ED/SP, and malformed rows that have no true episode token."""
|
| 553 |
+
categories: List[str] = []
|
| 554 |
+
seen_title = False
|
| 555 |
+
title_allowed = True
|
| 556 |
+
|
| 557 |
+
for idx, token in enumerate(tokens):
|
| 558 |
+
if is_separator_token(token):
|
| 559 |
+
categories.append("title" if seen_title and title_allowed else "sep")
|
| 560 |
+
continue
|
| 561 |
+
if idx == 0 and is_group_bracket(token, idx, tokens):
|
| 562 |
+
categories.append("group")
|
| 563 |
+
continue
|
| 564 |
+
if is_resolution(token):
|
| 565 |
+
categories.append("resolution")
|
| 566 |
+
title_allowed = False
|
| 567 |
+
continue
|
| 568 |
+
if is_source(token):
|
| 569 |
+
categories.append("source")
|
| 570 |
+
title_allowed = False
|
| 571 |
+
continue
|
| 572 |
+
if is_special(token):
|
| 573 |
+
categories.append("special")
|
| 574 |
+
title_allowed = False
|
| 575 |
+
continue
|
| 576 |
+
if is_noise_bracket(token):
|
| 577 |
+
categories.append("sep")
|
| 578 |
+
continue
|
| 579 |
+
categories.append("title")
|
| 580 |
+
seen_title = True
|
| 581 |
+
|
| 582 |
+
return finalize_weak_sample(tokens, categories, tokenizer, require_episode=False)
|
| 583 |
+
|
| 584 |
+
|
| 585 |
+
def bracket_delimiters(token: str) -> tuple[str, str]:
|
| 586 |
+
open_char = token[0] if token and token[0] in "[【(《" else ""
|
| 587 |
+
close_char = token[-1] if token and token[-1] in "]】)》" else ""
|
| 588 |
+
return open_char, close_char
|
| 589 |
+
|
| 590 |
+
|
| 591 |
def label_bracket_contents(token: str, category: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
|
| 592 |
inner = clean_bracket(token)
|
| 593 |
if not inner:
|
| 594 |
return [token], [category]
|
| 595 |
+
open_char, close_char = bracket_delimiters(token)
|
|
|
|
| 596 |
inner_tokens = tokenizer.tokenize(inner)
|
| 597 |
tokens: List[str] = []
|
| 598 |
cats: List[str] = []
|
|
|
|
| 607 |
return tokens, cats
|
| 608 |
|
| 609 |
|
| 610 |
+
def label_meta_bracket_contents(token: str, tokenizer: AnimeTokenizer) -> tuple[List[str], List[str]]:
|
| 611 |
+
inner = clean_bracket(token)
|
| 612 |
+
if not inner:
|
| 613 |
+
return [token], ["sep"]
|
| 614 |
+
open_char, close_char = bracket_delimiters(token)
|
| 615 |
+
inner_tokens = tokenizer.tokenize(inner)
|
| 616 |
+
tokens: List[str] = []
|
| 617 |
+
cats: List[str] = []
|
| 618 |
+
if open_char:
|
| 619 |
+
tokens.append(open_char)
|
| 620 |
+
cats.append("sep")
|
| 621 |
+
for inner_token in inner_tokens:
|
| 622 |
+
if inner_token in {" ", "-", "_", "|", "~", "~", ".", "+", "&", "/", ","}:
|
| 623 |
+
cat = "sep"
|
| 624 |
+
elif is_resolution(inner_token) or RESOLUTION_SEARCH_RE.fullmatch(inner_token):
|
| 625 |
+
cat = "resolution"
|
| 626 |
+
elif is_source(inner_token):
|
| 627 |
+
cat = "source"
|
| 628 |
+
elif is_special(inner_token):
|
| 629 |
+
cat = "special"
|
| 630 |
+
elif is_noise_bracket(inner_token):
|
| 631 |
+
cat = "sep"
|
| 632 |
+
else:
|
| 633 |
+
cat = "sep"
|
| 634 |
+
tokens.append(inner_token)
|
| 635 |
+
cats.append(cat)
|
| 636 |
+
if close_char:
|
| 637 |
+
tokens.append(close_char)
|
| 638 |
+
cats.append("sep")
|
| 639 |
+
return tokens, cats
|
| 640 |
+
|
| 641 |
+
|
| 642 |
def expand_tokens_and_categories(
|
| 643 |
tokens: Sequence[str],
|
| 644 |
categories: Sequence[str],
|
|
|
|
| 661 |
expanded_tokens.extend(split_tokens)
|
| 662 |
expanded_categories.extend(split_categories)
|
| 663 |
continue
|
| 664 |
+
if category in {"source", "resolution", "special", "sep"} and (
|
| 665 |
+
token.startswith("[") or token.startswith("(") or token.startswith("【") or token.startswith("《")
|
| 666 |
+
):
|
| 667 |
+
split_tokens, split_categories = label_meta_bracket_contents(token, tokenizer)
|
| 668 |
+
if any(cat != "sep" for cat in split_categories):
|
| 669 |
+
expanded_tokens.extend(split_tokens)
|
| 670 |
+
expanded_categories.extend(split_categories)
|
| 671 |
+
continue
|
| 672 |
expanded_tokens.append(token)
|
| 673 |
expanded_categories.append(category)
|
| 674 |
return expanded_tokens, expanded_categories
|
| 675 |
|
| 676 |
|
| 677 |
def weak_label_filename(filename: str, tokenizer: AnimeTokenizer) -> Optional[dict]:
|
| 678 |
+
basename = normalize_path_basename(str(filename))
|
| 679 |
+
stem, ext = strip_video_extension(basename)
|
| 680 |
+
if ext in VIDEO_EXTENSIONS:
|
| 681 |
+
filename = stem
|
| 682 |
+
else:
|
| 683 |
+
filename = basename
|
| 684 |
+
|
| 685 |
tokens = tokenizer.tokenize(filename)
|
| 686 |
if not tokens:
|
| 687 |
return None
|
| 688 |
+
if has_embedded_episode_candidate(tokens):
|
| 689 |
+
embedded_sample = fallback_embedded_episode_sample(tokens, tokenizer)
|
| 690 |
+
if embedded_sample is not None:
|
| 691 |
+
return embedded_sample
|
| 692 |
|
| 693 |
categories = ["sep" if token in {" ", "-", "_", "|", "~", "~", "."} else "title" for token in tokens]
|
| 694 |
|
|
|
|
| 705 |
categories[idx] = "source"
|
| 706 |
elif is_special(token):
|
| 707 |
categories[idx] = "special"
|
| 708 |
+
elif is_explicit_season(token):
|
| 709 |
categories[idx] = "season"
|
| 710 |
elif is_noise_bracket(token):
|
| 711 |
categories[idx] = "sep"
|
| 712 |
|
| 713 |
episode_idx = find_episode_index(tokens)
|
| 714 |
if episode_idx is None:
|
| 715 |
+
return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_no_episode_sample(tokens, tokenizer)
|
| 716 |
categories[episode_idx] = "episode"
|
| 717 |
+
label_context_season_tokens(tokens, categories, episode_idx)
|
| 718 |
|
| 719 |
# S01E07 is tokenized as S01 + E07 after tokenizer changes. If an older
|
| 720 |
# token slips through, expand_tokens_and_categories will split it.
|
|
|
|
| 741 |
title_start += 1
|
| 742 |
title_start, title_end = trim_title_span(tokens, title_start, title_end)
|
| 743 |
if title_start >= title_end:
|
| 744 |
+
return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
|
| 745 |
+
tokens, categories, episode_idx, tokenizer
|
| 746 |
+
) or fallback_minimal_sample(
|
| 747 |
+
tokens, episode_idx, tokenizer
|
| 748 |
+
)
|
| 749 |
|
| 750 |
for idx, token in enumerate(tokens):
|
| 751 |
if title_start <= idx < title_end:
|
|
|
|
| 755 |
categories[idx] = "sep"
|
| 756 |
|
| 757 |
if not any(cat == "title" for cat in categories) or not any(cat == "episode" for cat in categories):
|
| 758 |
+
return fallback_embedded_episode_sample(tokens, tokenizer) or fallback_episode_first_sample(
|
| 759 |
+
tokens, categories, episode_idx, tokenizer
|
| 760 |
+
) or fallback_minimal_sample(
|
| 761 |
+
tokens, episode_idx, tokenizer
|
| 762 |
+
)
|
| 763 |
|
| 764 |
+
return finalize_weak_sample(tokens, categories, tokenizer)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 765 |
|
| 766 |
|
| 767 |
def iter_db_rows(db_path: Path, min_id: int, max_id: int) -> Iterable[tuple[int, str]]:
|
evaluate_parser_cases.py
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Evaluate parser checkpoints on fixed real-world filename cases."""
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import json
|
| 5 |
+
import os
|
| 6 |
+
from typing import Dict, List, Optional
|
| 7 |
+
|
| 8 |
+
import torch
|
| 9 |
+
from transformers import BertForTokenClassification
|
| 10 |
+
|
| 11 |
+
from config import Config
|
| 12 |
+
from inference import parse_filename
|
| 13 |
+
from tokenizer import load_tokenizer
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
DEFAULT_CASE_FILE = os.path.join("data", "parser_regression_cases.json")
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def normalize_field_value(field: str, value) -> Optional[str]:
|
| 20 |
+
if value is None:
|
| 21 |
+
return None
|
| 22 |
+
if field in {"episode", "season"}:
|
| 23 |
+
try:
|
| 24 |
+
return str(int(value))
|
| 25 |
+
except (TypeError, ValueError):
|
| 26 |
+
return str(value).strip().lower()
|
| 27 |
+
text = str(value).strip()
|
| 28 |
+
if field in {"resolution", "source"}:
|
| 29 |
+
return text.lower().replace("_", "-")
|
| 30 |
+
return " ".join(text.lower().split())
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def load_cases(path: str) -> List[Dict]:
|
| 34 |
+
with open(path, "r", encoding="utf-8") as f:
|
| 35 |
+
cases = json.load(f)
|
| 36 |
+
if not isinstance(cases, list):
|
| 37 |
+
raise ValueError(f"{path} must contain a JSON list")
|
| 38 |
+
return cases
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def evaluate_cases(
|
| 42 |
+
model_dir: str,
|
| 43 |
+
case_file: str,
|
| 44 |
+
tokenizer_variant: Optional[str],
|
| 45 |
+
max_length: Optional[int],
|
| 46 |
+
use_rules: bool,
|
| 47 |
+
constrain_bio: bool,
|
| 48 |
+
) -> Dict:
|
| 49 |
+
cfg = Config()
|
| 50 |
+
tokenizer = load_tokenizer(model_dir, tokenizer_variant)
|
| 51 |
+
model = BertForTokenClassification.from_pretrained(model_dir)
|
| 52 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 53 |
+
model.to(device)
|
| 54 |
+
model.eval()
|
| 55 |
+
|
| 56 |
+
id2label = {int(k): v for k, v in getattr(model.config, "id2label", cfg.id2label).items()}
|
| 57 |
+
resolved_max_length = max_length or int(getattr(model.config, "max_seq_length", 64))
|
| 58 |
+
cases = load_cases(case_file)
|
| 59 |
+
|
| 60 |
+
field_totals: Dict[str, int] = {}
|
| 61 |
+
field_correct: Dict[str, int] = {}
|
| 62 |
+
results = []
|
| 63 |
+
full_correct = 0
|
| 64 |
+
|
| 65 |
+
for case in cases:
|
| 66 |
+
expected = case.get("expected", {})
|
| 67 |
+
pred = parse_filename(
|
| 68 |
+
case["filename"],
|
| 69 |
+
model,
|
| 70 |
+
tokenizer,
|
| 71 |
+
id2label,
|
| 72 |
+
max_length=resolved_max_length,
|
| 73 |
+
debug=False,
|
| 74 |
+
use_rules=use_rules,
|
| 75 |
+
constrain_bio=constrain_bio,
|
| 76 |
+
)
|
| 77 |
+
errors = {}
|
| 78 |
+
for field, expected_value in expected.items():
|
| 79 |
+
field_totals[field] = field_totals.get(field, 0) + 1
|
| 80 |
+
expected_norm = normalize_field_value(field, expected_value)
|
| 81 |
+
pred_norm = normalize_field_value(field, pred.get(field))
|
| 82 |
+
if expected_norm == pred_norm:
|
| 83 |
+
field_correct[field] = field_correct.get(field, 0) + 1
|
| 84 |
+
else:
|
| 85 |
+
errors[field] = {
|
| 86 |
+
"expected": expected_value,
|
| 87 |
+
"pred": pred.get(field),
|
| 88 |
+
}
|
| 89 |
+
if not errors:
|
| 90 |
+
full_correct += 1
|
| 91 |
+
results.append(
|
| 92 |
+
{
|
| 93 |
+
"id": case.get("id"),
|
| 94 |
+
"filename": case["filename"],
|
| 95 |
+
"ok": not errors,
|
| 96 |
+
"errors": errors,
|
| 97 |
+
"expected": expected,
|
| 98 |
+
"pred": {field: pred.get(field) for field in sorted(expected)},
|
| 99 |
+
}
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
field_accuracy = {
|
| 103 |
+
field: field_correct.get(field, 0) / total
|
| 104 |
+
for field, total in sorted(field_totals.items())
|
| 105 |
+
}
|
| 106 |
+
return {
|
| 107 |
+
"model_dir": model_dir,
|
| 108 |
+
"case_file": case_file,
|
| 109 |
+
"tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
|
| 110 |
+
"max_length": resolved_max_length,
|
| 111 |
+
"use_rules": use_rules,
|
| 112 |
+
"constrain_bio": constrain_bio,
|
| 113 |
+
"case_count": len(cases),
|
| 114 |
+
"full_correct": full_correct,
|
| 115 |
+
"full_accuracy": full_correct / len(cases) if cases else 0.0,
|
| 116 |
+
"field_correct": field_correct,
|
| 117 |
+
"field_total": field_totals,
|
| 118 |
+
"field_accuracy": field_accuracy,
|
| 119 |
+
"failures": [result for result in results if not result["ok"]],
|
| 120 |
+
"results": results,
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def main() -> None:
|
| 125 |
+
parser = argparse.ArgumentParser(description="Evaluate parser on fixed filename regression cases")
|
| 126 |
+
parser.add_argument("--model-dir", required=True)
|
| 127 |
+
parser.add_argument("--case-file", default=DEFAULT_CASE_FILE)
|
| 128 |
+
parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
|
| 129 |
+
parser.add_argument("--max-length", type=int, default=None)
|
| 130 |
+
parser.add_argument("--output", default=None, help="Optional JSON output path")
|
| 131 |
+
parser.add_argument("--no-rule-assist", action="store_true")
|
| 132 |
+
parser.add_argument("--no-constrained-bio", action="store_true")
|
| 133 |
+
args = parser.parse_args()
|
| 134 |
+
|
| 135 |
+
metrics = evaluate_cases(
|
| 136 |
+
model_dir=args.model_dir,
|
| 137 |
+
case_file=args.case_file,
|
| 138 |
+
tokenizer_variant=args.tokenizer,
|
| 139 |
+
max_length=args.max_length,
|
| 140 |
+
use_rules=not args.no_rule_assist,
|
| 141 |
+
constrain_bio=not args.no_constrained_bio,
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
print(
|
| 145 |
+
f"Full case accuracy: {metrics['full_correct']}/{metrics['case_count']} "
|
| 146 |
+
f"({metrics['full_accuracy']:.4f})"
|
| 147 |
+
)
|
| 148 |
+
for field, total in metrics["field_total"].items():
|
| 149 |
+
correct = metrics["field_correct"].get(field, 0)
|
| 150 |
+
print(f" {field}: {correct}/{total} ({correct / total:.4f})")
|
| 151 |
+
if metrics["failures"]:
|
| 152 |
+
print("\nFailures:")
|
| 153 |
+
for failure in metrics["failures"]:
|
| 154 |
+
print(json.dumps(failure, ensure_ascii=False))
|
| 155 |
+
|
| 156 |
+
if args.output:
|
| 157 |
+
os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
|
| 158 |
+
with open(args.output, "w", encoding="utf-8") as f:
|
| 159 |
+
json.dump(metrics, f, ensure_ascii=False, indent=2)
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
if __name__ == "__main__":
|
| 163 |
+
main()
|
exports/anime_filename_parser.metadata.json
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
{
|
| 2 |
-
"model_dir": "
|
| 3 |
"output": "exports\\anime_filename_parser.onnx",
|
| 4 |
-
"max_length":
|
| 5 |
"sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
|
| 6 |
"logits_shape": [
|
| 7 |
1,
|
| 8 |
-
|
| 9 |
15
|
| 10 |
],
|
| 11 |
-
"max_abs_diff": 3.
|
| 12 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"model_dir": ".",
|
| 3 |
"output": "exports\\anime_filename_parser.onnx",
|
| 4 |
+
"max_length": 128,
|
| 5 |
"sample": "[ANi] 葬送的芙莉莲 S2 - 03 [1080P][WEB-DL]",
|
| 6 |
"logits_shape": [
|
| 7 |
1,
|
| 8 |
+
128,
|
| 9 |
15
|
| 10 |
],
|
| 11 |
+
"max_abs_diff": 3.3855438232421875e-05
|
| 12 |
}
|
exports/anime_filename_parser.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f9b874fbd4217a190487f512dcc6dd7ce2f0e610147703ca0cddcc0db44fb1c7
|
| 3 |
+
size 19633926
|
inference.py
CHANGED
|
@@ -20,6 +20,7 @@ import torch
|
|
| 20 |
from transformers import BertForTokenClassification
|
| 21 |
|
| 22 |
from config import Config
|
|
|
|
| 23 |
from tokenizer import AnimeTokenizer, load_tokenizer
|
| 24 |
|
| 25 |
|
|
@@ -37,6 +38,10 @@ def extract_season_number(text: str) -> Optional[int]:
|
|
| 37 |
Examples:
|
| 38 |
"S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
|
| 39 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
# Arabic digits
|
| 41 |
match = re.search(r'(\d+)', text)
|
| 42 |
if match:
|
|
@@ -261,19 +266,66 @@ def postprocess(
|
|
| 261 |
|
| 262 |
|
| 263 |
BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
|
| 264 |
-
RESOLUTION_RE = re.compile(r"
|
| 265 |
-
|
| 266 |
-
r"
|
| 267 |
-
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
re.I,
|
| 269 |
)
|
| 270 |
EPISODE_PATTERNS = [
|
| 271 |
-
re.compile(r"
|
| 272 |
-
re.compile(r"[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
]
|
| 274 |
SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
NOISE_META_RE = re.compile(
|
| 276 |
-
r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|DVDRip|DVD|TVRip|"
|
| 277 |
r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
|
| 278 |
r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
|
| 279 |
re.I,
|
|
@@ -316,6 +368,52 @@ def looks_like_group(text: str) -> bool:
|
|
| 316 |
)
|
| 317 |
|
| 318 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
| 320 |
"""
|
| 321 |
Fill high-confidence structural fields from filename conventions.
|
|
@@ -327,8 +425,8 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
|
| 327 |
brackets = bracket_parts(filename)
|
| 328 |
|
| 329 |
if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
|
| 330 |
-
first_text, first_start,
|
| 331 |
-
if first_start == 0 and
|
| 332 |
repaired["group"] = first_text
|
| 333 |
|
| 334 |
if not repaired.get("resolution"):
|
|
@@ -336,10 +434,34 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
|
| 336 |
if match:
|
| 337 |
repaired["resolution"] = match.group(0)
|
| 338 |
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
|
| 344 |
if repaired.get("season") is None:
|
| 345 |
match = SEASON_RE.search(filename)
|
|
@@ -348,52 +470,223 @@ def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
|
| 348 |
season = cn_number_to_int(value)
|
| 349 |
if season is not None:
|
| 350 |
repaired["season"] = season
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
ep = int(ep_text)
|
| 358 |
-
if ep == 0 or ep > 2000:
|
| 359 |
-
continue
|
| 360 |
-
score = match.start()
|
| 361 |
-
if 1 <= ep <= 200:
|
| 362 |
-
score += 10000
|
| 363 |
-
if "-" in filename[max(0, match.start() - 3):match.start() + 1]:
|
| 364 |
-
score += 1000
|
| 365 |
-
if match.start() > len(filename) // 3:
|
| 366 |
-
score += 200
|
| 367 |
-
candidates.append((score, ep, ep_text))
|
| 368 |
-
if candidates:
|
| 369 |
-
repaired["episode"] = max(candidates, key=lambda item: item[0])[1]
|
| 370 |
|
| 371 |
title = repaired.get("title")
|
| 372 |
group = repaired.get("group")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 373 |
if title and group and title.startswith(group):
|
| 374 |
title = title[len(group):].lstrip("]】)>})》 \t-_.")
|
| 375 |
repaired["title"] = title or repaired["title"]
|
| 376 |
|
| 377 |
-
if
|
| 378 |
repaired_title = infer_title_span(filename, group, repaired["episode"])
|
| 379 |
if repaired_title:
|
| 380 |
repaired["title"] = repaired_title
|
| 381 |
|
|
|
|
|
|
|
|
|
|
| 382 |
return repaired
|
| 383 |
|
| 384 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 385 |
def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
|
| 386 |
start = 0
|
| 387 |
if group:
|
| 388 |
first = BRACKET_RE.match(filename)
|
| 389 |
if first and group in first.group(0):
|
| 390 |
start = first.end()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 391 |
|
| 392 |
end = None
|
| 393 |
if episode is not None:
|
| 394 |
ep_patterns = [
|
|
|
|
| 395 |
rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 396 |
rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
|
|
|
|
|
|
|
| 397 |
rf"[Ee]0*{episode}(?:v\d+)?",
|
| 398 |
]
|
| 399 |
for pattern in ep_patterns:
|
|
@@ -412,7 +705,7 @@ def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]
|
|
| 412 |
|
| 413 |
if end is None or end <= start:
|
| 414 |
return None
|
| 415 |
-
title = filename[start:end]
|
| 416 |
return title or None
|
| 417 |
|
| 418 |
|
|
@@ -448,6 +741,16 @@ def parse_filename(
|
|
| 448 |
|
| 449 |
# Convert to input IDs
|
| 450 |
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 451 |
unk_token_id = tokenizer.unk_token_id
|
| 452 |
unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
|
| 453 |
|
|
@@ -516,6 +819,10 @@ def parse_filename(
|
|
| 516 |
"unk_count": len(unk_tokens),
|
| 517 |
"unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
|
| 518 |
"unk_tokens": unk_tokens[:50],
|
|
|
|
|
|
|
|
|
|
|
|
|
| 519 |
"tokens": tokens[:available],
|
| 520 |
"labels": label_strings,
|
| 521 |
"scores": [round(float(score), 4) for score in selected_scores],
|
|
@@ -544,7 +851,7 @@ def main():
|
|
| 544 |
parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
|
| 545 |
parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
|
| 546 |
parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
|
| 547 |
-
parser.add_argument("--model-dir", type=str, default=".
|
| 548 |
help="Path to trained model directory")
|
| 549 |
parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
|
| 550 |
help="Tokenizer variant override. Defaults to checkpoint metadata")
|
|
|
|
| 20 |
from transformers import BertForTokenClassification
|
| 21 |
|
| 22 |
from config import Config
|
| 23 |
+
from label_repairs import season_marker_number
|
| 24 |
from tokenizer import AnimeTokenizer, load_tokenizer
|
| 25 |
|
| 26 |
|
|
|
|
| 38 |
Examples:
|
| 39 |
"S2" → 2, "Season 2" → 2, "第二季" → 2, "1st Season" → 1
|
| 40 |
"""
|
| 41 |
+
marker_value = season_marker_number(text)
|
| 42 |
+
if marker_value is not None:
|
| 43 |
+
return marker_value
|
| 44 |
+
|
| 45 |
# Arabic digits
|
| 46 |
match = re.search(r'(\d+)', text)
|
| 47 |
if match:
|
|
|
|
| 266 |
|
| 267 |
|
| 268 |
BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
|
| 269 |
+
RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
|
| 270 |
+
SOURCE_TOKEN_PATTERN = (
|
| 271 |
+
r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
|
| 272 |
+
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
|
| 273 |
+
r"CHS|CHT|GB|BIG5|JPN?|繁中|简中"
|
| 274 |
+
)
|
| 275 |
+
SOURCE_RE = re.compile(rf"\b(?:{SOURCE_TOKEN_PATTERN})\b", re.I)
|
| 276 |
+
SOURCE_TAG_RE = re.compile(
|
| 277 |
+
rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
|
| 278 |
+
re.I,
|
| 279 |
+
)
|
| 280 |
+
SPECIAL_TAG_RE = re.compile(
|
| 281 |
+
r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
|
| 282 |
re.I,
|
| 283 |
)
|
| 284 |
EPISODE_PATTERNS = [
|
| 285 |
+
("season_episode", re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I)),
|
| 286 |
+
("dash_episode", re.compile(r"(?:^|[\s._])[-_]\s*(?P<ep>\d{1,4})(?:v\d+)?(?=$|[\s._\-\]\)】》\[])")),
|
| 287 |
+
("bracket_episode", re.compile(r"[\[\(【《](?:EP?|#)?(?P<ep>\d{1,4})(?:v\d+)?[\]\)】》]", re.I)),
|
| 288 |
+
("explicit_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I)),
|
| 289 |
+
(
|
| 290 |
+
"long_episode",
|
| 291 |
+
re.compile(
|
| 292 |
+
r"(?:^|[\s._\-\[\(【《])(?P<ep>\d{3,4})(?:v\d+)?"
|
| 293 |
+
r"(?=[\s._\-\]\)】》\[]+(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
|
| 294 |
+
re.I,
|
| 295 |
+
),
|
| 296 |
+
),
|
| 297 |
+
("generic_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?P<ep>\d{1,3})(?:v\d+)?(?=$|[\s._\-\]\)】》])", re.I)),
|
| 298 |
]
|
| 299 |
SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
|
| 300 |
+
SEQUEL_MARKER_RE = re.compile(
|
| 301 |
+
r"(?<![A-Za-z0-9])"
|
| 302 |
+
r"(?P<marker>"
|
| 303 |
+
r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
|
| 304 |
+
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
|
| 305 |
+
r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 306 |
+
r"(?:Go|Gou)\s+no\s+Sara|"
|
| 307 |
+
r"Ni\s+Gakki|Sono\s+Ni|Ni|"
|
| 308 |
+
r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
|
| 309 |
+
r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
|
| 310 |
+
r")"
|
| 311 |
+
r"(?![A-Za-z0-9])",
|
| 312 |
+
re.I,
|
| 313 |
+
)
|
| 314 |
+
TRAILING_SEQUEL_MARKER_RE = re.compile(
|
| 315 |
+
r"(?:^|[\s._-])"
|
| 316 |
+
r"(?P<marker>"
|
| 317 |
+
r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
|
| 318 |
+
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
|
| 319 |
+
r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 320 |
+
r"(?:Go|Gou)\s+no\s+Sara|"
|
| 321 |
+
r"Ni\s+Gakki|Sono\s+Ni|Ni|"
|
| 322 |
+
r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
|
| 323 |
+
r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
|
| 324 |
+
r")$",
|
| 325 |
+
re.I,
|
| 326 |
+
)
|
| 327 |
NOISE_META_RE = re.compile(
|
| 328 |
+
r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|"
|
| 329 |
r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
|
| 330 |
r"Opus|ASS.*|CHS|CHT|BIG5|GB|JPN?|MP4|MKV|繁中|简中|内封|外挂)$",
|
| 331 |
re.I,
|
|
|
|
| 368 |
)
|
| 369 |
|
| 370 |
|
| 371 |
+
def looks_like_episode_or_meta(text: str) -> bool:
|
| 372 |
+
if not text:
|
| 373 |
+
return False
|
| 374 |
+
clean = text.strip()
|
| 375 |
+
return bool(
|
| 376 |
+
re.fullmatch(r"(?:EP?|#)?\d{1,4}(?:v\d+)?", clean, re.I)
|
| 377 |
+
or RESOLUTION_RE.search(clean)
|
| 378 |
+
or SOURCE_TAG_RE.fullmatch(clean)
|
| 379 |
+
or SOURCE_RE.search(clean)
|
| 380 |
+
or SPECIAL_TAG_RE.search(clean)
|
| 381 |
+
or NOISE_META_RE.search(clean)
|
| 382 |
+
)
|
| 383 |
+
|
| 384 |
+
|
| 385 |
+
def looks_like_structural_group(text: str, filename: str, bracket_end: int) -> bool:
|
| 386 |
+
"""Heuristic for short leading release-group brackets not in the name list."""
|
| 387 |
+
if looks_like_group(text):
|
| 388 |
+
return True
|
| 389 |
+
if not text or looks_like_episode_or_meta(text):
|
| 390 |
+
return False
|
| 391 |
+
|
| 392 |
+
after = filename[bracket_end:].lstrip(" \t._")
|
| 393 |
+
if after.startswith("-"):
|
| 394 |
+
return False
|
| 395 |
+
next_bracket = BRACKET_RE.match(after)
|
| 396 |
+
if next_bracket:
|
| 397 |
+
next_text = next(group for group in next_bracket.groups() if group is not None)
|
| 398 |
+
if looks_like_episode_or_meta(next_text):
|
| 399 |
+
return False
|
| 400 |
+
|
| 401 |
+
words = re.findall(r"[A-Za-z0-9]+", text)
|
| 402 |
+
if not words:
|
| 403 |
+
if re.search(r"[\u3400-\u9fff]", text) and len(text) <= 32:
|
| 404 |
+
return True
|
| 405 |
+
return False
|
| 406 |
+
if len(text) > 32:
|
| 407 |
+
return False
|
| 408 |
+
if len(words) == 1:
|
| 409 |
+
return True
|
| 410 |
+
if any(sep in text for sep in "-_"):
|
| 411 |
+
return True
|
| 412 |
+
if words[0].isupper() and len(words[0]) <= 4 and len(words) <= 3:
|
| 413 |
+
return True
|
| 414 |
+
return False
|
| 415 |
+
|
| 416 |
+
|
| 417 |
def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
| 418 |
"""
|
| 419 |
Fill high-confidence structural fields from filename conventions.
|
|
|
|
| 425 |
brackets = bracket_parts(filename)
|
| 426 |
|
| 427 |
if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
|
| 428 |
+
first_text, first_start, first_end = brackets[0]
|
| 429 |
+
if first_start == 0 and looks_like_structural_group(first_text, filename, first_end):
|
| 430 |
repaired["group"] = first_text
|
| 431 |
|
| 432 |
if not repaired.get("resolution"):
|
|
|
|
| 434 |
if match:
|
| 435 |
repaired["resolution"] = match.group(0)
|
| 436 |
|
| 437 |
+
source_matches = source_candidates(filename)
|
| 438 |
+
current_source = repaired.get("source")
|
| 439 |
+
preferred_source = source_matches[0] if source_matches else None
|
| 440 |
+
if source_matches and (
|
| 441 |
+
not current_source
|
| 442 |
+
or not SOURCE_RE.fullmatch(str(current_source))
|
| 443 |
+
or len(str(current_source)) <= 3 and str(current_source).lower() not in {"nf", "cr"}
|
| 444 |
+
or (
|
| 445 |
+
preferred_source
|
| 446 |
+
and str(current_source).lower().replace("_", "-") in {"web-dl", "webdl", "webrip", "web-rip"}
|
| 447 |
+
and preferred_source.lower().replace("_", "-") not in {"web-dl", "webdl", "webrip", "web-rip"}
|
| 448 |
+
)
|
| 449 |
+
):
|
| 450 |
+
repaired["source"] = preferred_source
|
| 451 |
+
|
| 452 |
+
if not repaired.get("special"):
|
| 453 |
+
for text, _start, _end in brackets:
|
| 454 |
+
clean = text.strip()
|
| 455 |
+
if SPECIAL_TAG_RE.search(clean):
|
| 456 |
+
repaired["special"] = clean
|
| 457 |
+
break
|
| 458 |
+
|
| 459 |
+
episode = best_structural_episode(filename)
|
| 460 |
+
if episode is not None and (
|
| 461 |
+
repaired.get("episode") is None
|
| 462 |
+
or not plausible_episode_context(filename, int(repaired["episode"]))
|
| 463 |
+
):
|
| 464 |
+
repaired["episode"] = episode
|
| 465 |
|
| 466 |
if repaired.get("season") is None:
|
| 467 |
match = SEASON_RE.search(filename)
|
|
|
|
| 470 |
season = cn_number_to_int(value)
|
| 471 |
if season is not None:
|
| 472 |
repaired["season"] = season
|
| 473 |
+
if repaired.get("season") is None and repaired.get("episode") is not None:
|
| 474 |
+
sequel = structural_sequel_marker(filename, repaired.get("group"), repaired.get("episode"))
|
| 475 |
+
if sequel is not None:
|
| 476 |
+
repaired["season"] = sequel[1]
|
| 477 |
+
elif repaired.get("episode") == repaired.get("season") and not SEASON_RE.search(filename):
|
| 478 |
+
repaired["season"] = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 479 |
|
| 480 |
title = repaired.get("title")
|
| 481 |
group = repaired.get("group")
|
| 482 |
+
if group and (NOISE_META_RE.search(str(group)) or SOURCE_RE.fullmatch(str(group)) or RESOLUTION_RE.fullmatch(str(group))):
|
| 483 |
+
repaired["group"] = None
|
| 484 |
+
group = None
|
| 485 |
+
|
| 486 |
if title and group and title.startswith(group):
|
| 487 |
title = title[len(group):].lstrip("]】)>})》 \t-_.")
|
| 488 |
repaired["title"] = title or repaired["title"]
|
| 489 |
|
| 490 |
+
if repaired.get("episode"):
|
| 491 |
repaired_title = infer_title_span(filename, group, repaired["episode"])
|
| 492 |
if repaired_title:
|
| 493 |
repaired["title"] = repaired_title
|
| 494 |
|
| 495 |
+
if repaired.get("title") and repaired.get("season") is not None:
|
| 496 |
+
repaired["title"] = strip_trailing_season_from_title(repaired["title"], repaired["season"])
|
| 497 |
+
|
| 498 |
return repaired
|
| 499 |
|
| 500 |
|
| 501 |
+
def structural_sequel_marker(
|
| 502 |
+
filename: str,
|
| 503 |
+
group: Optional[str],
|
| 504 |
+
episode: Optional[int],
|
| 505 |
+
) -> Optional[Tuple[str, int]]:
|
| 506 |
+
if episode is None:
|
| 507 |
+
return None
|
| 508 |
+
title_end = None
|
| 509 |
+
if episode is not None:
|
| 510 |
+
ep_patterns = [
|
| 511 |
+
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 512 |
+
rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 513 |
+
rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
|
| 514 |
+
rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 515 |
+
rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 516 |
+
]
|
| 517 |
+
start = 0
|
| 518 |
+
if group:
|
| 519 |
+
first = BRACKET_RE.match(filename)
|
| 520 |
+
if first and group in first.group(0):
|
| 521 |
+
start = first.end()
|
| 522 |
+
for pattern in ep_patterns:
|
| 523 |
+
match = re.search(pattern, filename[start:], re.I)
|
| 524 |
+
if match:
|
| 525 |
+
title_end = start + match.start()
|
| 526 |
+
break
|
| 527 |
+
if title_end is None:
|
| 528 |
+
return None
|
| 529 |
+
|
| 530 |
+
prefix = filename[:title_end].rstrip(" \t-_.")
|
| 531 |
+
for match in reversed(list(SEQUEL_MARKER_RE.finditer(prefix))):
|
| 532 |
+
marker = match.group("marker")
|
| 533 |
+
value = season_marker_number(marker)
|
| 534 |
+
if value is None:
|
| 535 |
+
continue
|
| 536 |
+
tail = prefix[match.end():].strip(" \t-_.")
|
| 537 |
+
if tail:
|
| 538 |
+
continue
|
| 539 |
+
if marker.lower() == "ni" and "Kakuriyo no Yadomeshi Ni" not in prefix:
|
| 540 |
+
continue
|
| 541 |
+
return marker, value
|
| 542 |
+
return None
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
def normalize_source_text(text: str) -> str:
|
| 546 |
+
text = re.sub(r"\s+", "", text.strip())
|
| 547 |
+
text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
|
| 548 |
+
text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
|
| 549 |
+
text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
|
| 550 |
+
text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
|
| 551 |
+
return text.replace("_", "-")
|
| 552 |
+
|
| 553 |
+
|
| 554 |
+
def source_priority(source: str) -> int:
|
| 555 |
+
normalized = source.lower().replace("_", "-").replace(" ", "")
|
| 556 |
+
parts = re.split(r"[&+/,]", normalized)
|
| 557 |
+
if any(part in {"nf", "netflix", "amzn", "baha", "cr", "abema", "dsnp", "u-next", "hulu", "at-x"} for part in parts):
|
| 558 |
+
return 90
|
| 559 |
+
if any(part in {"web-dl", "webdl", "webrip", "web-rip", "bdrip", "bluray", "bdmv", "bd", "dvdrip", "dvd", "tvrip", "hdtv"} for part in parts):
|
| 560 |
+
return 60
|
| 561 |
+
if len(parts) > 1:
|
| 562 |
+
return 40
|
| 563 |
+
return 20
|
| 564 |
+
|
| 565 |
+
|
| 566 |
+
def source_candidates(filename: str) -> List[str]:
|
| 567 |
+
candidates: List[Tuple[int, int, str]] = []
|
| 568 |
+
for text, start, _end in bracket_parts(filename):
|
| 569 |
+
clean = text.strip()
|
| 570 |
+
if SOURCE_TAG_RE.fullmatch(clean):
|
| 571 |
+
normalized = normalize_source_text(clean)
|
| 572 |
+
candidates.append((source_priority(normalized), -start, normalized))
|
| 573 |
+
|
| 574 |
+
for match in SOURCE_RE.finditer(filename):
|
| 575 |
+
normalized = normalize_source_text(match.group(0))
|
| 576 |
+
candidates.append((source_priority(normalized), -match.start(), normalized))
|
| 577 |
+
|
| 578 |
+
deduped: Dict[str, Tuple[int, int, str]] = {}
|
| 579 |
+
for priority, neg_start, value in candidates:
|
| 580 |
+
key = value.lower()
|
| 581 |
+
if key not in deduped or (priority, neg_start) > (deduped[key][0], deduped[key][1]):
|
| 582 |
+
deduped[key] = (priority, neg_start, value)
|
| 583 |
+
|
| 584 |
+
return [value for _priority, _neg_start, value in sorted(deduped.values(), reverse=True)]
|
| 585 |
+
|
| 586 |
+
|
| 587 |
+
def best_structural_episode(filename: str) -> Optional[int]:
|
| 588 |
+
priorities = {
|
| 589 |
+
"season_episode": 1000,
|
| 590 |
+
"dash_episode": 900,
|
| 591 |
+
"bracket_episode": 850,
|
| 592 |
+
"explicit_episode": 800,
|
| 593 |
+
"long_episode": 750,
|
| 594 |
+
"generic_episode": 100,
|
| 595 |
+
}
|
| 596 |
+
candidates: List[Tuple[int, int, int]] = []
|
| 597 |
+
for name, pattern in EPISODE_PATTERNS:
|
| 598 |
+
for match in pattern.finditer(filename):
|
| 599 |
+
ep_text = match.group("ep")
|
| 600 |
+
ep = int(ep_text)
|
| 601 |
+
if ep == 0 or ep > 2000:
|
| 602 |
+
continue
|
| 603 |
+
context = filename[max(0, match.start() - 5):match.end() + 5]
|
| 604 |
+
if RESOLUTION_RE.search(context) or re.search(r"AAC|DDP|AC3|H\.?26[45]|x26[45]", context, re.I):
|
| 605 |
+
continue
|
| 606 |
+
priority = priorities[name]
|
| 607 |
+
if 1 <= ep <= 200:
|
| 608 |
+
priority += 20
|
| 609 |
+
candidates.append((priority, match.start(), ep))
|
| 610 |
+
if not candidates:
|
| 611 |
+
return None
|
| 612 |
+
return max(candidates, key=lambda item: (item[0], item[1]))[2]
|
| 613 |
+
|
| 614 |
+
|
| 615 |
+
def plausible_episode_context(filename: str, episode: int) -> bool:
|
| 616 |
+
ep_text = str(episode)
|
| 617 |
+
padded = f"{episode:02d}"
|
| 618 |
+
if re.search(rf"(?<![A-Za-z0-9])(?:H|x)\.?0*{re.escape(ep_text)}(?!\d)", filename, re.I):
|
| 619 |
+
return False
|
| 620 |
+
patterns = [
|
| 621 |
+
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 622 |
+
rf"(?:^|[\s._])[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])",
|
| 623 |
+
rf"[\[\(【《](?:EP?|#)?0*{episode}(?:v\d+)?[\]\)】》]",
|
| 624 |
+
rf"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)0*{episode}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 625 |
+
rf"(?:^|[\s._\-\[\(【《])0*{episode}(?:v\d+)?(?=[\s._\-\]\)】》\[]+(?:\d{{3,4}}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
|
| 626 |
+
]
|
| 627 |
+
return any(re.search(pattern, filename, re.I) for pattern in patterns) or bool(
|
| 628 |
+
re.search(rf"(?:^|[\s._\-\[\(【《])(?:{re.escape(ep_text)}|{re.escape(padded)})(?=$|[\s._\-\]\)】》])", filename)
|
| 629 |
+
)
|
| 630 |
+
|
| 631 |
+
|
| 632 |
+
def strip_trailing_season_from_title(title: str, season: int) -> str:
|
| 633 |
+
season_text = str(season)
|
| 634 |
+
patterns = [
|
| 635 |
+
rf"\s+[Ss]0*{season_text}$",
|
| 636 |
+
rf"\s+Season\s*0*{season_text}$",
|
| 637 |
+
rf"\s+0*{season_text}$",
|
| 638 |
+
]
|
| 639 |
+
cleaned = title
|
| 640 |
+
for pattern in patterns:
|
| 641 |
+
cleaned = re.sub(pattern, "", cleaned, flags=re.I).strip(" \t-_.")
|
| 642 |
+
match = TRAILING_SEQUEL_MARKER_RE.search(cleaned)
|
| 643 |
+
if match and season_marker_number(match.group("marker")) == season:
|
| 644 |
+
cleaned = cleaned[:match.start()].strip(" \t-_.")
|
| 645 |
+
return cleaned or title
|
| 646 |
+
|
| 647 |
+
|
| 648 |
+
def clean_inferred_title(title: str) -> str:
|
| 649 |
+
raw_title = title.strip(" \t-_.")
|
| 650 |
+
bracket_matches = list(BRACKET_RE.finditer(raw_title))
|
| 651 |
+
if bracket_matches:
|
| 652 |
+
first = bracket_matches[0]
|
| 653 |
+
prefix = raw_title[:first.start()].strip(" \t-_.★☆")
|
| 654 |
+
text = next(group for group in first.groups() if group is not None).strip()
|
| 655 |
+
if text and not looks_like_episode_or_meta(text) and (
|
| 656 |
+
not prefix
|
| 657 |
+
or re.search(r"(?:新番|月|合集|繁|简|字幕|先行|合集|★|☆)", prefix, re.I)
|
| 658 |
+
):
|
| 659 |
+
return text
|
| 660 |
+
return raw_title.strip("[]()【】《》()")
|
| 661 |
+
|
| 662 |
+
|
| 663 |
def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
|
| 664 |
start = 0
|
| 665 |
if group:
|
| 666 |
first = BRACKET_RE.match(filename)
|
| 667 |
if first and group in first.group(0):
|
| 668 |
start = first.end()
|
| 669 |
+
else:
|
| 670 |
+
# Some releases put leading metadata before the actual title, e.g.
|
| 671 |
+
# `[1080p] Title - 01`. Do not keep that wrapper as title text.
|
| 672 |
+
while True:
|
| 673 |
+
leading = BRACKET_RE.match(filename[start:].lstrip(" \t._-"))
|
| 674 |
+
if not leading:
|
| 675 |
+
break
|
| 676 |
+
skipped_ws = len(filename[start:]) - len(filename[start:].lstrip(" \t._-"))
|
| 677 |
+
text = next(group for group in leading.groups() if group is not None)
|
| 678 |
+
if not looks_like_episode_or_meta(text):
|
| 679 |
+
break
|
| 680 |
+
start += skipped_ws + leading.end()
|
| 681 |
|
| 682 |
end = None
|
| 683 |
if episode is not None:
|
| 684 |
ep_patterns = [
|
| 685 |
+
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 686 |
rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 687 |
rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
|
| 688 |
+
rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 689 |
+
rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 690 |
rf"[Ee]0*{episode}(?:v\d+)?",
|
| 691 |
]
|
| 692 |
for pattern in ep_patterns:
|
|
|
|
| 705 |
|
| 706 |
if end is None or end <= start:
|
| 707 |
return None
|
| 708 |
+
title = clean_inferred_title(filename[start:end])
|
| 709 |
return title or None
|
| 710 |
|
| 711 |
|
|
|
|
| 741 |
|
| 742 |
# Convert to input IDs
|
| 743 |
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
| 744 |
+
embedding_size = model.get_input_embeddings().weight.shape[0]
|
| 745 |
+
out_of_range_tokens = [
|
| 746 |
+
token for token, token_id in zip(tokens, input_ids)
|
| 747 |
+
if token_id >= embedding_size
|
| 748 |
+
]
|
| 749 |
+
if out_of_range_tokens:
|
| 750 |
+
input_ids = [
|
| 751 |
+
token_id if token_id < embedding_size else tokenizer.unk_token_id
|
| 752 |
+
for token_id in input_ids
|
| 753 |
+
]
|
| 754 |
unk_token_id = tokenizer.unk_token_id
|
| 755 |
unk_tokens = [token for token, token_id in zip(tokens, input_ids) if token_id == unk_token_id]
|
| 756 |
|
|
|
|
| 819 |
"unk_count": len(unk_tokens),
|
| 820 |
"unk_rate": len(unk_tokens) / len(tokens) if tokens else 0.0,
|
| 821 |
"unk_tokens": unk_tokens[:50],
|
| 822 |
+
"vocab_mismatch": bool(out_of_range_tokens),
|
| 823 |
+
"model_embedding_size": int(embedding_size),
|
| 824 |
+
"tokenizer_vocab_size": int(tokenizer.vocab_size),
|
| 825 |
+
"out_of_range_tokens": out_of_range_tokens[:50],
|
| 826 |
"tokens": tokens[:available],
|
| 827 |
"labels": label_strings,
|
| 828 |
"scores": [round(float(score), 4) for score in selected_scores],
|
|
|
|
| 851 |
parser.add_argument("filename", nargs="?", type=str, help="Anime filename to parse")
|
| 852 |
parser.add_argument("--input-file", type=str, help="File with filenames (one per line)")
|
| 853 |
parser.add_argument("--output-file", type=str, help="Output file for results (JSONL)")
|
| 854 |
+
parser.add_argument("--model-dir", type=str, default=".",
|
| 855 |
help="Path to trained model directory")
|
| 856 |
parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
|
| 857 |
help="Tokenizer variant override. Defaults to checkpoint metadata")
|
label_repairs.py
ADDED
|
@@ -0,0 +1,513 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Deterministic label repairs for known weak-label blind spots."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from typing import Dict, Iterable, List, Optional, Sequence, Tuple
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
SEPARATOR_CHARS = set(" \t-_.|~~")
|
| 11 |
+
|
| 12 |
+
ROMAN_NUMERAL_VALUES = {
|
| 13 |
+
"II": 2,
|
| 14 |
+
"III": 3,
|
| 15 |
+
"IV": 4,
|
| 16 |
+
"V": 5,
|
| 17 |
+
"VI": 6,
|
| 18 |
+
"VII": 7,
|
| 19 |
+
"VIII": 8,
|
| 20 |
+
"IX": 9,
|
| 21 |
+
"Ⅱ": 2,
|
| 22 |
+
"Ⅲ": 3,
|
| 23 |
+
"Ⅳ": 4,
|
| 24 |
+
"Ⅴ": 5,
|
| 25 |
+
"Ⅵ": 6,
|
| 26 |
+
"Ⅶ": 7,
|
| 27 |
+
"Ⅷ": 8,
|
| 28 |
+
"Ⅸ": 9,
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
CN_NUMERAL_VALUES = {
|
| 32 |
+
"一": 1,
|
| 33 |
+
"二": 2,
|
| 34 |
+
"兩": 2,
|
| 35 |
+
"两": 2,
|
| 36 |
+
"貳": 2,
|
| 37 |
+
"贰": 2,
|
| 38 |
+
"弐": 2,
|
| 39 |
+
"弍": 2,
|
| 40 |
+
"三": 3,
|
| 41 |
+
"參": 3,
|
| 42 |
+
"叁": 3,
|
| 43 |
+
"参": 3,
|
| 44 |
+
"四": 4,
|
| 45 |
+
"肆": 4,
|
| 46 |
+
"五": 5,
|
| 47 |
+
"伍": 5,
|
| 48 |
+
"六": 6,
|
| 49 |
+
"陸": 6,
|
| 50 |
+
"陆": 6,
|
| 51 |
+
"七": 7,
|
| 52 |
+
"柒": 7,
|
| 53 |
+
"八": 8,
|
| 54 |
+
"捌": 8,
|
| 55 |
+
"九": 9,
|
| 56 |
+
"玖": 9,
|
| 57 |
+
"十": 10,
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
READING_MARKER_VALUES = {
|
| 61 |
+
"ni no sara": 2,
|
| 62 |
+
"ni no shou": 2,
|
| 63 |
+
"ni no sho": 2,
|
| 64 |
+
"ni no syo": 2,
|
| 65 |
+
"ni no shō": 2,
|
| 66 |
+
"ni gakki": 2,
|
| 67 |
+
"sono ni": 2,
|
| 68 |
+
"san no sara": 3,
|
| 69 |
+
"san no shou": 3,
|
| 70 |
+
"san no sho": 3,
|
| 71 |
+
"san no syo": 3,
|
| 72 |
+
"yon no sara": 4,
|
| 73 |
+
"shi no sara": 4,
|
| 74 |
+
"shin no sara": 4,
|
| 75 |
+
"go no sara": 5,
|
| 76 |
+
"gou no sara": 5,
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
# Bare "Ni" is often the Japanese particle に in romanized titles. Only repair
|
| 80 |
+
# it for titles that have been verified as a sequel marker in the release name.
|
| 81 |
+
STANDALONE_NI_SEASON_BASES = {
|
| 82 |
+
"Kakuriyo no Yadomeshi": 2,
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
EPISODE_CONTEXT_RE = re.compile(
|
| 86 |
+
r"^\s*(?:"
|
| 87 |
+
r"[-_]\s*(?:\d{1,4}|NCOP|NCED|OP|ED|OVA|OAD|SP|END)\b|"
|
| 88 |
+
r"#\s*\d{1,4}|"
|
| 89 |
+
r"[\[\(【《]\s*(?:EP?|#)?\d{1,4}"
|
| 90 |
+
r")",
|
| 91 |
+
re.I,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
EPISODE_SPAN_RE = re.compile(
|
| 95 |
+
r"(?:"
|
| 96 |
+
r"[Ss]\d{1,2}[Ee]\d{1,4}(?:v\d+)?|"
|
| 97 |
+
r"(?:^|[\s._])[-_]\s*\d{1,4}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])|"
|
| 98 |
+
r"[\[\(【《](?:EP?|#)?\d{1,4}(?:v\d+)?[\]\)】》]|"
|
| 99 |
+
r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)\d{1,4}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])"
|
| 100 |
+
r")",
|
| 101 |
+
re.I,
|
| 102 |
+
)
|
| 103 |
+
BRACKET_RE = re.compile(r"\[([^\]]*)\]|\(([^)]*)\)|【([^】]*)】|《([^》]*)》")
|
| 104 |
+
RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
|
| 105 |
+
SOURCE_TOKEN_PATTERN = (
|
| 106 |
+
r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
|
| 107 |
+
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
|
| 108 |
+
r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
|
| 109 |
+
r"CHS|CHT|GB|BIG5|JPN?|JPSC|JPTC|繁中|简中"
|
| 110 |
+
)
|
| 111 |
+
SOURCE_RE = re.compile(rf"(?<![A-Za-z0-9])(?:{SOURCE_TOKEN_PATTERN})(?![A-Za-z0-9])", re.I)
|
| 112 |
+
SOURCE_TAG_RE = re.compile(
|
| 113 |
+
rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/,_-]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
|
| 114 |
+
re.I,
|
| 115 |
+
)
|
| 116 |
+
SPECIAL_TAG_RE = re.compile(
|
| 117 |
+
r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
|
| 118 |
+
re.I,
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
READING_MARKER_RE = re.compile(
|
| 122 |
+
r"(?<![A-Za-z0-9])"
|
| 123 |
+
r"(?P<marker>"
|
| 124 |
+
r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
|
| 125 |
+
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
|
| 126 |
+
r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 127 |
+
r"(?:Go|Gou)\s+no\s+Sara|"
|
| 128 |
+
r"Ni\s+Gakki|"
|
| 129 |
+
r"Sono\s+Ni"
|
| 130 |
+
r")"
|
| 131 |
+
r"(?![A-Za-z0-9])",
|
| 132 |
+
)
|
| 133 |
+
|
| 134 |
+
ROMAN_MARKER_RE = re.compile(
|
| 135 |
+
r"(?<![A-Za-z0-9])"
|
| 136 |
+
r"(?P<marker>II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ])"
|
| 137 |
+
r"(?![A-Za-z0-9])"
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
CJK_MARKER_RE = re.compile(
|
| 141 |
+
r"(?P<marker>"
|
| 142 |
+
r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?|"
|
| 143 |
+
r"第[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖\d]+[季期部章]"
|
| 144 |
+
r")"
|
| 145 |
+
)
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
@dataclass(frozen=True)
|
| 149 |
+
class LabelRepair:
|
| 150 |
+
kind: str
|
| 151 |
+
marker: str
|
| 152 |
+
value: int
|
| 153 |
+
start: int
|
| 154 |
+
end: int
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def clean_marker_text(text: str) -> str:
|
| 158 |
+
return text.strip().strip("[]()【】《》()").strip()
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def cn_number_to_int(text: str) -> Optional[int]:
|
| 162 |
+
text = text.strip()
|
| 163 |
+
if text.isdigit():
|
| 164 |
+
return int(text)
|
| 165 |
+
if text in CN_NUMERAL_VALUES:
|
| 166 |
+
return CN_NUMERAL_VALUES[text]
|
| 167 |
+
values = CN_NUMERAL_VALUES
|
| 168 |
+
if text.startswith("十") and len(text) == 2:
|
| 169 |
+
return 10 + values.get(text[1], 0)
|
| 170 |
+
if text.endswith("十") and len(text) == 2:
|
| 171 |
+
return values.get(text[0], 0) * 10
|
| 172 |
+
if "十" in text and len(text) == 3:
|
| 173 |
+
return values.get(text[0], 0) * 10 + values.get(text[2], 0)
|
| 174 |
+
return None
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def season_marker_number(text: str) -> Optional[int]:
|
| 178 |
+
"""Return season number for compact sequel markers such as II or Ni no Sara."""
|
| 179 |
+
clean = clean_marker_text(text)
|
| 180 |
+
if not clean:
|
| 181 |
+
return None
|
| 182 |
+
|
| 183 |
+
if clean in ROMAN_NUMERAL_VALUES:
|
| 184 |
+
return ROMAN_NUMERAL_VALUES[clean]
|
| 185 |
+
|
| 186 |
+
lowered = re.sub(r"\s+", " ", clean.lower()).strip()
|
| 187 |
+
if lowered in READING_MARKER_VALUES:
|
| 188 |
+
return READING_MARKER_VALUES[lowered]
|
| 189 |
+
if lowered == "ni":
|
| 190 |
+
return 2
|
| 191 |
+
|
| 192 |
+
explicit = re.fullmatch(r"第(.+)[季期部章]", clean)
|
| 193 |
+
if explicit:
|
| 194 |
+
return cn_number_to_int(explicit.group(1))
|
| 195 |
+
|
| 196 |
+
cjk = re.fullmatch(r"([一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖])(?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?", clean)
|
| 197 |
+
if cjk:
|
| 198 |
+
return cn_number_to_int(cjk.group(1))
|
| 199 |
+
|
| 200 |
+
return None
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
def token_offsets_in_text(text: str, tokens: Sequence[str]) -> Optional[List[Tuple[int, int]]]:
|
| 204 |
+
offsets: List[Tuple[int, int]] = []
|
| 205 |
+
cursor = 0
|
| 206 |
+
for token in tokens:
|
| 207 |
+
if token == "":
|
| 208 |
+
offsets.append((cursor, cursor))
|
| 209 |
+
continue
|
| 210 |
+
position = text.find(token, cursor)
|
| 211 |
+
if position < 0:
|
| 212 |
+
return None
|
| 213 |
+
end = position + len(token)
|
| 214 |
+
offsets.append((position, end))
|
| 215 |
+
cursor = end
|
| 216 |
+
return offsets
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
def has_episode_context(text: str, marker_end: int) -> bool:
|
| 220 |
+
tail = text[marker_end:]
|
| 221 |
+
if EPISODE_CONTEXT_RE.match(tail):
|
| 222 |
+
return True
|
| 223 |
+
|
| 224 |
+
# Some releases put a season marker at the end of a title bracket and the
|
| 225 |
+
# episode in the next bracket: `[Title 貳之章][01]`.
|
| 226 |
+
tail = tail.lstrip()
|
| 227 |
+
tail = re.sub(r"^[\]\)】》]\s*", "", tail)
|
| 228 |
+
tail = re.sub(
|
| 229 |
+
r"^(?:[\[\(【《]\s*(?:menu|menus|bdmenu|ncop|nced|op|ed|ova|oad|sp)\s*[\]\)】》]\s*){0,2}",
|
| 230 |
+
"",
|
| 231 |
+
tail,
|
| 232 |
+
flags=re.I,
|
| 233 |
+
)
|
| 234 |
+
return bool(EPISODE_CONTEXT_RE.match(tail))
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def find_sequel_season_markers(text: str) -> List[LabelRepair]:
|
| 238 |
+
"""Find high-confidence sequel markers that should be labeled as SEASON."""
|
| 239 |
+
repairs: List[LabelRepair] = []
|
| 240 |
+
|
| 241 |
+
for pattern, kind in (
|
| 242 |
+
(READING_MARKER_RE, "reading"),
|
| 243 |
+
(ROMAN_MARKER_RE, "roman"),
|
| 244 |
+
(CJK_MARKER_RE, "cjk"),
|
| 245 |
+
):
|
| 246 |
+
for match in pattern.finditer(text):
|
| 247 |
+
marker = match.group("marker")
|
| 248 |
+
value = season_marker_number(marker)
|
| 249 |
+
if value is None or not has_episode_context(text, match.end()):
|
| 250 |
+
continue
|
| 251 |
+
repairs.append(LabelRepair(kind, marker, value, match.start(), match.end()))
|
| 252 |
+
|
| 253 |
+
for base, value in STANDALONE_NI_SEASON_BASES.items():
|
| 254 |
+
pattern = re.compile(rf"(?<![A-Za-z0-9]){re.escape(base)}\s+(?P<marker>Ni)(?![A-Za-z0-9])")
|
| 255 |
+
for match in pattern.finditer(text):
|
| 256 |
+
if not has_episode_context(text, match.end("marker")):
|
| 257 |
+
continue
|
| 258 |
+
repairs.append(
|
| 259 |
+
LabelRepair(
|
| 260 |
+
kind="verified_bare_ni",
|
| 261 |
+
marker=match.group("marker"),
|
| 262 |
+
value=value,
|
| 263 |
+
start=match.start("marker"),
|
| 264 |
+
end=match.end("marker"),
|
| 265 |
+
)
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
repairs.sort(key=lambda item: (item.start, item.end))
|
| 269 |
+
deduped: List[LabelRepair] = []
|
| 270 |
+
for repair in repairs:
|
| 271 |
+
if deduped and repair.start < deduped[-1].end:
|
| 272 |
+
previous = deduped[-1]
|
| 273 |
+
if (repair.end - repair.start) > (previous.end - previous.start):
|
| 274 |
+
deduped[-1] = repair
|
| 275 |
+
continue
|
| 276 |
+
deduped.append(repair)
|
| 277 |
+
return deduped
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
def labels_have_season_before(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], marker_start: int) -> bool:
|
| 281 |
+
return any(label.endswith("SEASON") and end <= marker_start for label, (_start, end) in zip(labels, offsets))
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
def token_indices_for_span(offsets: Sequence[Tuple[int, int]], start: int, end: int) -> List[int]:
|
| 285 |
+
return [
|
| 286 |
+
idx for idx, (tok_start, tok_end) in enumerate(offsets)
|
| 287 |
+
if tok_start < end and tok_end > start
|
| 288 |
+
]
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def label_span(labels: List[str], indices: Sequence[int], entity: str) -> None:
|
| 292 |
+
previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
|
| 293 |
+
first = not previous_is_same_entity
|
| 294 |
+
for idx in indices:
|
| 295 |
+
labels[idx] = f"B-{entity}" if first else f"I-{entity}"
|
| 296 |
+
first = False
|
| 297 |
+
|
| 298 |
+
|
| 299 |
+
def label_span_if_changed(labels: List[str], indices: Sequence[int], entity: str) -> bool:
|
| 300 |
+
previous_is_same_entity = bool(indices) and indices[0] > 0 and labels[indices[0] - 1].endswith(entity)
|
| 301 |
+
first_label = f"I-{entity}" if previous_is_same_entity else f"B-{entity}"
|
| 302 |
+
expected = [first_label] + [f"I-{entity}"] * max(0, len(indices) - 1)
|
| 303 |
+
if [labels[idx] for idx in indices] == expected:
|
| 304 |
+
return False
|
| 305 |
+
label_span(labels, indices, entity)
|
| 306 |
+
return True
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
def safe_to_overwrite_meta(labels: Sequence[str], indices: Sequence[int]) -> bool:
|
| 310 |
+
if not indices:
|
| 311 |
+
return False
|
| 312 |
+
return not any(
|
| 313 |
+
labels[idx].endswith(("GROUP", "EPISODE", "SEASON"))
|
| 314 |
+
for idx in indices
|
| 315 |
+
)
|
| 316 |
+
|
| 317 |
+
|
| 318 |
+
def mark_adjacent_title_separators_o(
|
| 319 |
+
tokens: Sequence[str],
|
| 320 |
+
labels: List[str],
|
| 321 |
+
marker_indices: Sequence[int],
|
| 322 |
+
) -> None:
|
| 323 |
+
if not marker_indices:
|
| 324 |
+
return
|
| 325 |
+
|
| 326 |
+
idx = marker_indices[0] - 1
|
| 327 |
+
while idx >= 0 and "".join(tokens[idx]).strip() == "" and labels[idx].endswith("TITLE"):
|
| 328 |
+
labels[idx] = "O"
|
| 329 |
+
idx -= 1
|
| 330 |
+
|
| 331 |
+
idx = marker_indices[-1] + 1
|
| 332 |
+
while idx < len(tokens) and tokens[idx] in SEPARATOR_CHARS and labels[idx].endswith("TITLE"):
|
| 333 |
+
labels[idx] = "O"
|
| 334 |
+
idx += 1
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
def first_episode_end(labels: Sequence[str], offsets: Sequence[Tuple[int, int]], text: str) -> int:
|
| 338 |
+
ends = [
|
| 339 |
+
end for label, (_start, end) in zip(labels, offsets)
|
| 340 |
+
if label.endswith("EPISODE")
|
| 341 |
+
]
|
| 342 |
+
if ends:
|
| 343 |
+
return min(ends)
|
| 344 |
+
match = EPISODE_SPAN_RE.search(text)
|
| 345 |
+
return match.end() if match else 0
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
def bracket_content_spans(text: str) -> Iterable[Tuple[str, int, int, int, int]]:
|
| 349 |
+
for match in BRACKET_RE.finditer(text):
|
| 350 |
+
groups = match.groups()
|
| 351 |
+
group_index = next((idx for idx, value in enumerate(groups) if value is not None), None)
|
| 352 |
+
if group_index is None:
|
| 353 |
+
continue
|
| 354 |
+
inner = groups[group_index] or ""
|
| 355 |
+
# The opening delimiter is one code point in all supported bracket forms.
|
| 356 |
+
inner_start = match.start() + 1
|
| 357 |
+
inner_end = inner_start + len(inner)
|
| 358 |
+
yield inner.strip(), inner_start, inner_end, match.start(), match.end()
|
| 359 |
+
|
| 360 |
+
|
| 361 |
+
def repair_structural_meta_labels(
|
| 362 |
+
text: str,
|
| 363 |
+
tokens: Sequence[str],
|
| 364 |
+
labels: List[str],
|
| 365 |
+
offsets: Sequence[Tuple[int, int]],
|
| 366 |
+
) -> List[LabelRepair]:
|
| 367 |
+
repairs: List[LabelRepair] = []
|
| 368 |
+
episode_end = first_episode_end(labels, offsets, text)
|
| 369 |
+
|
| 370 |
+
for clean, inner_start, inner_end, bracket_start, _bracket_end in bracket_content_spans(text):
|
| 371 |
+
if bracket_start < episode_end:
|
| 372 |
+
continue
|
| 373 |
+
if not clean:
|
| 374 |
+
continue
|
| 375 |
+
|
| 376 |
+
if SPECIAL_TAG_RE.fullmatch(clean):
|
| 377 |
+
indices = token_indices_for_span(offsets, inner_start, inner_end)
|
| 378 |
+
if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SPECIAL"):
|
| 379 |
+
repairs.append(LabelRepair("special", clean, 0, inner_start, inner_end))
|
| 380 |
+
continue
|
| 381 |
+
|
| 382 |
+
if SOURCE_TAG_RE.fullmatch(clean):
|
| 383 |
+
indices = token_indices_for_span(offsets, inner_start, inner_end)
|
| 384 |
+
if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
|
| 385 |
+
repairs.append(LabelRepair("source", clean, 0, inner_start, inner_end))
|
| 386 |
+
continue
|
| 387 |
+
|
| 388 |
+
for match in RESOLUTION_RE.finditer(clean):
|
| 389 |
+
start = inner_start + match.start()
|
| 390 |
+
end = inner_start + match.end()
|
| 391 |
+
indices = token_indices_for_span(offsets, start, end)
|
| 392 |
+
if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "RESOLUTION"):
|
| 393 |
+
repairs.append(LabelRepair("resolution", match.group(0), 0, start, end))
|
| 394 |
+
|
| 395 |
+
for match in SOURCE_RE.finditer(clean):
|
| 396 |
+
start = inner_start + match.start()
|
| 397 |
+
end = inner_start + match.end()
|
| 398 |
+
indices = token_indices_for_span(offsets, start, end)
|
| 399 |
+
if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, "SOURCE"):
|
| 400 |
+
repairs.append(LabelRepair("source", match.group(0), 0, start, end))
|
| 401 |
+
|
| 402 |
+
# Dot-separated WEB names often carry source/resolution after SxxEyy without
|
| 403 |
+
# brackets. Repair only after the episode span to avoid touching titles.
|
| 404 |
+
for pattern, entity in ((RESOLUTION_RE, "RESOLUTION"), (SOURCE_RE, "SOURCE")):
|
| 405 |
+
for match in pattern.finditer(text):
|
| 406 |
+
if match.start() < episode_end:
|
| 407 |
+
continue
|
| 408 |
+
indices = token_indices_for_span(offsets, match.start(), match.end())
|
| 409 |
+
if safe_to_overwrite_meta(labels, indices) and label_span_if_changed(labels, indices, entity):
|
| 410 |
+
repairs.append(LabelRepair(entity.lower(), match.group(0), 0, match.start(), match.end()))
|
| 411 |
+
|
| 412 |
+
return repairs
|
| 413 |
+
|
| 414 |
+
|
| 415 |
+
def repair_known_label_issues(
|
| 416 |
+
item: Dict,
|
| 417 |
+
) -> Tuple[List[str], List[str], List[LabelRepair]]:
|
| 418 |
+
"""
|
| 419 |
+
Repair known weak-label issues.
|
| 420 |
+
|
| 421 |
+
The repair is intentionally conservative:
|
| 422 |
+
- sequel markers must be immediately before an episode/special context;
|
| 423 |
+
- sequel marker spans must currently be part of TITLE/O, not group/meta;
|
| 424 |
+
- rows that already have a season before the marker are left alone;
|
| 425 |
+
- structural meta repairs only touch spans after the first episode.
|
| 426 |
+
"""
|
| 427 |
+
source_tokens = [str(token) for token in item.get("tokens", [])]
|
| 428 |
+
source_labels = [str(label) for label in item.get("labels", [])]
|
| 429 |
+
if len(source_tokens) != len(source_labels):
|
| 430 |
+
return source_tokens, source_labels, []
|
| 431 |
+
|
| 432 |
+
filename = str(item.get("filename") or "")
|
| 433 |
+
text = filename if filename else "".join(source_tokens)
|
| 434 |
+
offsets = token_offsets_in_text(text, source_tokens)
|
| 435 |
+
if offsets is None:
|
| 436 |
+
text = "".join(source_tokens)
|
| 437 |
+
offsets = token_offsets_in_text(text, source_tokens)
|
| 438 |
+
if offsets is None:
|
| 439 |
+
return source_tokens, source_labels, []
|
| 440 |
+
|
| 441 |
+
repaired_labels = list(source_labels)
|
| 442 |
+
applied: List[LabelRepair] = []
|
| 443 |
+
|
| 444 |
+
quick_text = text.lower()
|
| 445 |
+
has_sequel_marker_hint = any(
|
| 446 |
+
needle in text or needle in quick_text
|
| 447 |
+
for needle in (
|
| 448 |
+
" II", " III", " IV", " V", " VI", " VII", " VIII", " IX",
|
| 449 |
+
"Ⅱ", "Ⅲ", "Ⅳ", "Ⅴ", "Ⅵ", "Ⅶ", "Ⅷ", "Ⅸ",
|
| 450 |
+
"之章", "之期", "之季", "之部", "ノ章", "ノ期", "の章", "の期",
|
| 451 |
+
"貳", "贰", "弐", "弍", "參", "叁", "参", "肆", "陸", "陆",
|
| 452 |
+
"Ni ", " ni ", " no Sara", "Gakki",
|
| 453 |
+
)
|
| 454 |
+
)
|
| 455 |
+
if has_sequel_marker_hint:
|
| 456 |
+
for repair in find_sequel_season_markers(text):
|
| 457 |
+
if labels_have_season_before(repaired_labels, offsets, repair.start):
|
| 458 |
+
continue
|
| 459 |
+
indices = token_indices_for_span(offsets, repair.start, repair.end)
|
| 460 |
+
if not indices:
|
| 461 |
+
continue
|
| 462 |
+
existing = [repaired_labels[idx] for idx in indices]
|
| 463 |
+
if any(
|
| 464 |
+
label.endswith(("GROUP", "EPISODE", "RESOLUTION", "SOURCE", "SPECIAL"))
|
| 465 |
+
for label in existing
|
| 466 |
+
):
|
| 467 |
+
continue
|
| 468 |
+
if not any(label.endswith("TITLE") for label in existing):
|
| 469 |
+
continue
|
| 470 |
+
|
| 471 |
+
label_span(repaired_labels, indices, "SEASON")
|
| 472 |
+
mark_adjacent_title_separators_o(source_tokens, repaired_labels, indices)
|
| 473 |
+
applied.append(repair)
|
| 474 |
+
|
| 475 |
+
applied.extend(repair_structural_meta_labels(text, source_tokens, repaired_labels, offsets))
|
| 476 |
+
return source_tokens, repaired_labels, applied
|
| 477 |
+
|
| 478 |
+
|
| 479 |
+
def repair_sequel_season_labels(
|
| 480 |
+
item: Dict,
|
| 481 |
+
) -> Tuple[List[str], List[str], List[LabelRepair]]:
|
| 482 |
+
"""Backward-compatible wrapper for callers that repair known label issues."""
|
| 483 |
+
return repair_known_label_issues(item)
|
| 484 |
+
|
| 485 |
+
|
| 486 |
+
def repair_jsonl_item(item: Dict) -> Tuple[Dict, List[LabelRepair]]:
|
| 487 |
+
tokens, labels, repairs = repair_known_label_issues(item)
|
| 488 |
+
labels = normalize_iob2(labels)
|
| 489 |
+
if not repairs:
|
| 490 |
+
if labels == item.get("labels", []):
|
| 491 |
+
return item, []
|
| 492 |
+
repaired = dict(item)
|
| 493 |
+
repaired["labels"] = labels
|
| 494 |
+
return repaired, []
|
| 495 |
+
repaired = dict(item)
|
| 496 |
+
repaired["tokens"] = tokens
|
| 497 |
+
repaired["labels"] = labels
|
| 498 |
+
return repaired, repairs
|
| 499 |
+
|
| 500 |
+
|
| 501 |
+
def normalize_iob2(labels: Sequence[str]) -> List[str]:
|
| 502 |
+
normalized: List[str] = []
|
| 503 |
+
previous_entity: Optional[str] = None
|
| 504 |
+
for label in labels:
|
| 505 |
+
if not label.startswith(("B-", "I-")):
|
| 506 |
+
normalized.append("O")
|
| 507 |
+
previous_entity = None
|
| 508 |
+
continue
|
| 509 |
+
entity = label.split("-", 1)[1]
|
| 510 |
+
prefix = "I" if previous_entity == entity else "B"
|
| 511 |
+
normalized.append(f"{prefix}-{entity}")
|
| 512 |
+
previous_entity = entity
|
| 513 |
+
return normalized
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:697d7491b83ef615994e02f11f0f65362c400f5eb6b4be8f43f02435ad43173f
|
| 3 |
+
size 19142604
|
model/config.json
DELETED
|
@@ -1,64 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"add_cross_attention": false,
|
| 3 |
-
"architectures": [
|
| 4 |
-
"BertForTokenClassification"
|
| 5 |
-
],
|
| 6 |
-
"attention_probs_dropout_prob": 0.1,
|
| 7 |
-
"bos_token_id": null,
|
| 8 |
-
"classifier_dropout": null,
|
| 9 |
-
"dtype": "float32",
|
| 10 |
-
"eos_token_id": null,
|
| 11 |
-
"hidden_act": "gelu",
|
| 12 |
-
"hidden_dropout_prob": 0.1,
|
| 13 |
-
"hidden_size": 256,
|
| 14 |
-
"id2label": {
|
| 15 |
-
"0": "O",
|
| 16 |
-
"1": "B-TITLE",
|
| 17 |
-
"2": "I-TITLE",
|
| 18 |
-
"3": "B-SEASON",
|
| 19 |
-
"4": "I-SEASON",
|
| 20 |
-
"5": "B-EPISODE",
|
| 21 |
-
"6": "I-EPISODE",
|
| 22 |
-
"7": "B-SPECIAL",
|
| 23 |
-
"8": "I-SPECIAL",
|
| 24 |
-
"9": "B-GROUP",
|
| 25 |
-
"10": "I-GROUP",
|
| 26 |
-
"11": "B-RESOLUTION",
|
| 27 |
-
"12": "I-RESOLUTION",
|
| 28 |
-
"13": "B-SOURCE",
|
| 29 |
-
"14": "I-SOURCE"
|
| 30 |
-
},
|
| 31 |
-
"initializer_range": 0.02,
|
| 32 |
-
"intermediate_size": 1024,
|
| 33 |
-
"is_decoder": false,
|
| 34 |
-
"label2id": {
|
| 35 |
-
"B-EPISODE": 5,
|
| 36 |
-
"B-GROUP": 9,
|
| 37 |
-
"B-RESOLUTION": 11,
|
| 38 |
-
"B-SEASON": 3,
|
| 39 |
-
"B-SOURCE": 13,
|
| 40 |
-
"B-SPECIAL": 7,
|
| 41 |
-
"B-TITLE": 1,
|
| 42 |
-
"I-EPISODE": 6,
|
| 43 |
-
"I-GROUP": 10,
|
| 44 |
-
"I-RESOLUTION": 12,
|
| 45 |
-
"I-SEASON": 4,
|
| 46 |
-
"I-SOURCE": 14,
|
| 47 |
-
"I-SPECIAL": 8,
|
| 48 |
-
"I-TITLE": 2,
|
| 49 |
-
"O": 0
|
| 50 |
-
},
|
| 51 |
-
"layer_norm_eps": 1e-12,
|
| 52 |
-
"max_position_embeddings": 128,
|
| 53 |
-
"max_seq_length": 64,
|
| 54 |
-
"model_type": "bert",
|
| 55 |
-
"num_attention_heads": 8,
|
| 56 |
-
"num_hidden_layers": 4,
|
| 57 |
-
"pad_token_id": 0,
|
| 58 |
-
"tie_word_embeddings": true,
|
| 59 |
-
"tokenizer_variant": "regex",
|
| 60 |
-
"transformers_version": "5.8.1",
|
| 61 |
-
"type_vocab_size": 2,
|
| 62 |
-
"use_cache": false,
|
| 63 |
-
"vocab_size": 3000
|
| 64 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
model/model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:8213677836eed2c4e4f64f81ebeff58e6166c808aee158954055475cbf90601b
|
| 3 |
-
size 15866796
|
|
|
|
|
|
|
|
|
|
|
|
model/tokenizer_config.json
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"added_tokens_decoder": {
|
| 3 |
-
"0": {
|
| 4 |
-
"content": "[PAD]",
|
| 5 |
-
"lstrip": false,
|
| 6 |
-
"normalized": false,
|
| 7 |
-
"rstrip": false,
|
| 8 |
-
"single_word": false,
|
| 9 |
-
"special": true
|
| 10 |
-
},
|
| 11 |
-
"1": {
|
| 12 |
-
"content": "[UNK]",
|
| 13 |
-
"lstrip": false,
|
| 14 |
-
"normalized": false,
|
| 15 |
-
"rstrip": false,
|
| 16 |
-
"single_word": false,
|
| 17 |
-
"special": true
|
| 18 |
-
},
|
| 19 |
-
"2": {
|
| 20 |
-
"content": "[CLS]",
|
| 21 |
-
"lstrip": false,
|
| 22 |
-
"normalized": false,
|
| 23 |
-
"rstrip": false,
|
| 24 |
-
"single_word": false,
|
| 25 |
-
"special": true
|
| 26 |
-
},
|
| 27 |
-
"3": {
|
| 28 |
-
"content": "[SEP]",
|
| 29 |
-
"lstrip": false,
|
| 30 |
-
"normalized": false,
|
| 31 |
-
"rstrip": false,
|
| 32 |
-
"single_word": false,
|
| 33 |
-
"special": true
|
| 34 |
-
}
|
| 35 |
-
},
|
| 36 |
-
"backend": "custom",
|
| 37 |
-
"cls_token": "[CLS]",
|
| 38 |
-
"model_max_length": 1000000000000000019884624838656,
|
| 39 |
-
"pad_token": "[PAD]",
|
| 40 |
-
"sep_token": "[SEP]",
|
| 41 |
-
"tokenizer_class": "AnimeTokenizer",
|
| 42 |
-
"tokenizer_variant": "regex",
|
| 43 |
-
"unk_token": "[UNK]"
|
| 44 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
model/training_args.bin
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:67f4980e6e5c8a3b151030042cae7449e798e3fc87518f33ed4d557e6fa17e41
|
| 3 |
-
size 5265
|
|
|
|
|
|
|
|
|
|
|
|
model/vocab.json
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
parse_eval_metrics.json
ADDED
|
@@ -0,0 +1,595 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"sample_count": 2048,
|
| 3 |
+
"field_accuracy": {
|
| 4 |
+
"group": 1.0,
|
| 5 |
+
"title": 0.99658203125,
|
| 6 |
+
"season": 0.994140625,
|
| 7 |
+
"episode": 0.99609375,
|
| 8 |
+
"resolution": 0.998046875,
|
| 9 |
+
"source": 0.99365234375,
|
| 10 |
+
"special": 0.998046875
|
| 11 |
+
},
|
| 12 |
+
"field_correct": {
|
| 13 |
+
"group": 2048,
|
| 14 |
+
"title": 2041,
|
| 15 |
+
"season": 2036,
|
| 16 |
+
"episode": 2040,
|
| 17 |
+
"resolution": 2044,
|
| 18 |
+
"source": 2035,
|
| 19 |
+
"special": 2044
|
| 20 |
+
},
|
| 21 |
+
"field_total": {
|
| 22 |
+
"group": 2048,
|
| 23 |
+
"title": 2048,
|
| 24 |
+
"season": 2048,
|
| 25 |
+
"episode": 2048,
|
| 26 |
+
"resolution": 2048,
|
| 27 |
+
"source": 2048,
|
| 28 |
+
"special": 2048
|
| 29 |
+
},
|
| 30 |
+
"full_match_accuracy": 0.98046875,
|
| 31 |
+
"full_match_correct": 2008,
|
| 32 |
+
"full_match_total": 2048,
|
| 33 |
+
"failures": [
|
| 34 |
+
{
|
| 35 |
+
"filename": "[DBD-Raws][Boruto Naruto Next Generations][menu][S13][D2][02][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 36 |
+
"errors": {
|
| 37 |
+
"season": {
|
| 38 |
+
"gold": null,
|
| 39 |
+
"pred": "13"
|
| 40 |
+
}
|
| 41 |
+
},
|
| 42 |
+
"gold": {
|
| 43 |
+
"group": "DBD-Raws",
|
| 44 |
+
"title": "Boruto Naruto Next Generations",
|
| 45 |
+
"season": null,
|
| 46 |
+
"episode": 2,
|
| 47 |
+
"resolution": "1080P",
|
| 48 |
+
"source": "BDRip",
|
| 49 |
+
"special": null
|
| 50 |
+
},
|
| 51 |
+
"pred": {
|
| 52 |
+
"group": "DBD-Raws",
|
| 53 |
+
"title": "Boruto Naruto Next Generations",
|
| 54 |
+
"season": 13,
|
| 55 |
+
"episode": 2,
|
| 56 |
+
"resolution": "1080P",
|
| 57 |
+
"source": "BDRip",
|
| 58 |
+
"special": null
|
| 59 |
+
}
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"filename": "[アニメ BD] ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」(1424x1072 HEVC 10bit FLAC softSub(chi+eng) chap)",
|
| 63 |
+
"errors": {
|
| 64 |
+
"season": {
|
| 65 |
+
"gold": null,
|
| 66 |
+
"pred": "1"
|
| 67 |
+
}
|
| 68 |
+
},
|
| 69 |
+
"gold": {
|
| 70 |
+
"group": "アニメ BD",
|
| 71 |
+
"title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
|
| 72 |
+
"season": null,
|
| 73 |
+
"episode": 9,
|
| 74 |
+
"resolution": "1424x1072",
|
| 75 |
+
"source": "BD",
|
| 76 |
+
"special": null
|
| 77 |
+
},
|
| 78 |
+
"pred": {
|
| 79 |
+
"group": "アニメ BD",
|
| 80 |
+
"title": "ギャラクシーエンジェル 第1期(無印) 第09話「ロストテクノロジーのローストビーフ」",
|
| 81 |
+
"season": 1,
|
| 82 |
+
"episode": 9,
|
| 83 |
+
"resolution": "1424x1072",
|
| 84 |
+
"source": "BD",
|
| 85 |
+
"special": null
|
| 86 |
+
}
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"filename": "コメットさん☆ 第11話 「バトンの力」(DVD DivX4.12 QB95 640x480 24f) [CRC32_C09E1AB0]",
|
| 90 |
+
"errors": {
|
| 91 |
+
"source": {
|
| 92 |
+
"gold": "cr",
|
| 93 |
+
"pred": "dvd"
|
| 94 |
+
}
|
| 95 |
+
},
|
| 96 |
+
"gold": {
|
| 97 |
+
"group": null,
|
| 98 |
+
"title": "コメットさん☆",
|
| 99 |
+
"season": null,
|
| 100 |
+
"episode": 11,
|
| 101 |
+
"resolution": "640x480",
|
| 102 |
+
"source": "CR",
|
| 103 |
+
"special": null
|
| 104 |
+
},
|
| 105 |
+
"pred": {
|
| 106 |
+
"group": null,
|
| 107 |
+
"title": "コメットさん☆",
|
| 108 |
+
"season": null,
|
| 109 |
+
"episode": 11,
|
| 110 |
+
"resolution": "640x480",
|
| 111 |
+
"source": "DVD",
|
| 112 |
+
"special": null
|
| 113 |
+
}
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"filename": "[Kamigami&Mabors&VCB-Studio] Saenai Heroine no Sodatekata Flat [07][Ma10p_1080p][x265_2aac]",
|
| 117 |
+
"errors": {
|
| 118 |
+
"source": {
|
| 119 |
+
"gold": "aac",
|
| 120 |
+
"pred": "x265-2aac"
|
| 121 |
+
}
|
| 122 |
+
},
|
| 123 |
+
"gold": {
|
| 124 |
+
"group": "Kamigami&Mabors&VCB-Studio",
|
| 125 |
+
"title": "Saenai Heroine no Sodatekata Flat",
|
| 126 |
+
"season": null,
|
| 127 |
+
"episode": 7,
|
| 128 |
+
"resolution": "1080p",
|
| 129 |
+
"source": "aac",
|
| 130 |
+
"special": null
|
| 131 |
+
},
|
| 132 |
+
"pred": {
|
| 133 |
+
"group": "Kamigami&Mabors&VCB-Studio",
|
| 134 |
+
"title": "Saenai Heroine no Sodatekata Flat",
|
| 135 |
+
"season": null,
|
| 136 |
+
"episode": 7,
|
| 137 |
+
"resolution": "1080p",
|
| 138 |
+
"source": "x265_2aac",
|
| 139 |
+
"special": null
|
| 140 |
+
}
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"filename": "[Liuyun&VCB-Studio] Hanasaku Iroha [07][Hi10p_1080p][x264_flac_ac3]",
|
| 144 |
+
"errors": {
|
| 145 |
+
"source": {
|
| 146 |
+
"gold": "flac",
|
| 147 |
+
"pred": "x264-flac"
|
| 148 |
+
}
|
| 149 |
+
},
|
| 150 |
+
"gold": {
|
| 151 |
+
"group": "Liuyun&VCB-Studio",
|
| 152 |
+
"title": "Hanasaku Iroha",
|
| 153 |
+
"season": null,
|
| 154 |
+
"episode": 7,
|
| 155 |
+
"resolution": "1080p",
|
| 156 |
+
"source": "flac",
|
| 157 |
+
"special": null
|
| 158 |
+
},
|
| 159 |
+
"pred": {
|
| 160 |
+
"group": "Liuyun&VCB-Studio",
|
| 161 |
+
"title": "Hanasaku Iroha",
|
| 162 |
+
"season": null,
|
| 163 |
+
"episode": 7,
|
| 164 |
+
"resolution": "1080p",
|
| 165 |
+
"source": "x264_flac",
|
| 166 |
+
"special": null
|
| 167 |
+
}
|
| 168 |
+
},
|
| 169 |
+
{
|
| 170 |
+
"filename": "小新外传4[EP02][2017.06.07]出动!妖怪克星",
|
| 171 |
+
"errors": {
|
| 172 |
+
"title": {
|
| 173 |
+
"gold": "小新外传4 ep02 2017 06",
|
| 174 |
+
"pred": "小新外传 ep02 2"
|
| 175 |
+
},
|
| 176 |
+
"season": {
|
| 177 |
+
"gold": null,
|
| 178 |
+
"pred": "4"
|
| 179 |
+
},
|
| 180 |
+
"episode": {
|
| 181 |
+
"gold": "7",
|
| 182 |
+
"pred": "2"
|
| 183 |
+
}
|
| 184 |
+
},
|
| 185 |
+
"gold": {
|
| 186 |
+
"group": null,
|
| 187 |
+
"title": "小新外传4 EP02 2017 06",
|
| 188 |
+
"season": null,
|
| 189 |
+
"episode": 7,
|
| 190 |
+
"resolution": null,
|
| 191 |
+
"source": null,
|
| 192 |
+
"special": null
|
| 193 |
+
},
|
| 194 |
+
"pred": {
|
| 195 |
+
"group": null,
|
| 196 |
+
"title": "小新外传 EP02 2",
|
| 197 |
+
"season": 4,
|
| 198 |
+
"episode": 2,
|
| 199 |
+
"resolution": null,
|
| 200 |
+
"source": null,
|
| 201 |
+
"special": null
|
| 202 |
+
}
|
| 203 |
+
},
|
| 204 |
+
{
|
| 205 |
+
"filename": "[GM-Team][国漫][异常生物见闻录][The Record of Unusual Creatures][2019][12][HEVC][GB][3840×2160]",
|
| 206 |
+
"errors": {
|
| 207 |
+
"resolution": {
|
| 208 |
+
"gold": "3840×2160",
|
| 209 |
+
"pred": "3840×"
|
| 210 |
+
}
|
| 211 |
+
},
|
| 212 |
+
"gold": {
|
| 213 |
+
"group": "GM-Team",
|
| 214 |
+
"title": "国漫",
|
| 215 |
+
"season": null,
|
| 216 |
+
"episode": 12,
|
| 217 |
+
"resolution": "3840×2160",
|
| 218 |
+
"source": "GB",
|
| 219 |
+
"special": null
|
| 220 |
+
},
|
| 221 |
+
"pred": {
|
| 222 |
+
"group": "GM-Team",
|
| 223 |
+
"title": "国漫",
|
| 224 |
+
"season": null,
|
| 225 |
+
"episode": 12,
|
| 226 |
+
"resolution": "3840×",
|
| 227 |
+
"source": "GB",
|
| 228 |
+
"special": null
|
| 229 |
+
}
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"filename": "Ⅱ 116 第108次鐘聲已經敲過了嗎?",
|
| 233 |
+
"errors": {
|
| 234 |
+
"title": {
|
| 235 |
+
"gold": "ⅱ 116 第",
|
| 236 |
+
"pred": "第"
|
| 237 |
+
}
|
| 238 |
+
},
|
| 239 |
+
"gold": {
|
| 240 |
+
"group": null,
|
| 241 |
+
"title": "Ⅱ 116 第",
|
| 242 |
+
"season": null,
|
| 243 |
+
"episode": 116,
|
| 244 |
+
"resolution": null,
|
| 245 |
+
"source": null,
|
| 246 |
+
"special": null
|
| 247 |
+
},
|
| 248 |
+
"pred": {
|
| 249 |
+
"group": null,
|
| 250 |
+
"title": "第",
|
| 251 |
+
"season": null,
|
| 252 |
+
"episode": 116,
|
| 253 |
+
"resolution": null,
|
| 254 |
+
"source": null,
|
| 255 |
+
"special": null
|
| 256 |
+
}
|
| 257 |
+
},
|
| 258 |
+
{
|
| 259 |
+
"filename": "EP08 & EP11 NCED",
|
| 260 |
+
"errors": {
|
| 261 |
+
"title": {
|
| 262 |
+
"gold": "&",
|
| 263 |
+
"pred": "ep"
|
| 264 |
+
}
|
| 265 |
+
},
|
| 266 |
+
"gold": {
|
| 267 |
+
"group": null,
|
| 268 |
+
"title": "&",
|
| 269 |
+
"season": null,
|
| 270 |
+
"episode": 11,
|
| 271 |
+
"resolution": null,
|
| 272 |
+
"source": null,
|
| 273 |
+
"special": "NCED"
|
| 274 |
+
},
|
| 275 |
+
"pred": {
|
| 276 |
+
"group": null,
|
| 277 |
+
"title": "EP",
|
| 278 |
+
"season": null,
|
| 279 |
+
"episode": 11,
|
| 280 |
+
"resolution": null,
|
| 281 |
+
"source": null,
|
| 282 |
+
"special": "NCED"
|
| 283 |
+
}
|
| 284 |
+
},
|
| 285 |
+
{
|
| 286 |
+
"filename": "[S1YURICON] Necronomico no Cosmic Horror Show[06][1080p][WebRip][HEVC_AAC][CHS]",
|
| 287 |
+
"errors": {
|
| 288 |
+
"season": {
|
| 289 |
+
"gold": null,
|
| 290 |
+
"pred": "1"
|
| 291 |
+
}
|
| 292 |
+
},
|
| 293 |
+
"gold": {
|
| 294 |
+
"group": "S1YURICON",
|
| 295 |
+
"title": "Necronomico no Cosmic Horror Show",
|
| 296 |
+
"season": null,
|
| 297 |
+
"episode": 6,
|
| 298 |
+
"resolution": "1080p",
|
| 299 |
+
"source": "WebRip",
|
| 300 |
+
"special": null
|
| 301 |
+
},
|
| 302 |
+
"pred": {
|
| 303 |
+
"group": "S1YURICON",
|
| 304 |
+
"title": "Necronomico no Cosmic Horror Show",
|
| 305 |
+
"season": 1,
|
| 306 |
+
"episode": 6,
|
| 307 |
+
"resolution": "1080p",
|
| 308 |
+
"source": "WebRip",
|
| 309 |
+
"special": null
|
| 310 |
+
}
|
| 311 |
+
},
|
| 312 |
+
{
|
| 313 |
+
"filename": "[FZsub]Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02(14) (MX 1280x720 x264 AAC)_x264",
|
| 314 |
+
"errors": {
|
| 315 |
+
"title": {
|
| 316 |
+
"gold": "gate - jieitai kanochi nite, kaku tatakaeri 2",
|
| 317 |
+
"pred": "gate - jieitai kanochi nite, kaku tatakaeri 2 - 02"
|
| 318 |
+
},
|
| 319 |
+
"season": {
|
| 320 |
+
"gold": "2",
|
| 321 |
+
"pred": null
|
| 322 |
+
}
|
| 323 |
+
},
|
| 324 |
+
"gold": {
|
| 325 |
+
"group": "FZsub",
|
| 326 |
+
"title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2",
|
| 327 |
+
"season": 2,
|
| 328 |
+
"episode": 14,
|
| 329 |
+
"resolution": "1280x720",
|
| 330 |
+
"source": "x264",
|
| 331 |
+
"special": null
|
| 332 |
+
},
|
| 333 |
+
"pred": {
|
| 334 |
+
"group": "FZsub",
|
| 335 |
+
"title": "Gate - Jieitai Kanochi nite, Kaku Tatakaeri 2 - 02",
|
| 336 |
+
"season": null,
|
| 337 |
+
"episode": 14,
|
| 338 |
+
"resolution": "1280x720",
|
| 339 |
+
"source": "x264",
|
| 340 |
+
"special": null
|
| 341 |
+
}
|
| 342 |
+
},
|
| 343 |
+
{
|
| 344 |
+
"filename": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On [BD 1248x702 23.976fps AVC-yuv420p10 FLAC] v2 - yan04000985",
|
| 345 |
+
"errors": {
|
| 346 |
+
"episode": {
|
| 347 |
+
"gold": null,
|
| 348 |
+
"pred": "23"
|
| 349 |
+
}
|
| 350 |
+
},
|
| 351 |
+
"gold": {
|
| 352 |
+
"group": null,
|
| 353 |
+
"title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
|
| 354 |
+
"season": null,
|
| 355 |
+
"episode": null,
|
| 356 |
+
"resolution": "1248x702",
|
| 357 |
+
"source": "BD",
|
| 358 |
+
"special": null
|
| 359 |
+
},
|
| 360 |
+
"pred": {
|
| 361 |
+
"group": null,
|
| 362 |
+
"title": "Mobile Suit Gundam SEED Destiny - HD Remaster 2013 Anime Music Clip2 - Life Goes On",
|
| 363 |
+
"season": null,
|
| 364 |
+
"episode": 23,
|
| 365 |
+
"resolution": "1248x702",
|
| 366 |
+
"source": "BD",
|
| 367 |
+
"special": null
|
| 368 |
+
}
|
| 369 |
+
},
|
| 370 |
+
{
|
| 371 |
+
"filename": "Mary_E_Il_Giardino_Segreto_-_07_-_Camilla_[DvdMUX_by_Magic_©2008]",
|
| 372 |
+
"errors": {
|
| 373 |
+
"source": {
|
| 374 |
+
"gold": null,
|
| 375 |
+
"pred": "dvd"
|
| 376 |
+
}
|
| 377 |
+
},
|
| 378 |
+
"gold": {
|
| 379 |
+
"group": null,
|
| 380 |
+
"title": "Mary_E_Il_Giardino_Segreto",
|
| 381 |
+
"season": null,
|
| 382 |
+
"episode": 7,
|
| 383 |
+
"resolution": null,
|
| 384 |
+
"source": null,
|
| 385 |
+
"special": null
|
| 386 |
+
},
|
| 387 |
+
"pred": {
|
| 388 |
+
"group": null,
|
| 389 |
+
"title": "Mary_E_Il_Giardino_Segreto",
|
| 390 |
+
"season": null,
|
| 391 |
+
"episode": 7,
|
| 392 |
+
"resolution": null,
|
| 393 |
+
"source": "Dvd",
|
| 394 |
+
"special": null
|
| 395 |
+
}
|
| 396 |
+
},
|
| 397 |
+
{
|
| 398 |
+
"filename": "(アニメ) アイドル伝説えり子 第24話 「心をつなぐ輪舞曲」 (DVD 640x480DivX5.02QB93 48kHz128kbps)",
|
| 399 |
+
"errors": {
|
| 400 |
+
"resolution": {
|
| 401 |
+
"gold": null,
|
| 402 |
+
"pred": "640x480"
|
| 403 |
+
}
|
| 404 |
+
},
|
| 405 |
+
"gold": {
|
| 406 |
+
"group": "アニメ",
|
| 407 |
+
"title": "アイドル伝説えり子",
|
| 408 |
+
"season": null,
|
| 409 |
+
"episode": 24,
|
| 410 |
+
"resolution": null,
|
| 411 |
+
"source": "DVD",
|
| 412 |
+
"special": null
|
| 413 |
+
},
|
| 414 |
+
"pred": {
|
| 415 |
+
"group": "アニメ",
|
| 416 |
+
"title": "アイドル伝説えり子",
|
| 417 |
+
"season": null,
|
| 418 |
+
"episode": 24,
|
| 419 |
+
"resolution": "640x480",
|
| 420 |
+
"source": "DVD",
|
| 421 |
+
"special": null
|
| 422 |
+
}
|
| 423 |
+
},
|
| 424 |
+
{
|
| 425 |
+
"filename": "[DMG] 東京レイヴンズ 第06話「days in nest -休日-」 [BDRip][AVC_AAC][720P][CHS](A8161323)",
|
| 426 |
+
"errors": {
|
| 427 |
+
"episode": {
|
| 428 |
+
"gold": "1323",
|
| 429 |
+
"pred": "6"
|
| 430 |
+
}
|
| 431 |
+
},
|
| 432 |
+
"gold": {
|
| 433 |
+
"group": "DMG",
|
| 434 |
+
"title": "東京レイヴンズ 第06話「days in nest -休日-」",
|
| 435 |
+
"season": null,
|
| 436 |
+
"episode": 1323,
|
| 437 |
+
"resolution": "720P",
|
| 438 |
+
"source": "BDRip",
|
| 439 |
+
"special": null
|
| 440 |
+
},
|
| 441 |
+
"pred": {
|
| 442 |
+
"group": "DMG",
|
| 443 |
+
"title": "東京レイヴンズ 第06話「days in nest -休日-」",
|
| 444 |
+
"season": null,
|
| 445 |
+
"episode": 6,
|
| 446 |
+
"resolution": "720P",
|
| 447 |
+
"source": "BDRip",
|
| 448 |
+
"special": null
|
| 449 |
+
}
|
| 450 |
+
},
|
| 451 |
+
{
|
| 452 |
+
"filename": "[S1YURICON] Necronomico no Cosmic Horror Show[05v2][1080p][WebRip][AVC_AAC][CHS]",
|
| 453 |
+
"errors": {
|
| 454 |
+
"season": {
|
| 455 |
+
"gold": null,
|
| 456 |
+
"pred": "1"
|
| 457 |
+
}
|
| 458 |
+
},
|
| 459 |
+
"gold": {
|
| 460 |
+
"group": "S1YURICON",
|
| 461 |
+
"title": "Necronomico no Cosmic Horror Show",
|
| 462 |
+
"season": null,
|
| 463 |
+
"episode": 5,
|
| 464 |
+
"resolution": "1080p",
|
| 465 |
+
"source": "WebRip",
|
| 466 |
+
"special": null
|
| 467 |
+
},
|
| 468 |
+
"pred": {
|
| 469 |
+
"group": "S1YURICON",
|
| 470 |
+
"title": "Necronomico no Cosmic Horror Show",
|
| 471 |
+
"season": 1,
|
| 472 |
+
"episode": 5,
|
| 473 |
+
"resolution": "1080p",
|
| 474 |
+
"source": "WebRip",
|
| 475 |
+
"special": null
|
| 476 |
+
}
|
| 477 |
+
},
|
| 478 |
+
{
|
| 479 |
+
"filename": "Cardcaptor Sakura - 17 [x264-AAC-BD1440x1080p][Sakura][C-W][E2B50799]",
|
| 480 |
+
"errors": {
|
| 481 |
+
"resolution": {
|
| 482 |
+
"gold": null,
|
| 483 |
+
"pred": "1080p"
|
| 484 |
+
},
|
| 485 |
+
"source": {
|
| 486 |
+
"gold": null,
|
| 487 |
+
"pred": "e2b50799"
|
| 488 |
+
}
|
| 489 |
+
},
|
| 490 |
+
"gold": {
|
| 491 |
+
"group": null,
|
| 492 |
+
"title": "Cardcaptor Sakura",
|
| 493 |
+
"season": null,
|
| 494 |
+
"episode": 17,
|
| 495 |
+
"resolution": null,
|
| 496 |
+
"source": null,
|
| 497 |
+
"special": null
|
| 498 |
+
},
|
| 499 |
+
"pred": {
|
| 500 |
+
"group": null,
|
| 501 |
+
"title": "Cardcaptor Sakura",
|
| 502 |
+
"season": null,
|
| 503 |
+
"episode": 17,
|
| 504 |
+
"resolution": "1080p",
|
| 505 |
+
"source": "E2B50799",
|
| 506 |
+
"special": null
|
| 507 |
+
}
|
| 508 |
+
},
|
| 509 |
+
{
|
| 510 |
+
"filename": "[Xspitfire911] Tate no Yuusha no Nariagari S01E20 BDRIP 1080p X265 10bit VOSTFR",
|
| 511 |
+
"errors": {
|
| 512 |
+
"season": {
|
| 513 |
+
"gold": null,
|
| 514 |
+
"pred": "1"
|
| 515 |
+
}
|
| 516 |
+
},
|
| 517 |
+
"gold": {
|
| 518 |
+
"group": "Xspitfire911",
|
| 519 |
+
"title": "Tate no Yuusha no Nariagari",
|
| 520 |
+
"season": null,
|
| 521 |
+
"episode": 20,
|
| 522 |
+
"resolution": "1080p",
|
| 523 |
+
"source": "BDRIP",
|
| 524 |
+
"special": null
|
| 525 |
+
},
|
| 526 |
+
"pred": {
|
| 527 |
+
"group": "Xspitfire911",
|
| 528 |
+
"title": "Tate no Yuusha no Nariagari",
|
| 529 |
+
"season": 1,
|
| 530 |
+
"episode": 20,
|
| 531 |
+
"resolution": "1080p",
|
| 532 |
+
"source": "BDRIP",
|
| 533 |
+
"special": null
|
| 534 |
+
}
|
| 535 |
+
},
|
| 536 |
+
{
|
| 537 |
+
"filename": "[KTXP][Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV][13][BIG5][720P][MP4]",
|
| 538 |
+
"errors": {
|
| 539 |
+
"title": {
|
| 540 |
+
"gold": "dungeon ni deai wo motomeru no wa machigatteiru darou ka",
|
| 541 |
+
"pred": "dungeon ni deai wo motomeru no wa machigatteiru darou ka iv"
|
| 542 |
+
},
|
| 543 |
+
"season": {
|
| 544 |
+
"gold": "4",
|
| 545 |
+
"pred": null
|
| 546 |
+
}
|
| 547 |
+
},
|
| 548 |
+
"gold": {
|
| 549 |
+
"group": "KTXP",
|
| 550 |
+
"title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka",
|
| 551 |
+
"season": 4,
|
| 552 |
+
"episode": 13,
|
| 553 |
+
"resolution": "720P",
|
| 554 |
+
"source": "BIG5",
|
| 555 |
+
"special": null
|
| 556 |
+
},
|
| 557 |
+
"pred": {
|
| 558 |
+
"group": "KTXP",
|
| 559 |
+
"title": "Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka IV",
|
| 560 |
+
"season": null,
|
| 561 |
+
"episode": 13,
|
| 562 |
+
"resolution": "720P",
|
| 563 |
+
"source": "BIG5",
|
| 564 |
+
"special": null
|
| 565 |
+
}
|
| 566 |
+
},
|
| 567 |
+
{
|
| 568 |
+
"filename": "[JyFanSub][Fate_Apocrypha][15][GB][1080]p",
|
| 569 |
+
"errors": {
|
| 570 |
+
"episode": {
|
| 571 |
+
"gold": "1080",
|
| 572 |
+
"pred": "15"
|
| 573 |
+
}
|
| 574 |
+
},
|
| 575 |
+
"gold": {
|
| 576 |
+
"group": "JyFanSub",
|
| 577 |
+
"title": "Fate_Apocrypha",
|
| 578 |
+
"season": null,
|
| 579 |
+
"episode": 1080,
|
| 580 |
+
"resolution": null,
|
| 581 |
+
"source": "GB",
|
| 582 |
+
"special": null
|
| 583 |
+
},
|
| 584 |
+
"pred": {
|
| 585 |
+
"group": "JyFanSub",
|
| 586 |
+
"title": "Fate_Apocrypha",
|
| 587 |
+
"season": null,
|
| 588 |
+
"episode": 15,
|
| 589 |
+
"resolution": null,
|
| 590 |
+
"source": "GB",
|
| 591 |
+
"special": null
|
| 592 |
+
}
|
| 593 |
+
}
|
| 594 |
+
]
|
| 595 |
+
}
|
pyproject.toml
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "anifilebert"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "Tiny BERT token-classification model and tooling for parsing anime release filenames."
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = ">=3.11"
|
| 7 |
+
license = { text = "Apache-2.0" }
|
| 8 |
+
dependencies = [
|
| 9 |
+
"accelerate==1.13.0",
|
| 10 |
+
"datasets==4.8.5",
|
| 11 |
+
"numpy==2.4.5",
|
| 12 |
+
"onnx==1.21.0",
|
| 13 |
+
"onnxruntime==1.26.0",
|
| 14 |
+
"onnxscript==0.7.0",
|
| 15 |
+
"seqeval==1.2.2",
|
| 16 |
+
"tensorboard>=2.14.0",
|
| 17 |
+
"torch==2.12.0+cu126",
|
| 18 |
+
"transformers==5.8.1",
|
| 19 |
+
]
|
| 20 |
+
|
| 21 |
+
[project.urls]
|
| 22 |
+
Repository = "https://huggingface.co/ModerRAS/AniFileBERT"
|
| 23 |
+
|
| 24 |
+
[tool.uv]
|
| 25 |
+
package = false
|
| 26 |
+
environments = ["sys_platform == 'win32'"]
|
| 27 |
+
|
| 28 |
+
[tool.uv.sources]
|
| 29 |
+
torch = [
|
| 30 |
+
{ index = "pytorch-cu126", marker = "platform_system == 'Windows'" },
|
| 31 |
+
]
|
| 32 |
+
|
| 33 |
+
[[tool.uv.index]]
|
| 34 |
+
name = "pytorch-cu126"
|
| 35 |
+
url = "https://download.pytorch.org/whl/cu126"
|
| 36 |
+
explicit = true
|
relabel_dataset_from_filenames.py
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rebuild AnimeName weak labels from each stored filename."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from collections import Counter
|
| 8 |
+
from datetime import datetime, timezone
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from statistics import mean
|
| 11 |
+
from typing import Iterable
|
| 12 |
+
|
| 13 |
+
from dmhy_dataset import weak_label_filename
|
| 14 |
+
from label_repairs import repair_jsonl_item
|
| 15 |
+
from tokenizer import AnimeTokenizer
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def parse_args() -> argparse.Namespace:
|
| 19 |
+
parser = argparse.ArgumentParser(description="Relabel a JSONL dataset from filename strings")
|
| 20 |
+
parser.add_argument("--input", required=True, help="Input JSONL containing filename fields")
|
| 21 |
+
parser.add_argument("--output", required=True, help="Output relabeled regex-token JSONL")
|
| 22 |
+
parser.add_argument("--manifest-output", default=None, help="Relabel manifest JSON")
|
| 23 |
+
parser.add_argument("--vocab-output", default=None, help="Optional regex vocab JSON")
|
| 24 |
+
parser.add_argument("--base-vocab", default=None, help="Optional regex vocab whose IDs should be preserved")
|
| 25 |
+
parser.add_argument("--max-vocab-size", type=int, default=3000)
|
| 26 |
+
parser.add_argument("--limit", type=int, default=None)
|
| 27 |
+
parser.add_argument("--progress", type=int, default=50000)
|
| 28 |
+
parser.add_argument("--example-count", type=int, default=20)
|
| 29 |
+
return parser.parse_args()
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def iter_jsonl(path: Path) -> Iterable[dict]:
|
| 33 |
+
with path.open("r", encoding="utf-8") as handle:
|
| 34 |
+
for line_no, line in enumerate(handle, 1):
|
| 35 |
+
line = line.strip()
|
| 36 |
+
if not line:
|
| 37 |
+
continue
|
| 38 |
+
try:
|
| 39 |
+
yield json.loads(line)
|
| 40 |
+
except json.JSONDecodeError as exc:
|
| 41 |
+
raise ValueError(f"{path}:{line_no}: invalid JSON") from exc
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def length_stats(values: list[int]) -> dict:
|
| 45 |
+
if not values:
|
| 46 |
+
return {"min": 0, "mean": 0, "p50": 0, "p90": 0, "p95": 0, "p99": 0, "max": 0}
|
| 47 |
+
ordered = sorted(values)
|
| 48 |
+
|
| 49 |
+
def percentile(pct: float) -> int:
|
| 50 |
+
index = min(len(ordered) - 1, round((pct / 100) * (len(ordered) - 1)))
|
| 51 |
+
return ordered[index]
|
| 52 |
+
|
| 53 |
+
return {
|
| 54 |
+
"min": min(values),
|
| 55 |
+
"mean": mean(values),
|
| 56 |
+
"p50": percentile(50),
|
| 57 |
+
"p90": percentile(90),
|
| 58 |
+
"p95": percentile(95),
|
| 59 |
+
"p99": percentile(99),
|
| 60 |
+
"max": max(values),
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def main() -> None:
|
| 65 |
+
args = parse_args()
|
| 66 |
+
input_path = Path(args.input)
|
| 67 |
+
output_path = Path(args.output)
|
| 68 |
+
manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
|
| 69 |
+
vocab_path = Path(args.vocab_output) if args.vocab_output else None
|
| 70 |
+
|
| 71 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 72 |
+
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
| 73 |
+
if vocab_path:
|
| 74 |
+
vocab_path.parent.mkdir(parents=True, exist_ok=True)
|
| 75 |
+
|
| 76 |
+
tokenizer = AnimeTokenizer()
|
| 77 |
+
rows_in = 0
|
| 78 |
+
rows_written = 0
|
| 79 |
+
rows_failed = 0
|
| 80 |
+
rows_repaired_after_relabel = 0
|
| 81 |
+
label_counter: Counter[str] = Counter()
|
| 82 |
+
failure_counter: Counter[str] = Counter()
|
| 83 |
+
token_lists: list[list[str]] = []
|
| 84 |
+
lengths: list[int] = []
|
| 85 |
+
examples: list[dict] = []
|
| 86 |
+
failures: list[dict] = []
|
| 87 |
+
|
| 88 |
+
with output_path.open("w", encoding="utf-8", newline="\n") as out:
|
| 89 |
+
for item in iter_jsonl(input_path):
|
| 90 |
+
rows_in += 1
|
| 91 |
+
filename = item.get("filename")
|
| 92 |
+
if not filename:
|
| 93 |
+
rows_failed += 1
|
| 94 |
+
failure_counter["missing_filename"] += 1
|
| 95 |
+
continue
|
| 96 |
+
sample = weak_label_filename(str(filename), tokenizer)
|
| 97 |
+
if sample is None:
|
| 98 |
+
rows_failed += 1
|
| 99 |
+
failure_counter["weak_label_failed"] += 1
|
| 100 |
+
if len(failures) < args.example_count:
|
| 101 |
+
failures.append({"file_id": item.get("file_id"), "filename": filename})
|
| 102 |
+
continue
|
| 103 |
+
record = dict(item)
|
| 104 |
+
record.pop("tokenizer_variant", None)
|
| 105 |
+
record.pop("source_token_count", None)
|
| 106 |
+
record.pop("char_token_count", None)
|
| 107 |
+
record["tokens"] = sample["tokens"]
|
| 108 |
+
record["labels"] = sample["labels"]
|
| 109 |
+
|
| 110 |
+
repaired, repairs = repair_jsonl_item(record)
|
| 111 |
+
if repairs:
|
| 112 |
+
rows_repaired_after_relabel += 1
|
| 113 |
+
record = repaired
|
| 114 |
+
|
| 115 |
+
out.write(json.dumps(record, ensure_ascii=False, separators=(",", ":")) + "\n")
|
| 116 |
+
rows_written += 1
|
| 117 |
+
label_counter.update(record["labels"])
|
| 118 |
+
token_lists.append(record["tokens"])
|
| 119 |
+
lengths.append(len(record["tokens"]))
|
| 120 |
+
if len(examples) < args.example_count:
|
| 121 |
+
examples.append(record)
|
| 122 |
+
|
| 123 |
+
if args.limit is not None and rows_written >= args.limit:
|
| 124 |
+
break
|
| 125 |
+
if args.progress and rows_written % args.progress == 0:
|
| 126 |
+
print(f"relabeled {rows_written:,} rows; failed={rows_failed:,}")
|
| 127 |
+
|
| 128 |
+
base_vocab = None
|
| 129 |
+
if args.base_vocab:
|
| 130 |
+
with Path(args.base_vocab).open("r", encoding="utf-8") as handle:
|
| 131 |
+
base_vocab = json.load(handle)
|
| 132 |
+
tokenizer.build_vocab(token_lists, max_size=args.max_vocab_size, base_vocab=base_vocab)
|
| 133 |
+
if vocab_path:
|
| 134 |
+
vocab_path.write_text(json.dumps(tokenizer.get_vocab(), ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
| 135 |
+
|
| 136 |
+
manifest = {
|
| 137 |
+
"created_at": datetime.now(timezone.utc).isoformat(),
|
| 138 |
+
"input": str(input_path),
|
| 139 |
+
"output": str(output_path),
|
| 140 |
+
"vocab_output": str(vocab_path) if vocab_path else None,
|
| 141 |
+
"row_count": rows_written,
|
| 142 |
+
"input_rows": rows_in,
|
| 143 |
+
"failed_rows": rows_failed,
|
| 144 |
+
"repaired_after_relabel_rows": rows_repaired_after_relabel,
|
| 145 |
+
"failure_counts": dict(failure_counter),
|
| 146 |
+
"label_counts": dict(label_counter),
|
| 147 |
+
"token_length": length_stats(lengths),
|
| 148 |
+
"vocab_size": tokenizer.vocab_size,
|
| 149 |
+
"examples": examples,
|
| 150 |
+
"failures": failures,
|
| 151 |
+
}
|
| 152 |
+
manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
|
| 153 |
+
print(json.dumps({k: v for k, v in manifest.items() if k not in {"examples", "failures"}}, ensure_ascii=False, indent=2))
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
if __name__ == "__main__":
|
| 157 |
+
main()
|
repair_dataset_labels.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Repair known weak-label mistakes in exported AnimeName JSONL datasets."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from collections import Counter, defaultdict
|
| 8 |
+
from datetime import datetime, timezone
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from typing import Dict, List
|
| 11 |
+
|
| 12 |
+
from label_repairs import LabelRepair, repair_jsonl_item
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def parse_args() -> argparse.Namespace:
|
| 16 |
+
parser = argparse.ArgumentParser(description="Repair weak BIO labels in a JSONL dataset")
|
| 17 |
+
parser.add_argument("--input", required=True, help="Input JSONL")
|
| 18 |
+
parser.add_argument("--output", required=True, help="Output repaired JSONL")
|
| 19 |
+
parser.add_argument("--manifest-output", default=None, help="Optional repair manifest JSON")
|
| 20 |
+
parser.add_argument("--dry-run", action="store_true", help="Scan only; do not write output JSONL")
|
| 21 |
+
parser.add_argument("--example-limit", type=int, default=40)
|
| 22 |
+
return parser.parse_args()
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def repair_key(repair: LabelRepair) -> str:
|
| 26 |
+
return f"{repair.kind}:{repair.marker}"
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def main() -> None:
|
| 30 |
+
args = parse_args()
|
| 31 |
+
input_path = Path(args.input)
|
| 32 |
+
output_path = Path(args.output)
|
| 33 |
+
manifest_path = Path(args.manifest_output) if args.manifest_output else output_path.with_suffix(".manifest.json")
|
| 34 |
+
|
| 35 |
+
counts: Counter[str] = Counter()
|
| 36 |
+
marker_counts: Counter[str] = Counter()
|
| 37 |
+
examples: Dict[str, List[dict]] = defaultdict(list)
|
| 38 |
+
label_counts: Counter[str] = Counter()
|
| 39 |
+
row_count = 0
|
| 40 |
+
repaired_rows = 0
|
| 41 |
+
|
| 42 |
+
output_handle = None
|
| 43 |
+
if not args.dry_run:
|
| 44 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 45 |
+
output_handle = output_path.open("w", encoding="utf-8")
|
| 46 |
+
|
| 47 |
+
try:
|
| 48 |
+
with input_path.open("r", encoding="utf-8") as handle:
|
| 49 |
+
for line in handle:
|
| 50 |
+
line = line.strip()
|
| 51 |
+
if not line:
|
| 52 |
+
continue
|
| 53 |
+
row_count += 1
|
| 54 |
+
item = json.loads(line)
|
| 55 |
+
repaired, repairs = repair_jsonl_item(item)
|
| 56 |
+
if repairs:
|
| 57 |
+
repaired_rows += 1
|
| 58 |
+
for repair in repairs:
|
| 59 |
+
key = repair_key(repair)
|
| 60 |
+
counts[repair.kind] += 1
|
| 61 |
+
marker_counts[key] += 1
|
| 62 |
+
if len(examples[key]) < args.example_limit:
|
| 63 |
+
examples[key].append(
|
| 64 |
+
{
|
| 65 |
+
"file_id": item.get("file_id"),
|
| 66 |
+
"filename": item.get("filename"),
|
| 67 |
+
"marker": repair.marker,
|
| 68 |
+
"value": repair.value,
|
| 69 |
+
"span": [repair.start, repair.end],
|
| 70 |
+
}
|
| 71 |
+
)
|
| 72 |
+
label_counts.update(repaired.get("labels", []))
|
| 73 |
+
if output_handle is not None:
|
| 74 |
+
output_handle.write(json.dumps(repaired, ensure_ascii=False, separators=(",", ":")) + "\n")
|
| 75 |
+
finally:
|
| 76 |
+
if output_handle is not None:
|
| 77 |
+
output_handle.close()
|
| 78 |
+
|
| 79 |
+
manifest = {
|
| 80 |
+
"created_at": datetime.now(timezone.utc).isoformat(),
|
| 81 |
+
"input": str(input_path),
|
| 82 |
+
"output": None if args.dry_run else str(output_path),
|
| 83 |
+
"dry_run": args.dry_run,
|
| 84 |
+
"row_count": row_count,
|
| 85 |
+
"repaired_rows": repaired_rows,
|
| 86 |
+
"repair_counts": dict(counts),
|
| 87 |
+
"marker_counts": dict(marker_counts),
|
| 88 |
+
"label_counts": dict(label_counts),
|
| 89 |
+
"examples": examples,
|
| 90 |
+
}
|
| 91 |
+
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
| 92 |
+
manifest_path.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 93 |
+
print(json.dumps({
|
| 94 |
+
"row_count": row_count,
|
| 95 |
+
"repaired_rows": repaired_rows,
|
| 96 |
+
"repair_counts": dict(counts),
|
| 97 |
+
"manifest": str(manifest_path),
|
| 98 |
+
"output": None if args.dry_run else str(output_path),
|
| 99 |
+
}, ensure_ascii=False, indent=2))
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
if __name__ == "__main__":
|
| 103 |
+
main()
|
requirements.txt
CHANGED
|
@@ -1,10 +1,12 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
--extra-index-url https://download.pytorch.org/whl/cu126
|
| 2 |
+
|
| 3 |
+
accelerate==1.13.0
|
| 4 |
+
datasets==4.8.5
|
| 5 |
+
numpy==2.4.5
|
| 6 |
+
onnx==1.21.0
|
| 7 |
+
onnxruntime==1.26.0
|
| 8 |
+
onnxscript==0.7.0
|
| 9 |
+
seqeval==1.2.2
|
| 10 |
+
tensorboard>=2.14.0
|
| 11 |
+
torch==2.12.0+cu126
|
| 12 |
+
transformers==5.8.1
|
run_metadata.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"experiment_name": "dmhy-char-full-relabel",
|
| 3 |
+
"data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
|
| 4 |
+
"tokenizer_variant": "char",
|
| 5 |
+
"vocab_file": "datasets/AnimeName/vocab.char.json",
|
| 6 |
+
"vocab_size": 6199,
|
| 7 |
+
"max_seq_length": 128,
|
| 8 |
+
"hidden_size": 256,
|
| 9 |
+
"num_hidden_layers": 4,
|
| 10 |
+
"num_attention_heads": 8,
|
| 11 |
+
"intermediate_size": 1024,
|
| 12 |
+
"train_samples": 619361,
|
| 13 |
+
"eval_samples": 12641,
|
| 14 |
+
"epochs": 2.0,
|
| 15 |
+
"batch_size": 256,
|
| 16 |
+
"learning_rate": 8e-05,
|
| 17 |
+
"warmup_steps": 300,
|
| 18 |
+
"seed": 48,
|
| 19 |
+
"device": "cuda",
|
| 20 |
+
"fp16": true,
|
| 21 |
+
"gradient_accumulation_steps": 1,
|
| 22 |
+
"dataloader_num_workers": 4
|
| 23 |
+
}
|
tokenizer.py
CHANGED
|
@@ -45,9 +45,9 @@ class AnimeTokenizer(PreTrainedTokenizer):
|
|
| 45 |
# Layer 2: Individual format token patterns
|
| 46 |
FORMAT_PATTERNS: List[str] = [
|
| 47 |
# Resolution
|
| 48 |
-
r'\d{3,4}[pP]',
|
| 49 |
-
r'\d{3,4}[xX×]\d{3,4}',
|
| 50 |
-
r'\d[Kk]',
|
| 51 |
|
| 52 |
# Codec
|
| 53 |
r'[xX]26[45]',
|
|
|
|
| 45 |
# Layer 2: Individual format token patterns
|
| 46 |
FORMAT_PATTERNS: List[str] = [
|
| 47 |
# Resolution
|
| 48 |
+
r'(?<![A-Za-z0-9])\d{3,4}[pP](?![A-Za-z0-9])',
|
| 49 |
+
r'(?<![A-Za-z0-9])\d{3,4}[xX×]\d{3,4}(?![A-Za-z0-9])',
|
| 50 |
+
r'(?<![A-Za-z0-9])\d[Kk](?![A-Za-z0-9])',
|
| 51 |
|
| 52 |
# Codec
|
| 53 |
r'[xX]26[45]',
|
tokenizer_config.json
CHANGED
|
@@ -38,7 +38,7 @@
|
|
| 38 |
"model_max_length": 1000000000000000019884624838656,
|
| 39 |
"pad_token": "[PAD]",
|
| 40 |
"sep_token": "[SEP]",
|
| 41 |
-
"tokenizer_class": "
|
| 42 |
-
"tokenizer_variant": "
|
| 43 |
"unk_token": "[UNK]"
|
| 44 |
}
|
|
|
|
| 38 |
"model_max_length": 1000000000000000019884624838656,
|
| 39 |
"pad_token": "[PAD]",
|
| 40 |
"sep_token": "[SEP]",
|
| 41 |
+
"tokenizer_class": "CharAnimeTokenizer",
|
| 42 |
+
"tokenizer_variant": "char",
|
| 43 |
"unk_token": "[UNK]"
|
| 44 |
}
|
train.py
CHANGED
|
@@ -14,6 +14,7 @@ import json
|
|
| 14 |
import tempfile
|
| 15 |
import argparse
|
| 16 |
import random
|
|
|
|
| 17 |
from typing import Dict, List, Optional
|
| 18 |
|
| 19 |
import numpy as np
|
|
@@ -29,7 +30,8 @@ from seqeval.metrics import classification_report, accuracy_score, f1_score, pre
|
|
| 29 |
from config import Config
|
| 30 |
from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
|
| 31 |
from model import create_model, print_model_summary, count_parameters
|
| 32 |
-
from dataset import AnimeDataset,
|
|
|
|
| 33 |
|
| 34 |
|
| 35 |
def compute_metrics(p):
|
|
@@ -88,10 +90,27 @@ def parse_args() -> argparse.Namespace:
|
|
| 88 |
help="Save resumable checkpoints every N steps instead of only at epoch end")
|
| 89 |
parser.add_argument("--save-total-limit", type=int, default=2,
|
| 90 |
help="Maximum number of checkpoints to keep")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
parser.add_argument("--cpu", action="store_true", help="Force CPU training")
|
| 92 |
parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
|
| 93 |
parser.add_argument("--resume-from-checkpoint", default=None,
|
| 94 |
help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
return parser.parse_args()
|
| 96 |
|
| 97 |
|
|
@@ -172,6 +191,118 @@ def validate_dataset_tokenizer_metadata(data: List[Dict], tokenizer_variant: str
|
|
| 172 |
)
|
| 173 |
|
| 174 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
def remap_token_embeddings(
|
| 176 |
model: BertForTokenClassification,
|
| 177 |
old_vocab: Dict[str, int],
|
|
@@ -220,7 +351,7 @@ def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_pat
|
|
| 220 |
max_size: Optional[int] = None) -> None:
|
| 221 |
token_lists: List[List[str]] = []
|
| 222 |
for item in data:
|
| 223 |
-
tokens,
|
| 224 |
token_lists.append(tokens)
|
| 225 |
|
| 226 |
tokenizer.build_vocab(token_lists, max_size=max_size)
|
|
@@ -250,20 +381,35 @@ def main():
|
|
| 250 |
config.warmup_steps = args.warmup_steps
|
| 251 |
if args.train_split is not None:
|
| 252 |
config.train_split = args.train_split
|
|
|
|
|
|
|
| 253 |
if args.max_seq_length is not None:
|
| 254 |
config.max_seq_length = args.max_seq_length
|
| 255 |
elif tokenizer_variant == "char":
|
| 256 |
config.max_seq_length = max(config.max_seq_length, 128)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
|
| 258 |
random.seed(args.seed)
|
| 259 |
np.random.seed(args.seed)
|
| 260 |
torch.manual_seed(args.seed)
|
| 261 |
|
| 262 |
print("Loading dataset...")
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
all_data = all_data[:args.limit_samples]
|
| 267 |
if not args.no_shuffle:
|
| 268 |
random.shuffle(all_data)
|
| 269 |
validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
|
|
@@ -280,6 +426,9 @@ def main():
|
|
| 280 |
print(f" Variant: {tokenizer_variant}")
|
| 281 |
print(f" Vocab size: {tokenizer.vocab_size}")
|
| 282 |
print(f" Max sequence length: {config.max_seq_length}")
|
|
|
|
|
|
|
|
|
|
| 283 |
|
| 284 |
# Update config with actual vocab size
|
| 285 |
config.vocab_size = tokenizer.vocab_size
|
|
@@ -288,15 +437,22 @@ def main():
|
|
| 288 |
if args.init_model_dir:
|
| 289 |
print(f"Loading model for fine-tuning: {args.init_model_dir}")
|
| 290 |
model = BertForTokenClassification.from_pretrained(args.init_model_dir)
|
| 291 |
-
init_tokenizer = load_tokenizer(args.init_model_dir)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
|
| 293 |
if init_variant != tokenizer_variant:
|
| 294 |
print(f" WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
|
| 295 |
print(" Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
|
| 296 |
-
if model.config.vocab_size != config.vocab_size or
|
| 297 |
copied = remap_token_embeddings(
|
| 298 |
model=model,
|
| 299 |
-
old_vocab=
|
| 300 |
new_vocab=tokenizer.get_vocab(),
|
| 301 |
pad_token_id=tokenizer.pad_token_id,
|
| 302 |
)
|
|
@@ -316,6 +472,7 @@ def main():
|
|
| 316 |
print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
|
| 317 |
|
| 318 |
split_idx = int(len(all_data) * config.train_split)
|
|
|
|
| 319 |
train_data = all_data[:split_idx]
|
| 320 |
eval_data = all_data[split_idx:]
|
| 321 |
|
|
@@ -350,8 +507,7 @@ def main():
|
|
| 350 |
use_cpu = args.cpu or not torch.cuda.is_available()
|
| 351 |
use_fp16 = not use_cpu
|
| 352 |
print(f" Device: {'CPU' if use_cpu else 'CUDA'}")
|
| 353 |
-
|
| 354 |
-
load_best_model_at_end = args.checkpoint_steps is None
|
| 355 |
|
| 356 |
# Training arguments
|
| 357 |
training_args = TrainingArguments(
|
|
@@ -359,20 +515,23 @@ def main():
|
|
| 359 |
num_train_epochs=config.num_epochs,
|
| 360 |
per_device_train_batch_size=config.batch_size,
|
| 361 |
per_device_eval_batch_size=config.batch_size,
|
| 362 |
-
eval_strategy=
|
| 363 |
-
save_strategy=
|
|
|
|
| 364 |
save_steps=args.checkpoint_steps,
|
| 365 |
logging_steps=config.log_interval,
|
| 366 |
learning_rate=config.learning_rate,
|
| 367 |
weight_decay=config.weight_decay,
|
| 368 |
warmup_steps=config.warmup_steps,
|
|
|
|
| 369 |
use_cpu=use_cpu,
|
| 370 |
-
report_to="none",
|
| 371 |
save_total_limit=args.save_total_limit,
|
| 372 |
-
load_best_model_at_end=
|
| 373 |
metric_for_best_model="f1",
|
| 374 |
greater_is_better=True,
|
| 375 |
dataloader_num_workers=config.num_workers,
|
|
|
|
| 376 |
fp16=use_fp16,
|
| 377 |
)
|
| 378 |
|
|
@@ -410,6 +569,31 @@ def main():
|
|
| 410 |
final_save_path = os.path.join(config.save_dir, "final")
|
| 411 |
trainer.save_model(final_save_path)
|
| 412 |
tokenizer.save_pretrained(final_save_path)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 413 |
print(f"Model saved to: {final_save_path}")
|
| 414 |
|
| 415 |
# Final evaluation
|
|
@@ -417,6 +601,30 @@ def main():
|
|
| 417 |
eval_results = trainer.evaluate()
|
| 418 |
for key, value in eval_results.items():
|
| 419 |
print(f" {key}: {value:.4f}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
|
| 421 |
|
| 422 |
if __name__ == "__main__":
|
|
|
|
| 14 |
import tempfile
|
| 15 |
import argparse
|
| 16 |
import random
|
| 17 |
+
from collections import Counter
|
| 18 |
from typing import Dict, List, Optional
|
| 19 |
|
| 20 |
import numpy as np
|
|
|
|
| 30 |
from config import Config
|
| 31 |
from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
|
| 32 |
from model import create_model, print_model_summary, count_parameters
|
| 33 |
+
from dataset import AnimeDataset, labels_for_tokenizer
|
| 34 |
+
from inference import parse_filename, postprocess
|
| 35 |
|
| 36 |
|
| 37 |
def compute_metrics(p):
|
|
|
|
| 90 |
help="Save resumable checkpoints every N steps instead of only at epoch end")
|
| 91 |
parser.add_argument("--save-total-limit", type=int, default=2,
|
| 92 |
help="Maximum number of checkpoints to keep")
|
| 93 |
+
parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
|
| 94 |
+
help="Accumulate gradients across this many steps")
|
| 95 |
+
parser.add_argument("--num-workers", type=int, default=None,
|
| 96 |
+
help="DataLoader worker count. Defaults to config.num_workers")
|
| 97 |
parser.add_argument("--cpu", action="store_true", help="Force CPU training")
|
| 98 |
parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
|
| 99 |
parser.add_argument("--resume-from-checkpoint", default=None,
|
| 100 |
help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
|
| 101 |
+
parser.add_argument("--tensorboard", dest="tensorboard", action="store_true",
|
| 102 |
+
help="Log metrics to TensorBoard in addition to stdout/checkpoints")
|
| 103 |
+
parser.add_argument("--no-tensorboard", dest="tensorboard", action="store_false",
|
| 104 |
+
help="Disable TensorBoard logging")
|
| 105 |
+
parser.add_argument("--experiment-name", default=None,
|
| 106 |
+
help="Optional experiment name written to run_metadata.json")
|
| 107 |
+
parser.add_argument("--parse-eval-limit", type=int, default=512,
|
| 108 |
+
help="Run field exact-match evaluation on up to N eval samples after training; 0 disables it")
|
| 109 |
+
parser.add_argument("--hidden-size", type=int, default=None, help="Override BERT hidden size")
|
| 110 |
+
parser.add_argument("--num-hidden-layers", type=int, default=None, help="Override BERT layer count")
|
| 111 |
+
parser.add_argument("--num-attention-heads", type=int, default=None, help="Override BERT attention heads")
|
| 112 |
+
parser.add_argument("--intermediate-size", type=int, default=None, help="Override BERT FFN intermediate size")
|
| 113 |
+
parser.set_defaults(tensorboard=True)
|
| 114 |
return parser.parse_args()
|
| 115 |
|
| 116 |
|
|
|
|
| 191 |
)
|
| 192 |
|
| 193 |
|
| 194 |
+
def load_jsonl(data_file: str, limit: Optional[int] = None) -> List[Dict]:
|
| 195 |
+
"""Load JSONL rows, stopping early for smoke runs."""
|
| 196 |
+
data: List[Dict] = []
|
| 197 |
+
with open(data_file, "r", encoding="utf-8") as f:
|
| 198 |
+
for line in f:
|
| 199 |
+
line = line.strip()
|
| 200 |
+
if not line:
|
| 201 |
+
continue
|
| 202 |
+
data.append(json.loads(line))
|
| 203 |
+
if limit is not None and len(data) >= limit:
|
| 204 |
+
break
|
| 205 |
+
return data
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def normalize_field_value(field: str, value) -> Optional[str]:
|
| 209 |
+
if value is None:
|
| 210 |
+
return None
|
| 211 |
+
if field in {"episode", "season"}:
|
| 212 |
+
try:
|
| 213 |
+
return str(int(value))
|
| 214 |
+
except (TypeError, ValueError):
|
| 215 |
+
return str(value).strip().lower()
|
| 216 |
+
text = str(value).strip()
|
| 217 |
+
if field in {"resolution", "source"}:
|
| 218 |
+
return text.lower().replace("_", "-")
|
| 219 |
+
return " ".join(text.lower().split())
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
def parse_exact_metrics(
|
| 223 |
+
samples: List[Dict],
|
| 224 |
+
model: BertForTokenClassification,
|
| 225 |
+
tokenizer: AnimeTokenizer,
|
| 226 |
+
id2label: Dict[int, str],
|
| 227 |
+
max_length: int,
|
| 228 |
+
limit: Optional[int],
|
| 229 |
+
) -> Dict:
|
| 230 |
+
"""Evaluate end-to-end field exact match on filenames, not just token loss."""
|
| 231 |
+
fields = ["group", "title", "season", "episode", "resolution", "source", "special"]
|
| 232 |
+
selected = [sample for sample in samples if sample.get("filename")]
|
| 233 |
+
if limit is not None and limit > 0:
|
| 234 |
+
selected = selected[:limit]
|
| 235 |
+
|
| 236 |
+
counter: Counter = Counter()
|
| 237 |
+
failures: List[Dict] = []
|
| 238 |
+
model.eval()
|
| 239 |
+
|
| 240 |
+
for sample in selected:
|
| 241 |
+
filename = sample["filename"]
|
| 242 |
+
tokens, gold_labels = labels_for_tokenizer(sample, tokenizer)
|
| 243 |
+
available = max(0, max_length - 2)
|
| 244 |
+
tokens = tokens[:available]
|
| 245 |
+
gold_labels = gold_labels[:available]
|
| 246 |
+
gold = postprocess(tokens, gold_labels, tokenizer=tokenizer, filename=filename, use_rules=True)
|
| 247 |
+
gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
|
| 248 |
+
for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
|
| 249 |
+
if entity not in gold_entities:
|
| 250 |
+
gold[optional_field] = None
|
| 251 |
+
pred = parse_filename(
|
| 252 |
+
filename,
|
| 253 |
+
model,
|
| 254 |
+
tokenizer,
|
| 255 |
+
id2label,
|
| 256 |
+
max_length=max_length,
|
| 257 |
+
debug=False,
|
| 258 |
+
use_rules=True,
|
| 259 |
+
constrain_bio=True,
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
full_match = True
|
| 263 |
+
field_errors: Dict[str, Dict[str, Optional[str]]] = {}
|
| 264 |
+
for field in fields:
|
| 265 |
+
gold_value = normalize_field_value(field, gold.get(field))
|
| 266 |
+
pred_value = normalize_field_value(field, pred.get(field))
|
| 267 |
+
counter[f"{field}_total"] += 1
|
| 268 |
+
if gold_value == pred_value:
|
| 269 |
+
counter[f"{field}_correct"] += 1
|
| 270 |
+
else:
|
| 271 |
+
full_match = False
|
| 272 |
+
field_errors[field] = {"gold": gold_value, "pred": pred_value}
|
| 273 |
+
counter["full_total"] += 1
|
| 274 |
+
if full_match:
|
| 275 |
+
counter["full_correct"] += 1
|
| 276 |
+
elif len(failures) < 20:
|
| 277 |
+
failures.append(
|
| 278 |
+
{
|
| 279 |
+
"filename": filename,
|
| 280 |
+
"errors": field_errors,
|
| 281 |
+
"gold": {field: gold.get(field) for field in fields},
|
| 282 |
+
"pred": {field: pred.get(field) for field in fields},
|
| 283 |
+
}
|
| 284 |
+
)
|
| 285 |
+
|
| 286 |
+
field_accuracy = {}
|
| 287 |
+
for field in fields:
|
| 288 |
+
total = counter.get(f"{field}_total", 0)
|
| 289 |
+
correct = counter.get(f"{field}_correct", 0)
|
| 290 |
+
field_accuracy[field] = correct / total if total else 0.0
|
| 291 |
+
|
| 292 |
+
total = counter.get("full_total", 0)
|
| 293 |
+
correct = counter.get("full_correct", 0)
|
| 294 |
+
return {
|
| 295 |
+
"sample_count": total,
|
| 296 |
+
"field_accuracy": field_accuracy,
|
| 297 |
+
"field_correct": {field: counter.get(f"{field}_correct", 0) for field in fields},
|
| 298 |
+
"field_total": {field: counter.get(f"{field}_total", 0) for field in fields},
|
| 299 |
+
"full_match_accuracy": correct / total if total else 0.0,
|
| 300 |
+
"full_match_correct": correct,
|
| 301 |
+
"full_match_total": total,
|
| 302 |
+
"failures": failures,
|
| 303 |
+
}
|
| 304 |
+
|
| 305 |
+
|
| 306 |
def remap_token_embeddings(
|
| 307 |
model: BertForTokenClassification,
|
| 308 |
old_vocab: Dict[str, int],
|
|
|
|
| 351 |
max_size: Optional[int] = None) -> None:
|
| 352 |
token_lists: List[List[str]] = []
|
| 353 |
for item in data:
|
| 354 |
+
tokens, _labels = labels_for_tokenizer(item, tokenizer)
|
| 355 |
token_lists.append(tokens)
|
| 356 |
|
| 357 |
tokenizer.build_vocab(token_lists, max_size=max_size)
|
|
|
|
| 381 |
config.warmup_steps = args.warmup_steps
|
| 382 |
if args.train_split is not None:
|
| 383 |
config.train_split = args.train_split
|
| 384 |
+
if args.num_workers is not None:
|
| 385 |
+
config.num_workers = args.num_workers
|
| 386 |
if args.max_seq_length is not None:
|
| 387 |
config.max_seq_length = args.max_seq_length
|
| 388 |
elif tokenizer_variant == "char":
|
| 389 |
config.max_seq_length = max(config.max_seq_length, 128)
|
| 390 |
+
if args.hidden_size is not None:
|
| 391 |
+
config.hidden_size = args.hidden_size
|
| 392 |
+
if args.num_hidden_layers is not None:
|
| 393 |
+
config.num_hidden_layers = args.num_hidden_layers
|
| 394 |
+
if args.num_attention_heads is not None:
|
| 395 |
+
config.num_attention_heads = args.num_attention_heads
|
| 396 |
+
if args.intermediate_size is not None:
|
| 397 |
+
config.intermediate_size = args.intermediate_size
|
| 398 |
+
if config.hidden_size % config.num_attention_heads != 0:
|
| 399 |
+
raise ValueError(
|
| 400 |
+
f"hidden_size ({config.hidden_size}) must be divisible by "
|
| 401 |
+
f"num_attention_heads ({config.num_attention_heads})."
|
| 402 |
+
)
|
| 403 |
+
config.max_position_embeddings = max(config.max_position_embeddings, config.max_seq_length)
|
| 404 |
|
| 405 |
random.seed(args.seed)
|
| 406 |
np.random.seed(args.seed)
|
| 407 |
torch.manual_seed(args.seed)
|
| 408 |
|
| 409 |
print("Loading dataset...")
|
| 410 |
+
all_data = load_jsonl(config.data_file, args.limit_samples)
|
| 411 |
+
if len(all_data) < 2:
|
| 412 |
+
raise ValueError("Need at least two samples so train/eval split is non-empty.")
|
|
|
|
| 413 |
if not args.no_shuffle:
|
| 414 |
random.shuffle(all_data)
|
| 415 |
validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
|
|
|
|
| 426 |
print(f" Variant: {tokenizer_variant}")
|
| 427 |
print(f" Vocab size: {tokenizer.vocab_size}")
|
| 428 |
print(f" Max sequence length: {config.max_seq_length}")
|
| 429 |
+
if torch.cuda.is_available() and not args.cpu:
|
| 430 |
+
print(f" CUDA device: {torch.cuda.get_device_name(0)}")
|
| 431 |
+
print(" Mixed precision: fp16")
|
| 432 |
|
| 433 |
# Update config with actual vocab size
|
| 434 |
config.vocab_size = tokenizer.vocab_size
|
|
|
|
| 437 |
if args.init_model_dir:
|
| 438 |
print(f"Loading model for fine-tuning: {args.init_model_dir}")
|
| 439 |
model = BertForTokenClassification.from_pretrained(args.init_model_dir)
|
| 440 |
+
init_tokenizer = load_tokenizer(args.init_model_dir, tokenizer_variant)
|
| 441 |
+
init_vocab = init_tokenizer.get_vocab()
|
| 442 |
+
embedding_size = model.get_input_embeddings().weight.shape[0]
|
| 443 |
+
if len(init_vocab) != embedding_size:
|
| 444 |
+
print(
|
| 445 |
+
" WARNING: init checkpoint tokenizer vocab length does not match model embedding size "
|
| 446 |
+
f"({len(init_vocab):,} vs {embedding_size:,}). Prefer a self-consistent checkpoint."
|
| 447 |
+
)
|
| 448 |
init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
|
| 449 |
if init_variant != tokenizer_variant:
|
| 450 |
print(f" WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
|
| 451 |
print(" Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
|
| 452 |
+
if model.config.vocab_size != config.vocab_size or init_vocab != tokenizer.get_vocab():
|
| 453 |
copied = remap_token_embeddings(
|
| 454 |
model=model,
|
| 455 |
+
old_vocab=init_vocab,
|
| 456 |
new_vocab=tokenizer.get_vocab(),
|
| 457 |
pad_token_id=tokenizer.pad_token_id,
|
| 458 |
)
|
|
|
|
| 472 |
print("WARNING: Model exceeds the historical 5M target; continuing because vocab size is configurable.")
|
| 473 |
|
| 474 |
split_idx = int(len(all_data) * config.train_split)
|
| 475 |
+
split_idx = max(1, min(len(all_data) - 1, split_idx))
|
| 476 |
train_data = all_data[:split_idx]
|
| 477 |
eval_data = all_data[split_idx:]
|
| 478 |
|
|
|
|
| 507 |
use_cpu = args.cpu or not torch.cuda.is_available()
|
| 508 |
use_fp16 = not use_cpu
|
| 509 |
print(f" Device: {'CPU' if use_cpu else 'CUDA'}")
|
| 510 |
+
eval_save_strategy = "steps" if args.checkpoint_steps else "epoch"
|
|
|
|
| 511 |
|
| 512 |
# Training arguments
|
| 513 |
training_args = TrainingArguments(
|
|
|
|
| 515 |
num_train_epochs=config.num_epochs,
|
| 516 |
per_device_train_batch_size=config.batch_size,
|
| 517 |
per_device_eval_batch_size=config.batch_size,
|
| 518 |
+
eval_strategy=eval_save_strategy,
|
| 519 |
+
save_strategy=eval_save_strategy,
|
| 520 |
+
eval_steps=args.checkpoint_steps,
|
| 521 |
save_steps=args.checkpoint_steps,
|
| 522 |
logging_steps=config.log_interval,
|
| 523 |
learning_rate=config.learning_rate,
|
| 524 |
weight_decay=config.weight_decay,
|
| 525 |
warmup_steps=config.warmup_steps,
|
| 526 |
+
gradient_accumulation_steps=args.gradient_accumulation_steps,
|
| 527 |
use_cpu=use_cpu,
|
| 528 |
+
report_to=["tensorboard"] if args.tensorboard else "none",
|
| 529 |
save_total_limit=args.save_total_limit,
|
| 530 |
+
load_best_model_at_end=True,
|
| 531 |
metric_for_best_model="f1",
|
| 532 |
greater_is_better=True,
|
| 533 |
dataloader_num_workers=config.num_workers,
|
| 534 |
+
dataloader_pin_memory=not use_cpu,
|
| 535 |
fp16=use_fp16,
|
| 536 |
)
|
| 537 |
|
|
|
|
| 569 |
final_save_path = os.path.join(config.save_dir, "final")
|
| 570 |
trainer.save_model(final_save_path)
|
| 571 |
tokenizer.save_pretrained(final_save_path)
|
| 572 |
+
metadata = {
|
| 573 |
+
"experiment_name": args.experiment_name,
|
| 574 |
+
"data_file": config.data_file,
|
| 575 |
+
"tokenizer_variant": tokenizer_variant,
|
| 576 |
+
"vocab_file": vocab_path,
|
| 577 |
+
"vocab_size": tokenizer.vocab_size,
|
| 578 |
+
"max_seq_length": config.max_seq_length,
|
| 579 |
+
"hidden_size": config.hidden_size,
|
| 580 |
+
"num_hidden_layers": config.num_hidden_layers,
|
| 581 |
+
"num_attention_heads": config.num_attention_heads,
|
| 582 |
+
"intermediate_size": config.intermediate_size,
|
| 583 |
+
"train_samples": len(train_dataset),
|
| 584 |
+
"eval_samples": len(eval_dataset),
|
| 585 |
+
"epochs": config.num_epochs,
|
| 586 |
+
"batch_size": config.batch_size,
|
| 587 |
+
"learning_rate": config.learning_rate,
|
| 588 |
+
"warmup_steps": config.warmup_steps,
|
| 589 |
+
"seed": args.seed,
|
| 590 |
+
"device": "cpu" if use_cpu else "cuda",
|
| 591 |
+
"fp16": use_fp16,
|
| 592 |
+
"gradient_accumulation_steps": training_args.gradient_accumulation_steps,
|
| 593 |
+
"dataloader_num_workers": config.num_workers,
|
| 594 |
+
}
|
| 595 |
+
with open(os.path.join(final_save_path, "run_metadata.json"), "w", encoding="utf-8") as f:
|
| 596 |
+
json.dump(metadata, f, ensure_ascii=False, indent=2)
|
| 597 |
print(f"Model saved to: {final_save_path}")
|
| 598 |
|
| 599 |
# Final evaluation
|
|
|
|
| 601 |
eval_results = trainer.evaluate()
|
| 602 |
for key, value in eval_results.items():
|
| 603 |
print(f" {key}: {value:.4f}")
|
| 604 |
+
with open(os.path.join(final_save_path, "trainer_eval_metrics.json"), "w", encoding="utf-8") as f:
|
| 605 |
+
json.dump({key: float(value) for key, value in eval_results.items()}, f, ensure_ascii=False, indent=2)
|
| 606 |
+
|
| 607 |
+
if args.parse_eval_limit != 0:
|
| 608 |
+
parse_limit = args.parse_eval_limit if args.parse_eval_limit and args.parse_eval_limit > 0 else None
|
| 609 |
+
parse_metrics = parse_exact_metrics(
|
| 610 |
+
eval_data,
|
| 611 |
+
trainer.model,
|
| 612 |
+
tokenizer,
|
| 613 |
+
config.id2label,
|
| 614 |
+
config.max_seq_length,
|
| 615 |
+
parse_limit,
|
| 616 |
+
)
|
| 617 |
+
with open(os.path.join(final_save_path, "parse_eval_metrics.json"), "w", encoding="utf-8") as f:
|
| 618 |
+
json.dump(parse_metrics, f, ensure_ascii=False, indent=2)
|
| 619 |
+
print("\nParse exact-match evaluation:")
|
| 620 |
+
print(
|
| 621 |
+
f" full_match: {parse_metrics['full_match_correct']}/"
|
| 622 |
+
f"{parse_metrics['full_match_total']} ({parse_metrics['full_match_accuracy']:.4f})"
|
| 623 |
+
)
|
| 624 |
+
for field, accuracy in parse_metrics["field_accuracy"].items():
|
| 625 |
+
correct = parse_metrics["field_correct"][field]
|
| 626 |
+
total = parse_metrics["field_total"][field]
|
| 627 |
+
print(f" {field}: {correct}/{total} ({accuracy:.4f})")
|
| 628 |
|
| 629 |
|
| 630 |
if __name__ == "__main__":
|
trainer_eval_metrics.json
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"eval_loss": 0.01631847210228443,
|
| 3 |
+
"eval_precision": 0.9799749533444652,
|
| 4 |
+
"eval_recall": 0.986698478236683,
|
| 5 |
+
"eval_f1": 0.9833252228334185,
|
| 6 |
+
"eval_accuracy": 0.9943065860243627,
|
| 7 |
+
"eval_runtime": 39.3604,
|
| 8 |
+
"eval_samples_per_second": 321.161,
|
| 9 |
+
"eval_steps_per_second": 1.27,
|
| 10 |
+
"epoch": 2.0
|
| 11 |
+
}
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 5265
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b5aa0df615ce731796aa9934b0505e00a685611be134c071d7b2487d8112dde1
|
| 3 |
size 5265
|
uv.lock
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
vocab.char.json
CHANGED
|
@@ -56,8 +56,8 @@
|
|
| 56 |
"N": 54,
|
| 57 |
"3": 55,
|
| 58 |
"(": 56,
|
| 59 |
-
"
|
| 60 |
-
"
|
| 61 |
"g": 59,
|
| 62 |
"y": 60,
|
| 63 |
"O": 61,
|
|
|
|
| 56 |
"N": 54,
|
| 57 |
"3": 55,
|
| 58 |
"(": 56,
|
| 59 |
+
"K": 57,
|
| 60 |
+
")": 58,
|
| 61 |
"g": 59,
|
| 62 |
"y": 60,
|
| 63 |
"O": 61,
|
vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|