Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Remove structural parser rule assists
Browse files- MAINTENANCE.md +4 -3
- README.md +5 -5
- benchmark_inference.py +1 -5
- benchmark_results.json +16 -17
- case_metrics.json +0 -569
- diagnose_pipeline.py +8 -53
- docs/onnx.md +5 -6
- docs/training.md +2 -2
- evaluate_parser_cases.py +4 -18
- inference.py +10 -661
- onnx_inference.py +3 -7
- parse_eval_metrics.json +1 -624
- train.py +3 -8
MAINTENANCE.md
CHANGED
|
@@ -121,11 +121,12 @@ uv run python benchmark_inference.py --model-dir . --onnx exports/anime_filename
|
|
| 121 |
```
|
| 122 |
|
| 123 |
The default parser path is thin runtime: model logits, constrained BIO, entity
|
| 124 |
-
aggregation, and light string/number normalization.
|
| 125 |
-
|
|
|
|
| 126 |
|
| 127 |
默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
|
| 128 |
-
|
| 129 |
|
| 130 |
## Dataset Submodule / 数据集子模块
|
| 131 |
|
|
|
|
| 121 |
```
|
| 122 |
|
| 123 |
The default parser path is thin runtime: model logits, constrained BIO, entity
|
| 124 |
+
aggregation, and light string/number normalization. Do not add structural
|
| 125 |
+
filename regex assists back to the default runtime; parser quality should come
|
| 126 |
+
from labels and model training.
|
| 127 |
|
| 128 |
默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
|
| 129 |
+
不要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自标签和模型训练。
|
| 130 |
|
| 131 |
## Dataset Submodule / 数据集子模块
|
| 132 |
|
README.md
CHANGED
|
@@ -146,11 +146,11 @@ Current published checkpoint:
|
|
| 146 |
| Focus held-out, default thin runtime / 困难抽样,默认薄层运行时 | 1017/1024 full match = `99.32%` |
|
| 147 |
| Token/entity eval / token/entity 评估 | F1 `0.9972`, token accuracy `0.9995` |
|
| 148 |
| ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
|
| 149 |
-
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.
|
| 150 |
|
| 151 |
-
**中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;
|
| 152 |
|
| 153 |
-
**English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime;
|
| 154 |
|
| 155 |
Run regression:
|
| 156 |
|
|
@@ -177,8 +177,8 @@ decoding, entity aggregation, and light string/number normalization:
|
|
| 177 |
|
| 178 |
| Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
|
| 179 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 180 |
-
| PyTorch |
|
| 181 |
-
| ONNX Runtime |
|
| 182 |
|
| 183 |
**中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
|
| 184 |
|
|
|
|
| 146 |
| Focus held-out, default thin runtime / 困难抽样,默认薄层运行时 | 1017/1024 full match = `99.32%` |
|
| 147 |
| Token/entity eval / token/entity 评估 | F1 `0.9972`, token accuracy `0.9995` |
|
| 148 |
| ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
|
| 149 |
+
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.18 ms`, P95 `16.70 ms` |
|
| 150 |
|
| 151 |
+
**中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;旧版结构规则辅助层已移除,不再作为运行时或质量对照。
|
| 152 |
|
| 153 |
+
**English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports.
|
| 154 |
|
| 155 |
Run regression:
|
| 156 |
|
|
|
|
| 177 |
|
| 178 |
| Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
|
| 179 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 180 |
+
| PyTorch | 76.56 | 16.85 | 16.21 | 22.84 | 28.31 | 59.4 |
|
| 181 |
+
| ONNX Runtime | 49.74 | 13.18 | 12.86 | 16.70 | 18.06 | 75.9 |
|
| 182 |
|
| 183 |
**中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
|
| 184 |
|
benchmark_inference.py
CHANGED
|
@@ -95,8 +95,6 @@ def main() -> None:
|
|
| 95 |
parser.add_argument("--torch-threads", type=int, default=1, help="torch intra-op thread count")
|
| 96 |
parser.add_argument("--ort-threads", type=int, default=1, help="ONNX Runtime intra/inter-op thread count")
|
| 97 |
parser.add_argument("--no-constrained-bio", action="store_true", help="Use greedy labels for PyTorch backend")
|
| 98 |
-
parser.add_argument("--rule-assist", action="store_true", help="Enable legacy structural postprocessing")
|
| 99 |
-
parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
|
| 100 |
parser.add_argument("--output", default=None, help="Optional JSON output path")
|
| 101 |
args = parser.parse_args()
|
| 102 |
|
|
@@ -128,7 +126,6 @@ def main() -> None:
|
|
| 128 |
id2label,
|
| 129 |
max_length=resolved_max_length,
|
| 130 |
debug=False,
|
| 131 |
-
use_rules=args.rule_assist and not args.no_rule_assist,
|
| 132 |
constrain_bio=not args.no_constrained_bio,
|
| 133 |
)
|
| 134 |
|
|
@@ -150,7 +147,7 @@ def main() -> None:
|
|
| 150 |
load_ms = (time.perf_counter() - load_start) * 1000.0
|
| 151 |
|
| 152 |
def parse_onnx(filename: str) -> Dict:
|
| 153 |
-
return onnx_parser.parse(filename
|
| 154 |
|
| 155 |
raw = run_benchmark("onnxruntime", parse_onnx, filenames, args.warmup, args.repeat)
|
| 156 |
results.append(summarize(raw["name"], load_ms, raw["latencies_ms"]))
|
|
@@ -164,7 +161,6 @@ def main() -> None:
|
|
| 164 |
"warmup": args.warmup,
|
| 165 |
"torch_threads": args.torch_threads,
|
| 166 |
"ort_threads": args.ort_threads,
|
| 167 |
-
"use_rules": args.rule_assist and not args.no_rule_assist,
|
| 168 |
"constrain_bio": not args.no_constrained_bio,
|
| 169 |
"results": results,
|
| 170 |
}
|
|
|
|
| 95 |
parser.add_argument("--torch-threads", type=int, default=1, help="torch intra-op thread count")
|
| 96 |
parser.add_argument("--ort-threads", type=int, default=1, help="ONNX Runtime intra/inter-op thread count")
|
| 97 |
parser.add_argument("--no-constrained-bio", action="store_true", help="Use greedy labels for PyTorch backend")
|
|
|
|
|
|
|
| 98 |
parser.add_argument("--output", default=None, help="Optional JSON output path")
|
| 99 |
args = parser.parse_args()
|
| 100 |
|
|
|
|
| 126 |
id2label,
|
| 127 |
max_length=resolved_max_length,
|
| 128 |
debug=False,
|
|
|
|
| 129 |
constrain_bio=not args.no_constrained_bio,
|
| 130 |
)
|
| 131 |
|
|
|
|
| 147 |
load_ms = (time.perf_counter() - load_start) * 1000.0
|
| 148 |
|
| 149 |
def parse_onnx(filename: str) -> Dict:
|
| 150 |
+
return onnx_parser.parse(filename)
|
| 151 |
|
| 152 |
raw = run_benchmark("onnxruntime", parse_onnx, filenames, args.warmup, args.repeat)
|
| 153 |
results.append(summarize(raw["name"], load_ms, raw["latencies_ms"]))
|
|
|
|
| 161 |
"warmup": args.warmup,
|
| 162 |
"torch_threads": args.torch_threads,
|
| 163 |
"ort_threads": args.ort_threads,
|
|
|
|
| 164 |
"constrain_bio": not args.no_constrained_bio,
|
| 165 |
"results": results,
|
| 166 |
}
|
benchmark_results.json
CHANGED
|
@@ -7,32 +7,31 @@
|
|
| 7 |
"warmup": 20,
|
| 8 |
"torch_threads": 1,
|
| 9 |
"ort_threads": 1,
|
| 10 |
-
"use_rules": false,
|
| 11 |
"constrain_bio": true,
|
| 12 |
"results": [
|
| 13 |
{
|
| 14 |
"name": "pytorch",
|
| 15 |
-
"load_ms":
|
| 16 |
"runs": 520,
|
| 17 |
-
"avg_ms":
|
| 18 |
-
"p50_ms":
|
| 19 |
-
"p95_ms":
|
| 20 |
-
"p99_ms":
|
| 21 |
-
"min_ms": 11.
|
| 22 |
-
"max_ms":
|
| 23 |
-
"throughput_fps":
|
| 24 |
},
|
| 25 |
{
|
| 26 |
"name": "onnxruntime",
|
| 27 |
-
"load_ms":
|
| 28 |
"runs": 520,
|
| 29 |
-
"avg_ms": 13.
|
| 30 |
-
"p50_ms": 12.
|
| 31 |
-
"p95_ms":
|
| 32 |
-
"p99_ms":
|
| 33 |
-
"min_ms":
|
| 34 |
-
"max_ms":
|
| 35 |
-
"throughput_fps":
|
| 36 |
}
|
| 37 |
]
|
| 38 |
}
|
|
|
|
| 7 |
"warmup": 20,
|
| 8 |
"torch_threads": 1,
|
| 9 |
"ort_threads": 1,
|
|
|
|
| 10 |
"constrain_bio": true,
|
| 11 |
"results": [
|
| 12 |
{
|
| 13 |
"name": "pytorch",
|
| 14 |
+
"load_ms": 76.55749993864447,
|
| 15 |
"runs": 520,
|
| 16 |
+
"avg_ms": 16.846879808312785,
|
| 17 |
+
"p50_ms": 16.207700013183057,
|
| 18 |
+
"p95_ms": 22.843200032366425,
|
| 19 |
+
"p99_ms": 28.308318012859665,
|
| 20 |
+
"min_ms": 11.152399936690927,
|
| 21 |
+
"max_ms": 34.10990000702441,
|
| 22 |
+
"throughput_fps": 59.35817263363916
|
| 23 |
},
|
| 24 |
{
|
| 25 |
"name": "onnxruntime",
|
| 26 |
+
"load_ms": 49.74160005804151,
|
| 27 |
"runs": 520,
|
| 28 |
+
"avg_ms": 13.178169615835381,
|
| 29 |
+
"p50_ms": 12.862899922765791,
|
| 30 |
+
"p95_ms": 16.696884995326396,
|
| 31 |
+
"p99_ms": 18.06362595874816,
|
| 32 |
+
"min_ms": 9.811799973249435,
|
| 33 |
+
"max_ms": 20.784800057299435,
|
| 34 |
+
"throughput_fps": 75.88307247148819
|
| 35 |
}
|
| 36 |
]
|
| 37 |
}
|
case_metrics.json
CHANGED
|
@@ -6,7 +6,6 @@
|
|
| 6 |
"case_file": "data/parser_regression_cases.json",
|
| 7 |
"tokenizer_variant": "char",
|
| 8 |
"max_length": 128,
|
| 9 |
-
"use_rules": false,
|
| 10 |
"constrain_bio": false,
|
| 11 |
"case_count": 26,
|
| 12 |
"full_correct": 25,
|
|
@@ -606,574 +605,6 @@
|
|
| 606 |
"case_file": "data/parser_regression_cases.json",
|
| 607 |
"tokenizer_variant": "char",
|
| 608 |
"max_length": 128,
|
| 609 |
-
"use_rules": false,
|
| 610 |
-
"constrain_bio": true,
|
| 611 |
-
"case_count": 26,
|
| 612 |
-
"full_correct": 26,
|
| 613 |
-
"full_accuracy": 1.0,
|
| 614 |
-
"field_correct": {
|
| 615 |
-
"group": 22,
|
| 616 |
-
"title": 26,
|
| 617 |
-
"episode": 26,
|
| 618 |
-
"resolution": 26,
|
| 619 |
-
"source": 19,
|
| 620 |
-
"season": 9,
|
| 621 |
-
"special": 5
|
| 622 |
-
},
|
| 623 |
-
"field_total": {
|
| 624 |
-
"group": 22,
|
| 625 |
-
"title": 26,
|
| 626 |
-
"episode": 26,
|
| 627 |
-
"resolution": 26,
|
| 628 |
-
"source": 19,
|
| 629 |
-
"season": 9,
|
| 630 |
-
"special": 5
|
| 631 |
-
},
|
| 632 |
-
"field_accuracy": {
|
| 633 |
-
"episode": 1.0,
|
| 634 |
-
"group": 1.0,
|
| 635 |
-
"resolution": 1.0,
|
| 636 |
-
"season": 1.0,
|
| 637 |
-
"source": 1.0,
|
| 638 |
-
"special": 1.0,
|
| 639 |
-
"title": 1.0
|
| 640 |
-
},
|
| 641 |
-
"failures": [],
|
| 642 |
-
"results": [
|
| 643 |
-
{
|
| 644 |
-
"id": "lolihouse_dash_episode",
|
| 645 |
-
"filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
|
| 646 |
-
"ok": true,
|
| 647 |
-
"errors": {},
|
| 648 |
-
"expected": {
|
| 649 |
-
"group": "LoliHouse",
|
| 650 |
-
"title": "Yomi no Tsugai",
|
| 651 |
-
"episode": 7,
|
| 652 |
-
"resolution": "1080p",
|
| 653 |
-
"source": "WebRip"
|
| 654 |
-
},
|
| 655 |
-
"pred": {
|
| 656 |
-
"episode": 7,
|
| 657 |
-
"group": "LoliHouse",
|
| 658 |
-
"resolution": "1080p",
|
| 659 |
-
"source": "WebRip",
|
| 660 |
-
"title": "Yomi no Tsugai"
|
| 661 |
-
}
|
| 662 |
-
},
|
| 663 |
-
{
|
| 664 |
-
"id": "dot_season_episode_no_group",
|
| 665 |
-
"filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
|
| 666 |
-
"ok": true,
|
| 667 |
-
"errors": {},
|
| 668 |
-
"expected": {
|
| 669 |
-
"title": "Witch.Hat.Atelier",
|
| 670 |
-
"season": 1,
|
| 671 |
-
"episode": 7,
|
| 672 |
-
"group": null,
|
| 673 |
-
"resolution": "1080p",
|
| 674 |
-
"source": "NF"
|
| 675 |
-
},
|
| 676 |
-
"pred": {
|
| 677 |
-
"episode": 7,
|
| 678 |
-
"group": null,
|
| 679 |
-
"resolution": "1080p",
|
| 680 |
-
"season": 1,
|
| 681 |
-
"source": "NF",
|
| 682 |
-
"title": "Witch.Hat.Atelier"
|
| 683 |
-
}
|
| 684 |
-
},
|
| 685 |
-
{
|
| 686 |
-
"id": "ani_cjk_season_dash_episode",
|
| 687 |
-
"filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
|
| 688 |
-
"ok": true,
|
| 689 |
-
"errors": {},
|
| 690 |
-
"expected": {
|
| 691 |
-
"group": "ANi",
|
| 692 |
-
"title": "異世界悠閒農家",
|
| 693 |
-
"season": 2,
|
| 694 |
-
"episode": 6,
|
| 695 |
-
"resolution": "1080P",
|
| 696 |
-
"source": "Baha"
|
| 697 |
-
},
|
| 698 |
-
"pred": {
|
| 699 |
-
"episode": 6,
|
| 700 |
-
"group": "ANi",
|
| 701 |
-
"resolution": "1080P",
|
| 702 |
-
"season": 2,
|
| 703 |
-
"source": "Baha",
|
| 704 |
-
"title": "異世界悠閒農家"
|
| 705 |
-
}
|
| 706 |
-
},
|
| 707 |
-
{
|
| 708 |
-
"id": "kisssub_bracket_title_episode",
|
| 709 |
-
"filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
|
| 710 |
-
"ok": true,
|
| 711 |
-
"errors": {},
|
| 712 |
-
"expected": {
|
| 713 |
-
"group": "KissSub",
|
| 714 |
-
"title": "Shunkashuutou Daikousha - Haru no Mai",
|
| 715 |
-
"episode": 5,
|
| 716 |
-
"resolution": "1080P",
|
| 717 |
-
"source": "GB"
|
| 718 |
-
},
|
| 719 |
-
"pred": {
|
| 720 |
-
"episode": 5,
|
| 721 |
-
"group": "KissSub",
|
| 722 |
-
"resolution": "1080P",
|
| 723 |
-
"source": "GB",
|
| 724 |
-
"title": "Shunkashuutou Daikousha - Haru no Mai"
|
| 725 |
-
}
|
| 726 |
-
},
|
| 727 |
-
{
|
| 728 |
-
"id": "airotabracket_title_episode",
|
| 729 |
-
"filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
|
| 730 |
-
"ok": true,
|
| 731 |
-
"errors": {},
|
| 732 |
-
"expected": {
|
| 733 |
-
"group": "Airota",
|
| 734 |
-
"title": "Sousou no Frieren",
|
| 735 |
-
"episode": 29,
|
| 736 |
-
"resolution": "1080p",
|
| 737 |
-
"source": "CHT"
|
| 738 |
-
},
|
| 739 |
-
"pred": {
|
| 740 |
-
"episode": 29,
|
| 741 |
-
"group": "Airota",
|
| 742 |
-
"resolution": "1080p",
|
| 743 |
-
"source": "CHT",
|
| 744 |
-
"title": "Sousou no Frieren"
|
| 745 |
-
}
|
| 746 |
-
},
|
| 747 |
-
{
|
| 748 |
-
"id": "subsplease_parenthesized_resolution",
|
| 749 |
-
"filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
|
| 750 |
-
"ok": true,
|
| 751 |
-
"errors": {},
|
| 752 |
-
"expected": {
|
| 753 |
-
"group": "SubsPlease",
|
| 754 |
-
"title": "Mushoku Tensei",
|
| 755 |
-
"episode": 12,
|
| 756 |
-
"resolution": "1080p"
|
| 757 |
-
},
|
| 758 |
-
"pred": {
|
| 759 |
-
"episode": 12,
|
| 760 |
-
"group": "SubsPlease",
|
| 761 |
-
"resolution": "1080p",
|
| 762 |
-
"title": "Mushoku Tensei"
|
| 763 |
-
}
|
| 764 |
-
},
|
| 765 |
-
{
|
| 766 |
-
"id": "vcb_bracket_episode",
|
| 767 |
-
"filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
|
| 768 |
-
"ok": true,
|
| 769 |
-
"errors": {},
|
| 770 |
-
"expected": {
|
| 771 |
-
"group": "VCB-Studio",
|
| 772 |
-
"title": "Girls Band Cry",
|
| 773 |
-
"episode": 1,
|
| 774 |
-
"resolution": "1080p"
|
| 775 |
-
},
|
| 776 |
-
"pred": {
|
| 777 |
-
"episode": 1,
|
| 778 |
-
"group": "VCB-Studio",
|
| 779 |
-
"resolution": "1080p",
|
| 780 |
-
"title": "Girls Band Cry"
|
| 781 |
-
}
|
| 782 |
-
},
|
| 783 |
-
{
|
| 784 |
-
"id": "numeric_title_not_episode",
|
| 785 |
-
"filename": "86 Eighty Six - 01 [1080P][Baha]",
|
| 786 |
-
"ok": true,
|
| 787 |
-
"errors": {},
|
| 788 |
-
"expected": {
|
| 789 |
-
"title": "86 Eighty Six",
|
| 790 |
-
"episode": 1,
|
| 791 |
-
"resolution": "1080P",
|
| 792 |
-
"source": "Baha"
|
| 793 |
-
},
|
| 794 |
-
"pred": {
|
| 795 |
-
"episode": 1,
|
| 796 |
-
"resolution": "1080P",
|
| 797 |
-
"source": "Baha",
|
| 798 |
-
"title": "86 Eighty Six"
|
| 799 |
-
}
|
| 800 |
-
},
|
| 801 |
-
{
|
| 802 |
-
"id": "erai_raws_dash_episode",
|
| 803 |
-
"filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
|
| 804 |
-
"ok": true,
|
| 805 |
-
"errors": {},
|
| 806 |
-
"expected": {
|
| 807 |
-
"group": "Erai-raws",
|
| 808 |
-
"title": "Sousou no Frieren",
|
| 809 |
-
"episode": 1,
|
| 810 |
-
"resolution": "1080p"
|
| 811 |
-
},
|
| 812 |
-
"pred": {
|
| 813 |
-
"episode": 1,
|
| 814 |
-
"group": "Erai-raws",
|
| 815 |
-
"resolution": "1080p",
|
| 816 |
-
"title": "Sousou no Frieren"
|
| 817 |
-
}
|
| 818 |
-
},
|
| 819 |
-
{
|
| 820 |
-
"id": "nekomoe_space_group",
|
| 821 |
-
"filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
|
| 822 |
-
"ok": true,
|
| 823 |
-
"errors": {},
|
| 824 |
-
"expected": {
|
| 825 |
-
"group": "Nekomoe kissaten",
|
| 826 |
-
"title": "Watashi no Shiawase na Kekkon",
|
| 827 |
-
"episode": 1,
|
| 828 |
-
"resolution": "1080p"
|
| 829 |
-
},
|
| 830 |
-
"pred": {
|
| 831 |
-
"episode": 1,
|
| 832 |
-
"group": "Nekomoe kissaten",
|
| 833 |
-
"resolution": "1080p",
|
| 834 |
-
"title": "Watashi no Shiawase na Kekkon"
|
| 835 |
-
}
|
| 836 |
-
},
|
| 837 |
-
{
|
| 838 |
-
"id": "long_running_episode",
|
| 839 |
-
"filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
|
| 840 |
-
"ok": true,
|
| 841 |
-
"errors": {},
|
| 842 |
-
"expected": {
|
| 843 |
-
"title": "One.Piece",
|
| 844 |
-
"episode": 1110,
|
| 845 |
-
"resolution": "1080p",
|
| 846 |
-
"source": "WEB-DL"
|
| 847 |
-
},
|
| 848 |
-
"pred": {
|
| 849 |
-
"episode": 1110,
|
| 850 |
-
"resolution": "1080p",
|
| 851 |
-
"source": "WEB-DL",
|
| 852 |
-
"title": "One.Piece"
|
| 853 |
-
}
|
| 854 |
-
},
|
| 855 |
-
{
|
| 856 |
-
"id": "season_episode_amzn",
|
| 857 |
-
"filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
|
| 858 |
-
"ok": true,
|
| 859 |
-
"errors": {},
|
| 860 |
-
"expected": {
|
| 861 |
-
"title": "Example.Show",
|
| 862 |
-
"season": 2,
|
| 863 |
-
"episode": 3,
|
| 864 |
-
"resolution": "2160p",
|
| 865 |
-
"source": "AMZN"
|
| 866 |
-
},
|
| 867 |
-
"pred": {
|
| 868 |
-
"episode": 3,
|
| 869 |
-
"resolution": "2160p",
|
| 870 |
-
"season": 2,
|
| 871 |
-
"source": "AMZN",
|
| 872 |
-
"title": "Example.Show"
|
| 873 |
-
}
|
| 874 |
-
},
|
| 875 |
-
{
|
| 876 |
-
"id": "cjk_group_with_prefix_tag",
|
| 877 |
-
"filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
|
| 878 |
-
"ok": true,
|
| 879 |
-
"errors": {},
|
| 880 |
-
"expected": {
|
| 881 |
-
"group": "喵萌奶茶屋",
|
| 882 |
-
"title": "葬送的芙莉莲",
|
| 883 |
-
"episode": 1,
|
| 884 |
-
"resolution": "1080P"
|
| 885 |
-
},
|
| 886 |
-
"pred": {
|
| 887 |
-
"episode": 1,
|
| 888 |
-
"group": "喵萌奶茶屋",
|
| 889 |
-
"resolution": "1080P",
|
| 890 |
-
"title": "葬送的芙莉莲"
|
| 891 |
-
}
|
| 892 |
-
},
|
| 893 |
-
{
|
| 894 |
-
"id": "leading_meta_not_group",
|
| 895 |
-
"filename": "[1080p] Witch Watch - 15 [CHS]",
|
| 896 |
-
"ok": true,
|
| 897 |
-
"errors": {},
|
| 898 |
-
"expected": {
|
| 899 |
-
"group": null,
|
| 900 |
-
"title": "Witch Watch",
|
| 901 |
-
"episode": 15,
|
| 902 |
-
"resolution": "1080p",
|
| 903 |
-
"source": "CHS"
|
| 904 |
-
},
|
| 905 |
-
"pred": {
|
| 906 |
-
"episode": 15,
|
| 907 |
-
"group": null,
|
| 908 |
-
"resolution": "1080p",
|
| 909 |
-
"source": "CHS",
|
| 910 |
-
"title": "Witch Watch"
|
| 911 |
-
}
|
| 912 |
-
},
|
| 913 |
-
{
|
| 914 |
-
"id": "sakurato_group_language_source",
|
| 915 |
-
"filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
|
| 916 |
-
"ok": true,
|
| 917 |
-
"errors": {},
|
| 918 |
-
"expected": {
|
| 919 |
-
"group": "Sakurato",
|
| 920 |
-
"title": "Witch Watch",
|
| 921 |
-
"episode": 15,
|
| 922 |
-
"resolution": "1080p",
|
| 923 |
-
"source": "CHS"
|
| 924 |
-
},
|
| 925 |
-
"pred": {
|
| 926 |
-
"episode": 15,
|
| 927 |
-
"group": "Sakurato",
|
| 928 |
-
"resolution": "1080p",
|
| 929 |
-
"source": "CHS",
|
| 930 |
-
"title": "Witch Watch"
|
| 931 |
-
}
|
| 932 |
-
},
|
| 933 |
-
{
|
| 934 |
-
"id": "billion_meta_lab_search_special",
|
| 935 |
-
"filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
|
| 936 |
-
"ok": true,
|
| 937 |
-
"errors": {},
|
| 938 |
-
"expected": {
|
| 939 |
-
"group": "Billion Meta Lab",
|
| 940 |
-
"title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
|
| 941 |
-
"episode": 7,
|
| 942 |
-
"resolution": "1080P",
|
| 943 |
-
"source": "CHT&JPN",
|
| 944 |
-
"special": "檢索:魔法姊妹露露特莉莉"
|
| 945 |
-
},
|
| 946 |
-
"pred": {
|
| 947 |
-
"episode": 7,
|
| 948 |
-
"group": "Billion Meta Lab",
|
| 949 |
-
"resolution": "1080P",
|
| 950 |
-
"source": "CHT&JPN",
|
| 951 |
-
"special": "檢索:魔法姊妹露露特莉莉",
|
| 952 |
-
"title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi"
|
| 953 |
-
}
|
| 954 |
-
},
|
| 955 |
-
{
|
| 956 |
-
"id": "studio_greentea_s2_bracket_episode",
|
| 957 |
-
"filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
|
| 958 |
-
"ok": true,
|
| 959 |
-
"errors": {},
|
| 960 |
-
"expected": {
|
| 961 |
-
"group": "Studio GreenTea",
|
| 962 |
-
"title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
|
| 963 |
-
"season": 2,
|
| 964 |
-
"episode": 6,
|
| 965 |
-
"resolution": "1080p",
|
| 966 |
-
"source": "WebRip"
|
| 967 |
-
},
|
| 968 |
-
"pred": {
|
| 969 |
-
"episode": 6,
|
| 970 |
-
"group": "Studio GreenTea",
|
| 971 |
-
"resolution": "1080p",
|
| 972 |
-
"season": 2,
|
| 973 |
-
"source": "WebRip",
|
| 974 |
-
"title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken"
|
| 975 |
-
}
|
| 976 |
-
},
|
| 977 |
-
{
|
| 978 |
-
"id": "lolihouse_kakuriyo_bare_ni_season",
|
| 979 |
-
"filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
|
| 980 |
-
"ok": true,
|
| 981 |
-
"errors": {},
|
| 982 |
-
"expected": {
|
| 983 |
-
"group": "LoliHouse",
|
| 984 |
-
"title": "Kakuriyo no Yadomeshi",
|
| 985 |
-
"season": 2,
|
| 986 |
-
"episode": 12,
|
| 987 |
-
"resolution": "1080p",
|
| 988 |
-
"source": "WebRip"
|
| 989 |
-
},
|
| 990 |
-
"pred": {
|
| 991 |
-
"episode": 12,
|
| 992 |
-
"group": "LoliHouse",
|
| 993 |
-
"resolution": "1080p",
|
| 994 |
-
"season": 2,
|
| 995 |
-
"source": "WebRip",
|
| 996 |
-
"title": "Kakuriyo no Yadomeshi"
|
| 997 |
-
}
|
| 998 |
-
},
|
| 999 |
-
{
|
| 1000 |
-
"id": "ani_kakuriyo_traditional_ni",
|
| 1001 |
-
"filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
|
| 1002 |
-
"ok": true,
|
| 1003 |
-
"errors": {},
|
| 1004 |
-
"expected": {
|
| 1005 |
-
"group": "ANi",
|
| 1006 |
-
"title": "妖怪旅館營業中",
|
| 1007 |
-
"season": 2,
|
| 1008 |
-
"episode": 11,
|
| 1009 |
-
"resolution": "1080P",
|
| 1010 |
-
"source": "Baha"
|
| 1011 |
-
},
|
| 1012 |
-
"pred": {
|
| 1013 |
-
"episode": 11,
|
| 1014 |
-
"group": "ANi",
|
| 1015 |
-
"resolution": "1080P",
|
| 1016 |
-
"season": 2,
|
| 1017 |
-
"source": "Baha",
|
| 1018 |
-
"title": "妖怪旅館營業中"
|
| 1019 |
-
}
|
| 1020 |
-
},
|
| 1021 |
-
{
|
| 1022 |
-
"id": "jibaketa_shokugeki_ni_no_sara",
|
| 1023 |
-
"filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
|
| 1024 |
-
"ok": true,
|
| 1025 |
-
"errors": {},
|
| 1026 |
-
"expected": {
|
| 1027 |
-
"group": "jibaketa",
|
| 1028 |
-
"title": "Shokugeki no Souma",
|
| 1029 |
-
"season": 2,
|
| 1030 |
-
"episode": 13,
|
| 1031 |
-
"resolution": "1920x1080"
|
| 1032 |
-
},
|
| 1033 |
-
"pred": {
|
| 1034 |
-
"episode": 13,
|
| 1035 |
-
"group": "jibaketa",
|
| 1036 |
-
"resolution": "1920x1080",
|
| 1037 |
-
"season": 2,
|
| 1038 |
-
"title": "Shokugeki no Souma"
|
| 1039 |
-
}
|
| 1040 |
-
},
|
| 1041 |
-
{
|
| 1042 |
-
"id": "ai_raws_fire_force_cjk_season_hash_episode",
|
| 1043 |
-
"filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
|
| 1044 |
-
"ok": true,
|
| 1045 |
-
"errors": {},
|
| 1046 |
-
"expected": {
|
| 1047 |
-
"group": "AI-Raws",
|
| 1048 |
-
"title": "炎炎の消防隊",
|
| 1049 |
-
"season": 2,
|
| 1050 |
-
"episode": 13,
|
| 1051 |
-
"resolution": "1920x1080"
|
| 1052 |
-
},
|
| 1053 |
-
"pred": {
|
| 1054 |
-
"episode": 13,
|
| 1055 |
-
"group": "AI-Raws",
|
| 1056 |
-
"resolution": "1920x1080",
|
| 1057 |
-
"season": 2,
|
| 1058 |
-
"title": "炎炎の消防隊"
|
| 1059 |
-
}
|
| 1060 |
-
},
|
| 1061 |
-
{
|
| 1062 |
-
"id": "gm_team_guoman_bilingual_s2",
|
| 1063 |
-
"filename": "[GM-Team][国漫][逆天邪神 第2季][Against the Gods Ⅱ][2026][04][HEVC][GB][4K].mp4",
|
| 1064 |
-
"ok": true,
|
| 1065 |
-
"errors": {},
|
| 1066 |
-
"expected": {
|
| 1067 |
-
"group": "GM-Team",
|
| 1068 |
-
"title": "逆天邪神",
|
| 1069 |
-
"season": 2,
|
| 1070 |
-
"episode": 4,
|
| 1071 |
-
"resolution": "4K",
|
| 1072 |
-
"source": "GB"
|
| 1073 |
-
},
|
| 1074 |
-
"pred": {
|
| 1075 |
-
"episode": 4,
|
| 1076 |
-
"group": "GM-Team",
|
| 1077 |
-
"resolution": "4K",
|
| 1078 |
-
"season": 2,
|
| 1079 |
-
"source": "GB",
|
| 1080 |
-
"title": "逆天邪神"
|
| 1081 |
-
}
|
| 1082 |
-
},
|
| 1083 |
-
{
|
| 1084 |
-
"id": "vcb_special_iv_not_episode",
|
| 1085 |
-
"filename": "[YYDM&VCB-Studio] Shinsekai Yori [IV05][Ma10p_1080p][x265_aac].mkv",
|
| 1086 |
-
"ok": true,
|
| 1087 |
-
"errors": {},
|
| 1088 |
-
"expected": {
|
| 1089 |
-
"group": "YYDM&VCB-Studio",
|
| 1090 |
-
"title": "Shinsekai Yori",
|
| 1091 |
-
"episode": null,
|
| 1092 |
-
"resolution": "1080p",
|
| 1093 |
-
"source": "x265_aac",
|
| 1094 |
-
"special": "IV05"
|
| 1095 |
-
},
|
| 1096 |
-
"pred": {
|
| 1097 |
-
"episode": null,
|
| 1098 |
-
"group": "YYDM&VCB-Studio",
|
| 1099 |
-
"resolution": "1080p",
|
| 1100 |
-
"source": "x265-aac",
|
| 1101 |
-
"special": "IV05",
|
| 1102 |
-
"title": "Shinsekai Yori"
|
| 1103 |
-
}
|
| 1104 |
-
},
|
| 1105 |
-
{
|
| 1106 |
-
"id": "vcb_nced_not_episode",
|
| 1107 |
-
"filename": "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv",
|
| 1108 |
-
"ok": true,
|
| 1109 |
-
"errors": {},
|
| 1110 |
-
"expected": {
|
| 1111 |
-
"group": "YYDM&VCB-Studio",
|
| 1112 |
-
"title": "Shinsekai Yori",
|
| 1113 |
-
"episode": null,
|
| 1114 |
-
"resolution": "1080p",
|
| 1115 |
-
"source": "x265_flac",
|
| 1116 |
-
"special": "NCED02"
|
| 1117 |
-
},
|
| 1118 |
-
"pred": {
|
| 1119 |
-
"episode": null,
|
| 1120 |
-
"group": "YYDM&VCB-Studio",
|
| 1121 |
-
"resolution": "1080p",
|
| 1122 |
-
"source": "x265-flac",
|
| 1123 |
-
"special": "NCED02",
|
| 1124 |
-
"title": "Shinsekai Yori"
|
| 1125 |
-
}
|
| 1126 |
-
},
|
| 1127 |
-
{
|
| 1128 |
-
"id": "dot_nced_suffix_not_episode",
|
| 1129 |
-
"filename": "InuYasha.2000.NCED02.BDrip.AV1.10Bit.DTS.1080p-CalChi",
|
| 1130 |
-
"ok": true,
|
| 1131 |
-
"errors": {},
|
| 1132 |
-
"expected": {
|
| 1133 |
-
"title": "InuYasha",
|
| 1134 |
-
"episode": null,
|
| 1135 |
-
"resolution": "1080p",
|
| 1136 |
-
"source": "BDrip",
|
| 1137 |
-
"special": "NCED02"
|
| 1138 |
-
},
|
| 1139 |
-
"pred": {
|
| 1140 |
-
"episode": null,
|
| 1141 |
-
"resolution": "1080p",
|
| 1142 |
-
"source": "BDrip",
|
| 1143 |
-
"special": "NCED02",
|
| 1144 |
-
"title": "InuYasha"
|
| 1145 |
-
}
|
| 1146 |
-
},
|
| 1147 |
-
{
|
| 1148 |
-
"id": "vcb_numeric_title_nced",
|
| 1149 |
-
"filename": "[VCB-Studio] Yamada-kun to 7-nin no Majo [NCED][Ma10p_1080p][x265_flac]",
|
| 1150 |
-
"ok": true,
|
| 1151 |
-
"errors": {},
|
| 1152 |
-
"expected": {
|
| 1153 |
-
"group": "VCB-Studio",
|
| 1154 |
-
"title": "Yamada-kun to 7-nin no Majo",
|
| 1155 |
-
"episode": null,
|
| 1156 |
-
"resolution": "1080p",
|
| 1157 |
-
"source": "x265_flac",
|
| 1158 |
-
"special": "NCED"
|
| 1159 |
-
},
|
| 1160 |
-
"pred": {
|
| 1161 |
-
"episode": null,
|
| 1162 |
-
"group": "VCB-Studio",
|
| 1163 |
-
"resolution": "1080p",
|
| 1164 |
-
"source": "x265-flac",
|
| 1165 |
-
"special": "NCED",
|
| 1166 |
-
"title": "Yamada-kun to 7-nin no Majo"
|
| 1167 |
-
}
|
| 1168 |
-
}
|
| 1169 |
-
]
|
| 1170 |
-
},
|
| 1171 |
-
"rule_assisted": {
|
| 1172 |
-
"model_dir": ".",
|
| 1173 |
-
"case_file": "data/parser_regression_cases.json",
|
| 1174 |
-
"tokenizer_variant": "char",
|
| 1175 |
-
"max_length": 128,
|
| 1176 |
-
"use_rules": true,
|
| 1177 |
"constrain_bio": true,
|
| 1178 |
"case_count": 26,
|
| 1179 |
"full_correct": 26,
|
|
|
|
| 6 |
"case_file": "data/parser_regression_cases.json",
|
| 7 |
"tokenizer_variant": "char",
|
| 8 |
"max_length": 128,
|
|
|
|
| 9 |
"constrain_bio": false,
|
| 10 |
"case_count": 26,
|
| 11 |
"full_correct": 25,
|
|
|
|
| 605 |
"case_file": "data/parser_regression_cases.json",
|
| 606 |
"tokenizer_variant": "char",
|
| 607 |
"max_length": 128,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 608 |
"constrain_bio": true,
|
| 609 |
"case_count": 26,
|
| 610 |
"full_correct": 26,
|
diagnose_pipeline.py
CHANGED
|
@@ -364,9 +364,7 @@ def evaluate_model(
|
|
| 364 |
entity_confusion: Counter = Counter()
|
| 365 |
boundary_errors: Counter = Counter()
|
| 366 |
parse_metrics: Counter = Counter()
|
| 367 |
-
parse_metrics_no_rules: Counter = Counter()
|
| 368 |
field_failures: List[dict] = []
|
| 369 |
-
field_failures_no_rules: List[dict] = []
|
| 370 |
|
| 371 |
with torch.no_grad():
|
| 372 |
for sample in eval_samples:
|
|
@@ -410,32 +408,13 @@ def evaluate_model(
|
|
| 410 |
active_tokens,
|
| 411 |
true_labels,
|
| 412 |
tokenizer=tokenizer,
|
| 413 |
-
filename=sample.get("filename"),
|
| 414 |
-
use_rules=True,
|
| 415 |
)
|
| 416 |
pred_parse = postprocess(
|
| 417 |
active_tokens,
|
| 418 |
pred_labels,
|
| 419 |
tokenizer=tokenizer,
|
| 420 |
-
filename=sample.get("filename"),
|
| 421 |
-
use_rules=True,
|
| 422 |
-
)
|
| 423 |
-
gold_parse_no_rules = postprocess(
|
| 424 |
-
active_tokens,
|
| 425 |
-
true_labels,
|
| 426 |
-
tokenizer=tokenizer,
|
| 427 |
-
filename=sample.get("filename"),
|
| 428 |
-
use_rules=False,
|
| 429 |
-
)
|
| 430 |
-
pred_parse_no_rules = postprocess(
|
| 431 |
-
active_tokens,
|
| 432 |
-
pred_labels,
|
| 433 |
-
tokenizer=tokenizer,
|
| 434 |
-
filename=sample.get("filename"),
|
| 435 |
-
use_rules=False,
|
| 436 |
)
|
| 437 |
update_parse_metrics(parse_metrics, gold_parse, pred_parse)
|
| 438 |
-
update_parse_metrics(parse_metrics_no_rules, gold_parse_no_rules, pred_parse_no_rules)
|
| 439 |
failures = collect_field_failures(gold_parse, pred_parse)
|
| 440 |
if failures and len(field_failures) < 30:
|
| 441 |
field_failures.append(
|
|
@@ -446,16 +425,6 @@ def evaluate_model(
|
|
| 446 |
"pred": pred_parse,
|
| 447 |
}
|
| 448 |
)
|
| 449 |
-
failures_no_rules = collect_field_failures(gold_parse_no_rules, pred_parse_no_rules)
|
| 450 |
-
if failures_no_rules and len(field_failures_no_rules) < 30:
|
| 451 |
-
field_failures_no_rules.append(
|
| 452 |
-
{
|
| 453 |
-
"filename": sample.get("filename"),
|
| 454 |
-
"errors": failures_no_rules,
|
| 455 |
-
"gold": gold_parse_no_rules,
|
| 456 |
-
"pred": pred_parse_no_rules,
|
| 457 |
-
}
|
| 458 |
-
)
|
| 459 |
|
| 460 |
errors = confusion.copy()
|
| 461 |
for label in set(label for pair in confusion for label in pair):
|
|
@@ -473,9 +442,7 @@ def evaluate_model(
|
|
| 473 |
).most_common(30),
|
| 474 |
"boundary_errors": boundary_errors,
|
| 475 |
"parse_metrics": parse_metrics,
|
| 476 |
-
"parse_metrics_no_rules": parse_metrics_no_rules,
|
| 477 |
"field_failures": field_failures,
|
| 478 |
-
"field_failures_no_rules": field_failures_no_rules,
|
| 479 |
}
|
| 480 |
|
| 481 |
|
|
@@ -811,8 +778,7 @@ def main() -> None:
|
|
| 811 |
]
|
| 812 |
return field_rows, full_line, error_rows
|
| 813 |
|
| 814 |
-
|
| 815 |
-
ner_field_rows, ner_full_line, ner_error_rows = parse_metric_tables(model_eval["parse_metrics_no_rules"])
|
| 816 |
sections.append(
|
| 817 |
(
|
| 818 |
"Model Confusion Analysis",
|
|
@@ -832,28 +798,17 @@ def main() -> None:
|
|
| 832 |
"### Top entity-type confusions",
|
| 833 |
markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
|
| 834 |
"",
|
| 835 |
-
"### Field exact-match accuracy (
|
| 836 |
-
markdown_table(["field", "correct/total", "accuracy"],
|
| 837 |
"",
|
| 838 |
-
f"
|
| 839 |
"",
|
| 840 |
-
"### Top
|
| 841 |
-
markdown_table(["field", "gold", "pred", "count"],
|
| 842 |
"",
|
| 843 |
-
"###
|
| 844 |
-
markdown_table(["field", "correct/total", "accuracy"], ner_field_rows),
|
| 845 |
-
"",
|
| 846 |
-
f"NER-only full parse exact match: {ner_full_line}",
|
| 847 |
-
"",
|
| 848 |
-
"### Top NER-only field parse errors",
|
| 849 |
-
markdown_table(["field", "gold", "pred", "count"], ner_error_rows) if ner_error_rows else "- none",
|
| 850 |
-
"",
|
| 851 |
-
"### Hardest sampled parse failures (rule-assisted)",
|
| 852 |
markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
|
| 853 |
"",
|
| 854 |
-
"### Hardest sampled parse failures (NER-only)",
|
| 855 |
-
markdown_json(model_eval["field_failures_no_rules"][:10]) if model_eval["field_failures_no_rules"] else "- none",
|
| 856 |
-
"",
|
| 857 |
"### Seqeval report",
|
| 858 |
"```text\n" + model_eval["classification_report"] + "\n```",
|
| 859 |
]
|
|
@@ -870,7 +825,7 @@ def main() -> None:
|
|
| 870 |
"2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels.",
|
| 871 |
"3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`.",
|
| 872 |
"4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked.",
|
| 873 |
-
"5. Keep
|
| 874 |
"6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.",
|
| 875 |
]
|
| 876 |
),
|
|
|
|
| 364 |
entity_confusion: Counter = Counter()
|
| 365 |
boundary_errors: Counter = Counter()
|
| 366 |
parse_metrics: Counter = Counter()
|
|
|
|
| 367 |
field_failures: List[dict] = []
|
|
|
|
| 368 |
|
| 369 |
with torch.no_grad():
|
| 370 |
for sample in eval_samples:
|
|
|
|
| 408 |
active_tokens,
|
| 409 |
true_labels,
|
| 410 |
tokenizer=tokenizer,
|
|
|
|
|
|
|
| 411 |
)
|
| 412 |
pred_parse = postprocess(
|
| 413 |
active_tokens,
|
| 414 |
pred_labels,
|
| 415 |
tokenizer=tokenizer,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 416 |
)
|
| 417 |
update_parse_metrics(parse_metrics, gold_parse, pred_parse)
|
|
|
|
| 418 |
failures = collect_field_failures(gold_parse, pred_parse)
|
| 419 |
if failures and len(field_failures) < 30:
|
| 420 |
field_failures.append(
|
|
|
|
| 425 |
"pred": pred_parse,
|
| 426 |
}
|
| 427 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 428 |
|
| 429 |
errors = confusion.copy()
|
| 430 |
for label in set(label for pair in confusion for label in pair):
|
|
|
|
| 442 |
).most_common(30),
|
| 443 |
"boundary_errors": boundary_errors,
|
| 444 |
"parse_metrics": parse_metrics,
|
|
|
|
| 445 |
"field_failures": field_failures,
|
|
|
|
| 446 |
}
|
| 447 |
|
| 448 |
|
|
|
|
| 778 |
]
|
| 779 |
return field_rows, full_line, error_rows
|
| 780 |
|
| 781 |
+
parse_field_rows, parse_full_line, parse_error_rows = parse_metric_tables(model_eval["parse_metrics"])
|
|
|
|
| 782 |
sections.append(
|
| 783 |
(
|
| 784 |
"Model Confusion Analysis",
|
|
|
|
| 798 |
"### Top entity-type confusions",
|
| 799 |
markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
|
| 800 |
"",
|
| 801 |
+
"### Field exact-match accuracy (thin runtime)",
|
| 802 |
+
markdown_table(["field", "correct/total", "accuracy"], parse_field_rows),
|
| 803 |
"",
|
| 804 |
+
f"Thin-runtime full parse exact match: {parse_full_line}",
|
| 805 |
"",
|
| 806 |
+
"### Top thin-runtime field parse errors",
|
| 807 |
+
markdown_table(["field", "gold", "pred", "count"], parse_error_rows) if parse_error_rows else "- none",
|
| 808 |
"",
|
| 809 |
+
"### Hardest sampled parse failures",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 810 |
markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
|
| 811 |
"",
|
|
|
|
|
|
|
|
|
|
| 812 |
"### Seqeval report",
|
| 813 |
"```text\n" + model_eval["classification_report"] + "\n```",
|
| 814 |
]
|
|
|
|
| 825 |
"2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels.",
|
| 826 |
"3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`.",
|
| 827 |
"4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked.",
|
| 828 |
+
"5. Keep runtime post-processing thin: BIO aggregation plus string/number normalization.",
|
| 829 |
"6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.",
|
| 830 |
]
|
| 831 |
),
|
docs/onnx.md
CHANGED
|
@@ -107,15 +107,14 @@ The runtime parser should do this:
|
|
| 107 |
使用约束 BIO transition 解码标签。
|
| 108 |
8. Aggregate labels into parser fields.
|
| 109 |
聚合标签为结构化字段。
|
| 110 |
-
9. Apply thin normalization only: trim brackets
|
| 111 |
-
fields.
|
| 112 |
只做薄层规范化:裁剪括号/扩展名并转换数字字段。
|
| 113 |
|
| 114 |
-
The
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
-
参考运行时。
|
| 119 |
|
| 120 |
## 5. Android Notes / Android 注意事项
|
| 121 |
|
|
|
|
| 107 |
使用约束 BIO transition 解码标签。
|
| 108 |
8. Aggregate labels into parser fields.
|
| 109 |
聚合标签为结构化字段。
|
| 110 |
+
9. Apply thin normalization only: trim brackets, normalize source text, and
|
| 111 |
+
convert numeric fields.
|
| 112 |
只做薄层规范化:裁剪括号/扩展名并转换数字字段。
|
| 113 |
|
| 114 |
+
The ONNX reference runtime intentionally matches the Python thin runtime. It
|
| 115 |
+
does not include structural filename regex assists.
|
| 116 |
|
| 117 |
+
ONNX 参考运行时有意与 Python 薄层运行时保持一致,不包含结构化文件名正则辅助。
|
|
|
|
| 118 |
|
| 119 |
## 5. Android Notes / Android 注意事项
|
| 120 |
|
docs/training.md
CHANGED
|
@@ -172,12 +172,12 @@ The default quality gate is model-led parsing:
|
|
| 172 |
- fixed regression `model_only >= 85%`
|
| 173 |
- held-out parse `model_only >= 75%`
|
| 174 |
- `normalized_only` is the default thin runtime metric
|
| 175 |
-
-
|
| 176 |
|
| 177 |
- 固定回归 `model_only >= 85%`
|
| 178 |
- held-out 解析 `model_only >= 75%`
|
| 179 |
- `normalized_only` 是默认薄层运行时指标
|
| 180 |
-
-
|
| 181 |
|
| 182 |
## 7. Publish to Repository Root / 发布到仓库根目录
|
| 183 |
|
|
|
|
| 172 |
- fixed regression `model_only >= 85%`
|
| 173 |
- held-out parse `model_only >= 75%`
|
| 174 |
- `normalized_only` is the default thin runtime metric
|
| 175 |
+
- structural filename assists are not part of training or release metrics
|
| 176 |
|
| 177 |
- 固定回归 `model_only >= 85%`
|
| 178 |
- held-out 解析 `model_only >= 75%`
|
| 179 |
- `normalized_only` 是默认薄层运行时指标
|
| 180 |
+
- 结构化文件名辅助不属于训练或发布指标
|
| 181 |
|
| 182 |
## 7. Publish to Repository Root / 发布到仓库根目录
|
| 183 |
|
evaluate_parser_cases.py
CHANGED
|
@@ -43,7 +43,6 @@ def evaluate_cases(
|
|
| 43 |
case_file: str,
|
| 44 |
tokenizer_variant: Optional[str],
|
| 45 |
max_length: Optional[int],
|
| 46 |
-
use_rules: bool,
|
| 47 |
constrain_bio: bool,
|
| 48 |
) -> Dict:
|
| 49 |
cfg = Config()
|
|
@@ -71,7 +70,6 @@ def evaluate_cases(
|
|
| 71 |
id2label,
|
| 72 |
max_length=resolved_max_length,
|
| 73 |
debug=False,
|
| 74 |
-
use_rules=use_rules,
|
| 75 |
constrain_bio=constrain_bio,
|
| 76 |
)
|
| 77 |
errors = {}
|
|
@@ -108,7 +106,6 @@ def evaluate_cases(
|
|
| 108 |
"case_file": case_file,
|
| 109 |
"tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
|
| 110 |
"max_length": resolved_max_length,
|
| 111 |
-
"use_rules": use_rules,
|
| 112 |
"constrain_bio": constrain_bio,
|
| 113 |
"case_count": len(cases),
|
| 114 |
"full_correct": full_correct,
|
|
@@ -128,9 +125,8 @@ def evaluate_case_modes(
|
|
| 128 |
max_length: Optional[int],
|
| 129 |
) -> Dict:
|
| 130 |
modes = {
|
| 131 |
-
"model_only": {"
|
| 132 |
-
"normalized_only": {"
|
| 133 |
-
"rule_assisted": {"use_rules": True, "constrain_bio": True},
|
| 134 |
}
|
| 135 |
results = {
|
| 136 |
name: evaluate_cases(
|
|
@@ -138,7 +134,6 @@ def evaluate_case_modes(
|
|
| 138 |
case_file=case_file,
|
| 139 |
tokenizer_variant=tokenizer_variant,
|
| 140 |
max_length=max_length,
|
| 141 |
-
use_rules=settings["use_rules"],
|
| 142 |
constrain_bio=settings["constrain_bio"],
|
| 143 |
)
|
| 144 |
for name, settings in modes.items()
|
|
@@ -170,17 +165,10 @@ def main() -> None:
|
|
| 170 |
parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
|
| 171 |
parser.add_argument("--max-length", type=int, default=None)
|
| 172 |
parser.add_argument("--output", default=None, help="Optional JSON output path")
|
| 173 |
-
parser.add_argument("--mode", choices=["all", "model-only", "normalized-only"
|
| 174 |
-
parser.add_argument("--rule-assist", action="store_true", help="Shortcut for --mode rule-assisted")
|
| 175 |
-
parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
|
| 176 |
parser.add_argument("--no-constrained-bio", action="store_true")
|
| 177 |
args = parser.parse_args()
|
| 178 |
|
| 179 |
-
if args.rule_assist:
|
| 180 |
-
args.mode = "rule-assisted"
|
| 181 |
-
if args.no_rule_assist and args.mode == "rule-assisted":
|
| 182 |
-
args.mode = "normalized-only"
|
| 183 |
-
|
| 184 |
if args.mode == "all" and not args.no_constrained_bio:
|
| 185 |
metrics = evaluate_case_modes(
|
| 186 |
model_dir=args.model_dir,
|
|
@@ -188,18 +176,16 @@ def main() -> None:
|
|
| 188 |
tokenizer_variant=args.tokenizer,
|
| 189 |
max_length=args.max_length,
|
| 190 |
)
|
| 191 |
-
for name in ("model_only", "normalized_only"
|
| 192 |
print_metrics(name, metrics["modes"][name])
|
| 193 |
print()
|
| 194 |
else:
|
| 195 |
-
use_rules = args.mode == "rule-assisted"
|
| 196 |
constrain_bio = not args.no_constrained_bio and args.mode != "model-only"
|
| 197 |
metrics = evaluate_cases(
|
| 198 |
model_dir=args.model_dir,
|
| 199 |
case_file=args.case_file,
|
| 200 |
tokenizer_variant=args.tokenizer,
|
| 201 |
max_length=args.max_length,
|
| 202 |
-
use_rules=use_rules,
|
| 203 |
constrain_bio=constrain_bio,
|
| 204 |
)
|
| 205 |
print_metrics(args.mode, metrics)
|
|
|
|
| 43 |
case_file: str,
|
| 44 |
tokenizer_variant: Optional[str],
|
| 45 |
max_length: Optional[int],
|
|
|
|
| 46 |
constrain_bio: bool,
|
| 47 |
) -> Dict:
|
| 48 |
cfg = Config()
|
|
|
|
| 70 |
id2label,
|
| 71 |
max_length=resolved_max_length,
|
| 72 |
debug=False,
|
|
|
|
| 73 |
constrain_bio=constrain_bio,
|
| 74 |
)
|
| 75 |
errors = {}
|
|
|
|
| 106 |
"case_file": case_file,
|
| 107 |
"tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
|
| 108 |
"max_length": resolved_max_length,
|
|
|
|
| 109 |
"constrain_bio": constrain_bio,
|
| 110 |
"case_count": len(cases),
|
| 111 |
"full_correct": full_correct,
|
|
|
|
| 125 |
max_length: Optional[int],
|
| 126 |
) -> Dict:
|
| 127 |
modes = {
|
| 128 |
+
"model_only": {"constrain_bio": False},
|
| 129 |
+
"normalized_only": {"constrain_bio": True},
|
|
|
|
| 130 |
}
|
| 131 |
results = {
|
| 132 |
name: evaluate_cases(
|
|
|
|
| 134 |
case_file=case_file,
|
| 135 |
tokenizer_variant=tokenizer_variant,
|
| 136 |
max_length=max_length,
|
|
|
|
| 137 |
constrain_bio=settings["constrain_bio"],
|
| 138 |
)
|
| 139 |
for name, settings in modes.items()
|
|
|
|
| 165 |
parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
|
| 166 |
parser.add_argument("--max-length", type=int, default=None)
|
| 167 |
parser.add_argument("--output", default=None, help="Optional JSON output path")
|
| 168 |
+
parser.add_argument("--mode", choices=["all", "model-only", "normalized-only"], default="all")
|
|
|
|
|
|
|
| 169 |
parser.add_argument("--no-constrained-bio", action="store_true")
|
| 170 |
args = parser.parse_args()
|
| 171 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
if args.mode == "all" and not args.no_constrained_bio:
|
| 173 |
metrics = evaluate_case_modes(
|
| 174 |
model_dir=args.model_dir,
|
|
|
|
| 176 |
tokenizer_variant=args.tokenizer,
|
| 177 |
max_length=args.max_length,
|
| 178 |
)
|
| 179 |
+
for name in ("model_only", "normalized_only"):
|
| 180 |
print_metrics(name, metrics["modes"][name])
|
| 181 |
print()
|
| 182 |
else:
|
|
|
|
| 183 |
constrain_bio = not args.no_constrained_bio and args.mode != "model-only"
|
| 184 |
metrics = evaluate_cases(
|
| 185 |
model_dir=args.model_dir,
|
| 186 |
case_file=args.case_file,
|
| 187 |
tokenizer_variant=args.tokenizer,
|
| 188 |
max_length=args.max_length,
|
|
|
|
| 189 |
constrain_bio=constrain_bio,
|
| 190 |
)
|
| 191 |
print_metrics(args.mode, metrics)
|
inference.py
CHANGED
|
@@ -11,7 +11,6 @@ Usage:
|
|
| 11 |
|
| 12 |
import argparse
|
| 13 |
import json
|
| 14 |
-
import os
|
| 15 |
import re
|
| 16 |
import sys
|
| 17 |
from typing import Dict, List, Optional, Tuple
|
|
@@ -98,6 +97,15 @@ def thin_source_priority(source: str) -> int:
|
|
| 98 |
return 40 if re.search(r"[&+/,]", source) else 30
|
| 99 |
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
def choose_thin_source(sources: List[str]) -> Optional[str]:
|
| 102 |
cleaned = [normalize_source_text(source) for source in sources if normalize_field_text(source)]
|
| 103 |
if not cleaned:
|
|
@@ -239,8 +247,6 @@ def postprocess(
|
|
| 239 |
tokens: List[str],
|
| 240 |
labels: List[str],
|
| 241 |
tokenizer: Optional[AnimeTokenizer] = None,
|
| 242 |
-
filename: Optional[str] = None,
|
| 243 |
-
use_rules: bool = False,
|
| 244 |
) -> Dict:
|
| 245 |
"""
|
| 246 |
Convert BIO-labeled tokens into structured metadata.
|
|
@@ -298,658 +304,9 @@ def postprocess(
|
|
| 298 |
|
| 299 |
result["source"] = choose_thin_source(grouped_entities.get("SOURCE", []))
|
| 300 |
|
| 301 |
-
if use_rules and filename:
|
| 302 |
-
result = apply_rule_assists(filename, result)
|
| 303 |
-
|
| 304 |
return result
|
| 305 |
|
| 306 |
|
| 307 |
-
BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
|
| 308 |
-
RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
|
| 309 |
-
SOURCE_TOKEN_PATTERN = (
|
| 310 |
-
r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
|
| 311 |
-
r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
|
| 312 |
-
r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
|
| 313 |
-
r"SDR|HDR10?|UHD|REMUX|10bit|8bit|Hi10p|Ma10p|ASSx?\d*|SRTx?\d*|"
|
| 314 |
-
r"CHS|CHT|GB|BIG5|JPN?|JPSC|JPTC|繁中|简中"
|
| 315 |
-
)
|
| 316 |
-
SOURCE_RE = re.compile(rf"\b(?:{SOURCE_TOKEN_PATTERN})\b", re.I)
|
| 317 |
-
SOURCE_TAG_RE = re.compile(
|
| 318 |
-
rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
|
| 319 |
-
re.I,
|
| 320 |
-
)
|
| 321 |
-
SPECIAL_TAG_RE = re.compile(
|
| 322 |
-
r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
|
| 323 |
-
re.I,
|
| 324 |
-
)
|
| 325 |
-
SPECIAL_CODE_RE = re.compile(
|
| 326 |
-
r"^(?:NCOP|NCED|OP|ED|PV|CM)\d*$|^IV\d+$|^(?:OVA|OAD|SP)\d*$",
|
| 327 |
-
re.I,
|
| 328 |
-
)
|
| 329 |
-
SPECIAL_CODE_INLINE_RE = re.compile(
|
| 330 |
-
r"(?<![A-Za-z0-9])"
|
| 331 |
-
r"(?P<code>(?:NCOP|NCED)(?:[\s._-]*\d{1,4})?|(?:OP|ED|PV|CM)\d{1,4}|IV\d{1,4})"
|
| 332 |
-
r"(?![A-Za-z0-9])",
|
| 333 |
-
re.I,
|
| 334 |
-
)
|
| 335 |
-
EPISODE_PATTERNS = [
|
| 336 |
-
("season_episode", re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I)),
|
| 337 |
-
("dash_episode", re.compile(r"(?:^|[\s._])[-_]\s*(?P<ep>\d{1,4})(?:v\d+)?(?=$|[\s._\-\]\)】》\[])")),
|
| 338 |
-
("bracket_episode", re.compile(r"[\[\(【《](?:EP?|#)?(?P<ep>\d{1,4})(?:v\d+)?[\]\)】》]", re.I)),
|
| 339 |
-
("explicit_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I)),
|
| 340 |
-
(
|
| 341 |
-
"long_episode",
|
| 342 |
-
re.compile(
|
| 343 |
-
r"(?:^|[\s._\-\[\(【《])(?P<ep>\d{3,4})(?:v\d+)?"
|
| 344 |
-
r"(?=[\s._\-\]\)】》\[]+(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
|
| 345 |
-
re.I,
|
| 346 |
-
),
|
| 347 |
-
),
|
| 348 |
-
("generic_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?P<ep>\d{1,3})(?:v\d+)?(?=$|[\s._\-\]\)】》])", re.I)),
|
| 349 |
-
]
|
| 350 |
-
SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
|
| 351 |
-
SEQUEL_MARKER_RE = re.compile(
|
| 352 |
-
r"(?<![A-Za-z0-9])"
|
| 353 |
-
r"(?P<marker>"
|
| 354 |
-
r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
|
| 355 |
-
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
|
| 356 |
-
r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 357 |
-
r"(?:Go|Gou)\s+no\s+Sara|"
|
| 358 |
-
r"Ni\s+Gakki|Sono\s+Ni|Ni|"
|
| 359 |
-
r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
|
| 360 |
-
r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
|
| 361 |
-
r")"
|
| 362 |
-
r"(?![A-Za-z0-9])",
|
| 363 |
-
re.I,
|
| 364 |
-
)
|
| 365 |
-
TRAILING_SEQUEL_MARKER_RE = re.compile(
|
| 366 |
-
r"(?:^|[\s._-])"
|
| 367 |
-
r"(?P<marker>"
|
| 368 |
-
r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
|
| 369 |
-
r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
|
| 370 |
-
r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
|
| 371 |
-
r"(?:Go|Gou)\s+no\s+Sara|"
|
| 372 |
-
r"Ni\s+Gakki|Sono\s+Ni|Ni|"
|
| 373 |
-
r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
|
| 374 |
-
r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
|
| 375 |
-
r")$",
|
| 376 |
-
re.I,
|
| 377 |
-
)
|
| 378 |
-
NOISE_META_RE = re.compile(
|
| 379 |
-
r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|"
|
| 380 |
-
r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
|
| 381 |
-
r"Opus|SDR|HDR10?|UHD|REMUX|10bit|8bit|Hi10p|Ma10p|ASS.*|SRT.*|CHS|CHT|BIG5|GB|JPN?|"
|
| 382 |
-
r"JPSC|JPTC|MP4|MKV|繁中|简中|内封|外挂)$",
|
| 383 |
-
re.I,
|
| 384 |
-
)
|
| 385 |
-
DATE_RE = re.compile(r"^(?:19|20)\d{2}(?:[.\-_年]?(?:0?[1-9]|1[0-2]))?(?:[.\-_月]?(?:0?[1-9]|[12]\d|3[01]))?日?$")
|
| 386 |
-
CATEGORY_BRACKETS = {
|
| 387 |
-
"国漫", "國漫", "国产", "國產", "国产动漫", "國產動漫", "国产动画", "國產動畫",
|
| 388 |
-
"国创", "國創", "中国动漫", "中國動漫", "中国动画", "中國動畫",
|
| 389 |
-
}
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
def cn_number_to_int(text: str) -> Optional[int]:
|
| 393 |
-
if text.isdigit():
|
| 394 |
-
return int(text)
|
| 395 |
-
values = {"一": 1, "二": 2, "三": 3, "四": 4, "五": 5, "六": 6, "七": 7, "八": 8, "九": 9}
|
| 396 |
-
if text == "十":
|
| 397 |
-
return 10
|
| 398 |
-
if text.startswith("十") and len(text) == 2:
|
| 399 |
-
return 10 + values.get(text[1], 0)
|
| 400 |
-
if text.endswith("十") and len(text) == 2:
|
| 401 |
-
return values.get(text[0], 0) * 10
|
| 402 |
-
if "十" in text and len(text) == 3:
|
| 403 |
-
return values.get(text[0], 0) * 10 + values.get(text[2], 0)
|
| 404 |
-
return values.get(text)
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
def bracket_parts(filename: str) -> List[Tuple[str, int, int]]:
|
| 408 |
-
parts: List[Tuple[str, int, int]] = []
|
| 409 |
-
for match in BRACKET_RE.finditer(filename):
|
| 410 |
-
text = next(group for group in match.groups() if group is not None)
|
| 411 |
-
parts.append((text.strip(), match.start(), match.end()))
|
| 412 |
-
return parts
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
def looks_like_group(text: str) -> bool:
|
| 416 |
-
if not text or NOISE_META_RE.search(text):
|
| 417 |
-
return False
|
| 418 |
-
return bool(
|
| 419 |
-
re.search(
|
| 420 |
-
r"(?:字幕|字幕组|字幕組|sub|subs|raws?|fansub|studio|house|team|project|"
|
| 421 |
-
r"loli|ani|vcb|airota|kiss|dmhy|erai|subsplease)",
|
| 422 |
-
text,
|
| 423 |
-
re.I,
|
| 424 |
-
)
|
| 425 |
-
)
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
def looks_like_episode_or_meta(text: str) -> bool:
|
| 429 |
-
if not text:
|
| 430 |
-
return False
|
| 431 |
-
clean = text.strip()
|
| 432 |
-
normalized = re.sub(r"[\s._-]+", "", clean)
|
| 433 |
-
return bool(
|
| 434 |
-
re.fullmatch(r"(?:EP?|#)?\d{1,4}(?:v\d+)?", clean, re.I)
|
| 435 |
-
or DATE_RE.fullmatch(clean)
|
| 436 |
-
or normalized in CATEGORY_BRACKETS
|
| 437 |
-
or RESOLUTION_RE.search(clean)
|
| 438 |
-
or SOURCE_TAG_RE.fullmatch(clean)
|
| 439 |
-
or SOURCE_RE.search(clean)
|
| 440 |
-
or SPECIAL_TAG_RE.search(clean)
|
| 441 |
-
or SPECIAL_CODE_RE.fullmatch(normalized)
|
| 442 |
-
or NOISE_META_RE.search(clean)
|
| 443 |
-
)
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
def normalize_special_code(text: str) -> str:
|
| 447 |
-
return re.sub(r"[\s._-]+", "", text.strip())
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
def special_code_spans(filename: str) -> List[Tuple[str, int, int]]:
|
| 451 |
-
spans: List[Tuple[str, int, int]] = []
|
| 452 |
-
for text, start, end in bracket_parts(filename):
|
| 453 |
-
normalized = normalize_special_code(text)
|
| 454 |
-
if SPECIAL_CODE_RE.fullmatch(normalized):
|
| 455 |
-
spans.append((normalized, start, end))
|
| 456 |
-
for match in SPECIAL_CODE_INLINE_RE.finditer(filename):
|
| 457 |
-
normalized = normalize_special_code(match.group("code"))
|
| 458 |
-
if SPECIAL_CODE_RE.fullmatch(normalized):
|
| 459 |
-
spans.append((normalized, match.start("code"), match.end("code")))
|
| 460 |
-
|
| 461 |
-
deduped: List[Tuple[str, int, int]] = []
|
| 462 |
-
seen: set[Tuple[str, int, int]] = set()
|
| 463 |
-
for value, start, end in sorted(spans, key=lambda item: (item[1], item[2])):
|
| 464 |
-
key = (value.lower(), start, end)
|
| 465 |
-
if key in seen:
|
| 466 |
-
continue
|
| 467 |
-
seen.add(key)
|
| 468 |
-
deduped.append((value, start, end))
|
| 469 |
-
return deduped
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
def special_code_brackets(filename: str) -> List[Tuple[str, int, int]]:
|
| 473 |
-
return [
|
| 474 |
-
(text.strip(), start, end)
|
| 475 |
-
for text, start, end in bracket_parts(filename)
|
| 476 |
-
if SPECIAL_CODE_RE.fullmatch(normalize_special_code(text))
|
| 477 |
-
]
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
def span_is_inside_special_code(filename: str, start: int, end: int) -> bool:
|
| 481 |
-
return any(special_start <= start and end <= special_end for _code, special_start, special_end in special_code_spans(filename))
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
def has_non_special_episode_context(filename: str, episode: int) -> bool:
|
| 485 |
-
masked = filename
|
| 486 |
-
for _text, start, end in reversed(special_code_brackets(filename)):
|
| 487 |
-
masked = masked[:start] + (" " * (end - start)) + masked[end:]
|
| 488 |
-
return plausible_episode_context(masked, episode) and best_structural_episode(masked) == episode
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
def episode_comes_only_from_special_code(filename: str, episode: Optional[int]) -> bool:
|
| 492 |
-
if episode is None:
|
| 493 |
-
return False
|
| 494 |
-
specials = special_code_spans(filename)
|
| 495 |
-
if not specials:
|
| 496 |
-
return False
|
| 497 |
-
ep_text = str(int(episode))
|
| 498 |
-
for normalized, _start, _end in specials:
|
| 499 |
-
if re.search(rf"0*{re.escape(ep_text)}$", normalized):
|
| 500 |
-
return not has_non_special_episode_context(filename, int(episode))
|
| 501 |
-
return False
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
def strip_title_special_codes(title: str, special: Optional[str] = None) -> str:
|
| 505 |
-
cleaned = title.strip()
|
| 506 |
-
while True:
|
| 507 |
-
next_cleaned = re.sub(
|
| 508 |
-
r"\s*[\[\(【《]\s*(?:(?:NCOP|NCED|OP|ED|PV|CM)\d*|IV\d+|(?:OVA|OAD|SP)\d*)\s*[\]\)】》]\s*$",
|
| 509 |
-
"",
|
| 510 |
-
cleaned,
|
| 511 |
-
flags=re.I,
|
| 512 |
-
).strip(" \t-_.")
|
| 513 |
-
if next_cleaned == cleaned:
|
| 514 |
-
break
|
| 515 |
-
cleaned = next_cleaned
|
| 516 |
-
cleaned = re.sub(r"\s+(?:NCOP|NCED|OP|ED|PV|CM)\d*$", "", cleaned, flags=re.I).strip(" \t-_.")
|
| 517 |
-
if special:
|
| 518 |
-
normalized = re.sub(r"[\s._-]+", "", str(special).strip())
|
| 519 |
-
match = re.fullmatch(r"([A-Za-z]+)\d+", normalized)
|
| 520 |
-
if match and SPECIAL_CODE_RE.fullmatch(normalized):
|
| 521 |
-
prefix = re.escape(match.group(1))
|
| 522 |
-
cleaned = re.sub(rf"\s+{prefix}$", "", cleaned, flags=re.I).strip(" \t-_.")
|
| 523 |
-
return cleaned or title
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
def looks_like_structural_group(text: str, filename: str, bracket_end: int) -> bool:
|
| 527 |
-
"""Heuristic for short leading release-group brackets not in the name list."""
|
| 528 |
-
if looks_like_group(text):
|
| 529 |
-
return True
|
| 530 |
-
if not text or looks_like_episode_or_meta(text):
|
| 531 |
-
return False
|
| 532 |
-
|
| 533 |
-
after = filename[bracket_end:].lstrip(" \t._")
|
| 534 |
-
if after.startswith("-"):
|
| 535 |
-
return False
|
| 536 |
-
next_bracket = BRACKET_RE.match(after)
|
| 537 |
-
if next_bracket:
|
| 538 |
-
next_text = next(group for group in next_bracket.groups() if group is not None)
|
| 539 |
-
if looks_like_episode_or_meta(next_text):
|
| 540 |
-
return False
|
| 541 |
-
|
| 542 |
-
words = re.findall(r"[A-Za-z0-9]+", text)
|
| 543 |
-
if not words:
|
| 544 |
-
if re.search(r"[\u3400-\u9fff]", text) and len(text) <= 32:
|
| 545 |
-
return True
|
| 546 |
-
return False
|
| 547 |
-
if len(text) > 32:
|
| 548 |
-
return False
|
| 549 |
-
if len(words) == 1:
|
| 550 |
-
return True
|
| 551 |
-
if any(sep in text for sep in "-_"):
|
| 552 |
-
return True
|
| 553 |
-
if words[0].isupper() and len(words[0]) <= 4 and len(words) <= 3:
|
| 554 |
-
return True
|
| 555 |
-
return False
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
def apply_rule_assists(filename: str, result: Dict) -> Dict:
|
| 559 |
-
"""
|
| 560 |
-
Fill high-confidence structural fields from filename conventions.
|
| 561 |
-
|
| 562 |
-
The model remains the primary tagger; rules only fill missing obvious fields
|
| 563 |
-
or repair common boundary drift around leading group brackets and episodes.
|
| 564 |
-
"""
|
| 565 |
-
repaired = dict(result)
|
| 566 |
-
brackets = bracket_parts(filename)
|
| 567 |
-
|
| 568 |
-
if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
|
| 569 |
-
first_text, first_start, first_end = brackets[0]
|
| 570 |
-
if first_start == 0 and looks_like_structural_group(first_text, filename, first_end):
|
| 571 |
-
repaired["group"] = first_text
|
| 572 |
-
|
| 573 |
-
if not repaired.get("resolution"):
|
| 574 |
-
match = RESOLUTION_RE.search(filename)
|
| 575 |
-
if match:
|
| 576 |
-
repaired["resolution"] = match.group(0)
|
| 577 |
-
|
| 578 |
-
source_matches = source_candidates(filename)
|
| 579 |
-
current_source = repaired.get("source")
|
| 580 |
-
preferred_source = source_matches[0] if source_matches else None
|
| 581 |
-
if preferred_source and (
|
| 582 |
-
not current_source
|
| 583 |
-
or source_priority(preferred_source) > source_priority(str(current_source))
|
| 584 |
-
or (
|
| 585 |
-
source_priority(preferred_source) == source_priority(str(current_source))
|
| 586 |
-
and preferred_source.lower() != str(current_source).lower()
|
| 587 |
-
)
|
| 588 |
-
):
|
| 589 |
-
repaired["source"] = preferred_source
|
| 590 |
-
|
| 591 |
-
special_spans = special_code_spans(filename)
|
| 592 |
-
current_special = repaired.get("special")
|
| 593 |
-
if special_spans:
|
| 594 |
-
preferred_special = special_spans[0][0]
|
| 595 |
-
current_normalized = normalize_special_code(str(current_special)) if current_special else ""
|
| 596 |
-
if not current_special or preferred_special.lower().startswith(current_normalized.lower()):
|
| 597 |
-
repaired["special"] = preferred_special
|
| 598 |
-
if not repaired.get("special"):
|
| 599 |
-
for text, _start, _end in brackets:
|
| 600 |
-
clean = text.strip()
|
| 601 |
-
if SPECIAL_TAG_RE.search(clean):
|
| 602 |
-
repaired["special"] = clean
|
| 603 |
-
break
|
| 604 |
-
|
| 605 |
-
episode = best_structural_episode(filename)
|
| 606 |
-
if episode is not None and (
|
| 607 |
-
repaired.get("episode") is None
|
| 608 |
-
or not plausible_episode_context(filename, int(repaired["episode"]))
|
| 609 |
-
):
|
| 610 |
-
repaired["episode"] = episode
|
| 611 |
-
|
| 612 |
-
if repaired.get("episode") is not None and not plausible_episode_context(filename, int(repaired["episode"])):
|
| 613 |
-
repaired["episode"] = episode
|
| 614 |
-
if episode_comes_only_from_special_code(filename, repaired.get("episode")):
|
| 615 |
-
repaired["episode"] = None
|
| 616 |
-
|
| 617 |
-
if repaired.get("season") is None:
|
| 618 |
-
match = SEASON_RE.search(filename)
|
| 619 |
-
if match:
|
| 620 |
-
value = next(group for group in match.groups() if group)
|
| 621 |
-
season = cn_number_to_int(value)
|
| 622 |
-
if season is not None:
|
| 623 |
-
repaired["season"] = season
|
| 624 |
-
if repaired.get("season") is None and repaired.get("episode") is not None:
|
| 625 |
-
sequel = structural_sequel_marker(filename, repaired.get("group"), repaired.get("episode"))
|
| 626 |
-
if sequel is not None:
|
| 627 |
-
repaired["season"] = sequel[1]
|
| 628 |
-
elif repaired.get("episode") == repaired.get("season") and not SEASON_RE.search(filename):
|
| 629 |
-
repaired["season"] = None
|
| 630 |
-
|
| 631 |
-
title = repaired.get("title")
|
| 632 |
-
group = repaired.get("group")
|
| 633 |
-
if group and (NOISE_META_RE.search(str(group)) or SOURCE_RE.fullmatch(str(group)) or RESOLUTION_RE.fullmatch(str(group))):
|
| 634 |
-
repaired["group"] = None
|
| 635 |
-
group = None
|
| 636 |
-
|
| 637 |
-
if title and group and title.startswith(group):
|
| 638 |
-
title = title[len(group):].lstrip("]】)>})》 \t-_.")
|
| 639 |
-
repaired["title"] = title or repaired["title"]
|
| 640 |
-
|
| 641 |
-
if repaired.get("episode"):
|
| 642 |
-
repaired_title = infer_title_span(filename, group, repaired["episode"])
|
| 643 |
-
if repaired_title:
|
| 644 |
-
repaired["title"] = repaired_title
|
| 645 |
-
|
| 646 |
-
structured_title = infer_structured_bracket_title(filename, group, repaired.get("episode"))
|
| 647 |
-
if structured_title:
|
| 648 |
-
repaired["title"] = structured_title
|
| 649 |
-
|
| 650 |
-
if repaired.get("title") and repaired.get("season") is not None:
|
| 651 |
-
repaired["title"] = strip_trailing_season_from_title(repaired["title"], repaired["season"])
|
| 652 |
-
if repaired.get("episode") is None and repaired.get("group") and repaired.get("special"):
|
| 653 |
-
inferred_title = infer_title_span(filename, repaired.get("group"), None)
|
| 654 |
-
if inferred_title:
|
| 655 |
-
repaired["title"] = inferred_title
|
| 656 |
-
if repaired.get("title"):
|
| 657 |
-
repaired["title"] = strip_title_special_codes(repaired["title"], repaired.get("special"))
|
| 658 |
-
|
| 659 |
-
return repaired
|
| 660 |
-
|
| 661 |
-
|
| 662 |
-
def structural_sequel_marker(
|
| 663 |
-
filename: str,
|
| 664 |
-
group: Optional[str],
|
| 665 |
-
episode: Optional[int],
|
| 666 |
-
) -> Optional[Tuple[str, int]]:
|
| 667 |
-
if episode is None:
|
| 668 |
-
return None
|
| 669 |
-
title_end = None
|
| 670 |
-
if episode is not None:
|
| 671 |
-
ep_patterns = [
|
| 672 |
-
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 673 |
-
rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 674 |
-
rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
|
| 675 |
-
rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 676 |
-
rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 677 |
-
]
|
| 678 |
-
start = 0
|
| 679 |
-
if group:
|
| 680 |
-
first = BRACKET_RE.match(filename)
|
| 681 |
-
if first and group in first.group(0):
|
| 682 |
-
start = first.end()
|
| 683 |
-
for pattern in ep_patterns:
|
| 684 |
-
match = re.search(pattern, filename[start:], re.I)
|
| 685 |
-
if match:
|
| 686 |
-
title_end = start + match.start()
|
| 687 |
-
break
|
| 688 |
-
if title_end is None:
|
| 689 |
-
return None
|
| 690 |
-
|
| 691 |
-
prefix = filename[:title_end].rstrip(" \t-_.")
|
| 692 |
-
for match in reversed(list(SEQUEL_MARKER_RE.finditer(prefix))):
|
| 693 |
-
marker = match.group("marker")
|
| 694 |
-
value = season_marker_number(marker)
|
| 695 |
-
if value is None:
|
| 696 |
-
continue
|
| 697 |
-
tail = prefix[match.end():].strip(" \t-_.")
|
| 698 |
-
if tail:
|
| 699 |
-
continue
|
| 700 |
-
if marker.lower() == "ni" and "Kakuriyo no Yadomeshi Ni" not in prefix:
|
| 701 |
-
continue
|
| 702 |
-
return marker, value
|
| 703 |
-
|
| 704 |
-
numeric_tail = re.search(r"(?:^|[\s._-])(?P<season>[2-9])$", prefix)
|
| 705 |
-
if numeric_tail:
|
| 706 |
-
return numeric_tail.group("season"), int(numeric_tail.group("season"))
|
| 707 |
-
return None
|
| 708 |
-
|
| 709 |
-
|
| 710 |
-
def normalize_source_text(text: str) -> str:
|
| 711 |
-
text = re.sub(r"\s+", "", text.strip())
|
| 712 |
-
text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
|
| 713 |
-
text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
|
| 714 |
-
text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
|
| 715 |
-
text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
|
| 716 |
-
return text.replace("_", "-")
|
| 717 |
-
|
| 718 |
-
|
| 719 |
-
def source_priority(source: str) -> int:
|
| 720 |
-
normalized = source.lower().replace("_", "-").replace(" ", "")
|
| 721 |
-
parts = re.split(r"[&+/,]", normalized)
|
| 722 |
-
if any(part in {"nf", "netflix", "amzn", "baha", "cr", "abema", "dsnp", "u-next", "hulu", "at-x", "web-dl", "webdl", "webrip", "web-rip", "bdrip", "bluray", "bdmv", "bd", "dvdrip", "dvd", "tvrip", "hdtv"} for part in parts):
|
| 723 |
-
return 90
|
| 724 |
-
if any(part in {"chs", "cht", "gb", "big5", "jpn", "jpsc", "jptc", "繁中", "简中"} for part in parts):
|
| 725 |
-
return 70
|
| 726 |
-
if any(part in {"x264", "x265", "h.264", "h264", "h.265", "h265", "hevc", "avc", "av1", "aac", "flac", "mp3", "dts", "opus", "10bit", "8bit", "hi10p", "ma10p", "srt", "srtx2", "ass", "assx2"} for part in parts):
|
| 727 |
-
return 20
|
| 728 |
-
if len(parts) > 1:
|
| 729 |
-
return 40
|
| 730 |
-
return 20
|
| 731 |
-
|
| 732 |
-
|
| 733 |
-
def source_candidates(filename: str) -> List[str]:
|
| 734 |
-
candidates: List[Tuple[int, int, str]] = []
|
| 735 |
-
for text, start, _end in bracket_parts(filename):
|
| 736 |
-
clean = text.strip()
|
| 737 |
-
if SOURCE_TAG_RE.fullmatch(clean):
|
| 738 |
-
normalized = normalize_source_text(clean)
|
| 739 |
-
candidates.append((source_priority(normalized), -start, normalized))
|
| 740 |
-
|
| 741 |
-
for match in SOURCE_RE.finditer(filename):
|
| 742 |
-
normalized = normalize_source_text(match.group(0))
|
| 743 |
-
candidates.append((source_priority(normalized), -match.start(), normalized))
|
| 744 |
-
|
| 745 |
-
deduped: Dict[str, Tuple[int, int, str]] = {}
|
| 746 |
-
for priority, neg_start, value in candidates:
|
| 747 |
-
key = value.lower()
|
| 748 |
-
if key not in deduped or (priority, neg_start) > (deduped[key][0], deduped[key][1]):
|
| 749 |
-
deduped[key] = (priority, neg_start, value)
|
| 750 |
-
|
| 751 |
-
return [value for _priority, _neg_start, value in sorted(deduped.values(), reverse=True)]
|
| 752 |
-
|
| 753 |
-
|
| 754 |
-
def is_category_text(text: str) -> bool:
|
| 755 |
-
normalized = re.sub(r"[\s._-]+", "", text.strip())
|
| 756 |
-
return normalized in CATEGORY_BRACKETS
|
| 757 |
-
|
| 758 |
-
|
| 759 |
-
def infer_structured_bracket_title(
|
| 760 |
-
filename: str,
|
| 761 |
-
group: Optional[str],
|
| 762 |
-
episode: Optional[int],
|
| 763 |
-
) -> Optional[str]:
|
| 764 |
-
"""Pick the primary title from [group][category][title][alias][year][episode] rows."""
|
| 765 |
-
brackets = bracket_parts(filename)
|
| 766 |
-
if len(brackets) < 4 or episode is None:
|
| 767 |
-
return None
|
| 768 |
-
|
| 769 |
-
start_index = 0
|
| 770 |
-
if group and brackets and brackets[0][0] == group:
|
| 771 |
-
start_index = 1
|
| 772 |
-
|
| 773 |
-
search = brackets[start_index:]
|
| 774 |
-
if not search or not any(is_category_text(text) for text, _start, _end in search[:2]):
|
| 775 |
-
return None
|
| 776 |
-
|
| 777 |
-
episode_index = None
|
| 778 |
-
for idx, (text, _start, _end) in enumerate(brackets):
|
| 779 |
-
if re.fullmatch(rf"(?:EP?|#)?0*{episode}(?:v\d+)?", text.strip(), re.I):
|
| 780 |
-
episode_index = idx
|
| 781 |
-
break
|
| 782 |
-
if episode_index is None:
|
| 783 |
-
return None
|
| 784 |
-
|
| 785 |
-
candidates: List[Tuple[int, str]] = []
|
| 786 |
-
for idx in range(start_index, episode_index):
|
| 787 |
-
text = brackets[idx][0].strip()
|
| 788 |
-
if not text or looks_like_episode_or_meta(text):
|
| 789 |
-
continue
|
| 790 |
-
score = 0
|
| 791 |
-
if SEASON_RE.search(text) or TRAILING_SEQUEL_MARKER_RE.search(text):
|
| 792 |
-
score += 50
|
| 793 |
-
if re.search(r"[\u3400-\u9fff]", text):
|
| 794 |
-
score += 20
|
| 795 |
-
if idx > start_index:
|
| 796 |
-
score += 10
|
| 797 |
-
candidates.append((score, text))
|
| 798 |
-
|
| 799 |
-
if not candidates:
|
| 800 |
-
return None
|
| 801 |
-
return max(candidates, key=lambda item: item[0])[1]
|
| 802 |
-
|
| 803 |
-
|
| 804 |
-
def best_structural_episode(filename: str) -> Optional[int]:
|
| 805 |
-
priorities = {
|
| 806 |
-
"season_episode": 1000,
|
| 807 |
-
"dash_episode": 900,
|
| 808 |
-
"bracket_episode": 850,
|
| 809 |
-
"explicit_episode": 800,
|
| 810 |
-
"long_episode": 750,
|
| 811 |
-
"generic_episode": 100,
|
| 812 |
-
}
|
| 813 |
-
candidates: List[Tuple[int, int, int]] = []
|
| 814 |
-
for name, pattern in EPISODE_PATTERNS:
|
| 815 |
-
for match in pattern.finditer(filename):
|
| 816 |
-
ep_text = match.group("ep")
|
| 817 |
-
ep = int(ep_text)
|
| 818 |
-
if ep == 0 or ep > 2000:
|
| 819 |
-
continue
|
| 820 |
-
ep_start = match.start("ep")
|
| 821 |
-
ep_end = match.end("ep")
|
| 822 |
-
if span_is_inside_special_code(filename, ep_start, ep_end):
|
| 823 |
-
continue
|
| 824 |
-
if name == "generic_episode":
|
| 825 |
-
tail = filename[ep_end:]
|
| 826 |
-
if re.match(r"[-_][A-Za-z]", tail):
|
| 827 |
-
continue
|
| 828 |
-
if not re.match(
|
| 829 |
-
r"(?:$|[\]\)】》]|[\s._-]+(?:"
|
| 830 |
-
r"\[[^\]]*(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha|Ma10p|x26|HEVC|AVC)|"
|
| 831 |
-
r"\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha|Ma10p|x26|HEVC|AVC|mkv|mp4|avi"
|
| 832 |
-
r"))",
|
| 833 |
-
tail,
|
| 834 |
-
re.I,
|
| 835 |
-
):
|
| 836 |
-
continue
|
| 837 |
-
context = filename[max(0, ep_start - 5):ep_end + 5]
|
| 838 |
-
if RESOLUTION_RE.search(context) or re.search(r"AAC|DDP|AC3|H\.?26[45]|x26[45]", context, re.I):
|
| 839 |
-
continue
|
| 840 |
-
priority = priorities[name]
|
| 841 |
-
if 1 <= ep <= 200:
|
| 842 |
-
priority += 20
|
| 843 |
-
candidates.append((priority, ep_start, ep))
|
| 844 |
-
if not candidates:
|
| 845 |
-
return None
|
| 846 |
-
return max(candidates, key=lambda item: (item[0], item[1]))[2]
|
| 847 |
-
|
| 848 |
-
|
| 849 |
-
def plausible_episode_context(filename: str, episode: int) -> bool:
|
| 850 |
-
ep_text = str(episode)
|
| 851 |
-
padded = f"{episode:02d}"
|
| 852 |
-
if re.search(rf"(?<![A-Za-z0-9])(?:H|x)\.?0*{re.escape(ep_text)}(?!\d)", filename, re.I):
|
| 853 |
-
return False
|
| 854 |
-
patterns = [
|
| 855 |
-
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 856 |
-
rf"(?:^|[\s._])[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])",
|
| 857 |
-
rf"[\[\(【《](?:EP?|#)?0*{episode}(?:v\d+)?[\]\)】》]",
|
| 858 |
-
rf"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)0*{episode}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 859 |
-
rf"(?:^|[\s._\-\[\(【《])0*{episode}(?:v\d+)?(?=[\s._\-\]\)】》\[]+(?:\d{{3,4}}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
|
| 860 |
-
]
|
| 861 |
-
if any(re.search(pattern, filename, re.I) for pattern in patterns):
|
| 862 |
-
return True
|
| 863 |
-
return bool(re.search(rf"(?:^|[\s._-])(?:{re.escape(ep_text)}|{re.escape(padded)})(?:v\d+)?$", filename, re.I))
|
| 864 |
-
|
| 865 |
-
|
| 866 |
-
def strip_trailing_season_from_title(title: str, season: int) -> str:
|
| 867 |
-
season_text = str(season)
|
| 868 |
-
patterns = [
|
| 869 |
-
rf"\s+[Ss]0*{season_text}$",
|
| 870 |
-
rf"\s+Season\s*0*{season_text}$",
|
| 871 |
-
rf"\s+0*{season_text}$",
|
| 872 |
-
rf"\s+第(?:0*{season_text}|{season_text})[季期部章]$",
|
| 873 |
-
]
|
| 874 |
-
cleaned = title
|
| 875 |
-
for pattern in patterns:
|
| 876 |
-
cleaned = re.sub(pattern, "", cleaned, flags=re.I).strip(" \t-_.")
|
| 877 |
-
match = TRAILING_SEQUEL_MARKER_RE.search(cleaned)
|
| 878 |
-
if match and season_marker_number(match.group("marker")) == season:
|
| 879 |
-
cleaned = cleaned[:match.start()].strip(" \t-_.")
|
| 880 |
-
return cleaned or title
|
| 881 |
-
|
| 882 |
-
|
| 883 |
-
def clean_inferred_title(title: str) -> str:
|
| 884 |
-
raw_title = title.strip(" \t-_.")
|
| 885 |
-
bracket_matches = list(BRACKET_RE.finditer(raw_title))
|
| 886 |
-
if bracket_matches:
|
| 887 |
-
first = bracket_matches[0]
|
| 888 |
-
prefix = raw_title[:first.start()].strip(" \t-_.★☆")
|
| 889 |
-
text = next(group for group in first.groups() if group is not None).strip()
|
| 890 |
-
if text and not looks_like_episode_or_meta(text) and (
|
| 891 |
-
not prefix
|
| 892 |
-
or re.search(r"(?:新番|月|合集|繁|简|字幕|先行|合集|★|☆)", prefix, re.I)
|
| 893 |
-
):
|
| 894 |
-
return text
|
| 895 |
-
return raw_title.strip("[]()【】《》()")
|
| 896 |
-
|
| 897 |
-
|
| 898 |
-
def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
|
| 899 |
-
start = 0
|
| 900 |
-
if group:
|
| 901 |
-
first = BRACKET_RE.match(filename)
|
| 902 |
-
if first and group in first.group(0):
|
| 903 |
-
start = first.end()
|
| 904 |
-
else:
|
| 905 |
-
# Some releases put leading metadata before the actual title, e.g.
|
| 906 |
-
# `[1080p] Title - 01`. Do not keep that wrapper as title text.
|
| 907 |
-
while True:
|
| 908 |
-
leading = BRACKET_RE.match(filename[start:].lstrip(" \t._-"))
|
| 909 |
-
if not leading:
|
| 910 |
-
break
|
| 911 |
-
skipped_ws = len(filename[start:]) - len(filename[start:].lstrip(" \t._-"))
|
| 912 |
-
text = next(group for group in leading.groups() if group is not None)
|
| 913 |
-
if not looks_like_episode_or_meta(text):
|
| 914 |
-
break
|
| 915 |
-
start += skipped_ws + leading.end()
|
| 916 |
-
|
| 917 |
-
end = None
|
| 918 |
-
if episode is not None:
|
| 919 |
-
ep_patterns = [
|
| 920 |
-
rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
|
| 921 |
-
rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 922 |
-
rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
|
| 923 |
-
rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
|
| 924 |
-
rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
|
| 925 |
-
rf"[Ee]0*{episode}(?:v\d+)?",
|
| 926 |
-
]
|
| 927 |
-
for pattern in ep_patterns:
|
| 928 |
-
match = re.search(pattern, filename[start:], re.I)
|
| 929 |
-
if match:
|
| 930 |
-
end = start + match.start()
|
| 931 |
-
break
|
| 932 |
-
|
| 933 |
-
if end is None:
|
| 934 |
-
for text, bracket_start, _bracket_end in bracket_parts(filename):
|
| 935 |
-
if bracket_start <= start:
|
| 936 |
-
continue
|
| 937 |
-
if (
|
| 938 |
-
NOISE_META_RE.search(text)
|
| 939 |
-
or RESOLUTION_RE.search(text)
|
| 940 |
-
or SOURCE_RE.search(text)
|
| 941 |
-
or SPECIAL_TAG_RE.search(text)
|
| 942 |
-
or SPECIAL_CODE_RE.fullmatch(re.sub(r"[\s._-]+", "", text.strip()))
|
| 943 |
-
):
|
| 944 |
-
end = bracket_start
|
| 945 |
-
break
|
| 946 |
-
|
| 947 |
-
if end is None or end <= start:
|
| 948 |
-
return None
|
| 949 |
-
title = clean_inferred_title(filename[start:end])
|
| 950 |
-
return title or None
|
| 951 |
-
|
| 952 |
-
|
| 953 |
def parse_filename(
|
| 954 |
filename: str,
|
| 955 |
model: BertForTokenClassification,
|
|
@@ -957,7 +314,6 @@ def parse_filename(
|
|
| 957 |
id2label: Dict[int, str],
|
| 958 |
max_length: int = 64,
|
| 959 |
debug: bool = False,
|
| 960 |
-
use_rules: bool = False,
|
| 961 |
constrain_bio: bool = True,
|
| 962 |
) -> Dict:
|
| 963 |
"""
|
|
@@ -1046,14 +402,12 @@ def parse_filename(
|
|
| 1046 |
tokens[:available],
|
| 1047 |
label_strings,
|
| 1048 |
tokenizer=tokenizer,
|
| 1049 |
-
filename=filename,
|
| 1050 |
-
use_rules=use_rules,
|
| 1051 |
)
|
| 1052 |
if debug:
|
| 1053 |
result["_debug"] = {
|
| 1054 |
"tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
|
| 1055 |
"decoder": "constrained_bio" if constrain_bio else "greedy",
|
| 1056 |
-
"postprocess": "
|
| 1057 |
"max_length": max_length,
|
| 1058 |
"token_count": len(tokens),
|
| 1059 |
"available_token_count": available,
|
|
@@ -1101,10 +455,6 @@ def main():
|
|
| 1101 |
help="Maximum sequence length")
|
| 1102 |
parser.add_argument("--debug", action="store_true",
|
| 1103 |
help="Include tokenizer, labels, scores, and entity spans in JSON output")
|
| 1104 |
-
parser.add_argument("--rule-assist", action="store_true",
|
| 1105 |
-
help="Enable legacy structural post-processing rules")
|
| 1106 |
-
parser.add_argument("--no-rule-assist", action="store_true",
|
| 1107 |
-
help=argparse.SUPPRESS)
|
| 1108 |
parser.add_argument("--no-constrained-bio", action="store_true",
|
| 1109 |
help="Use greedy per-token decoding instead of constrained BIO Viterbi")
|
| 1110 |
args = parser.parse_args()
|
|
@@ -1152,7 +502,6 @@ def main():
|
|
| 1152 |
id2label,
|
| 1153 |
max_length,
|
| 1154 |
debug=args.debug,
|
| 1155 |
-
use_rules=args.rule_assist and not args.no_rule_assist,
|
| 1156 |
constrain_bio=not args.no_constrained_bio,
|
| 1157 |
)
|
| 1158 |
result["_input"] = fn
|
|
|
|
| 11 |
|
| 12 |
import argparse
|
| 13 |
import json
|
|
|
|
| 14 |
import re
|
| 15 |
import sys
|
| 16 |
from typing import Dict, List, Optional, Tuple
|
|
|
|
| 97 |
return 40 if re.search(r"[&+/,]", source) else 30
|
| 98 |
|
| 99 |
|
| 100 |
+
def normalize_source_text(text: str) -> str:
|
| 101 |
+
text = re.sub(r"\s+", "", text.strip())
|
| 102 |
+
text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
|
| 103 |
+
text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
|
| 104 |
+
text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
|
| 105 |
+
text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
|
| 106 |
+
return text.replace("_", "-")
|
| 107 |
+
|
| 108 |
+
|
| 109 |
def choose_thin_source(sources: List[str]) -> Optional[str]:
|
| 110 |
cleaned = [normalize_source_text(source) for source in sources if normalize_field_text(source)]
|
| 111 |
if not cleaned:
|
|
|
|
| 247 |
tokens: List[str],
|
| 248 |
labels: List[str],
|
| 249 |
tokenizer: Optional[AnimeTokenizer] = None,
|
|
|
|
|
|
|
| 250 |
) -> Dict:
|
| 251 |
"""
|
| 252 |
Convert BIO-labeled tokens into structured metadata.
|
|
|
|
| 304 |
|
| 305 |
result["source"] = choose_thin_source(grouped_entities.get("SOURCE", []))
|
| 306 |
|
|
|
|
|
|
|
|
|
|
| 307 |
return result
|
| 308 |
|
| 309 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 310 |
def parse_filename(
|
| 311 |
filename: str,
|
| 312 |
model: BertForTokenClassification,
|
|
|
|
| 314 |
id2label: Dict[int, str],
|
| 315 |
max_length: int = 64,
|
| 316 |
debug: bool = False,
|
|
|
|
| 317 |
constrain_bio: bool = True,
|
| 318 |
) -> Dict:
|
| 319 |
"""
|
|
|
|
| 402 |
tokens[:available],
|
| 403 |
label_strings,
|
| 404 |
tokenizer=tokenizer,
|
|
|
|
|
|
|
| 405 |
)
|
| 406 |
if debug:
|
| 407 |
result["_debug"] = {
|
| 408 |
"tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
|
| 409 |
"decoder": "constrained_bio" if constrain_bio else "greedy",
|
| 410 |
+
"postprocess": "thin_normalize",
|
| 411 |
"max_length": max_length,
|
| 412 |
"token_count": len(tokens),
|
| 413 |
"available_token_count": available,
|
|
|
|
| 455 |
help="Maximum sequence length")
|
| 456 |
parser.add_argument("--debug", action="store_true",
|
| 457 |
help="Include tokenizer, labels, scores, and entity spans in JSON output")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 458 |
parser.add_argument("--no-constrained-bio", action="store_true",
|
| 459 |
help="Use greedy per-token decoding instead of constrained BIO Viterbi")
|
| 460 |
args = parser.parse_args()
|
|
|
|
| 502 |
id2label,
|
| 503 |
max_length,
|
| 504 |
debug=args.debug,
|
|
|
|
| 505 |
constrain_bio=not args.no_constrained_bio,
|
| 506 |
)
|
| 507 |
result["_input"] = fn
|
onnx_inference.py
CHANGED
|
@@ -59,10 +59,9 @@ def parse_with_onnx(
|
|
| 59 |
model_dir: Path,
|
| 60 |
onnx_path: Path,
|
| 61 |
max_length: int,
|
| 62 |
-
use_rules: bool = False,
|
| 63 |
) -> Dict:
|
| 64 |
parser = OnnxFilenameParser(model_dir, onnx_path, max_length)
|
| 65 |
-
return parser.parse(filename
|
| 66 |
|
| 67 |
|
| 68 |
class OnnxFilenameParser:
|
|
@@ -87,7 +86,7 @@ class OnnxFilenameParser:
|
|
| 87 |
providers=providers or ["CPUExecutionProvider"],
|
| 88 |
)
|
| 89 |
|
| 90 |
-
def parse(self, filename: str
|
| 91 |
tokens, input_ids, attention_mask, available = encode(filename, self.tokenizer, self.max_length)
|
| 92 |
logits = self.session.run(
|
| 93 |
["logits"],
|
|
@@ -100,7 +99,7 @@ class OnnxFilenameParser:
|
|
| 100 |
token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
|
| 101 |
label_ids = constrained_bio_decode(token_logits, self.id2label)
|
| 102 |
labels = [self.id2label.get(label_id, "O") for label_id in label_ids]
|
| 103 |
-
result = postprocess(tokens, labels, tokenizer=self.tokenizer
|
| 104 |
result["_input"] = filename
|
| 105 |
return result
|
| 106 |
|
|
@@ -111,8 +110,6 @@ def main() -> None:
|
|
| 111 |
parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
|
| 112 |
parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
|
| 113 |
parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
|
| 114 |
-
parser.add_argument("--rule-assist", action="store_true", help="Enable legacy structural postprocessing")
|
| 115 |
-
parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
|
| 116 |
args = parser.parse_args()
|
| 117 |
|
| 118 |
result = parse_with_onnx(
|
|
@@ -120,7 +117,6 @@ def main() -> None:
|
|
| 120 |
model_dir=Path(args.model_dir),
|
| 121 |
onnx_path=Path(args.onnx),
|
| 122 |
max_length=args.max_length,
|
| 123 |
-
use_rules=args.rule_assist and not args.no_rule_assist,
|
| 124 |
)
|
| 125 |
print(json.dumps(result, ensure_ascii=False))
|
| 126 |
|
|
|
|
| 59 |
model_dir: Path,
|
| 60 |
onnx_path: Path,
|
| 61 |
max_length: int,
|
|
|
|
| 62 |
) -> Dict:
|
| 63 |
parser = OnnxFilenameParser(model_dir, onnx_path, max_length)
|
| 64 |
+
return parser.parse(filename)
|
| 65 |
|
| 66 |
|
| 67 |
class OnnxFilenameParser:
|
|
|
|
| 86 |
providers=providers or ["CPUExecutionProvider"],
|
| 87 |
)
|
| 88 |
|
| 89 |
+
def parse(self, filename: str) -> Dict:
|
| 90 |
tokens, input_ids, attention_mask, available = encode(filename, self.tokenizer, self.max_length)
|
| 91 |
logits = self.session.run(
|
| 92 |
["logits"],
|
|
|
|
| 99 |
token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
|
| 100 |
label_ids = constrained_bio_decode(token_logits, self.id2label)
|
| 101 |
labels = [self.id2label.get(label_id, "O") for label_id in label_ids]
|
| 102 |
+
result = postprocess(tokens, labels, tokenizer=self.tokenizer)
|
| 103 |
result["_input"] = filename
|
| 104 |
return result
|
| 105 |
|
|
|
|
| 110 |
parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
|
| 111 |
parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
|
| 112 |
parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
|
|
|
|
|
|
|
| 113 |
args = parser.parse_args()
|
| 114 |
|
| 115 |
result = parse_with_onnx(
|
|
|
|
| 117 |
model_dir=Path(args.model_dir),
|
| 118 |
onnx_path=Path(args.onnx),
|
| 119 |
max_length=args.max_length,
|
|
|
|
| 120 |
)
|
| 121 |
print(json.dumps(result, ensure_ascii=False))
|
| 122 |
|
parse_eval_metrics.json
CHANGED
|
@@ -2,7 +2,6 @@
|
|
| 2 |
"primary_metric": "normalized_only",
|
| 3 |
"modes": {
|
| 4 |
"model_only": {
|
| 5 |
-
"use_rules": false,
|
| 6 |
"constrain_bio": false,
|
| 7 |
"sample_count": 1024,
|
| 8 |
"field_accuracy": {
|
|
@@ -309,7 +308,6 @@
|
|
| 309 |
]
|
| 310 |
},
|
| 311 |
"normalized_only": {
|
| 312 |
-
"use_rules": false,
|
| 313 |
"constrain_bio": true,
|
| 314 |
"sample_count": 1024,
|
| 315 |
"field_accuracy": {
|
|
@@ -533,627 +531,6 @@
|
|
| 533 |
}
|
| 534 |
}
|
| 535 |
]
|
| 536 |
-
},
|
| 537 |
-
"rule_assisted": {
|
| 538 |
-
"use_rules": true,
|
| 539 |
-
"constrain_bio": true,
|
| 540 |
-
"sample_count": 1024,
|
| 541 |
-
"field_accuracy": {
|
| 542 |
-
"group": 0.9873046875,
|
| 543 |
-
"title": 0.7265625,
|
| 544 |
-
"season": 0.9912109375,
|
| 545 |
-
"episode": 0.7021484375,
|
| 546 |
-
"resolution": 1.0,
|
| 547 |
-
"source": 0.98046875,
|
| 548 |
-
"special": 0.951171875
|
| 549 |
-
},
|
| 550 |
-
"field_correct": {
|
| 551 |
-
"group": 1011,
|
| 552 |
-
"title": 744,
|
| 553 |
-
"season": 1015,
|
| 554 |
-
"episode": 719,
|
| 555 |
-
"resolution": 1024,
|
| 556 |
-
"source": 1004,
|
| 557 |
-
"special": 974
|
| 558 |
-
},
|
| 559 |
-
"field_total": {
|
| 560 |
-
"group": 1024,
|
| 561 |
-
"title": 1024,
|
| 562 |
-
"season": 1024,
|
| 563 |
-
"episode": 1024,
|
| 564 |
-
"resolution": 1024,
|
| 565 |
-
"source": 1024,
|
| 566 |
-
"special": 1024
|
| 567 |
-
},
|
| 568 |
-
"full_match_accuracy": 0.5068359375,
|
| 569 |
-
"full_match_correct": 519,
|
| 570 |
-
"full_match_total": 1024,
|
| 571 |
-
"failures": [
|
| 572 |
-
{
|
| 573 |
-
"filename": "[DBD-Raws][Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san][PV][20][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 574 |
-
"errors": {
|
| 575 |
-
"episode": {
|
| 576 |
-
"gold": null,
|
| 577 |
-
"pred": "20"
|
| 578 |
-
}
|
| 579 |
-
},
|
| 580 |
-
"gold": {
|
| 581 |
-
"group": "DBD-Raws",
|
| 582 |
-
"title": "Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san",
|
| 583 |
-
"season": null,
|
| 584 |
-
"episode": null,
|
| 585 |
-
"resolution": "1080P",
|
| 586 |
-
"source": "BDRip",
|
| 587 |
-
"special": "20"
|
| 588 |
-
},
|
| 589 |
-
"pred": {
|
| 590 |
-
"group": "DBD-Raws",
|
| 591 |
-
"title": "Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san",
|
| 592 |
-
"season": null,
|
| 593 |
-
"episode": 20,
|
| 594 |
-
"resolution": "1080P",
|
| 595 |
-
"source": "BDRip",
|
| 596 |
-
"special": "20"
|
| 597 |
-
}
|
| 598 |
-
},
|
| 599 |
-
{
|
| 600 |
-
"filename": "[DBD-Raws][我的英雄学院 第三季][PV][02][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 601 |
-
"errors": {
|
| 602 |
-
"title": {
|
| 603 |
-
"gold": "我的英雄学院",
|
| 604 |
-
"pred": "我的英雄学院 第三季"
|
| 605 |
-
},
|
| 606 |
-
"episode": {
|
| 607 |
-
"gold": null,
|
| 608 |
-
"pred": "2"
|
| 609 |
-
}
|
| 610 |
-
},
|
| 611 |
-
"gold": {
|
| 612 |
-
"group": "DBD-Raws",
|
| 613 |
-
"title": "我的英雄学院",
|
| 614 |
-
"season": 3,
|
| 615 |
-
"episode": null,
|
| 616 |
-
"resolution": "1080P",
|
| 617 |
-
"source": "BDRip",
|
| 618 |
-
"special": "02"
|
| 619 |
-
},
|
| 620 |
-
"pred": {
|
| 621 |
-
"group": "DBD-Raws",
|
| 622 |
-
"title": "我的英雄学院 第三季",
|
| 623 |
-
"season": 3,
|
| 624 |
-
"episode": 2,
|
| 625 |
-
"resolution": "1080P",
|
| 626 |
-
"source": "BDRip",
|
| 627 |
-
"special": "02"
|
| 628 |
-
}
|
| 629 |
-
},
|
| 630 |
-
{
|
| 631 |
-
"filename": "[Moozzi2] Katanagatari [SP01] NCOP - 02 (BD 1920x1080 x.264 Flac)",
|
| 632 |
-
"errors": {
|
| 633 |
-
"episode": {
|
| 634 |
-
"gold": "1",
|
| 635 |
-
"pred": null
|
| 636 |
-
}
|
| 637 |
-
},
|
| 638 |
-
"gold": {
|
| 639 |
-
"group": "Moozzi2",
|
| 640 |
-
"title": "Katanagatari",
|
| 641 |
-
"season": null,
|
| 642 |
-
"episode": 1,
|
| 643 |
-
"resolution": "1920x1080",
|
| 644 |
-
"source": "BD",
|
| 645 |
-
"special": "NCOP - 02"
|
| 646 |
-
},
|
| 647 |
-
"pred": {
|
| 648 |
-
"group": "Moozzi2",
|
| 649 |
-
"title": "Katanagatari",
|
| 650 |
-
"season": null,
|
| 651 |
-
"episode": null,
|
| 652 |
-
"resolution": "1920x1080",
|
| 653 |
-
"source": "BD",
|
| 654 |
-
"special": "NCOP - 02"
|
| 655 |
-
}
|
| 656 |
-
},
|
| 657 |
-
{
|
| 658 |
-
"filename": "[DBD-Raws][Ijiranaide, Nagatoro-san 2nd Attack][PV][06][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 659 |
-
"errors": {
|
| 660 |
-
"episode": {
|
| 661 |
-
"gold": null,
|
| 662 |
-
"pred": "6"
|
| 663 |
-
}
|
| 664 |
-
},
|
| 665 |
-
"gold": {
|
| 666 |
-
"group": "DBD-Raws",
|
| 667 |
-
"title": "Ijiranaide, Nagatoro-san 2nd Attack",
|
| 668 |
-
"season": null,
|
| 669 |
-
"episode": null,
|
| 670 |
-
"resolution": "1080P",
|
| 671 |
-
"source": "BDRip",
|
| 672 |
-
"special": "06"
|
| 673 |
-
},
|
| 674 |
-
"pred": {
|
| 675 |
-
"group": "DBD-Raws",
|
| 676 |
-
"title": "Ijiranaide, Nagatoro-san 2nd Attack",
|
| 677 |
-
"season": null,
|
| 678 |
-
"episode": 6,
|
| 679 |
-
"resolution": "1080P",
|
| 680 |
-
"source": "BDRip",
|
| 681 |
-
"special": "06"
|
| 682 |
-
}
|
| 683 |
-
},
|
| 684 |
-
{
|
| 685 |
-
"filename": "【枫叶字幕组】宠物小精灵XY&Z[第30(122)话][720P][MP4][GB_JP].mp4",
|
| 686 |
-
"errors": {
|
| 687 |
-
"title": {
|
| 688 |
-
"gold": "宠物小精灵xy&z",
|
| 689 |
-
"pred": "宠物小精灵xy&z[第30"
|
| 690 |
-
},
|
| 691 |
-
"episode": {
|
| 692 |
-
"gold": "30",
|
| 693 |
-
"pred": "122"
|
| 694 |
-
}
|
| 695 |
-
},
|
| 696 |
-
"gold": {
|
| 697 |
-
"group": "枫叶字幕组",
|
| 698 |
-
"title": "宠物小精灵XY&Z",
|
| 699 |
-
"season": null,
|
| 700 |
-
"episode": 30,
|
| 701 |
-
"resolution": "720P",
|
| 702 |
-
"source": "GB-JP",
|
| 703 |
-
"special": null
|
| 704 |
-
},
|
| 705 |
-
"pred": {
|
| 706 |
-
"group": "枫叶字幕组",
|
| 707 |
-
"title": "宠物小精灵XY&Z[第30",
|
| 708 |
-
"season": null,
|
| 709 |
-
"episode": 122,
|
| 710 |
-
"resolution": "720P",
|
| 711 |
-
"source": "GB-JP",
|
| 712 |
-
"special": null
|
| 713 |
-
}
|
| 714 |
-
},
|
| 715 |
-
{
|
| 716 |
-
"filename": "[Snow-Raws] グランベルム CM&PV10 (BD 1920x1080 HEVC-YUV420P10 FLAC)",
|
| 717 |
-
"errors": {
|
| 718 |
-
"title": {
|
| 719 |
-
"gold": "グランベルム",
|
| 720 |
-
"pred": "グランベルム cm&pv10"
|
| 721 |
-
}
|
| 722 |
-
},
|
| 723 |
-
"gold": {
|
| 724 |
-
"group": "Snow-Raws",
|
| 725 |
-
"title": "グランベルム",
|
| 726 |
-
"season": null,
|
| 727 |
-
"episode": null,
|
| 728 |
-
"resolution": "1920x1080",
|
| 729 |
-
"source": "BD",
|
| 730 |
-
"special": "PV10"
|
| 731 |
-
},
|
| 732 |
-
"pred": {
|
| 733 |
-
"group": "Snow-Raws",
|
| 734 |
-
"title": "グランベルム CM&PV10",
|
| 735 |
-
"season": null,
|
| 736 |
-
"episode": null,
|
| 737 |
-
"resolution": "1920x1080",
|
| 738 |
-
"source": "BD",
|
| 739 |
-
"special": "PV10"
|
| 740 |
-
}
|
| 741 |
-
},
|
| 742 |
-
{
|
| 743 |
-
"filename": "[Moozzi2] High School D×D New [SP02] NCED - 01 (BD 1920x1080 x.264 Flac)",
|
| 744 |
-
"errors": {
|
| 745 |
-
"episode": {
|
| 746 |
-
"gold": "2",
|
| 747 |
-
"pred": null
|
| 748 |
-
}
|
| 749 |
-
},
|
| 750 |
-
"gold": {
|
| 751 |
-
"group": "Moozzi2",
|
| 752 |
-
"title": "High School D×D New",
|
| 753 |
-
"season": null,
|
| 754 |
-
"episode": 2,
|
| 755 |
-
"resolution": "1920x1080",
|
| 756 |
-
"source": "BD",
|
| 757 |
-
"special": "NCED - 01"
|
| 758 |
-
},
|
| 759 |
-
"pred": {
|
| 760 |
-
"group": "Moozzi2",
|
| 761 |
-
"title": "High School D×D New",
|
| 762 |
-
"season": null,
|
| 763 |
-
"episode": null,
|
| 764 |
-
"resolution": "1920x1080",
|
| 765 |
-
"source": "BD",
|
| 766 |
-
"special": "NCED - 01"
|
| 767 |
-
}
|
| 768 |
-
},
|
| 769 |
-
{
|
| 770 |
-
"filename": "[SFEO-Raws] Koimonogatari - CM_01 (BD 720P x264 10bit AAC)[783E6EF2]",
|
| 771 |
-
"errors": {
|
| 772 |
-
"title": {
|
| 773 |
-
"gold": "koimonogatari",
|
| 774 |
-
"pred": "koimonogatari - cm_01"
|
| 775 |
-
}
|
| 776 |
-
},
|
| 777 |
-
"gold": {
|
| 778 |
-
"group": "SFEO-Raws",
|
| 779 |
-
"title": "Koimonogatari",
|
| 780 |
-
"season": null,
|
| 781 |
-
"episode": null,
|
| 782 |
-
"resolution": "720P",
|
| 783 |
-
"source": "BD",
|
| 784 |
-
"special": "CM_01"
|
| 785 |
-
},
|
| 786 |
-
"pred": {
|
| 787 |
-
"group": "SFEO-Raws",
|
| 788 |
-
"title": "Koimonogatari - CM_01",
|
| 789 |
-
"season": null,
|
| 790 |
-
"episode": null,
|
| 791 |
-
"resolution": "720P",
|
| 792 |
-
"source": "BD",
|
| 793 |
-
"special": "CM_01"
|
| 794 |
-
}
|
| 795 |
-
},
|
| 796 |
-
{
|
| 797 |
-
"filename": "[H720] Sangatsu no Lion CM01 (BD 1208x720 HEVC AAC)",
|
| 798 |
-
"errors": {
|
| 799 |
-
"group": {
|
| 800 |
-
"gold": null,
|
| 801 |
-
"pred": "h720"
|
| 802 |
-
},
|
| 803 |
-
"title": {
|
| 804 |
-
"gold": "h",
|
| 805 |
-
"pred": "sangatsu no lion"
|
| 806 |
-
},
|
| 807 |
-
"episode": {
|
| 808 |
-
"gold": "720",
|
| 809 |
-
"pred": null
|
| 810 |
-
},
|
| 811 |
-
"special": {
|
| 812 |
-
"gold": "cm",
|
| 813 |
-
"pred": "cm01"
|
| 814 |
-
}
|
| 815 |
-
},
|
| 816 |
-
"gold": {
|
| 817 |
-
"group": null,
|
| 818 |
-
"title": "H",
|
| 819 |
-
"season": null,
|
| 820 |
-
"episode": 720,
|
| 821 |
-
"resolution": "1208x720",
|
| 822 |
-
"source": "BD",
|
| 823 |
-
"special": "CM"
|
| 824 |
-
},
|
| 825 |
-
"pred": {
|
| 826 |
-
"group": "H720",
|
| 827 |
-
"title": "Sangatsu no Lion",
|
| 828 |
-
"season": null,
|
| 829 |
-
"episode": null,
|
| 830 |
-
"resolution": "1208x720",
|
| 831 |
-
"source": "BD",
|
| 832 |
-
"special": "CM01"
|
| 833 |
-
}
|
| 834 |
-
},
|
| 835 |
-
{
|
| 836 |
-
"filename": "[FZSD&DBD-Raws][King of Prism Dramatic Prism.1][PV][08][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 837 |
-
"errors": {
|
| 838 |
-
"episode": {
|
| 839 |
-
"gold": null,
|
| 840 |
-
"pred": "8"
|
| 841 |
-
}
|
| 842 |
-
},
|
| 843 |
-
"gold": {
|
| 844 |
-
"group": "FZSD&DBD-Raws",
|
| 845 |
-
"title": "King of Prism Dramatic Prism.1",
|
| 846 |
-
"season": null,
|
| 847 |
-
"episode": null,
|
| 848 |
-
"resolution": "1080P",
|
| 849 |
-
"source": "BDRip",
|
| 850 |
-
"special": "08"
|
| 851 |
-
},
|
| 852 |
-
"pred": {
|
| 853 |
-
"group": "FZSD&DBD-Raws",
|
| 854 |
-
"title": "King of Prism Dramatic Prism.1",
|
| 855 |
-
"season": null,
|
| 856 |
-
"episode": 8,
|
| 857 |
-
"resolution": "1080P",
|
| 858 |
-
"source": "BDRip",
|
| 859 |
-
"special": "08"
|
| 860 |
-
}
|
| 861 |
-
},
|
| 862 |
-
{
|
| 863 |
-
"filename": "Robin Hood no Daibouken 49",
|
| 864 |
-
"errors": {
|
| 865 |
-
"episode": {
|
| 866 |
-
"gold": null,
|
| 867 |
-
"pred": "49"
|
| 868 |
-
}
|
| 869 |
-
},
|
| 870 |
-
"gold": {
|
| 871 |
-
"group": null,
|
| 872 |
-
"title": "Robin Hood no Daibouken 49",
|
| 873 |
-
"season": null,
|
| 874 |
-
"episode": null,
|
| 875 |
-
"resolution": null,
|
| 876 |
-
"source": null,
|
| 877 |
-
"special": null
|
| 878 |
-
},
|
| 879 |
-
"pred": {
|
| 880 |
-
"group": null,
|
| 881 |
-
"title": "Robin Hood no Daibouken 49",
|
| 882 |
-
"season": null,
|
| 883 |
-
"episode": 49,
|
| 884 |
-
"resolution": null,
|
| 885 |
-
"source": null,
|
| 886 |
-
"special": null
|
| 887 |
-
}
|
| 888 |
-
},
|
| 889 |
-
{
|
| 890 |
-
"filename": "[Moozzi2] Paniponi Dash! [SP02] NCED - 07 [ EP.07 ] (BD 1920x1080 x.264 Flac)",
|
| 891 |
-
"errors": {
|
| 892 |
-
"episode": {
|
| 893 |
-
"gold": "2",
|
| 894 |
-
"pred": null
|
| 895 |
-
}
|
| 896 |
-
},
|
| 897 |
-
"gold": {
|
| 898 |
-
"group": "Moozzi2",
|
| 899 |
-
"title": "Paniponi Dash!",
|
| 900 |
-
"season": null,
|
| 901 |
-
"episode": 2,
|
| 902 |
-
"resolution": "1920x1080",
|
| 903 |
-
"source": "BD",
|
| 904 |
-
"special": "NCED - 07"
|
| 905 |
-
},
|
| 906 |
-
"pred": {
|
| 907 |
-
"group": "Moozzi2",
|
| 908 |
-
"title": "Paniponi Dash!",
|
| 909 |
-
"season": null,
|
| 910 |
-
"episode": null,
|
| 911 |
-
"resolution": "1920x1080",
|
| 912 |
-
"source": "BD",
|
| 913 |
-
"special": "NCED - 07"
|
| 914 |
-
}
|
| 915 |
-
},
|
| 916 |
-
{
|
| 917 |
-
"filename": "[Moozzi2] Onegai My Melody [SP10] Kuromi Naration TV-CM - 01 [ 30Sec. ] (BD 1024x768 x.264 AAC)",
|
| 918 |
-
"errors": {
|
| 919 |
-
"title": {
|
| 920 |
-
"gold": "onegai my melody",
|
| 921 |
-
"pred": "onegai my melody [sp10] kuromi naration tv-cm"
|
| 922 |
-
},
|
| 923 |
-
"episode": {
|
| 924 |
-
"gold": "10",
|
| 925 |
-
"pred": "1"
|
| 926 |
-
}
|
| 927 |
-
},
|
| 928 |
-
"gold": {
|
| 929 |
-
"group": "Moozzi2",
|
| 930 |
-
"title": "Onegai My Melody",
|
| 931 |
-
"season": null,
|
| 932 |
-
"episode": 10,
|
| 933 |
-
"resolution": "1024x768",
|
| 934 |
-
"source": "BD",
|
| 935 |
-
"special": "CM - 01"
|
| 936 |
-
},
|
| 937 |
-
"pred": {
|
| 938 |
-
"group": "Moozzi2",
|
| 939 |
-
"title": "Onegai My Melody [SP10] Kuromi Naration TV-CM",
|
| 940 |
-
"season": null,
|
| 941 |
-
"episode": 1,
|
| 942 |
-
"resolution": "1024x768",
|
| 943 |
-
"source": "BD",
|
| 944 |
-
"special": "CM - 01"
|
| 945 |
-
}
|
| 946 |
-
},
|
| 947 |
-
{
|
| 948 |
-
"filename": "[DBD-Raws][Kuzu no Honkai][PV][02][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 949 |
-
"errors": {
|
| 950 |
-
"episode": {
|
| 951 |
-
"gold": null,
|
| 952 |
-
"pred": "2"
|
| 953 |
-
}
|
| 954 |
-
},
|
| 955 |
-
"gold": {
|
| 956 |
-
"group": "DBD-Raws",
|
| 957 |
-
"title": "Kuzu no Honkai",
|
| 958 |
-
"season": null,
|
| 959 |
-
"episode": null,
|
| 960 |
-
"resolution": "1080P",
|
| 961 |
-
"source": "BDRip",
|
| 962 |
-
"special": "02"
|
| 963 |
-
},
|
| 964 |
-
"pred": {
|
| 965 |
-
"group": "DBD-Raws",
|
| 966 |
-
"title": "Kuzu no Honkai",
|
| 967 |
-
"season": null,
|
| 968 |
-
"episode": 2,
|
| 969 |
-
"resolution": "1080P",
|
| 970 |
-
"source": "BDRip",
|
| 971 |
-
"special": "02"
|
| 972 |
-
}
|
| 973 |
-
},
|
| 974 |
-
{
|
| 975 |
-
"filename": "[DBD-Raws][One Piece Wano Arc][Soushuuhen][03][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 976 |
-
"errors": {
|
| 977 |
-
"title": {
|
| 978 |
-
"gold": "one piece wano arc soushuuhen",
|
| 979 |
-
"pred": "one piece wano arc"
|
| 980 |
-
}
|
| 981 |
-
},
|
| 982 |
-
"gold": {
|
| 983 |
-
"group": "DBD-Raws",
|
| 984 |
-
"title": "One Piece Wano Arc Soushuuhen",
|
| 985 |
-
"season": null,
|
| 986 |
-
"episode": 3,
|
| 987 |
-
"resolution": "1080P",
|
| 988 |
-
"source": "BDRip",
|
| 989 |
-
"special": null
|
| 990 |
-
},
|
| 991 |
-
"pred": {
|
| 992 |
-
"group": "DBD-Raws",
|
| 993 |
-
"title": "One Piece Wano Arc",
|
| 994 |
-
"season": null,
|
| 995 |
-
"episode": 3,
|
| 996 |
-
"resolution": "1080P",
|
| 997 |
-
"source": "BDRip",
|
| 998 |
-
"special": null
|
| 999 |
-
}
|
| 1000 |
-
},
|
| 1001 |
-
{
|
| 1002 |
-
"filename": "[LAC][Gintama][196][GB][R10]",
|
| 1003 |
-
"errors": {
|
| 1004 |
-
"group": {
|
| 1005 |
-
"gold": null,
|
| 1006 |
-
"pred": "lac"
|
| 1007 |
-
},
|
| 1008 |
-
"title": {
|
| 1009 |
-
"gold": "lac gintama 196 gb r",
|
| 1010 |
-
"pred": "gintama"
|
| 1011 |
-
},
|
| 1012 |
-
"episode": {
|
| 1013 |
-
"gold": "10",
|
| 1014 |
-
"pred": "196"
|
| 1015 |
-
},
|
| 1016 |
-
"source": {
|
| 1017 |
-
"gold": null,
|
| 1018 |
-
"pred": "gb"
|
| 1019 |
-
}
|
| 1020 |
-
},
|
| 1021 |
-
"gold": {
|
| 1022 |
-
"group": null,
|
| 1023 |
-
"title": "LAC Gintama 196 GB R",
|
| 1024 |
-
"season": null,
|
| 1025 |
-
"episode": 10,
|
| 1026 |
-
"resolution": null,
|
| 1027 |
-
"source": null,
|
| 1028 |
-
"special": null
|
| 1029 |
-
},
|
| 1030 |
-
"pred": {
|
| 1031 |
-
"group": "LAC",
|
| 1032 |
-
"title": "Gintama",
|
| 1033 |
-
"season": null,
|
| 1034 |
-
"episode": 196,
|
| 1035 |
-
"resolution": null,
|
| 1036 |
-
"source": "GB",
|
| 1037 |
-
"special": null
|
| 1038 |
-
}
|
| 1039 |
-
},
|
| 1040 |
-
{
|
| 1041 |
-
"filename": "[DBD-Raws][Date a Live][Director's Cut][PV][07][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 1042 |
-
"errors": {
|
| 1043 |
-
"title": {
|
| 1044 |
-
"gold": "date a live director's cut",
|
| 1045 |
-
"pred": "date a live"
|
| 1046 |
-
},
|
| 1047 |
-
"episode": {
|
| 1048 |
-
"gold": null,
|
| 1049 |
-
"pred": "7"
|
| 1050 |
-
}
|
| 1051 |
-
},
|
| 1052 |
-
"gold": {
|
| 1053 |
-
"group": "DBD-Raws",
|
| 1054 |
-
"title": "Date a Live Director's Cut",
|
| 1055 |
-
"season": null,
|
| 1056 |
-
"episode": null,
|
| 1057 |
-
"resolution": "1080P",
|
| 1058 |
-
"source": "BDRip",
|
| 1059 |
-
"special": "07"
|
| 1060 |
-
},
|
| 1061 |
-
"pred": {
|
| 1062 |
-
"group": "DBD-Raws",
|
| 1063 |
-
"title": "Date a Live",
|
| 1064 |
-
"season": null,
|
| 1065 |
-
"episode": 7,
|
| 1066 |
-
"resolution": "1080P",
|
| 1067 |
-
"source": "BDRip",
|
| 1068 |
-
"special": "07"
|
| 1069 |
-
}
|
| 1070 |
-
},
|
| 1071 |
-
{
|
| 1072 |
-
"filename": "[DBD-Raws][Nageki no Bourei wa Intai Shitai][PV][09][1080P][BDRip][HEVC-10bit][FLAC]",
|
| 1073 |
-
"errors": {
|
| 1074 |
-
"episode": {
|
| 1075 |
-
"gold": null,
|
| 1076 |
-
"pred": "9"
|
| 1077 |
-
}
|
| 1078 |
-
},
|
| 1079 |
-
"gold": {
|
| 1080 |
-
"group": "DBD-Raws",
|
| 1081 |
-
"title": "Nageki no Bourei wa Intai Shitai",
|
| 1082 |
-
"season": null,
|
| 1083 |
-
"episode": null,
|
| 1084 |
-
"resolution": "1080P",
|
| 1085 |
-
"source": "BDRip",
|
| 1086 |
-
"special": "09"
|
| 1087 |
-
},
|
| 1088 |
-
"pred": {
|
| 1089 |
-
"group": "DBD-Raws",
|
| 1090 |
-
"title": "Nageki no Bourei wa Intai Shitai",
|
| 1091 |
-
"season": null,
|
| 1092 |
-
"episode": 9,
|
| 1093 |
-
"resolution": "1080P",
|
| 1094 |
-
"source": "BDRip",
|
| 1095 |
-
"special": "09"
|
| 1096 |
-
}
|
| 1097 |
-
},
|
| 1098 |
-
{
|
| 1099 |
-
"filename": "[RUELL-Next] Fruits Basket NCOP 1 (DVD 768x576 x264 AC3 384K) [FF1CA8EF]",
|
| 1100 |
-
"errors": {
|
| 1101 |
-
"title": {
|
| 1102 |
-
"gold": "fruits basket",
|
| 1103 |
-
"pred": "fruits basket ncop 1"
|
| 1104 |
-
},
|
| 1105 |
-
"special": {
|
| 1106 |
-
"gold": "ncop 1",
|
| 1107 |
-
"pred": "ncop1"
|
| 1108 |
-
}
|
| 1109 |
-
},
|
| 1110 |
-
"gold": {
|
| 1111 |
-
"group": "RUELL-Next",
|
| 1112 |
-
"title": "Fruits Basket",
|
| 1113 |
-
"season": null,
|
| 1114 |
-
"episode": null,
|
| 1115 |
-
"resolution": "768x576",
|
| 1116 |
-
"source": "DVD",
|
| 1117 |
-
"special": "NCOP 1"
|
| 1118 |
-
},
|
| 1119 |
-
"pred": {
|
| 1120 |
-
"group": "RUELL-Next",
|
| 1121 |
-
"title": "Fruits Basket NCOP 1",
|
| 1122 |
-
"season": null,
|
| 1123 |
-
"episode": null,
|
| 1124 |
-
"resolution": "768x576",
|
| 1125 |
-
"source": "DVD",
|
| 1126 |
-
"special": "NCOP1"
|
| 1127 |
-
}
|
| 1128 |
-
},
|
| 1129 |
-
{
|
| 1130 |
-
"filename": "[アニメ DVD] ミスター味っ子 第69話 「島巡り磯鍋競争!7包丁人・大石老師登場」 (640x480 WMV9)",
|
| 1131 |
-
"errors": {
|
| 1132 |
-
"source": {
|
| 1133 |
-
"gold": null,
|
| 1134 |
-
"pred": "dvd"
|
| 1135 |
-
}
|
| 1136 |
-
},
|
| 1137 |
-
"gold": {
|
| 1138 |
-
"group": "アニメ DVD",
|
| 1139 |
-
"title": "ミスター味っ子",
|
| 1140 |
-
"season": null,
|
| 1141 |
-
"episode": 69,
|
| 1142 |
-
"resolution": "640x480",
|
| 1143 |
-
"source": null,
|
| 1144 |
-
"special": null
|
| 1145 |
-
},
|
| 1146 |
-
"pred": {
|
| 1147 |
-
"group": "アニメ DVD",
|
| 1148 |
-
"title": "ミスター味っ子",
|
| 1149 |
-
"season": null,
|
| 1150 |
-
"episode": 69,
|
| 1151 |
-
"resolution": "640x480",
|
| 1152 |
-
"source": "DVD",
|
| 1153 |
-
"special": null
|
| 1154 |
-
}
|
| 1155 |
-
}
|
| 1156 |
-
]
|
| 1157 |
}
|
| 1158 |
}
|
| 1159 |
-
}
|
|
|
|
| 2 |
"primary_metric": "normalized_only",
|
| 3 |
"modes": {
|
| 4 |
"model_only": {
|
|
|
|
| 5 |
"constrain_bio": false,
|
| 6 |
"sample_count": 1024,
|
| 7 |
"field_accuracy": {
|
|
|
|
| 308 |
]
|
| 309 |
},
|
| 310 |
"normalized_only": {
|
|
|
|
| 311 |
"constrain_bio": true,
|
| 312 |
"sample_count": 1024,
|
| 313 |
"field_accuracy": {
|
|
|
|
| 531 |
}
|
| 532 |
}
|
| 533 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 534 |
}
|
| 535 |
}
|
| 536 |
+
}
|
train.py
CHANGED
|
@@ -230,7 +230,6 @@ def parse_exact_metrics(
|
|
| 230 |
id2label: Dict[int, str],
|
| 231 |
max_length: int,
|
| 232 |
limit: Optional[int],
|
| 233 |
-
use_rules: bool = False,
|
| 234 |
constrain_bio: bool = True,
|
| 235 |
) -> Dict:
|
| 236 |
"""Evaluate end-to-end field exact match on filenames, not just token loss."""
|
|
@@ -249,7 +248,7 @@ def parse_exact_metrics(
|
|
| 249 |
available = max(0, max_length - 2)
|
| 250 |
tokens = tokens[:available]
|
| 251 |
gold_labels = gold_labels[:available]
|
| 252 |
-
gold = postprocess(tokens, gold_labels, tokenizer=tokenizer
|
| 253 |
gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
|
| 254 |
for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
|
| 255 |
if entity not in gold_entities:
|
|
@@ -261,7 +260,6 @@ def parse_exact_metrics(
|
|
| 261 |
id2label,
|
| 262 |
max_length=max_length,
|
| 263 |
debug=False,
|
| 264 |
-
use_rules=use_rules,
|
| 265 |
constrain_bio=constrain_bio,
|
| 266 |
)
|
| 267 |
|
|
@@ -298,7 +296,6 @@ def parse_exact_metrics(
|
|
| 298 |
total = counter.get("full_total", 0)
|
| 299 |
correct = counter.get("full_correct", 0)
|
| 300 |
return {
|
| 301 |
-
"use_rules": use_rules,
|
| 302 |
"constrain_bio": constrain_bio,
|
| 303 |
"sample_count": total,
|
| 304 |
"field_accuracy": field_accuracy,
|
|
@@ -320,9 +317,8 @@ def parse_exact_metrics_all_modes(
|
|
| 320 |
limit: Optional[int],
|
| 321 |
) -> Dict:
|
| 322 |
modes = {
|
| 323 |
-
"model_only": {"
|
| 324 |
-
"normalized_only": {"
|
| 325 |
-
"rule_assisted": {"use_rules": True, "constrain_bio": True},
|
| 326 |
}
|
| 327 |
return {
|
| 328 |
"primary_metric": "normalized_only",
|
|
@@ -334,7 +330,6 @@ def parse_exact_metrics_all_modes(
|
|
| 334 |
id2label,
|
| 335 |
max_length,
|
| 336 |
limit,
|
| 337 |
-
use_rules=settings["use_rules"],
|
| 338 |
constrain_bio=settings["constrain_bio"],
|
| 339 |
)
|
| 340 |
for name, settings in modes.items()
|
|
|
|
| 230 |
id2label: Dict[int, str],
|
| 231 |
max_length: int,
|
| 232 |
limit: Optional[int],
|
|
|
|
| 233 |
constrain_bio: bool = True,
|
| 234 |
) -> Dict:
|
| 235 |
"""Evaluate end-to-end field exact match on filenames, not just token loss."""
|
|
|
|
| 248 |
available = max(0, max_length - 2)
|
| 249 |
tokens = tokens[:available]
|
| 250 |
gold_labels = gold_labels[:available]
|
| 251 |
+
gold = postprocess(tokens, gold_labels, tokenizer=tokenizer)
|
| 252 |
gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
|
| 253 |
for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
|
| 254 |
if entity not in gold_entities:
|
|
|
|
| 260 |
id2label,
|
| 261 |
max_length=max_length,
|
| 262 |
debug=False,
|
|
|
|
| 263 |
constrain_bio=constrain_bio,
|
| 264 |
)
|
| 265 |
|
|
|
|
| 296 |
total = counter.get("full_total", 0)
|
| 297 |
correct = counter.get("full_correct", 0)
|
| 298 |
return {
|
|
|
|
| 299 |
"constrain_bio": constrain_bio,
|
| 300 |
"sample_count": total,
|
| 301 |
"field_accuracy": field_accuracy,
|
|
|
|
| 317 |
limit: Optional[int],
|
| 318 |
) -> Dict:
|
| 319 |
modes = {
|
| 320 |
+
"model_only": {"constrain_bio": False},
|
| 321 |
+
"normalized_only": {"constrain_bio": True},
|
|
|
|
| 322 |
}
|
| 323 |
return {
|
| 324 |
"primary_metric": "normalized_only",
|
|
|
|
| 330 |
id2label,
|
| 331 |
max_length,
|
| 332 |
limit,
|
|
|
|
| 333 |
constrain_bio=settings["constrain_bio"],
|
| 334 |
)
|
| 335 |
for name, settings in modes.items()
|