ModerRAS commited on
Commit
116c87c
·
1 Parent(s): f2ec095

Remove structural parser rule assists

Browse files
MAINTENANCE.md CHANGED
@@ -121,11 +121,12 @@ uv run python benchmark_inference.py --model-dir . --onnx exports/anime_filename
121
  ```
122
 
123
  The default parser path is thin runtime: model logits, constrained BIO, entity
124
- aggregation, and light string/number normalization. `--rule-assist` is a
125
- compatibility/diagnostic mode only; do not use it as the primary quality metric.
 
126
 
127
  默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
128
- `--rule-assist` 只是兼容/诊断模式,作为主质量标。
129
 
130
  ## Dataset Submodule / 数据集子模块
131
 
 
121
  ```
122
 
123
  The default parser path is thin runtime: model logits, constrained BIO, entity
124
+ aggregation, and light string/number normalization. Do not add structural
125
+ filename regex assists back to the default runtime; parser quality should come
126
+ from labels and model training.
127
 
128
  默认解析路径是薄层运行时:模型 logits、约束 BIO、实体聚合和轻量字符串/数字规范化。
129
+ 要把结构化文件名正则辅助重新加回默认运行时;解析质量应来自签和模型训练
130
 
131
  ## Dataset Submodule / 数据集子模块
132
 
README.md CHANGED
@@ -146,11 +146,11 @@ Current published checkpoint:
146
  | Focus held-out, default thin runtime / 困难抽样,默认薄层运行时 | 1017/1024 full match = `99.32%` |
147
  | Token/entity eval / token/entity 评估 | F1 `0.9972`, token accuracy `0.9995` |
148
  | ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
149
- | CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.08 ms`, P95 `15.95 ms` |
150
 
151
- **中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;`--rule-assist` 只保留为兼容/诊断对照,不再作为模型质量标准
152
 
153
- **English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; `--rule-assist` is retained only for compatibility/diagnostics.
154
 
155
  Run regression:
156
 
@@ -177,8 +177,8 @@ decoding, entity aggregation, and light string/number normalization:
177
 
178
  | Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
179
  | --- | ---: | ---: | ---: | ---: | ---: | ---: |
180
- | PyTorch | 49.07 | 15.16 | 14.87 | 18.50 | 21.91 | 66.0 |
181
- | ONNX Runtime | 568.85 | 13.08 | 12.82 | 15.95 | 20.19 | 76.5 |
182
 
183
  **中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
184
 
 
146
  | Focus held-out, default thin runtime / 困难抽样,默认薄层运行时 | 1017/1024 full match = `99.32%` |
147
  | Token/entity eval / token/entity 评估 | F1 `0.9972`, token accuracy `0.9995` |
148
  | ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
149
+ | CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.18 ms`, P95 `16.70 ms` |
150
 
151
+ **中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;旧版结构规则辅助层已移除,不再作为运行时或质量对照
152
 
153
+ **English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; structural filename assists have been removed from the runtime and quality reports.
154
 
155
  Run regression:
156
 
 
177
 
178
  | Backend / 后端 | Load ms / 加载 ms | Avg ms / 平均 ms | P50 ms | P95 ms | P99 ms | files/s |
179
  | --- | ---: | ---: | ---: | ---: | ---: | ---: |
180
+ | PyTorch | 76.56 | 16.85 | 16.21 | 22.84 | 28.31 | 59.4 |
181
+ | ONNX Runtime | 49.74 | 13.18 | 12.86 | 16.70 | 18.06 | 75.9 |
182
 
183
  **中文**:这是完整薄层 parser 的端到端延迟,不是只测模型 forward。移动端实现应复用 ONNX session,并保持 tokenizer/BIO/薄规范化逻辑一致。
184
 
benchmark_inference.py CHANGED
@@ -95,8 +95,6 @@ def main() -> None:
95
  parser.add_argument("--torch-threads", type=int, default=1, help="torch intra-op thread count")
96
  parser.add_argument("--ort-threads", type=int, default=1, help="ONNX Runtime intra/inter-op thread count")
97
  parser.add_argument("--no-constrained-bio", action="store_true", help="Use greedy labels for PyTorch backend")
98
- parser.add_argument("--rule-assist", action="store_true", help="Enable legacy structural postprocessing")
99
- parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
100
  parser.add_argument("--output", default=None, help="Optional JSON output path")
101
  args = parser.parse_args()
102
 
@@ -128,7 +126,6 @@ def main() -> None:
128
  id2label,
129
  max_length=resolved_max_length,
130
  debug=False,
131
- use_rules=args.rule_assist and not args.no_rule_assist,
132
  constrain_bio=not args.no_constrained_bio,
133
  )
134
 
@@ -150,7 +147,7 @@ def main() -> None:
150
  load_ms = (time.perf_counter() - load_start) * 1000.0
151
 
152
  def parse_onnx(filename: str) -> Dict:
153
- return onnx_parser.parse(filename, use_rules=args.rule_assist and not args.no_rule_assist)
154
 
155
  raw = run_benchmark("onnxruntime", parse_onnx, filenames, args.warmup, args.repeat)
156
  results.append(summarize(raw["name"], load_ms, raw["latencies_ms"]))
@@ -164,7 +161,6 @@ def main() -> None:
164
  "warmup": args.warmup,
165
  "torch_threads": args.torch_threads,
166
  "ort_threads": args.ort_threads,
167
- "use_rules": args.rule_assist and not args.no_rule_assist,
168
  "constrain_bio": not args.no_constrained_bio,
169
  "results": results,
170
  }
 
95
  parser.add_argument("--torch-threads", type=int, default=1, help="torch intra-op thread count")
96
  parser.add_argument("--ort-threads", type=int, default=1, help="ONNX Runtime intra/inter-op thread count")
97
  parser.add_argument("--no-constrained-bio", action="store_true", help="Use greedy labels for PyTorch backend")
 
 
98
  parser.add_argument("--output", default=None, help="Optional JSON output path")
99
  args = parser.parse_args()
100
 
 
126
  id2label,
127
  max_length=resolved_max_length,
128
  debug=False,
 
129
  constrain_bio=not args.no_constrained_bio,
130
  )
131
 
 
147
  load_ms = (time.perf_counter() - load_start) * 1000.0
148
 
149
  def parse_onnx(filename: str) -> Dict:
150
+ return onnx_parser.parse(filename)
151
 
152
  raw = run_benchmark("onnxruntime", parse_onnx, filenames, args.warmup, args.repeat)
153
  results.append(summarize(raw["name"], load_ms, raw["latencies_ms"]))
 
161
  "warmup": args.warmup,
162
  "torch_threads": args.torch_threads,
163
  "ort_threads": args.ort_threads,
 
164
  "constrain_bio": not args.no_constrained_bio,
165
  "results": results,
166
  }
benchmark_results.json CHANGED
@@ -7,32 +7,31 @@
7
  "warmup": 20,
8
  "torch_threads": 1,
9
  "ort_threads": 1,
10
- "use_rules": false,
11
  "constrain_bio": true,
12
  "results": [
13
  {
14
  "name": "pytorch",
15
- "load_ms": 49.07089995685965,
16
  "runs": 520,
17
- "avg_ms": 15.156135000646687,
18
- "p50_ms": 14.874850050546229,
19
- "p95_ms": 18.50034496746957,
20
- "p99_ms": 21.91202303394671,
21
- "min_ms": 11.207600007764995,
22
- "max_ms": 26.899200049228966,
23
- "throughput_fps": 65.97988207134152
24
  },
25
  {
26
  "name": "onnxruntime",
27
- "load_ms": 568.8452000031248,
28
  "runs": 520,
29
- "avg_ms": 13.076459232475967,
30
- "p50_ms": 12.81869993545115,
31
- "p95_ms": 15.947990084532643,
32
- "p99_ms": 20.187044028425575,
33
- "min_ms": 10.0586999906227,
34
- "max_ms": 22.88920001592487,
35
- "throughput_fps": 76.4733007782761
36
  }
37
  ]
38
  }
 
7
  "warmup": 20,
8
  "torch_threads": 1,
9
  "ort_threads": 1,
 
10
  "constrain_bio": true,
11
  "results": [
12
  {
13
  "name": "pytorch",
14
+ "load_ms": 76.55749993864447,
15
  "runs": 520,
16
+ "avg_ms": 16.846879808312785,
17
+ "p50_ms": 16.207700013183057,
18
+ "p95_ms": 22.843200032366425,
19
+ "p99_ms": 28.308318012859665,
20
+ "min_ms": 11.152399936690927,
21
+ "max_ms": 34.10990000702441,
22
+ "throughput_fps": 59.35817263363916
23
  },
24
  {
25
  "name": "onnxruntime",
26
+ "load_ms": 49.74160005804151,
27
  "runs": 520,
28
+ "avg_ms": 13.178169615835381,
29
+ "p50_ms": 12.862899922765791,
30
+ "p95_ms": 16.696884995326396,
31
+ "p99_ms": 18.06362595874816,
32
+ "min_ms": 9.811799973249435,
33
+ "max_ms": 20.784800057299435,
34
+ "throughput_fps": 75.88307247148819
35
  }
36
  ]
37
  }
case_metrics.json CHANGED
@@ -6,7 +6,6 @@
6
  "case_file": "data/parser_regression_cases.json",
7
  "tokenizer_variant": "char",
8
  "max_length": 128,
9
- "use_rules": false,
10
  "constrain_bio": false,
11
  "case_count": 26,
12
  "full_correct": 25,
@@ -606,574 +605,6 @@
606
  "case_file": "data/parser_regression_cases.json",
607
  "tokenizer_variant": "char",
608
  "max_length": 128,
609
- "use_rules": false,
610
- "constrain_bio": true,
611
- "case_count": 26,
612
- "full_correct": 26,
613
- "full_accuracy": 1.0,
614
- "field_correct": {
615
- "group": 22,
616
- "title": 26,
617
- "episode": 26,
618
- "resolution": 26,
619
- "source": 19,
620
- "season": 9,
621
- "special": 5
622
- },
623
- "field_total": {
624
- "group": 22,
625
- "title": 26,
626
- "episode": 26,
627
- "resolution": 26,
628
- "source": 19,
629
- "season": 9,
630
- "special": 5
631
- },
632
- "field_accuracy": {
633
- "episode": 1.0,
634
- "group": 1.0,
635
- "resolution": 1.0,
636
- "season": 1.0,
637
- "source": 1.0,
638
- "special": 1.0,
639
- "title": 1.0
640
- },
641
- "failures": [],
642
- "results": [
643
- {
644
- "id": "lolihouse_dash_episode",
645
- "filename": "[LoliHouse] Yomi no Tsugai - 07 [WebRip 1080p HEVC-10bit AAC ASSx2]",
646
- "ok": true,
647
- "errors": {},
648
- "expected": {
649
- "group": "LoliHouse",
650
- "title": "Yomi no Tsugai",
651
- "episode": 7,
652
- "resolution": "1080p",
653
- "source": "WebRip"
654
- },
655
- "pred": {
656
- "episode": 7,
657
- "group": "LoliHouse",
658
- "resolution": "1080p",
659
- "source": "WebRip",
660
- "title": "Yomi no Tsugai"
661
- }
662
- },
663
- {
664
- "id": "dot_season_episode_no_group",
665
- "filename": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
666
- "ok": true,
667
- "errors": {},
668
- "expected": {
669
- "title": "Witch.Hat.Atelier",
670
- "season": 1,
671
- "episode": 7,
672
- "group": null,
673
- "resolution": "1080p",
674
- "source": "NF"
675
- },
676
- "pred": {
677
- "episode": 7,
678
- "group": null,
679
- "resolution": "1080p",
680
- "season": 1,
681
- "source": "NF",
682
- "title": "Witch.Hat.Atelier"
683
- }
684
- },
685
- {
686
- "id": "ani_cjk_season_dash_episode",
687
- "filename": "[ANi] 異世界悠閒農家 2 - 06 [1080P][Baha][WEB-DL][AAC AVC][CHT]",
688
- "ok": true,
689
- "errors": {},
690
- "expected": {
691
- "group": "ANi",
692
- "title": "異世界悠閒農家",
693
- "season": 2,
694
- "episode": 6,
695
- "resolution": "1080P",
696
- "source": "Baha"
697
- },
698
- "pred": {
699
- "episode": 6,
700
- "group": "ANi",
701
- "resolution": "1080P",
702
- "season": 2,
703
- "source": "Baha",
704
- "title": "異世界悠閒農家"
705
- }
706
- },
707
- {
708
- "id": "kisssub_bracket_title_episode",
709
- "filename": "[KissSub][Shunkashuutou Daikousha - Haru no Mai][05][1080P][GB][MP4]",
710
- "ok": true,
711
- "errors": {},
712
- "expected": {
713
- "group": "KissSub",
714
- "title": "Shunkashuutou Daikousha - Haru no Mai",
715
- "episode": 5,
716
- "resolution": "1080P",
717
- "source": "GB"
718
- },
719
- "pred": {
720
- "episode": 5,
721
- "group": "KissSub",
722
- "resolution": "1080P",
723
- "source": "GB",
724
- "title": "Shunkashuutou Daikousha - Haru no Mai"
725
- }
726
- },
727
- {
728
- "id": "airotabracket_title_episode",
729
- "filename": "[Airota][Sousou no Frieren][29][1080p AVC AAC][CHT]",
730
- "ok": true,
731
- "errors": {},
732
- "expected": {
733
- "group": "Airota",
734
- "title": "Sousou no Frieren",
735
- "episode": 29,
736
- "resolution": "1080p",
737
- "source": "CHT"
738
- },
739
- "pred": {
740
- "episode": 29,
741
- "group": "Airota",
742
- "resolution": "1080p",
743
- "source": "CHT",
744
- "title": "Sousou no Frieren"
745
- }
746
- },
747
- {
748
- "id": "subsplease_parenthesized_resolution",
749
- "filename": "[SubsPlease] Mushoku Tensei - 12 (1080p) [x265][AAC]",
750
- "ok": true,
751
- "errors": {},
752
- "expected": {
753
- "group": "SubsPlease",
754
- "title": "Mushoku Tensei",
755
- "episode": 12,
756
- "resolution": "1080p"
757
- },
758
- "pred": {
759
- "episode": 12,
760
- "group": "SubsPlease",
761
- "resolution": "1080p",
762
- "title": "Mushoku Tensei"
763
- }
764
- },
765
- {
766
- "id": "vcb_bracket_episode",
767
- "filename": "[VCB-Studio] Girls Band Cry [01][Ma10p_1080p][x265_flac]",
768
- "ok": true,
769
- "errors": {},
770
- "expected": {
771
- "group": "VCB-Studio",
772
- "title": "Girls Band Cry",
773
- "episode": 1,
774
- "resolution": "1080p"
775
- },
776
- "pred": {
777
- "episode": 1,
778
- "group": "VCB-Studio",
779
- "resolution": "1080p",
780
- "title": "Girls Band Cry"
781
- }
782
- },
783
- {
784
- "id": "numeric_title_not_episode",
785
- "filename": "86 Eighty Six - 01 [1080P][Baha]",
786
- "ok": true,
787
- "errors": {},
788
- "expected": {
789
- "title": "86 Eighty Six",
790
- "episode": 1,
791
- "resolution": "1080P",
792
- "source": "Baha"
793
- },
794
- "pred": {
795
- "episode": 1,
796
- "resolution": "1080P",
797
- "source": "Baha",
798
- "title": "86 Eighty Six"
799
- }
800
- },
801
- {
802
- "id": "erai_raws_dash_episode",
803
- "filename": "[Erai-raws] Sousou no Frieren - 01 [1080p][Multiple Subtitle][ENG]",
804
- "ok": true,
805
- "errors": {},
806
- "expected": {
807
- "group": "Erai-raws",
808
- "title": "Sousou no Frieren",
809
- "episode": 1,
810
- "resolution": "1080p"
811
- },
812
- "pred": {
813
- "episode": 1,
814
- "group": "Erai-raws",
815
- "resolution": "1080p",
816
- "title": "Sousou no Frieren"
817
- }
818
- },
819
- {
820
- "id": "nekomoe_space_group",
821
- "filename": "[Nekomoe kissaten][Watashi no Shiawase na Kekkon][01][1080p][JPSC]",
822
- "ok": true,
823
- "errors": {},
824
- "expected": {
825
- "group": "Nekomoe kissaten",
826
- "title": "Watashi no Shiawase na Kekkon",
827
- "episode": 1,
828
- "resolution": "1080p"
829
- },
830
- "pred": {
831
- "episode": 1,
832
- "group": "Nekomoe kissaten",
833
- "resolution": "1080p",
834
- "title": "Watashi no Shiawase na Kekkon"
835
- }
836
- },
837
- {
838
- "id": "long_running_episode",
839
- "filename": "One.Piece.1110.1080p.WEB-DL.AAC2.0.H.264",
840
- "ok": true,
841
- "errors": {},
842
- "expected": {
843
- "title": "One.Piece",
844
- "episode": 1110,
845
- "resolution": "1080p",
846
- "source": "WEB-DL"
847
- },
848
- "pred": {
849
- "episode": 1110,
850
- "resolution": "1080p",
851
- "source": "WEB-DL",
852
- "title": "One.Piece"
853
- }
854
- },
855
- {
856
- "id": "season_episode_amzn",
857
- "filename": "Example.Show.S02E03.2160p.AMZN.WEB-DL.DDP5.1.H.265",
858
- "ok": true,
859
- "errors": {},
860
- "expected": {
861
- "title": "Example.Show",
862
- "season": 2,
863
- "episode": 3,
864
- "resolution": "2160p",
865
- "source": "AMZN"
866
- },
867
- "pred": {
868
- "episode": 3,
869
- "resolution": "2160p",
870
- "season": 2,
871
- "source": "AMZN",
872
- "title": "Example.Show"
873
- }
874
- },
875
- {
876
- "id": "cjk_group_with_prefix_tag",
877
- "filename": "【喵萌奶茶屋】★04月新番★[葬送的芙莉莲][01][1080P][HEVC]",
878
- "ok": true,
879
- "errors": {},
880
- "expected": {
881
- "group": "喵萌奶茶屋",
882
- "title": "葬送的芙莉莲",
883
- "episode": 1,
884
- "resolution": "1080P"
885
- },
886
- "pred": {
887
- "episode": 1,
888
- "group": "喵萌奶茶屋",
889
- "resolution": "1080P",
890
- "title": "葬送的芙莉莲"
891
- }
892
- },
893
- {
894
- "id": "leading_meta_not_group",
895
- "filename": "[1080p] Witch Watch - 15 [CHS]",
896
- "ok": true,
897
- "errors": {},
898
- "expected": {
899
- "group": null,
900
- "title": "Witch Watch",
901
- "episode": 15,
902
- "resolution": "1080p",
903
- "source": "CHS"
904
- },
905
- "pred": {
906
- "episode": 15,
907
- "group": null,
908
- "resolution": "1080p",
909
- "source": "CHS",
910
- "title": "Witch Watch"
911
- }
912
- },
913
- {
914
- "id": "sakurato_group_language_source",
915
- "filename": "[Sakurato] Witch Watch - 15 [1080p][CHS]",
916
- "ok": true,
917
- "errors": {},
918
- "expected": {
919
- "group": "Sakurato",
920
- "title": "Witch Watch",
921
- "episode": 15,
922
- "resolution": "1080p",
923
- "source": "CHS"
924
- },
925
- "pred": {
926
- "episode": 15,
927
- "group": "Sakurato",
928
- "resolution": "1080p",
929
- "source": "CHS",
930
- "title": "Witch Watch"
931
- }
932
- },
933
- {
934
- "id": "billion_meta_lab_search_special",
935
- "filename": "[Billion Meta Lab] 魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi [07][1080P][CHT&JPN][檢索:魔法姊妹露露特莉莉].mp4",
936
- "ok": true,
937
- "errors": {},
938
- "expected": {
939
- "group": "Billion Meta Lab",
940
- "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi",
941
- "episode": 7,
942
- "resolution": "1080P",
943
- "source": "CHT&JPN",
944
- "special": "檢索:魔法姊妹露露特莉莉"
945
- },
946
- "pred": {
947
- "episode": 7,
948
- "group": "Billion Meta Lab",
949
- "resolution": "1080P",
950
- "source": "CHT&JPN",
951
- "special": "檢索:魔法姊妹露露特莉莉",
952
- "title": "魔法姊妹露露莉莉 Mahou no Shimai Rurutto Riryi"
953
- }
954
- },
955
- {
956
- "id": "studio_greentea_s2_bracket_episode",
957
- "filename": "[Studio GreenTea] Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken S2 [06][WebRip][HEVC-10bit 1080p AAC][JPSC].mp4",
958
- "ok": true,
959
- "errors": {},
960
- "expected": {
961
- "group": "Studio GreenTea",
962
- "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken",
963
- "season": 2,
964
- "episode": 6,
965
- "resolution": "1080p",
966
- "source": "WebRip"
967
- },
968
- "pred": {
969
- "episode": 6,
970
- "group": "Studio GreenTea",
971
- "resolution": "1080p",
972
- "season": 2,
973
- "source": "WebRip",
974
- "title": "Otonari no Tenshi-sama ni Itsunomanika Dame Ningen ni Sareteita Ken"
975
- }
976
- },
977
- {
978
- "id": "lolihouse_kakuriyo_bare_ni_season",
979
- "filename": "[LoliHouse] Kakuriyo no Yadomeshi Ni - 12 [WebRip 1080p HEVC-10bit AAC SRTx2].mkv",
980
- "ok": true,
981
- "errors": {},
982
- "expected": {
983
- "group": "LoliHouse",
984
- "title": "Kakuriyo no Yadomeshi",
985
- "season": 2,
986
- "episode": 12,
987
- "resolution": "1080p",
988
- "source": "WebRip"
989
- },
990
- "pred": {
991
- "episode": 12,
992
- "group": "LoliHouse",
993
- "resolution": "1080p",
994
- "season": 2,
995
- "source": "WebRip",
996
- "title": "Kakuriyo no Yadomeshi"
997
- }
998
- },
999
- {
1000
- "id": "ani_kakuriyo_traditional_ni",
1001
- "filename": "[ANi] 妖怪旅館營業中 貳 - 11 [1080P][Baha][WEB-DL][AAC AVC][CHT].mp4",
1002
- "ok": true,
1003
- "errors": {},
1004
- "expected": {
1005
- "group": "ANi",
1006
- "title": "妖怪旅館營業中",
1007
- "season": 2,
1008
- "episode": 11,
1009
- "resolution": "1080P",
1010
- "source": "Baha"
1011
- },
1012
- "pred": {
1013
- "episode": 11,
1014
- "group": "ANi",
1015
- "resolution": "1080P",
1016
- "season": 2,
1017
- "source": "Baha",
1018
- "title": "妖怪旅館營業中"
1019
- }
1020
- },
1021
- {
1022
- "id": "jibaketa_shokugeki_ni_no_sara",
1023
- "filename": "[jibaketa]Shokugeki no Souma Ni no Sara - 13 END [BD 1920x1080 x264 AACx2 SRT TVB CHT].mkv",
1024
- "ok": true,
1025
- "errors": {},
1026
- "expected": {
1027
- "group": "jibaketa",
1028
- "title": "Shokugeki no Souma",
1029
- "season": 2,
1030
- "episode": 13,
1031
- "resolution": "1920x1080"
1032
- },
1033
- "pred": {
1034
- "episode": 13,
1035
- "group": "jibaketa",
1036
- "resolution": "1920x1080",
1037
- "season": 2,
1038
- "title": "Shokugeki no Souma"
1039
- }
1040
- },
1041
- {
1042
- "id": "ai_raws_fire_force_cjk_season_hash_episode",
1043
- "filename": "[AI-Raws] 炎炎の消防隊 弐ノ章 #13 (BD HEVC 1920x1080 yuv444p10le FLAC)[FC74A2D5].mkv",
1044
- "ok": true,
1045
- "errors": {},
1046
- "expected": {
1047
- "group": "AI-Raws",
1048
- "title": "炎炎の消防隊",
1049
- "season": 2,
1050
- "episode": 13,
1051
- "resolution": "1920x1080"
1052
- },
1053
- "pred": {
1054
- "episode": 13,
1055
- "group": "AI-Raws",
1056
- "resolution": "1920x1080",
1057
- "season": 2,
1058
- "title": "炎炎の消防隊"
1059
- }
1060
- },
1061
- {
1062
- "id": "gm_team_guoman_bilingual_s2",
1063
- "filename": "[GM-Team][国漫][逆天邪神 第2季][Against the Gods Ⅱ][2026][04][HEVC][GB][4K].mp4",
1064
- "ok": true,
1065
- "errors": {},
1066
- "expected": {
1067
- "group": "GM-Team",
1068
- "title": "逆天邪神",
1069
- "season": 2,
1070
- "episode": 4,
1071
- "resolution": "4K",
1072
- "source": "GB"
1073
- },
1074
- "pred": {
1075
- "episode": 4,
1076
- "group": "GM-Team",
1077
- "resolution": "4K",
1078
- "season": 2,
1079
- "source": "GB",
1080
- "title": "逆天邪神"
1081
- }
1082
- },
1083
- {
1084
- "id": "vcb_special_iv_not_episode",
1085
- "filename": "[YYDM&VCB-Studio] Shinsekai Yori [IV05][Ma10p_1080p][x265_aac].mkv",
1086
- "ok": true,
1087
- "errors": {},
1088
- "expected": {
1089
- "group": "YYDM&VCB-Studio",
1090
- "title": "Shinsekai Yori",
1091
- "episode": null,
1092
- "resolution": "1080p",
1093
- "source": "x265_aac",
1094
- "special": "IV05"
1095
- },
1096
- "pred": {
1097
- "episode": null,
1098
- "group": "YYDM&VCB-Studio",
1099
- "resolution": "1080p",
1100
- "source": "x265-aac",
1101
- "special": "IV05",
1102
- "title": "Shinsekai Yori"
1103
- }
1104
- },
1105
- {
1106
- "id": "vcb_nced_not_episode",
1107
- "filename": "[YYDM&VCB-Studio] Shinsekai Yori [NCED02][Ma10p_1080p][x265_flac].mkv",
1108
- "ok": true,
1109
- "errors": {},
1110
- "expected": {
1111
- "group": "YYDM&VCB-Studio",
1112
- "title": "Shinsekai Yori",
1113
- "episode": null,
1114
- "resolution": "1080p",
1115
- "source": "x265_flac",
1116
- "special": "NCED02"
1117
- },
1118
- "pred": {
1119
- "episode": null,
1120
- "group": "YYDM&VCB-Studio",
1121
- "resolution": "1080p",
1122
- "source": "x265-flac",
1123
- "special": "NCED02",
1124
- "title": "Shinsekai Yori"
1125
- }
1126
- },
1127
- {
1128
- "id": "dot_nced_suffix_not_episode",
1129
- "filename": "InuYasha.2000.NCED02.BDrip.AV1.10Bit.DTS.1080p-CalChi",
1130
- "ok": true,
1131
- "errors": {},
1132
- "expected": {
1133
- "title": "InuYasha",
1134
- "episode": null,
1135
- "resolution": "1080p",
1136
- "source": "BDrip",
1137
- "special": "NCED02"
1138
- },
1139
- "pred": {
1140
- "episode": null,
1141
- "resolution": "1080p",
1142
- "source": "BDrip",
1143
- "special": "NCED02",
1144
- "title": "InuYasha"
1145
- }
1146
- },
1147
- {
1148
- "id": "vcb_numeric_title_nced",
1149
- "filename": "[VCB-Studio] Yamada-kun to 7-nin no Majo [NCED][Ma10p_1080p][x265_flac]",
1150
- "ok": true,
1151
- "errors": {},
1152
- "expected": {
1153
- "group": "VCB-Studio",
1154
- "title": "Yamada-kun to 7-nin no Majo",
1155
- "episode": null,
1156
- "resolution": "1080p",
1157
- "source": "x265_flac",
1158
- "special": "NCED"
1159
- },
1160
- "pred": {
1161
- "episode": null,
1162
- "group": "VCB-Studio",
1163
- "resolution": "1080p",
1164
- "source": "x265-flac",
1165
- "special": "NCED",
1166
- "title": "Yamada-kun to 7-nin no Majo"
1167
- }
1168
- }
1169
- ]
1170
- },
1171
- "rule_assisted": {
1172
- "model_dir": ".",
1173
- "case_file": "data/parser_regression_cases.json",
1174
- "tokenizer_variant": "char",
1175
- "max_length": 128,
1176
- "use_rules": true,
1177
  "constrain_bio": true,
1178
  "case_count": 26,
1179
  "full_correct": 26,
 
6
  "case_file": "data/parser_regression_cases.json",
7
  "tokenizer_variant": "char",
8
  "max_length": 128,
 
9
  "constrain_bio": false,
10
  "case_count": 26,
11
  "full_correct": 25,
 
605
  "case_file": "data/parser_regression_cases.json",
606
  "tokenizer_variant": "char",
607
  "max_length": 128,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
608
  "constrain_bio": true,
609
  "case_count": 26,
610
  "full_correct": 26,
diagnose_pipeline.py CHANGED
@@ -364,9 +364,7 @@ def evaluate_model(
364
  entity_confusion: Counter = Counter()
365
  boundary_errors: Counter = Counter()
366
  parse_metrics: Counter = Counter()
367
- parse_metrics_no_rules: Counter = Counter()
368
  field_failures: List[dict] = []
369
- field_failures_no_rules: List[dict] = []
370
 
371
  with torch.no_grad():
372
  for sample in eval_samples:
@@ -410,32 +408,13 @@ def evaluate_model(
410
  active_tokens,
411
  true_labels,
412
  tokenizer=tokenizer,
413
- filename=sample.get("filename"),
414
- use_rules=True,
415
  )
416
  pred_parse = postprocess(
417
  active_tokens,
418
  pred_labels,
419
  tokenizer=tokenizer,
420
- filename=sample.get("filename"),
421
- use_rules=True,
422
- )
423
- gold_parse_no_rules = postprocess(
424
- active_tokens,
425
- true_labels,
426
- tokenizer=tokenizer,
427
- filename=sample.get("filename"),
428
- use_rules=False,
429
- )
430
- pred_parse_no_rules = postprocess(
431
- active_tokens,
432
- pred_labels,
433
- tokenizer=tokenizer,
434
- filename=sample.get("filename"),
435
- use_rules=False,
436
  )
437
  update_parse_metrics(parse_metrics, gold_parse, pred_parse)
438
- update_parse_metrics(parse_metrics_no_rules, gold_parse_no_rules, pred_parse_no_rules)
439
  failures = collect_field_failures(gold_parse, pred_parse)
440
  if failures and len(field_failures) < 30:
441
  field_failures.append(
@@ -446,16 +425,6 @@ def evaluate_model(
446
  "pred": pred_parse,
447
  }
448
  )
449
- failures_no_rules = collect_field_failures(gold_parse_no_rules, pred_parse_no_rules)
450
- if failures_no_rules and len(field_failures_no_rules) < 30:
451
- field_failures_no_rules.append(
452
- {
453
- "filename": sample.get("filename"),
454
- "errors": failures_no_rules,
455
- "gold": gold_parse_no_rules,
456
- "pred": pred_parse_no_rules,
457
- }
458
- )
459
 
460
  errors = confusion.copy()
461
  for label in set(label for pair in confusion for label in pair):
@@ -473,9 +442,7 @@ def evaluate_model(
473
  ).most_common(30),
474
  "boundary_errors": boundary_errors,
475
  "parse_metrics": parse_metrics,
476
- "parse_metrics_no_rules": parse_metrics_no_rules,
477
  "field_failures": field_failures,
478
- "field_failures_no_rules": field_failures_no_rules,
479
  }
480
 
481
 
@@ -811,8 +778,7 @@ def main() -> None:
811
  ]
812
  return field_rows, full_line, error_rows
813
 
814
- rule_field_rows, rule_full_line, rule_error_rows = parse_metric_tables(model_eval["parse_metrics"])
815
- ner_field_rows, ner_full_line, ner_error_rows = parse_metric_tables(model_eval["parse_metrics_no_rules"])
816
  sections.append(
817
  (
818
  "Model Confusion Analysis",
@@ -832,28 +798,17 @@ def main() -> None:
832
  "### Top entity-type confusions",
833
  markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
834
  "",
835
- "### Field exact-match accuracy (rule-assisted)",
836
- markdown_table(["field", "correct/total", "accuracy"], rule_field_rows),
837
  "",
838
- f"Rule-assisted full parse exact match: {rule_full_line}",
839
  "",
840
- "### Top rule-assisted field parse errors",
841
- markdown_table(["field", "gold", "pred", "count"], rule_error_rows) if rule_error_rows else "- none",
842
  "",
843
- "### Field exact-match accuracy (NER-only, no rules)",
844
- markdown_table(["field", "correct/total", "accuracy"], ner_field_rows),
845
- "",
846
- f"NER-only full parse exact match: {ner_full_line}",
847
- "",
848
- "### Top NER-only field parse errors",
849
- markdown_table(["field", "gold", "pred", "count"], ner_error_rows) if ner_error_rows else "- none",
850
- "",
851
- "### Hardest sampled parse failures (rule-assisted)",
852
  markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
853
  "",
854
- "### Hardest sampled parse failures (NER-only)",
855
- markdown_json(model_eval["field_failures_no_rules"][:10]) if model_eval["field_failures_no_rules"] else "- none",
856
- "",
857
  "### Seqeval report",
858
  "```text\n" + model_eval["classification_report"] + "\n```",
859
  ]
@@ -870,7 +825,7 @@ def main() -> None:
870
  "2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels.",
871
  "3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`.",
872
  "4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked.",
873
- "5. Keep rule-assisted post-processing for high-confidence structural anchors: leading group bracket, ` - 07`, `S01E07`, source, and resolution.",
874
  "6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.",
875
  ]
876
  ),
 
364
  entity_confusion: Counter = Counter()
365
  boundary_errors: Counter = Counter()
366
  parse_metrics: Counter = Counter()
 
367
  field_failures: List[dict] = []
 
368
 
369
  with torch.no_grad():
370
  for sample in eval_samples:
 
408
  active_tokens,
409
  true_labels,
410
  tokenizer=tokenizer,
 
 
411
  )
412
  pred_parse = postprocess(
413
  active_tokens,
414
  pred_labels,
415
  tokenizer=tokenizer,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  )
417
  update_parse_metrics(parse_metrics, gold_parse, pred_parse)
 
418
  failures = collect_field_failures(gold_parse, pred_parse)
419
  if failures and len(field_failures) < 30:
420
  field_failures.append(
 
425
  "pred": pred_parse,
426
  }
427
  )
 
 
 
 
 
 
 
 
 
 
428
 
429
  errors = confusion.copy()
430
  for label in set(label for pair in confusion for label in pair):
 
442
  ).most_common(30),
443
  "boundary_errors": boundary_errors,
444
  "parse_metrics": parse_metrics,
 
445
  "field_failures": field_failures,
 
446
  }
447
 
448
 
 
778
  ]
779
  return field_rows, full_line, error_rows
780
 
781
+ parse_field_rows, parse_full_line, parse_error_rows = parse_metric_tables(model_eval["parse_metrics"])
 
782
  sections.append(
783
  (
784
  "Model Confusion Analysis",
 
798
  "### Top entity-type confusions",
799
  markdown_table(["true", "pred", "count"], entity_rows) if entity_rows else "- none",
800
  "",
801
+ "### Field exact-match accuracy (thin runtime)",
802
+ markdown_table(["field", "correct/total", "accuracy"], parse_field_rows),
803
  "",
804
+ f"Thin-runtime full parse exact match: {parse_full_line}",
805
  "",
806
+ "### Top thin-runtime field parse errors",
807
+ markdown_table(["field", "gold", "pred", "count"], parse_error_rows) if parse_error_rows else "- none",
808
  "",
809
+ "### Hardest sampled parse failures",
 
 
 
 
 
 
 
 
810
  markdown_json(model_eval["field_failures"][:10]) if model_eval["field_failures"] else "- none",
811
  "",
 
 
 
812
  "### Seqeval report",
813
  "```text\n" + model_eval["classification_report"] + "\n```",
814
  ]
 
825
  "2. Prefer char-level or a deterministic hybrid tokenizer for DMHY filenames; avoid generic subword tokenization for labels.",
826
  "3. For char-level runs, use `--tokenizer char --max-seq-length 128` with `vocab.char.json`.",
827
  "4. Add CRF decoding or constrained BIO decoding so illegal I-X transitions and impossible boundary jumps are blocked.",
828
+ "5. Keep runtime post-processing thin: BIO aggregation plus string/number normalization.",
829
  "6. Track entity-level F1 and field exact-match on real filenames; do not accept low validation loss alone.",
830
  ]
831
  ),
docs/onnx.md CHANGED
@@ -107,15 +107,14 @@ The runtime parser should do this:
107
  使用约束 BIO transition 解码标签。
108
  8. Aggregate labels into parser fields.
109
  聚合标签为结构化字段。
110
- 9. Apply thin normalization only: trim brackets/extensions and convert numeric
111
- fields.
112
  只做薄层规范化:裁剪括号/扩展名并转换数字字段。
113
 
114
- The legacy structural assist layer is available only behind `--rule-assist` in
115
- the Python tools. It is not part of the default ONNX reference runtime.
116
 
117
- 旧结构辅助层只在 Python 工具的 `--rule-assist` 下显式启用,不属于默认 ONNX
118
- 参考运行时。
119
 
120
  ## 5. Android Notes / Android 注意事项
121
 
 
107
  使用约束 BIO transition 解码标签。
108
  8. Aggregate labels into parser fields.
109
  聚合标签为结构化字段。
110
+ 9. Apply thin normalization only: trim brackets, normalize source text, and
111
+ convert numeric fields.
112
  只做薄层规范化:裁剪括号/扩展名并转换数字字段。
113
 
114
+ The ONNX reference runtime intentionally matches the Python thin runtime. It
115
+ does not include structural filename regex assists.
116
 
117
+ ONNX 参考运行时有意与 Python 薄层运行时保持一致,不包含结构化文件名正则辅助。
 
118
 
119
  ## 5. Android Notes / Android 注意事项
120
 
docs/training.md CHANGED
@@ -172,12 +172,12 @@ The default quality gate is model-led parsing:
172
  - fixed regression `model_only >= 85%`
173
  - held-out parse `model_only >= 75%`
174
  - `normalized_only` is the default thin runtime metric
175
- - `rule_assisted` is compatibility/diagnostic only
176
 
177
  - 固定回归 `model_only >= 85%`
178
  - held-out 解析 `model_only >= 75%`
179
  - `normalized_only` 是默认薄层运行时指标
180
- - `rule_assisted` 只作为兼容/诊断对照
181
 
182
  ## 7. Publish to Repository Root / 发布到仓库根目录
183
 
 
172
  - fixed regression `model_only >= 85%`
173
  - held-out parse `model_only >= 75%`
174
  - `normalized_only` is the default thin runtime metric
175
+ - structural filename assists are not part of training or release metrics
176
 
177
  - 固定回归 `model_only >= 85%`
178
  - held-out 解析 `model_only >= 75%`
179
  - `normalized_only` 是默认薄层运行时指标
180
+ - 结构化文件名辅助不属于训练或发布指标
181
 
182
  ## 7. Publish to Repository Root / 发布到仓库根目录
183
 
evaluate_parser_cases.py CHANGED
@@ -43,7 +43,6 @@ def evaluate_cases(
43
  case_file: str,
44
  tokenizer_variant: Optional[str],
45
  max_length: Optional[int],
46
- use_rules: bool,
47
  constrain_bio: bool,
48
  ) -> Dict:
49
  cfg = Config()
@@ -71,7 +70,6 @@ def evaluate_cases(
71
  id2label,
72
  max_length=resolved_max_length,
73
  debug=False,
74
- use_rules=use_rules,
75
  constrain_bio=constrain_bio,
76
  )
77
  errors = {}
@@ -108,7 +106,6 @@ def evaluate_cases(
108
  "case_file": case_file,
109
  "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
110
  "max_length": resolved_max_length,
111
- "use_rules": use_rules,
112
  "constrain_bio": constrain_bio,
113
  "case_count": len(cases),
114
  "full_correct": full_correct,
@@ -128,9 +125,8 @@ def evaluate_case_modes(
128
  max_length: Optional[int],
129
  ) -> Dict:
130
  modes = {
131
- "model_only": {"use_rules": False, "constrain_bio": False},
132
- "normalized_only": {"use_rules": False, "constrain_bio": True},
133
- "rule_assisted": {"use_rules": True, "constrain_bio": True},
134
  }
135
  results = {
136
  name: evaluate_cases(
@@ -138,7 +134,6 @@ def evaluate_case_modes(
138
  case_file=case_file,
139
  tokenizer_variant=tokenizer_variant,
140
  max_length=max_length,
141
- use_rules=settings["use_rules"],
142
  constrain_bio=settings["constrain_bio"],
143
  )
144
  for name, settings in modes.items()
@@ -170,17 +165,10 @@ def main() -> None:
170
  parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
171
  parser.add_argument("--max-length", type=int, default=None)
172
  parser.add_argument("--output", default=None, help="Optional JSON output path")
173
- parser.add_argument("--mode", choices=["all", "model-only", "normalized-only", "rule-assisted"], default="all")
174
- parser.add_argument("--rule-assist", action="store_true", help="Shortcut for --mode rule-assisted")
175
- parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
176
  parser.add_argument("--no-constrained-bio", action="store_true")
177
  args = parser.parse_args()
178
 
179
- if args.rule_assist:
180
- args.mode = "rule-assisted"
181
- if args.no_rule_assist and args.mode == "rule-assisted":
182
- args.mode = "normalized-only"
183
-
184
  if args.mode == "all" and not args.no_constrained_bio:
185
  metrics = evaluate_case_modes(
186
  model_dir=args.model_dir,
@@ -188,18 +176,16 @@ def main() -> None:
188
  tokenizer_variant=args.tokenizer,
189
  max_length=args.max_length,
190
  )
191
- for name in ("model_only", "normalized_only", "rule_assisted"):
192
  print_metrics(name, metrics["modes"][name])
193
  print()
194
  else:
195
- use_rules = args.mode == "rule-assisted"
196
  constrain_bio = not args.no_constrained_bio and args.mode != "model-only"
197
  metrics = evaluate_cases(
198
  model_dir=args.model_dir,
199
  case_file=args.case_file,
200
  tokenizer_variant=args.tokenizer,
201
  max_length=args.max_length,
202
- use_rules=use_rules,
203
  constrain_bio=constrain_bio,
204
  )
205
  print_metrics(args.mode, metrics)
 
43
  case_file: str,
44
  tokenizer_variant: Optional[str],
45
  max_length: Optional[int],
 
46
  constrain_bio: bool,
47
  ) -> Dict:
48
  cfg = Config()
 
70
  id2label,
71
  max_length=resolved_max_length,
72
  debug=False,
 
73
  constrain_bio=constrain_bio,
74
  )
75
  errors = {}
 
106
  "case_file": case_file,
107
  "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
108
  "max_length": resolved_max_length,
 
109
  "constrain_bio": constrain_bio,
110
  "case_count": len(cases),
111
  "full_correct": full_correct,
 
125
  max_length: Optional[int],
126
  ) -> Dict:
127
  modes = {
128
+ "model_only": {"constrain_bio": False},
129
+ "normalized_only": {"constrain_bio": True},
 
130
  }
131
  results = {
132
  name: evaluate_cases(
 
134
  case_file=case_file,
135
  tokenizer_variant=tokenizer_variant,
136
  max_length=max_length,
 
137
  constrain_bio=settings["constrain_bio"],
138
  )
139
  for name, settings in modes.items()
 
165
  parser.add_argument("--tokenizer", choices=["regex", "char"], default=None)
166
  parser.add_argument("--max-length", type=int, default=None)
167
  parser.add_argument("--output", default=None, help="Optional JSON output path")
168
+ parser.add_argument("--mode", choices=["all", "model-only", "normalized-only"], default="all")
 
 
169
  parser.add_argument("--no-constrained-bio", action="store_true")
170
  args = parser.parse_args()
171
 
 
 
 
 
 
172
  if args.mode == "all" and not args.no_constrained_bio:
173
  metrics = evaluate_case_modes(
174
  model_dir=args.model_dir,
 
176
  tokenizer_variant=args.tokenizer,
177
  max_length=args.max_length,
178
  )
179
+ for name in ("model_only", "normalized_only"):
180
  print_metrics(name, metrics["modes"][name])
181
  print()
182
  else:
 
183
  constrain_bio = not args.no_constrained_bio and args.mode != "model-only"
184
  metrics = evaluate_cases(
185
  model_dir=args.model_dir,
186
  case_file=args.case_file,
187
  tokenizer_variant=args.tokenizer,
188
  max_length=args.max_length,
 
189
  constrain_bio=constrain_bio,
190
  )
191
  print_metrics(args.mode, metrics)
inference.py CHANGED
@@ -11,7 +11,6 @@ Usage:
11
 
12
  import argparse
13
  import json
14
- import os
15
  import re
16
  import sys
17
  from typing import Dict, List, Optional, Tuple
@@ -98,6 +97,15 @@ def thin_source_priority(source: str) -> int:
98
  return 40 if re.search(r"[&+/,]", source) else 30
99
 
100
 
 
 
 
 
 
 
 
 
 
101
  def choose_thin_source(sources: List[str]) -> Optional[str]:
102
  cleaned = [normalize_source_text(source) for source in sources if normalize_field_text(source)]
103
  if not cleaned:
@@ -239,8 +247,6 @@ def postprocess(
239
  tokens: List[str],
240
  labels: List[str],
241
  tokenizer: Optional[AnimeTokenizer] = None,
242
- filename: Optional[str] = None,
243
- use_rules: bool = False,
244
  ) -> Dict:
245
  """
246
  Convert BIO-labeled tokens into structured metadata.
@@ -298,658 +304,9 @@ def postprocess(
298
 
299
  result["source"] = choose_thin_source(grouped_entities.get("SOURCE", []))
300
 
301
- if use_rules and filename:
302
- result = apply_rule_assists(filename, result)
303
-
304
  return result
305
 
306
 
307
- BRACKET_RE = re.compile(r"\[([^\]]+)\]|\(([^)]+)\)|【([^】]+)】|《([^》]+)》")
308
- RESOLUTION_RE = re.compile(r"(?<![A-Za-z0-9])(?:\d{3,4}[pP]|\d[Kk]|\d{3,4}[xX×]\d{3,4})(?![A-Za-z0-9])")
309
- SOURCE_TOKEN_PATTERN = (
310
- r"WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|HDTV|"
311
- r"Netflix|NF|AMZN|Baha|CR|ABEMA|DSNP|U[-_ ]?NEXT|Hulu|AT[-_ ]?X|"
312
- r"x26[45]|h\.?26[45]|HEVC|AVC|AV1|AAC\d*(?:\.\d+)?|AAC|FLAC|MP3|DTS|Opus|"
313
- r"SDR|HDR10?|UHD|REMUX|10bit|8bit|Hi10p|Ma10p|ASSx?\d*|SRTx?\d*|"
314
- r"CHS|CHT|GB|BIG5|JPN?|JPSC|JPTC|繁中|简中"
315
- )
316
- SOURCE_RE = re.compile(rf"\b(?:{SOURCE_TOKEN_PATTERN})\b", re.I)
317
- SOURCE_TAG_RE = re.compile(
318
- rf"^(?:{SOURCE_TOKEN_PATTERN})(?:\s*(?:[&+/]|,\s*)\s*(?:{SOURCE_TOKEN_PATTERN}))*$",
319
- re.I,
320
- )
321
- SPECIAL_TAG_RE = re.compile(
322
- r"^(?:檢索|检索|搜索|搜寻|搜尋|别名|別名|alias|search|keyword)\s*[::].+",
323
- re.I,
324
- )
325
- SPECIAL_CODE_RE = re.compile(
326
- r"^(?:NCOP|NCED|OP|ED|PV|CM)\d*$|^IV\d+$|^(?:OVA|OAD|SP)\d*$",
327
- re.I,
328
- )
329
- SPECIAL_CODE_INLINE_RE = re.compile(
330
- r"(?<![A-Za-z0-9])"
331
- r"(?P<code>(?:NCOP|NCED)(?:[\s._-]*\d{1,4})?|(?:OP|ED|PV|CM)\d{1,4}|IV\d{1,4})"
332
- r"(?![A-Za-z0-9])",
333
- re.I,
334
- )
335
- EPISODE_PATTERNS = [
336
- ("season_episode", re.compile(r"[Ss]\d{1,2}[Ee](?P<ep>\d{1,4})(?:v\d+)?", re.I)),
337
- ("dash_episode", re.compile(r"(?:^|[\s._])[-_]\s*(?P<ep>\d{1,4})(?:v\d+)?(?=$|[\s._\-\]\)】》\[])")),
338
- ("bracket_episode", re.compile(r"[\[\(【《](?:EP?|#)?(?P<ep>\d{1,4})(?:v\d+)?[\]\)】》]", re.I)),
339
- ("explicit_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)(?P<ep>\d{1,4})(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])", re.I)),
340
- (
341
- "long_episode",
342
- re.compile(
343
- r"(?:^|[\s._\-\[\(【《])(?P<ep>\d{3,4})(?:v\d+)?"
344
- r"(?=[\s._\-\]\)】》\[]+(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
345
- re.I,
346
- ),
347
- ),
348
- ("generic_episode", re.compile(r"(?:^|[\s._\-\[\(【《#])(?P<ep>\d{1,3})(?:v\d+)?(?=$|[\s._\-\]\)】》])", re.I)),
349
- ]
350
- SEASON_RE = re.compile(r"(?:^|[\s._\-\[\(【《])(?:[Ss](?P<s1>\d{1,2})|Season\s*(?P<s2>\d{1,2})|第(?P<s3>[一二三四五六七八九十\d]+)[季期部])", re.I)
351
- SEQUEL_MARKER_RE = re.compile(
352
- r"(?<![A-Za-z0-9])"
353
- r"(?P<marker>"
354
- r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
355
- r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
356
- r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
357
- r"(?:Go|Gou)\s+no\s+Sara|"
358
- r"Ni\s+Gakki|Sono\s+Ni|Ni|"
359
- r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
360
- r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
361
- r")"
362
- r"(?![A-Za-z0-9])",
363
- re.I,
364
- )
365
- TRAILING_SEQUEL_MARKER_RE = re.compile(
366
- r"(?:^|[\s._-])"
367
- r"(?P<marker>"
368
- r"Ni\s+no\s+(?:Sara|Shou|Sho|Syo|Shō)|"
369
- r"San\s+no\s+(?:Sara|Shou|Sho|Syo)|"
370
- r"(?:Yon|Shi|Shin)\s+no\s+Sara|"
371
- r"(?:Go|Gou)\s+no\s+Sara|"
372
- r"Ni\s+Gakki|Sono\s+Ni|Ni|"
373
- r"II|III|IV|V|VI|VII|VIII|IX|[ⅡⅢⅣⅤⅥⅦⅧⅨ]|"
374
- r"[一二三四五六七八九十兩两貳贰弐弍參叁参肆伍陸陆柒捌玖](?:\s*(?:ノ|の|之)\s*(?:章|期|季|部))?"
375
- r")$",
376
- re.I,
377
- )
378
- NOISE_META_RE = re.compile(
379
- r"^(?:\d{3,4}[pP]|\d[Kk]|WEB[-_ ]?DL|WEB[-_ ]?Rip|BDRip|BluRay|BDMV|BD|DVDRip|DVD|TVRip|"
380
- r"HDTV|Netflix|NF|AMZN|Baha|CR|HEVC|AVC|AV1|x26[45]|h\.?26[45]|AAC.*|FLAC|MP3|DTS|"
381
- r"Opus|SDR|HDR10?|UHD|REMUX|10bit|8bit|Hi10p|Ma10p|ASS.*|SRT.*|CHS|CHT|BIG5|GB|JPN?|"
382
- r"JPSC|JPTC|MP4|MKV|繁中|简中|内封|外挂)$",
383
- re.I,
384
- )
385
- DATE_RE = re.compile(r"^(?:19|20)\d{2}(?:[.\-_年]?(?:0?[1-9]|1[0-2]))?(?:[.\-_月]?(?:0?[1-9]|[12]\d|3[01]))?日?$")
386
- CATEGORY_BRACKETS = {
387
- "国漫", "國漫", "国产", "國產", "国产动漫", "國產動漫", "国产动画", "國產動畫",
388
- "国创", "國創", "中国动漫", "中國動漫", "中国动画", "中國動畫",
389
- }
390
-
391
-
392
- def cn_number_to_int(text: str) -> Optional[int]:
393
- if text.isdigit():
394
- return int(text)
395
- values = {"一": 1, "二": 2, "三": 3, "四": 4, "五": 5, "六": 6, "七": 7, "八": 8, "九": 9}
396
- if text == "十":
397
- return 10
398
- if text.startswith("十") and len(text) == 2:
399
- return 10 + values.get(text[1], 0)
400
- if text.endswith("十") and len(text) == 2:
401
- return values.get(text[0], 0) * 10
402
- if "十" in text and len(text) == 3:
403
- return values.get(text[0], 0) * 10 + values.get(text[2], 0)
404
- return values.get(text)
405
-
406
-
407
- def bracket_parts(filename: str) -> List[Tuple[str, int, int]]:
408
- parts: List[Tuple[str, int, int]] = []
409
- for match in BRACKET_RE.finditer(filename):
410
- text = next(group for group in match.groups() if group is not None)
411
- parts.append((text.strip(), match.start(), match.end()))
412
- return parts
413
-
414
-
415
- def looks_like_group(text: str) -> bool:
416
- if not text or NOISE_META_RE.search(text):
417
- return False
418
- return bool(
419
- re.search(
420
- r"(?:字幕|字幕组|字幕組|sub|subs|raws?|fansub|studio|house|team|project|"
421
- r"loli|ani|vcb|airota|kiss|dmhy|erai|subsplease)",
422
- text,
423
- re.I,
424
- )
425
- )
426
-
427
-
428
- def looks_like_episode_or_meta(text: str) -> bool:
429
- if not text:
430
- return False
431
- clean = text.strip()
432
- normalized = re.sub(r"[\s._-]+", "", clean)
433
- return bool(
434
- re.fullmatch(r"(?:EP?|#)?\d{1,4}(?:v\d+)?", clean, re.I)
435
- or DATE_RE.fullmatch(clean)
436
- or normalized in CATEGORY_BRACKETS
437
- or RESOLUTION_RE.search(clean)
438
- or SOURCE_TAG_RE.fullmatch(clean)
439
- or SOURCE_RE.search(clean)
440
- or SPECIAL_TAG_RE.search(clean)
441
- or SPECIAL_CODE_RE.fullmatch(normalized)
442
- or NOISE_META_RE.search(clean)
443
- )
444
-
445
-
446
- def normalize_special_code(text: str) -> str:
447
- return re.sub(r"[\s._-]+", "", text.strip())
448
-
449
-
450
- def special_code_spans(filename: str) -> List[Tuple[str, int, int]]:
451
- spans: List[Tuple[str, int, int]] = []
452
- for text, start, end in bracket_parts(filename):
453
- normalized = normalize_special_code(text)
454
- if SPECIAL_CODE_RE.fullmatch(normalized):
455
- spans.append((normalized, start, end))
456
- for match in SPECIAL_CODE_INLINE_RE.finditer(filename):
457
- normalized = normalize_special_code(match.group("code"))
458
- if SPECIAL_CODE_RE.fullmatch(normalized):
459
- spans.append((normalized, match.start("code"), match.end("code")))
460
-
461
- deduped: List[Tuple[str, int, int]] = []
462
- seen: set[Tuple[str, int, int]] = set()
463
- for value, start, end in sorted(spans, key=lambda item: (item[1], item[2])):
464
- key = (value.lower(), start, end)
465
- if key in seen:
466
- continue
467
- seen.add(key)
468
- deduped.append((value, start, end))
469
- return deduped
470
-
471
-
472
- def special_code_brackets(filename: str) -> List[Tuple[str, int, int]]:
473
- return [
474
- (text.strip(), start, end)
475
- for text, start, end in bracket_parts(filename)
476
- if SPECIAL_CODE_RE.fullmatch(normalize_special_code(text))
477
- ]
478
-
479
-
480
- def span_is_inside_special_code(filename: str, start: int, end: int) -> bool:
481
- return any(special_start <= start and end <= special_end for _code, special_start, special_end in special_code_spans(filename))
482
-
483
-
484
- def has_non_special_episode_context(filename: str, episode: int) -> bool:
485
- masked = filename
486
- for _text, start, end in reversed(special_code_brackets(filename)):
487
- masked = masked[:start] + (" " * (end - start)) + masked[end:]
488
- return plausible_episode_context(masked, episode) and best_structural_episode(masked) == episode
489
-
490
-
491
- def episode_comes_only_from_special_code(filename: str, episode: Optional[int]) -> bool:
492
- if episode is None:
493
- return False
494
- specials = special_code_spans(filename)
495
- if not specials:
496
- return False
497
- ep_text = str(int(episode))
498
- for normalized, _start, _end in specials:
499
- if re.search(rf"0*{re.escape(ep_text)}$", normalized):
500
- return not has_non_special_episode_context(filename, int(episode))
501
- return False
502
-
503
-
504
- def strip_title_special_codes(title: str, special: Optional[str] = None) -> str:
505
- cleaned = title.strip()
506
- while True:
507
- next_cleaned = re.sub(
508
- r"\s*[\[\(【《]\s*(?:(?:NCOP|NCED|OP|ED|PV|CM)\d*|IV\d+|(?:OVA|OAD|SP)\d*)\s*[\]\)】》]\s*$",
509
- "",
510
- cleaned,
511
- flags=re.I,
512
- ).strip(" \t-_.")
513
- if next_cleaned == cleaned:
514
- break
515
- cleaned = next_cleaned
516
- cleaned = re.sub(r"\s+(?:NCOP|NCED|OP|ED|PV|CM)\d*$", "", cleaned, flags=re.I).strip(" \t-_.")
517
- if special:
518
- normalized = re.sub(r"[\s._-]+", "", str(special).strip())
519
- match = re.fullmatch(r"([A-Za-z]+)\d+", normalized)
520
- if match and SPECIAL_CODE_RE.fullmatch(normalized):
521
- prefix = re.escape(match.group(1))
522
- cleaned = re.sub(rf"\s+{prefix}$", "", cleaned, flags=re.I).strip(" \t-_.")
523
- return cleaned or title
524
-
525
-
526
- def looks_like_structural_group(text: str, filename: str, bracket_end: int) -> bool:
527
- """Heuristic for short leading release-group brackets not in the name list."""
528
- if looks_like_group(text):
529
- return True
530
- if not text or looks_like_episode_or_meta(text):
531
- return False
532
-
533
- after = filename[bracket_end:].lstrip(" \t._")
534
- if after.startswith("-"):
535
- return False
536
- next_bracket = BRACKET_RE.match(after)
537
- if next_bracket:
538
- next_text = next(group for group in next_bracket.groups() if group is not None)
539
- if looks_like_episode_or_meta(next_text):
540
- return False
541
-
542
- words = re.findall(r"[A-Za-z0-9]+", text)
543
- if not words:
544
- if re.search(r"[\u3400-\u9fff]", text) and len(text) <= 32:
545
- return True
546
- return False
547
- if len(text) > 32:
548
- return False
549
- if len(words) == 1:
550
- return True
551
- if any(sep in text for sep in "-_"):
552
- return True
553
- if words[0].isupper() and len(words[0]) <= 4 and len(words) <= 3:
554
- return True
555
- return False
556
-
557
-
558
- def apply_rule_assists(filename: str, result: Dict) -> Dict:
559
- """
560
- Fill high-confidence structural fields from filename conventions.
561
-
562
- The model remains the primary tagger; rules only fill missing obvious fields
563
- or repair common boundary drift around leading group brackets and episodes.
564
- """
565
- repaired = dict(result)
566
- brackets = bracket_parts(filename)
567
-
568
- if (not repaired.get("group") or (repaired.get("title") and repaired["group"] in repaired["title"])) and brackets:
569
- first_text, first_start, first_end = brackets[0]
570
- if first_start == 0 and looks_like_structural_group(first_text, filename, first_end):
571
- repaired["group"] = first_text
572
-
573
- if not repaired.get("resolution"):
574
- match = RESOLUTION_RE.search(filename)
575
- if match:
576
- repaired["resolution"] = match.group(0)
577
-
578
- source_matches = source_candidates(filename)
579
- current_source = repaired.get("source")
580
- preferred_source = source_matches[0] if source_matches else None
581
- if preferred_source and (
582
- not current_source
583
- or source_priority(preferred_source) > source_priority(str(current_source))
584
- or (
585
- source_priority(preferred_source) == source_priority(str(current_source))
586
- and preferred_source.lower() != str(current_source).lower()
587
- )
588
- ):
589
- repaired["source"] = preferred_source
590
-
591
- special_spans = special_code_spans(filename)
592
- current_special = repaired.get("special")
593
- if special_spans:
594
- preferred_special = special_spans[0][0]
595
- current_normalized = normalize_special_code(str(current_special)) if current_special else ""
596
- if not current_special or preferred_special.lower().startswith(current_normalized.lower()):
597
- repaired["special"] = preferred_special
598
- if not repaired.get("special"):
599
- for text, _start, _end in brackets:
600
- clean = text.strip()
601
- if SPECIAL_TAG_RE.search(clean):
602
- repaired["special"] = clean
603
- break
604
-
605
- episode = best_structural_episode(filename)
606
- if episode is not None and (
607
- repaired.get("episode") is None
608
- or not plausible_episode_context(filename, int(repaired["episode"]))
609
- ):
610
- repaired["episode"] = episode
611
-
612
- if repaired.get("episode") is not None and not plausible_episode_context(filename, int(repaired["episode"])):
613
- repaired["episode"] = episode
614
- if episode_comes_only_from_special_code(filename, repaired.get("episode")):
615
- repaired["episode"] = None
616
-
617
- if repaired.get("season") is None:
618
- match = SEASON_RE.search(filename)
619
- if match:
620
- value = next(group for group in match.groups() if group)
621
- season = cn_number_to_int(value)
622
- if season is not None:
623
- repaired["season"] = season
624
- if repaired.get("season") is None and repaired.get("episode") is not None:
625
- sequel = structural_sequel_marker(filename, repaired.get("group"), repaired.get("episode"))
626
- if sequel is not None:
627
- repaired["season"] = sequel[1]
628
- elif repaired.get("episode") == repaired.get("season") and not SEASON_RE.search(filename):
629
- repaired["season"] = None
630
-
631
- title = repaired.get("title")
632
- group = repaired.get("group")
633
- if group and (NOISE_META_RE.search(str(group)) or SOURCE_RE.fullmatch(str(group)) or RESOLUTION_RE.fullmatch(str(group))):
634
- repaired["group"] = None
635
- group = None
636
-
637
- if title and group and title.startswith(group):
638
- title = title[len(group):].lstrip("]】)>})》 \t-_.")
639
- repaired["title"] = title or repaired["title"]
640
-
641
- if repaired.get("episode"):
642
- repaired_title = infer_title_span(filename, group, repaired["episode"])
643
- if repaired_title:
644
- repaired["title"] = repaired_title
645
-
646
- structured_title = infer_structured_bracket_title(filename, group, repaired.get("episode"))
647
- if structured_title:
648
- repaired["title"] = structured_title
649
-
650
- if repaired.get("title") and repaired.get("season") is not None:
651
- repaired["title"] = strip_trailing_season_from_title(repaired["title"], repaired["season"])
652
- if repaired.get("episode") is None and repaired.get("group") and repaired.get("special"):
653
- inferred_title = infer_title_span(filename, repaired.get("group"), None)
654
- if inferred_title:
655
- repaired["title"] = inferred_title
656
- if repaired.get("title"):
657
- repaired["title"] = strip_title_special_codes(repaired["title"], repaired.get("special"))
658
-
659
- return repaired
660
-
661
-
662
- def structural_sequel_marker(
663
- filename: str,
664
- group: Optional[str],
665
- episode: Optional[int],
666
- ) -> Optional[Tuple[str, int]]:
667
- if episode is None:
668
- return None
669
- title_end = None
670
- if episode is not None:
671
- ep_patterns = [
672
- rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
673
- rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
674
- rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
675
- rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
676
- rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
677
- ]
678
- start = 0
679
- if group:
680
- first = BRACKET_RE.match(filename)
681
- if first and group in first.group(0):
682
- start = first.end()
683
- for pattern in ep_patterns:
684
- match = re.search(pattern, filename[start:], re.I)
685
- if match:
686
- title_end = start + match.start()
687
- break
688
- if title_end is None:
689
- return None
690
-
691
- prefix = filename[:title_end].rstrip(" \t-_.")
692
- for match in reversed(list(SEQUEL_MARKER_RE.finditer(prefix))):
693
- marker = match.group("marker")
694
- value = season_marker_number(marker)
695
- if value is None:
696
- continue
697
- tail = prefix[match.end():].strip(" \t-_.")
698
- if tail:
699
- continue
700
- if marker.lower() == "ni" and "Kakuriyo no Yadomeshi Ni" not in prefix:
701
- continue
702
- return marker, value
703
-
704
- numeric_tail = re.search(r"(?:^|[\s._-])(?P<season>[2-9])$", prefix)
705
- if numeric_tail:
706
- return numeric_tail.group("season"), int(numeric_tail.group("season"))
707
- return None
708
-
709
-
710
- def normalize_source_text(text: str) -> str:
711
- text = re.sub(r"\s+", "", text.strip())
712
- text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
713
- text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
714
- text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
715
- text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
716
- return text.replace("_", "-")
717
-
718
-
719
- def source_priority(source: str) -> int:
720
- normalized = source.lower().replace("_", "-").replace(" ", "")
721
- parts = re.split(r"[&+/,]", normalized)
722
- if any(part in {"nf", "netflix", "amzn", "baha", "cr", "abema", "dsnp", "u-next", "hulu", "at-x", "web-dl", "webdl", "webrip", "web-rip", "bdrip", "bluray", "bdmv", "bd", "dvdrip", "dvd", "tvrip", "hdtv"} for part in parts):
723
- return 90
724
- if any(part in {"chs", "cht", "gb", "big5", "jpn", "jpsc", "jptc", "繁中", "简中"} for part in parts):
725
- return 70
726
- if any(part in {"x264", "x265", "h.264", "h264", "h.265", "h265", "hevc", "avc", "av1", "aac", "flac", "mp3", "dts", "opus", "10bit", "8bit", "hi10p", "ma10p", "srt", "srtx2", "ass", "assx2"} for part in parts):
727
- return 20
728
- if len(parts) > 1:
729
- return 40
730
- return 20
731
-
732
-
733
- def source_candidates(filename: str) -> List[str]:
734
- candidates: List[Tuple[int, int, str]] = []
735
- for text, start, _end in bracket_parts(filename):
736
- clean = text.strip()
737
- if SOURCE_TAG_RE.fullmatch(clean):
738
- normalized = normalize_source_text(clean)
739
- candidates.append((source_priority(normalized), -start, normalized))
740
-
741
- for match in SOURCE_RE.finditer(filename):
742
- normalized = normalize_source_text(match.group(0))
743
- candidates.append((source_priority(normalized), -match.start(), normalized))
744
-
745
- deduped: Dict[str, Tuple[int, int, str]] = {}
746
- for priority, neg_start, value in candidates:
747
- key = value.lower()
748
- if key not in deduped or (priority, neg_start) > (deduped[key][0], deduped[key][1]):
749
- deduped[key] = (priority, neg_start, value)
750
-
751
- return [value for _priority, _neg_start, value in sorted(deduped.values(), reverse=True)]
752
-
753
-
754
- def is_category_text(text: str) -> bool:
755
- normalized = re.sub(r"[\s._-]+", "", text.strip())
756
- return normalized in CATEGORY_BRACKETS
757
-
758
-
759
- def infer_structured_bracket_title(
760
- filename: str,
761
- group: Optional[str],
762
- episode: Optional[int],
763
- ) -> Optional[str]:
764
- """Pick the primary title from [group][category][title][alias][year][episode] rows."""
765
- brackets = bracket_parts(filename)
766
- if len(brackets) < 4 or episode is None:
767
- return None
768
-
769
- start_index = 0
770
- if group and brackets and brackets[0][0] == group:
771
- start_index = 1
772
-
773
- search = brackets[start_index:]
774
- if not search or not any(is_category_text(text) for text, _start, _end in search[:2]):
775
- return None
776
-
777
- episode_index = None
778
- for idx, (text, _start, _end) in enumerate(brackets):
779
- if re.fullmatch(rf"(?:EP?|#)?0*{episode}(?:v\d+)?", text.strip(), re.I):
780
- episode_index = idx
781
- break
782
- if episode_index is None:
783
- return None
784
-
785
- candidates: List[Tuple[int, str]] = []
786
- for idx in range(start_index, episode_index):
787
- text = brackets[idx][0].strip()
788
- if not text or looks_like_episode_or_meta(text):
789
- continue
790
- score = 0
791
- if SEASON_RE.search(text) or TRAILING_SEQUEL_MARKER_RE.search(text):
792
- score += 50
793
- if re.search(r"[\u3400-\u9fff]", text):
794
- score += 20
795
- if idx > start_index:
796
- score += 10
797
- candidates.append((score, text))
798
-
799
- if not candidates:
800
- return None
801
- return max(candidates, key=lambda item: item[0])[1]
802
-
803
-
804
- def best_structural_episode(filename: str) -> Optional[int]:
805
- priorities = {
806
- "season_episode": 1000,
807
- "dash_episode": 900,
808
- "bracket_episode": 850,
809
- "explicit_episode": 800,
810
- "long_episode": 750,
811
- "generic_episode": 100,
812
- }
813
- candidates: List[Tuple[int, int, int]] = []
814
- for name, pattern in EPISODE_PATTERNS:
815
- for match in pattern.finditer(filename):
816
- ep_text = match.group("ep")
817
- ep = int(ep_text)
818
- if ep == 0 or ep > 2000:
819
- continue
820
- ep_start = match.start("ep")
821
- ep_end = match.end("ep")
822
- if span_is_inside_special_code(filename, ep_start, ep_end):
823
- continue
824
- if name == "generic_episode":
825
- tail = filename[ep_end:]
826
- if re.match(r"[-_][A-Za-z]", tail):
827
- continue
828
- if not re.match(
829
- r"(?:$|[\]\)】》]|[\s._-]+(?:"
830
- r"\[[^\]]*(?:\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha|Ma10p|x26|HEVC|AVC)|"
831
- r"\d{3,4}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha|Ma10p|x26|HEVC|AVC|mkv|mp4|avi"
832
- r"))",
833
- tail,
834
- re.I,
835
- ):
836
- continue
837
- context = filename[max(0, ep_start - 5):ep_end + 5]
838
- if RESOLUTION_RE.search(context) or re.search(r"AAC|DDP|AC3|H\.?26[45]|x26[45]", context, re.I):
839
- continue
840
- priority = priorities[name]
841
- if 1 <= ep <= 200:
842
- priority += 20
843
- candidates.append((priority, ep_start, ep))
844
- if not candidates:
845
- return None
846
- return max(candidates, key=lambda item: (item[0], item[1]))[2]
847
-
848
-
849
- def plausible_episode_context(filename: str, episode: int) -> bool:
850
- ep_text = str(episode)
851
- padded = f"{episode:02d}"
852
- if re.search(rf"(?<![A-Za-z0-9])(?:H|x)\.?0*{re.escape(ep_text)}(?!\d)", filename, re.I):
853
- return False
854
- patterns = [
855
- rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
856
- rf"(?:^|[\s._])[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s._\-\]\)】》\[])",
857
- rf"[\[\(【《](?:EP?|#)?0*{episode}(?:v\d+)?[\]\)】》]",
858
- rf"(?:^|[\s._\-\[\(【《#])(?:EP?|第|#)0*{episode}(?:v\d+)?(?:[话話集])?(?=$|[\s._\-\]\)】》])",
859
- rf"(?:^|[\s._\-\[\(【《])0*{episode}(?:v\d+)?(?=[\s._\-\]\)】》\[]+(?:\d{{3,4}}[pP]|WEB|BD|BluRay|HDTV|NF|AMZN|CR|Baha))",
860
- ]
861
- if any(re.search(pattern, filename, re.I) for pattern in patterns):
862
- return True
863
- return bool(re.search(rf"(?:^|[\s._-])(?:{re.escape(ep_text)}|{re.escape(padded)})(?:v\d+)?$", filename, re.I))
864
-
865
-
866
- def strip_trailing_season_from_title(title: str, season: int) -> str:
867
- season_text = str(season)
868
- patterns = [
869
- rf"\s+[Ss]0*{season_text}$",
870
- rf"\s+Season\s*0*{season_text}$",
871
- rf"\s+0*{season_text}$",
872
- rf"\s+第(?:0*{season_text}|{season_text})[季期部章]$",
873
- ]
874
- cleaned = title
875
- for pattern in patterns:
876
- cleaned = re.sub(pattern, "", cleaned, flags=re.I).strip(" \t-_.")
877
- match = TRAILING_SEQUEL_MARKER_RE.search(cleaned)
878
- if match and season_marker_number(match.group("marker")) == season:
879
- cleaned = cleaned[:match.start()].strip(" \t-_.")
880
- return cleaned or title
881
-
882
-
883
- def clean_inferred_title(title: str) -> str:
884
- raw_title = title.strip(" \t-_.")
885
- bracket_matches = list(BRACKET_RE.finditer(raw_title))
886
- if bracket_matches:
887
- first = bracket_matches[0]
888
- prefix = raw_title[:first.start()].strip(" \t-_.★☆")
889
- text = next(group for group in first.groups() if group is not None).strip()
890
- if text and not looks_like_episode_or_meta(text) and (
891
- not prefix
892
- or re.search(r"(?:新番|月|合集|繁|简|字幕|先行|合集|★|☆)", prefix, re.I)
893
- ):
894
- return text
895
- return raw_title.strip("[]()【】《》()")
896
-
897
-
898
- def infer_title_span(filename: str, group: Optional[str], episode: Optional[int]) -> Optional[str]:
899
- start = 0
900
- if group:
901
- first = BRACKET_RE.match(filename)
902
- if first and group in first.group(0):
903
- start = first.end()
904
- else:
905
- # Some releases put leading metadata before the actual title, e.g.
906
- # `[1080p] Title - 01`. Do not keep that wrapper as title text.
907
- while True:
908
- leading = BRACKET_RE.match(filename[start:].lstrip(" \t._-"))
909
- if not leading:
910
- break
911
- skipped_ws = len(filename[start:]) - len(filename[start:].lstrip(" \t._-"))
912
- text = next(group for group in leading.groups() if group is not None)
913
- if not looks_like_episode_or_meta(text):
914
- break
915
- start += skipped_ws + leading.end()
916
-
917
- end = None
918
- if episode is not None:
919
- ep_patterns = [
920
- rf"[Ss]\d{{1,2}}[Ee]0*{episode}(?:v\d+)?",
921
- rf"\s[-_]\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
922
- rf"[\[\(【《]0*{episode}(?:v\d+)?[\]\)】》]",
923
- rf"#\s*0*{episode}(?:v\d+)?(?=$|[\s\[\(【《._-])",
924
- rf"(?:^|[\s._\-\[\(【《])第0*{episode}(?:[话話集])?(?=$|[\s._\-\]\)】》])",
925
- rf"[Ee]0*{episode}(?:v\d+)?",
926
- ]
927
- for pattern in ep_patterns:
928
- match = re.search(pattern, filename[start:], re.I)
929
- if match:
930
- end = start + match.start()
931
- break
932
-
933
- if end is None:
934
- for text, bracket_start, _bracket_end in bracket_parts(filename):
935
- if bracket_start <= start:
936
- continue
937
- if (
938
- NOISE_META_RE.search(text)
939
- or RESOLUTION_RE.search(text)
940
- or SOURCE_RE.search(text)
941
- or SPECIAL_TAG_RE.search(text)
942
- or SPECIAL_CODE_RE.fullmatch(re.sub(r"[\s._-]+", "", text.strip()))
943
- ):
944
- end = bracket_start
945
- break
946
-
947
- if end is None or end <= start:
948
- return None
949
- title = clean_inferred_title(filename[start:end])
950
- return title or None
951
-
952
-
953
  def parse_filename(
954
  filename: str,
955
  model: BertForTokenClassification,
@@ -957,7 +314,6 @@ def parse_filename(
957
  id2label: Dict[int, str],
958
  max_length: int = 64,
959
  debug: bool = False,
960
- use_rules: bool = False,
961
  constrain_bio: bool = True,
962
  ) -> Dict:
963
  """
@@ -1046,14 +402,12 @@ def parse_filename(
1046
  tokens[:available],
1047
  label_strings,
1048
  tokenizer=tokenizer,
1049
- filename=filename,
1050
- use_rules=use_rules,
1051
  )
1052
  if debug:
1053
  result["_debug"] = {
1054
  "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
1055
  "decoder": "constrained_bio" if constrain_bio else "greedy",
1056
- "postprocess": "rule_assisted" if use_rules else "thin_normalize",
1057
  "max_length": max_length,
1058
  "token_count": len(tokens),
1059
  "available_token_count": available,
@@ -1101,10 +455,6 @@ def main():
1101
  help="Maximum sequence length")
1102
  parser.add_argument("--debug", action="store_true",
1103
  help="Include tokenizer, labels, scores, and entity spans in JSON output")
1104
- parser.add_argument("--rule-assist", action="store_true",
1105
- help="Enable legacy structural post-processing rules")
1106
- parser.add_argument("--no-rule-assist", action="store_true",
1107
- help=argparse.SUPPRESS)
1108
  parser.add_argument("--no-constrained-bio", action="store_true",
1109
  help="Use greedy per-token decoding instead of constrained BIO Viterbi")
1110
  args = parser.parse_args()
@@ -1152,7 +502,6 @@ def main():
1152
  id2label,
1153
  max_length,
1154
  debug=args.debug,
1155
- use_rules=args.rule_assist and not args.no_rule_assist,
1156
  constrain_bio=not args.no_constrained_bio,
1157
  )
1158
  result["_input"] = fn
 
11
 
12
  import argparse
13
  import json
 
14
  import re
15
  import sys
16
  from typing import Dict, List, Optional, Tuple
 
97
  return 40 if re.search(r"[&+/,]", source) else 30
98
 
99
 
100
+ def normalize_source_text(text: str) -> str:
101
+ text = re.sub(r"\s+", "", text.strip())
102
+ text = re.sub(r"(?i)WEB[_ ]?DL", "WEB-DL", text)
103
+ text = re.sub(r"(?i)WEB[_ ]?Rip", "WebRip", text)
104
+ text = re.sub(r"(?i)U[_ ]?NEXT", "U-NEXT", text)
105
+ text = re.sub(r"(?i)AT[_ ]?X", "AT-X", text)
106
+ return text.replace("_", "-")
107
+
108
+
109
  def choose_thin_source(sources: List[str]) -> Optional[str]:
110
  cleaned = [normalize_source_text(source) for source in sources if normalize_field_text(source)]
111
  if not cleaned:
 
247
  tokens: List[str],
248
  labels: List[str],
249
  tokenizer: Optional[AnimeTokenizer] = None,
 
 
250
  ) -> Dict:
251
  """
252
  Convert BIO-labeled tokens into structured metadata.
 
304
 
305
  result["source"] = choose_thin_source(grouped_entities.get("SOURCE", []))
306
 
 
 
 
307
  return result
308
 
309
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
  def parse_filename(
311
  filename: str,
312
  model: BertForTokenClassification,
 
314
  id2label: Dict[int, str],
315
  max_length: int = 64,
316
  debug: bool = False,
 
317
  constrain_bio: bool = True,
318
  ) -> Dict:
319
  """
 
402
  tokens[:available],
403
  label_strings,
404
  tokenizer=tokenizer,
 
 
405
  )
406
  if debug:
407
  result["_debug"] = {
408
  "tokenizer_variant": getattr(tokenizer, "tokenizer_variant", "regex"),
409
  "decoder": "constrained_bio" if constrain_bio else "greedy",
410
+ "postprocess": "thin_normalize",
411
  "max_length": max_length,
412
  "token_count": len(tokens),
413
  "available_token_count": available,
 
455
  help="Maximum sequence length")
456
  parser.add_argument("--debug", action="store_true",
457
  help="Include tokenizer, labels, scores, and entity spans in JSON output")
 
 
 
 
458
  parser.add_argument("--no-constrained-bio", action="store_true",
459
  help="Use greedy per-token decoding instead of constrained BIO Viterbi")
460
  args = parser.parse_args()
 
502
  id2label,
503
  max_length,
504
  debug=args.debug,
 
505
  constrain_bio=not args.no_constrained_bio,
506
  )
507
  result["_input"] = fn
onnx_inference.py CHANGED
@@ -59,10 +59,9 @@ def parse_with_onnx(
59
  model_dir: Path,
60
  onnx_path: Path,
61
  max_length: int,
62
- use_rules: bool = False,
63
  ) -> Dict:
64
  parser = OnnxFilenameParser(model_dir, onnx_path, max_length)
65
- return parser.parse(filename, use_rules=use_rules)
66
 
67
 
68
  class OnnxFilenameParser:
@@ -87,7 +86,7 @@ class OnnxFilenameParser:
87
  providers=providers or ["CPUExecutionProvider"],
88
  )
89
 
90
- def parse(self, filename: str, use_rules: bool = False) -> Dict:
91
  tokens, input_ids, attention_mask, available = encode(filename, self.tokenizer, self.max_length)
92
  logits = self.session.run(
93
  ["logits"],
@@ -100,7 +99,7 @@ class OnnxFilenameParser:
100
  token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
101
  label_ids = constrained_bio_decode(token_logits, self.id2label)
102
  labels = [self.id2label.get(label_id, "O") for label_id in label_ids]
103
- result = postprocess(tokens, labels, tokenizer=self.tokenizer, filename=filename, use_rules=use_rules)
104
  result["_input"] = filename
105
  return result
106
 
@@ -111,8 +110,6 @@ def main() -> None:
111
  parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
112
  parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
113
  parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
114
- parser.add_argument("--rule-assist", action="store_true", help="Enable legacy structural postprocessing")
115
- parser.add_argument("--no-rule-assist", action="store_true", help=argparse.SUPPRESS)
116
  args = parser.parse_args()
117
 
118
  result = parse_with_onnx(
@@ -120,7 +117,6 @@ def main() -> None:
120
  model_dir=Path(args.model_dir),
121
  onnx_path=Path(args.onnx),
122
  max_length=args.max_length,
123
- use_rules=args.rule_assist and not args.no_rule_assist,
124
  )
125
  print(json.dumps(result, ensure_ascii=False))
126
 
 
59
  model_dir: Path,
60
  onnx_path: Path,
61
  max_length: int,
 
62
  ) -> Dict:
63
  parser = OnnxFilenameParser(model_dir, onnx_path, max_length)
64
+ return parser.parse(filename)
65
 
66
 
67
  class OnnxFilenameParser:
 
86
  providers=providers or ["CPUExecutionProvider"],
87
  )
88
 
89
+ def parse(self, filename: str) -> Dict:
90
  tokens, input_ids, attention_mask, available = encode(filename, self.tokenizer, self.max_length)
91
  logits = self.session.run(
92
  ["logits"],
 
99
  token_logits = torch.from_numpy(logits[0, 1:1 + available, :])
100
  label_ids = constrained_bio_decode(token_logits, self.id2label)
101
  labels = [self.id2label.get(label_id, "O") for label_id in label_ids]
102
+ result = postprocess(tokens, labels, tokenizer=self.tokenizer)
103
  result["_input"] = filename
104
  return result
105
 
 
110
  parser.add_argument("--model-dir", default=".", help="Directory containing vocab.json and config.json")
111
  parser.add_argument("--onnx", default="exports/anime_filename_parser.onnx", help="ONNX model path")
112
  parser.add_argument("--max-length", type=int, default=128, help="Static ONNX sequence length")
 
 
113
  args = parser.parse_args()
114
 
115
  result = parse_with_onnx(
 
117
  model_dir=Path(args.model_dir),
118
  onnx_path=Path(args.onnx),
119
  max_length=args.max_length,
 
120
  )
121
  print(json.dumps(result, ensure_ascii=False))
122
 
parse_eval_metrics.json CHANGED
@@ -2,7 +2,6 @@
2
  "primary_metric": "normalized_only",
3
  "modes": {
4
  "model_only": {
5
- "use_rules": false,
6
  "constrain_bio": false,
7
  "sample_count": 1024,
8
  "field_accuracy": {
@@ -309,7 +308,6 @@
309
  ]
310
  },
311
  "normalized_only": {
312
- "use_rules": false,
313
  "constrain_bio": true,
314
  "sample_count": 1024,
315
  "field_accuracy": {
@@ -533,627 +531,6 @@
533
  }
534
  }
535
  ]
536
- },
537
- "rule_assisted": {
538
- "use_rules": true,
539
- "constrain_bio": true,
540
- "sample_count": 1024,
541
- "field_accuracy": {
542
- "group": 0.9873046875,
543
- "title": 0.7265625,
544
- "season": 0.9912109375,
545
- "episode": 0.7021484375,
546
- "resolution": 1.0,
547
- "source": 0.98046875,
548
- "special": 0.951171875
549
- },
550
- "field_correct": {
551
- "group": 1011,
552
- "title": 744,
553
- "season": 1015,
554
- "episode": 719,
555
- "resolution": 1024,
556
- "source": 1004,
557
- "special": 974
558
- },
559
- "field_total": {
560
- "group": 1024,
561
- "title": 1024,
562
- "season": 1024,
563
- "episode": 1024,
564
- "resolution": 1024,
565
- "source": 1024,
566
- "special": 1024
567
- },
568
- "full_match_accuracy": 0.5068359375,
569
- "full_match_correct": 519,
570
- "full_match_total": 1024,
571
- "failures": [
572
- {
573
- "filename": "[DBD-Raws][Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san][PV][20][1080P][BDRip][HEVC-10bit][FLAC]",
574
- "errors": {
575
- "episode": {
576
- "gold": null,
577
- "pred": "20"
578
- }
579
- },
580
- "gold": {
581
- "group": "DBD-Raws",
582
- "title": "Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san",
583
- "season": null,
584
- "episode": null,
585
- "resolution": "1080P",
586
- "source": "BDRip",
587
- "special": "20"
588
- },
589
- "pred": {
590
- "group": "DBD-Raws",
591
- "title": "Tokidoki Bosotto Russia-go de Dereru Tonari no Alya-san",
592
- "season": null,
593
- "episode": 20,
594
- "resolution": "1080P",
595
- "source": "BDRip",
596
- "special": "20"
597
- }
598
- },
599
- {
600
- "filename": "[DBD-Raws][我的英雄学院 第三季][PV][02][1080P][BDRip][HEVC-10bit][FLAC]",
601
- "errors": {
602
- "title": {
603
- "gold": "我的英雄学院",
604
- "pred": "我的英雄学院 第三季"
605
- },
606
- "episode": {
607
- "gold": null,
608
- "pred": "2"
609
- }
610
- },
611
- "gold": {
612
- "group": "DBD-Raws",
613
- "title": "我的英雄学院",
614
- "season": 3,
615
- "episode": null,
616
- "resolution": "1080P",
617
- "source": "BDRip",
618
- "special": "02"
619
- },
620
- "pred": {
621
- "group": "DBD-Raws",
622
- "title": "我的英雄学院 第三季",
623
- "season": 3,
624
- "episode": 2,
625
- "resolution": "1080P",
626
- "source": "BDRip",
627
- "special": "02"
628
- }
629
- },
630
- {
631
- "filename": "[Moozzi2] Katanagatari [SP01] NCOP - 02 (BD 1920x1080 x.264 Flac)",
632
- "errors": {
633
- "episode": {
634
- "gold": "1",
635
- "pred": null
636
- }
637
- },
638
- "gold": {
639
- "group": "Moozzi2",
640
- "title": "Katanagatari",
641
- "season": null,
642
- "episode": 1,
643
- "resolution": "1920x1080",
644
- "source": "BD",
645
- "special": "NCOP - 02"
646
- },
647
- "pred": {
648
- "group": "Moozzi2",
649
- "title": "Katanagatari",
650
- "season": null,
651
- "episode": null,
652
- "resolution": "1920x1080",
653
- "source": "BD",
654
- "special": "NCOP - 02"
655
- }
656
- },
657
- {
658
- "filename": "[DBD-Raws][Ijiranaide, Nagatoro-san 2nd Attack][PV][06][1080P][BDRip][HEVC-10bit][FLAC]",
659
- "errors": {
660
- "episode": {
661
- "gold": null,
662
- "pred": "6"
663
- }
664
- },
665
- "gold": {
666
- "group": "DBD-Raws",
667
- "title": "Ijiranaide, Nagatoro-san 2nd Attack",
668
- "season": null,
669
- "episode": null,
670
- "resolution": "1080P",
671
- "source": "BDRip",
672
- "special": "06"
673
- },
674
- "pred": {
675
- "group": "DBD-Raws",
676
- "title": "Ijiranaide, Nagatoro-san 2nd Attack",
677
- "season": null,
678
- "episode": 6,
679
- "resolution": "1080P",
680
- "source": "BDRip",
681
- "special": "06"
682
- }
683
- },
684
- {
685
- "filename": "【枫叶字幕组】宠物小精灵XY&Z[第30(122)话][720P][MP4][GB_JP].mp4",
686
- "errors": {
687
- "title": {
688
- "gold": "宠物小精灵xy&z",
689
- "pred": "宠物小精灵xy&z[第30"
690
- },
691
- "episode": {
692
- "gold": "30",
693
- "pred": "122"
694
- }
695
- },
696
- "gold": {
697
- "group": "枫叶字幕组",
698
- "title": "宠物小精灵XY&Z",
699
- "season": null,
700
- "episode": 30,
701
- "resolution": "720P",
702
- "source": "GB-JP",
703
- "special": null
704
- },
705
- "pred": {
706
- "group": "枫叶字幕组",
707
- "title": "宠物小精灵XY&Z[第30",
708
- "season": null,
709
- "episode": 122,
710
- "resolution": "720P",
711
- "source": "GB-JP",
712
- "special": null
713
- }
714
- },
715
- {
716
- "filename": "[Snow-Raws] グランベルム CM&PV10 (BD 1920x1080 HEVC-YUV420P10 FLAC)",
717
- "errors": {
718
- "title": {
719
- "gold": "グランベルム",
720
- "pred": "グランベルム cm&pv10"
721
- }
722
- },
723
- "gold": {
724
- "group": "Snow-Raws",
725
- "title": "グランベルム",
726
- "season": null,
727
- "episode": null,
728
- "resolution": "1920x1080",
729
- "source": "BD",
730
- "special": "PV10"
731
- },
732
- "pred": {
733
- "group": "Snow-Raws",
734
- "title": "グランベルム CM&PV10",
735
- "season": null,
736
- "episode": null,
737
- "resolution": "1920x1080",
738
- "source": "BD",
739
- "special": "PV10"
740
- }
741
- },
742
- {
743
- "filename": "[Moozzi2] High School D×D New [SP02] NCED - 01 (BD 1920x1080 x.264 Flac)",
744
- "errors": {
745
- "episode": {
746
- "gold": "2",
747
- "pred": null
748
- }
749
- },
750
- "gold": {
751
- "group": "Moozzi2",
752
- "title": "High School D×D New",
753
- "season": null,
754
- "episode": 2,
755
- "resolution": "1920x1080",
756
- "source": "BD",
757
- "special": "NCED - 01"
758
- },
759
- "pred": {
760
- "group": "Moozzi2",
761
- "title": "High School D×D New",
762
- "season": null,
763
- "episode": null,
764
- "resolution": "1920x1080",
765
- "source": "BD",
766
- "special": "NCED - 01"
767
- }
768
- },
769
- {
770
- "filename": "[SFEO-Raws] Koimonogatari - CM_01 (BD 720P x264 10bit AAC)[783E6EF2]",
771
- "errors": {
772
- "title": {
773
- "gold": "koimonogatari",
774
- "pred": "koimonogatari - cm_01"
775
- }
776
- },
777
- "gold": {
778
- "group": "SFEO-Raws",
779
- "title": "Koimonogatari",
780
- "season": null,
781
- "episode": null,
782
- "resolution": "720P",
783
- "source": "BD",
784
- "special": "CM_01"
785
- },
786
- "pred": {
787
- "group": "SFEO-Raws",
788
- "title": "Koimonogatari - CM_01",
789
- "season": null,
790
- "episode": null,
791
- "resolution": "720P",
792
- "source": "BD",
793
- "special": "CM_01"
794
- }
795
- },
796
- {
797
- "filename": "[H720] Sangatsu no Lion CM01 (BD 1208x720 HEVC AAC)",
798
- "errors": {
799
- "group": {
800
- "gold": null,
801
- "pred": "h720"
802
- },
803
- "title": {
804
- "gold": "h",
805
- "pred": "sangatsu no lion"
806
- },
807
- "episode": {
808
- "gold": "720",
809
- "pred": null
810
- },
811
- "special": {
812
- "gold": "cm",
813
- "pred": "cm01"
814
- }
815
- },
816
- "gold": {
817
- "group": null,
818
- "title": "H",
819
- "season": null,
820
- "episode": 720,
821
- "resolution": "1208x720",
822
- "source": "BD",
823
- "special": "CM"
824
- },
825
- "pred": {
826
- "group": "H720",
827
- "title": "Sangatsu no Lion",
828
- "season": null,
829
- "episode": null,
830
- "resolution": "1208x720",
831
- "source": "BD",
832
- "special": "CM01"
833
- }
834
- },
835
- {
836
- "filename": "[FZSD&DBD-Raws][King of Prism Dramatic Prism.1][PV][08][1080P][BDRip][HEVC-10bit][FLAC]",
837
- "errors": {
838
- "episode": {
839
- "gold": null,
840
- "pred": "8"
841
- }
842
- },
843
- "gold": {
844
- "group": "FZSD&DBD-Raws",
845
- "title": "King of Prism Dramatic Prism.1",
846
- "season": null,
847
- "episode": null,
848
- "resolution": "1080P",
849
- "source": "BDRip",
850
- "special": "08"
851
- },
852
- "pred": {
853
- "group": "FZSD&DBD-Raws",
854
- "title": "King of Prism Dramatic Prism.1",
855
- "season": null,
856
- "episode": 8,
857
- "resolution": "1080P",
858
- "source": "BDRip",
859
- "special": "08"
860
- }
861
- },
862
- {
863
- "filename": "Robin Hood no Daibouken 49",
864
- "errors": {
865
- "episode": {
866
- "gold": null,
867
- "pred": "49"
868
- }
869
- },
870
- "gold": {
871
- "group": null,
872
- "title": "Robin Hood no Daibouken 49",
873
- "season": null,
874
- "episode": null,
875
- "resolution": null,
876
- "source": null,
877
- "special": null
878
- },
879
- "pred": {
880
- "group": null,
881
- "title": "Robin Hood no Daibouken 49",
882
- "season": null,
883
- "episode": 49,
884
- "resolution": null,
885
- "source": null,
886
- "special": null
887
- }
888
- },
889
- {
890
- "filename": "[Moozzi2] Paniponi Dash! [SP02] NCED - 07 [ EP.07 ] (BD 1920x1080 x.264 Flac)",
891
- "errors": {
892
- "episode": {
893
- "gold": "2",
894
- "pred": null
895
- }
896
- },
897
- "gold": {
898
- "group": "Moozzi2",
899
- "title": "Paniponi Dash!",
900
- "season": null,
901
- "episode": 2,
902
- "resolution": "1920x1080",
903
- "source": "BD",
904
- "special": "NCED - 07"
905
- },
906
- "pred": {
907
- "group": "Moozzi2",
908
- "title": "Paniponi Dash!",
909
- "season": null,
910
- "episode": null,
911
- "resolution": "1920x1080",
912
- "source": "BD",
913
- "special": "NCED - 07"
914
- }
915
- },
916
- {
917
- "filename": "[Moozzi2] Onegai My Melody [SP10] Kuromi Naration TV-CM - 01 [ 30Sec. ] (BD 1024x768 x.264 AAC)",
918
- "errors": {
919
- "title": {
920
- "gold": "onegai my melody",
921
- "pred": "onegai my melody [sp10] kuromi naration tv-cm"
922
- },
923
- "episode": {
924
- "gold": "10",
925
- "pred": "1"
926
- }
927
- },
928
- "gold": {
929
- "group": "Moozzi2",
930
- "title": "Onegai My Melody",
931
- "season": null,
932
- "episode": 10,
933
- "resolution": "1024x768",
934
- "source": "BD",
935
- "special": "CM - 01"
936
- },
937
- "pred": {
938
- "group": "Moozzi2",
939
- "title": "Onegai My Melody [SP10] Kuromi Naration TV-CM",
940
- "season": null,
941
- "episode": 1,
942
- "resolution": "1024x768",
943
- "source": "BD",
944
- "special": "CM - 01"
945
- }
946
- },
947
- {
948
- "filename": "[DBD-Raws][Kuzu no Honkai][PV][02][1080P][BDRip][HEVC-10bit][FLAC]",
949
- "errors": {
950
- "episode": {
951
- "gold": null,
952
- "pred": "2"
953
- }
954
- },
955
- "gold": {
956
- "group": "DBD-Raws",
957
- "title": "Kuzu no Honkai",
958
- "season": null,
959
- "episode": null,
960
- "resolution": "1080P",
961
- "source": "BDRip",
962
- "special": "02"
963
- },
964
- "pred": {
965
- "group": "DBD-Raws",
966
- "title": "Kuzu no Honkai",
967
- "season": null,
968
- "episode": 2,
969
- "resolution": "1080P",
970
- "source": "BDRip",
971
- "special": "02"
972
- }
973
- },
974
- {
975
- "filename": "[DBD-Raws][One Piece Wano Arc][Soushuuhen][03][1080P][BDRip][HEVC-10bit][FLAC]",
976
- "errors": {
977
- "title": {
978
- "gold": "one piece wano arc soushuuhen",
979
- "pred": "one piece wano arc"
980
- }
981
- },
982
- "gold": {
983
- "group": "DBD-Raws",
984
- "title": "One Piece Wano Arc Soushuuhen",
985
- "season": null,
986
- "episode": 3,
987
- "resolution": "1080P",
988
- "source": "BDRip",
989
- "special": null
990
- },
991
- "pred": {
992
- "group": "DBD-Raws",
993
- "title": "One Piece Wano Arc",
994
- "season": null,
995
- "episode": 3,
996
- "resolution": "1080P",
997
- "source": "BDRip",
998
- "special": null
999
- }
1000
- },
1001
- {
1002
- "filename": "[LAC][Gintama][196][GB][R10]",
1003
- "errors": {
1004
- "group": {
1005
- "gold": null,
1006
- "pred": "lac"
1007
- },
1008
- "title": {
1009
- "gold": "lac gintama 196 gb r",
1010
- "pred": "gintama"
1011
- },
1012
- "episode": {
1013
- "gold": "10",
1014
- "pred": "196"
1015
- },
1016
- "source": {
1017
- "gold": null,
1018
- "pred": "gb"
1019
- }
1020
- },
1021
- "gold": {
1022
- "group": null,
1023
- "title": "LAC Gintama 196 GB R",
1024
- "season": null,
1025
- "episode": 10,
1026
- "resolution": null,
1027
- "source": null,
1028
- "special": null
1029
- },
1030
- "pred": {
1031
- "group": "LAC",
1032
- "title": "Gintama",
1033
- "season": null,
1034
- "episode": 196,
1035
- "resolution": null,
1036
- "source": "GB",
1037
- "special": null
1038
- }
1039
- },
1040
- {
1041
- "filename": "[DBD-Raws][Date a Live][Director's Cut][PV][07][1080P][BDRip][HEVC-10bit][FLAC]",
1042
- "errors": {
1043
- "title": {
1044
- "gold": "date a live director's cut",
1045
- "pred": "date a live"
1046
- },
1047
- "episode": {
1048
- "gold": null,
1049
- "pred": "7"
1050
- }
1051
- },
1052
- "gold": {
1053
- "group": "DBD-Raws",
1054
- "title": "Date a Live Director's Cut",
1055
- "season": null,
1056
- "episode": null,
1057
- "resolution": "1080P",
1058
- "source": "BDRip",
1059
- "special": "07"
1060
- },
1061
- "pred": {
1062
- "group": "DBD-Raws",
1063
- "title": "Date a Live",
1064
- "season": null,
1065
- "episode": 7,
1066
- "resolution": "1080P",
1067
- "source": "BDRip",
1068
- "special": "07"
1069
- }
1070
- },
1071
- {
1072
- "filename": "[DBD-Raws][Nageki no Bourei wa Intai Shitai][PV][09][1080P][BDRip][HEVC-10bit][FLAC]",
1073
- "errors": {
1074
- "episode": {
1075
- "gold": null,
1076
- "pred": "9"
1077
- }
1078
- },
1079
- "gold": {
1080
- "group": "DBD-Raws",
1081
- "title": "Nageki no Bourei wa Intai Shitai",
1082
- "season": null,
1083
- "episode": null,
1084
- "resolution": "1080P",
1085
- "source": "BDRip",
1086
- "special": "09"
1087
- },
1088
- "pred": {
1089
- "group": "DBD-Raws",
1090
- "title": "Nageki no Bourei wa Intai Shitai",
1091
- "season": null,
1092
- "episode": 9,
1093
- "resolution": "1080P",
1094
- "source": "BDRip",
1095
- "special": "09"
1096
- }
1097
- },
1098
- {
1099
- "filename": "[RUELL-Next] Fruits Basket NCOP 1 (DVD 768x576 x264 AC3 384K) [FF1CA8EF]",
1100
- "errors": {
1101
- "title": {
1102
- "gold": "fruits basket",
1103
- "pred": "fruits basket ncop 1"
1104
- },
1105
- "special": {
1106
- "gold": "ncop 1",
1107
- "pred": "ncop1"
1108
- }
1109
- },
1110
- "gold": {
1111
- "group": "RUELL-Next",
1112
- "title": "Fruits Basket",
1113
- "season": null,
1114
- "episode": null,
1115
- "resolution": "768x576",
1116
- "source": "DVD",
1117
- "special": "NCOP 1"
1118
- },
1119
- "pred": {
1120
- "group": "RUELL-Next",
1121
- "title": "Fruits Basket NCOP 1",
1122
- "season": null,
1123
- "episode": null,
1124
- "resolution": "768x576",
1125
- "source": "DVD",
1126
- "special": "NCOP1"
1127
- }
1128
- },
1129
- {
1130
- "filename": "[アニメ DVD] ミスター味っ子 第69話 「島巡り磯鍋競争!7包丁人・大石老師登場」 (640x480 WMV9)",
1131
- "errors": {
1132
- "source": {
1133
- "gold": null,
1134
- "pred": "dvd"
1135
- }
1136
- },
1137
- "gold": {
1138
- "group": "アニメ DVD",
1139
- "title": "ミスター味っ子",
1140
- "season": null,
1141
- "episode": 69,
1142
- "resolution": "640x480",
1143
- "source": null,
1144
- "special": null
1145
- },
1146
- "pred": {
1147
- "group": "アニメ DVD",
1148
- "title": "ミスター味っ子",
1149
- "season": null,
1150
- "episode": 69,
1151
- "resolution": "640x480",
1152
- "source": "DVD",
1153
- "special": null
1154
- }
1155
- }
1156
- ]
1157
  }
1158
  }
1159
- }
 
2
  "primary_metric": "normalized_only",
3
  "modes": {
4
  "model_only": {
 
5
  "constrain_bio": false,
6
  "sample_count": 1024,
7
  "field_accuracy": {
 
308
  ]
309
  },
310
  "normalized_only": {
 
311
  "constrain_bio": true,
312
  "sample_count": 1024,
313
  "field_accuracy": {
 
531
  }
532
  }
533
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
534
  }
535
  }
536
+ }
train.py CHANGED
@@ -230,7 +230,6 @@ def parse_exact_metrics(
230
  id2label: Dict[int, str],
231
  max_length: int,
232
  limit: Optional[int],
233
- use_rules: bool = False,
234
  constrain_bio: bool = True,
235
  ) -> Dict:
236
  """Evaluate end-to-end field exact match on filenames, not just token loss."""
@@ -249,7 +248,7 @@ def parse_exact_metrics(
249
  available = max(0, max_length - 2)
250
  tokens = tokens[:available]
251
  gold_labels = gold_labels[:available]
252
- gold = postprocess(tokens, gold_labels, tokenizer=tokenizer, filename=filename, use_rules=False)
253
  gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
254
  for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
255
  if entity not in gold_entities:
@@ -261,7 +260,6 @@ def parse_exact_metrics(
261
  id2label,
262
  max_length=max_length,
263
  debug=False,
264
- use_rules=use_rules,
265
  constrain_bio=constrain_bio,
266
  )
267
 
@@ -298,7 +296,6 @@ def parse_exact_metrics(
298
  total = counter.get("full_total", 0)
299
  correct = counter.get("full_correct", 0)
300
  return {
301
- "use_rules": use_rules,
302
  "constrain_bio": constrain_bio,
303
  "sample_count": total,
304
  "field_accuracy": field_accuracy,
@@ -320,9 +317,8 @@ def parse_exact_metrics_all_modes(
320
  limit: Optional[int],
321
  ) -> Dict:
322
  modes = {
323
- "model_only": {"use_rules": False, "constrain_bio": False},
324
- "normalized_only": {"use_rules": False, "constrain_bio": True},
325
- "rule_assisted": {"use_rules": True, "constrain_bio": True},
326
  }
327
  return {
328
  "primary_metric": "normalized_only",
@@ -334,7 +330,6 @@ def parse_exact_metrics_all_modes(
334
  id2label,
335
  max_length,
336
  limit,
337
- use_rules=settings["use_rules"],
338
  constrain_bio=settings["constrain_bio"],
339
  )
340
  for name, settings in modes.items()
 
230
  id2label: Dict[int, str],
231
  max_length: int,
232
  limit: Optional[int],
 
233
  constrain_bio: bool = True,
234
  ) -> Dict:
235
  """Evaluate end-to-end field exact match on filenames, not just token loss."""
 
248
  available = max(0, max_length - 2)
249
  tokens = tokens[:available]
250
  gold_labels = gold_labels[:available]
251
+ gold = postprocess(tokens, gold_labels, tokenizer=tokenizer)
252
  gold_entities = {label.split("-", 1)[1] for label in gold_labels if label.startswith(("B-", "I-"))}
253
  for optional_field, entity in (("episode", "EPISODE"), ("season", "SEASON")):
254
  if entity not in gold_entities:
 
260
  id2label,
261
  max_length=max_length,
262
  debug=False,
 
263
  constrain_bio=constrain_bio,
264
  )
265
 
 
296
  total = counter.get("full_total", 0)
297
  correct = counter.get("full_correct", 0)
298
  return {
 
299
  "constrain_bio": constrain_bio,
300
  "sample_count": total,
301
  "field_accuracy": field_accuracy,
 
317
  limit: Optional[int],
318
  ) -> Dict:
319
  modes = {
320
+ "model_only": {"constrain_bio": False},
321
+ "normalized_only": {"constrain_bio": True},
 
322
  }
323
  return {
324
  "primary_metric": "normalized_only",
 
330
  id2label,
331
  max_length,
332
  limit,
 
333
  constrain_bio=settings["constrain_bio"],
334
  )
335
  for name, settings in modes.items()