File size: 29,419 Bytes
be5f706
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
"""
Synthetic training data generator for anime filename parser.

Generates labeled anime filenames using template filling with content pools.
Each sample is a filename tokenized into tokens with BIO labels.

Output format: JSONL (one JSON object per line)
  {"tokens": [...], "labels": [...]}
"""

import json
import os
import random
import re
from typing import Dict, List, Optional, Tuple

from config import Config
from tokenizer import AnimeTokenizer, create_tokenizer


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Content Pools
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

# ---- TITLES (200+ mixed CHS/CHT/EN/JP) ----
TITLES: List[str] = [
    # Chinese (100+)
    "่‘ฌ้€็š„่Š™่މ่Žฒ", "่‘ฌ้€็š„่Š™่މ่“ฎ", "ๅ’’ๆœฏๅ›žๆˆ˜", "ๅ’’่ก“่ฟดๆˆฐ",
    "้ฌผ็ญไน‹ๅˆƒ", "้ฌผๆป…ไน‹ๅˆƒ", "้—ด่ฐ่ฟ‡ๅฎถๅฎถ", "SPYร—FAMILY",
    "่‘ฌ้€ใฎใƒ•ใƒชใƒผใƒฌใƒณ", "่ฟ›ๅ‡ป็š„ๅทจไบบ", "้€ฒๆ“Š็š„ๅทจไบบ",
    "้’ขไน‹็‚ผ้‡‘ๆœฏๅธˆ", "้‹ผไน‹็…‰้‡‘่ก“ๅธซ", "ๆ–ฐไธ–็บช็ฆ้Ÿณๆˆ˜ๅฃซ",
    "ๆ–ฐไธ–็บชใ‚จใƒดใ‚กใƒณใ‚ฒใƒชใ‚ชใƒณ", "ๆญปไบก็ฌ”่ฎฐ", "DEATH NOTE",
    "ๅ‘ฝ่ฟ็Ÿณไน‹้—จ", "Steins;Gate", "้ญ”ๆณ•ๅฐ‘ๅฅณๅฐๅœ†",
    "้ญ”ๆณ•ๅฐ‘ๅฅณใพใฉใ‹โ˜†ใƒžใ‚ฎใ‚ซ", "ๅๅ›็š„้ฒ่ทฏไฟฎ", "ใ‚ณใƒผใƒ‰ใ‚ฎใ‚ขใ‚น",
    "ๆœช้—ป่Šฑๅ", "ใ‚ใฎๆ—ฅ่ฆ‹ใŸ่Šฑใฎๅๅ‰ใ‚’ๅƒ•้”ใฏใพใ ็Ÿฅใ‚‰ใชใ„",
    "Clannad", "Angel Beats!", "่ผ•้Ÿณๅฐ‘ๅฅณ", "K-ON!",
    "็ดซ็ฝ—ๅ…ฐๆฐธๆ’่Šฑๅ›ญ", "ใƒดใ‚กใ‚คใ‚ชใƒฌใƒƒใƒˆใƒปใ‚จใƒดใ‚กใƒผใ‚ฌใƒผใƒ‡ใƒณ",
    "ๆฅ่‡ชๆทฑๆธŠ", "ใƒกใ‚คใƒ‰ใ‚คใƒณใ‚ขใƒ“ใ‚น", "ๆ— ่Œ่ฝฌ็”Ÿ",
    "็„ก่ท่ปข็”Ÿ", "่ฝฌ็”Ÿๆˆๅฒ่Žฑๅง†", "่ปข็”Ÿใ—ใŸใ‚‰ใ‚นใƒฉใ‚คใƒ ใ ใฃใŸไปถ",
    "ๅ…ณไบŽๆˆ‘่ฝฌ็”Ÿๅ˜ๆˆๅฒ่Žฑๅง†่ฟ™ๆกฃไบ‹", "Re:ไปŽ้›ถๅผ€ๅง‹็š„ๅผ‚ไธ–็•Œ็”Ÿๆดป",
    "Re:ใ‚ผใƒญใ‹ใ‚‰ๅง‹ใ‚ใ‚‹็•ฐไธ–็•Œ็”Ÿๆดป", "่พ‰ๅคœๅคงๅฐๅงๆƒณ่ฎฉๆˆ‘ๅ‘Š็™ฝ",
    "ใ‹ใใ‚„ๆง˜ใฏๅ‘Šใ‚‰ใ›ใŸใ„", "ๆˆ‘็š„้’ๆ˜ฅๆ‹็ˆฑ็‰ฉ่ฏญๆžœ็„ถๆœ‰้—ฎ้ข˜",
    "ใ‚„ใฏใ‚Šไฟบใฎ้’ๆ˜ฅใƒฉใƒ–ใ‚ณใƒกใฏใพใกใŒใฃใฆใ„ใ‚‹",
    "ๅˆ€ๅ‰‘็ฅžๅŸŸ", "ใ‚ฝใƒผใƒ‰ใ‚ขใƒผใƒˆใƒปใ‚ชใƒณใƒฉใ‚คใƒณ",
    "OVERLORD", "ไธบ็พŽๅฅฝ็š„ไธ–็•Œ็ŒฎไธŠ็ฅ็ฆ",
    "ใ“ใฎ็ด ๆ™ดใ‚‰ใ—ใ„ไธ–็•Œใซ็ฅ็ฆใ‚’", "ๅฎžๅŠ›่‡ณไธŠไธปไน‰็š„ๆ•™ๅฎค",
    "ใ‚ˆใ†ใ“ใๅฎŸๅŠ›่‡ณไธŠไธป็พฉใฎๆ•™ๅฎคใธ", "86-ไธๅญ˜ๅœจ็š„ๆˆ˜ๅŒบ",
    "86-ใ‚จใ‚คใƒ†ใ‚ฃใ‚ทใƒƒใ‚ฏใ‚น-", "ๅญค็‹ฌๆ‘‡ๆปš", "ใผใฃใกใƒปใ–ใƒปใ‚ใฃใ",
    "Girls Band Cry", "ๆˆ‘ๅฟƒ้‡Œๅฑ้™ฉ็š„ไธœ่ฅฟ",
    "ๅƒ•ใฎๅฟƒใฎใƒคใƒใ‚คใ‚„ใค", "่ฏๅฑ‹ๅฐ‘ๅฅณ็š„ๅ‘ขๅ–ƒ",
    "่–ฌๅฑ‹ใฎใฒใจใ‚Šใ”ใจ", "่ฟทๅฎซ้ฅญ", "ใƒ€ใƒณใ‚ธใƒงใƒณ้ฃฏ",
    "ๆˆ‘ๆŽจ็š„ๅญฉๅญ", "ใ€ๆŽจใ—ใฎๅญใ€‘", "่‘ฌ้€็š„่Š™่މ่Žฒ ็ฌฌไบŒๅญฃ",
    "ๆญป็ฅž", "BLEACH", "ๆตท่ดผ็Ž‹", "ONE PIECE",
    "็ซๅฝฑๅฟ่€…", "NARUTO", "็ŒŽไบบ", "HUNTERร—HUNTER",
    "้พ™็ ", "DRAGON BALL", "็Œ็ฏฎ้ซ˜ๆ‰‹", "SLAM DUNK",
    "้“ถ้ญ‚", "GIN TAMA", "Fate/stay night",
    "Fate/Grand Order", "Fate/Zero", "ๆ”ปๅฃณๆœบๅŠจ้˜Ÿ",
    "ๆ”ปๆฎปๆฉŸๅ‹•้šŠ", "ๆ˜Ÿ้™…็‰›ไป”", "ใ‚ซใ‚ฆใƒœใƒผใ‚คใƒ“ใƒใƒƒใƒ—",
    "ๆททๆฒŒๆญฆๅฃซ", "ใ‚ตใƒ ใƒฉใ‚คใƒใƒฃใƒณใƒ—ใƒซใƒผ", "่™ซๅธˆ",
    "่Ÿฒๅธซ", "ไธ‰ๆœˆ็š„็‹ฎๅญ", "3ๆœˆใฎใƒฉใ‚คใ‚ชใƒณ",
    "ๆ˜ญๅ’Œๅ…ƒ็ฆ„่ฝ่ฏญๅฟƒไธญ", "ๆ˜ญๅ’Œๅ…ƒ็ฆ„่ฝ่ชžๅฟƒไธญ",
    "็™ฝ็ฎฑ", "SHIROBAKO", "ๆฏ”ๅฎ‡ๅฎ™ๆ›ด่ฟœ็š„ๅœฐๆ–น",
    "ๅฎ‡ๅฎ™ใ‚ˆใ‚Šใ‚‚้ ใ„ๅ ดๆ‰€", "ๆ‘‡ๆ›ณ้œฒ่ฅ", "ใ‚†ใ‚‹ใ‚ญใƒฃใƒณโ–ณ",
    "่ต›้ฉฌๅจ˜", "ใ‚ฆใƒžๅจ˜", "ๅถๅƒๅคงๅธˆ",
    "ใ‚ขใ‚คใƒ‰ใƒซใƒžใ‚นใ‚ฟใƒผ", "Love Live!", "lovelive!",
    "BanG Dream!", "ๅฐ‘ๅฅณๆญŒๅ‰ง", " Revue Starlight",
    "ๅฅ‡่›‹็‰ฉ่ฏญ", "ใƒฏใƒณใƒ€ใƒผใ‚จใƒƒใ‚ฐใƒปใƒ—ใƒฉใ‚คใ‚ชใƒชใƒ†ใ‚ฃ",
    "่މๅฏไธฝไธ", "ใƒชใ‚ณใƒชใ‚นใƒปใƒชใ‚ณใ‚คใƒซ", "ๅคๆ—ฅ้‡็Žฐ",
    "ใ‚ตใƒžใƒผใ‚ฟใ‚คใƒ ใƒฌใƒณใƒ€", "่พน็ผ˜่กŒ่€…", "CYBERPUNK EDGERUNNERS",

    # English/Romanized (50+)
    "Sousou no Frieren", "Jujutsu Kaisen", "Kimetsu no Yaiba",
    "Attack on Titan", "Shingeki no Kyojin", "Fullmetal Alchemist",
    "Neon Genesis Evangelion", "Steins Gate",
    "Puella Magi Madoka Magica", "Code Geass",
    "Violet Evergarden", "Made in Abyss", "Mushoku Tensei",
    "That Time I Got Reincarnated as a Slime",
    "Re Zero Starting Life in Another World",
    "Kaguya-sama Love is War", "Sword Art Online",
    "Konosuba God's Blessing on this Wonderful World",
    "Classroom of the Elite", "Solo Leveling",
    "Bocchi the Rock", "Dungeon Meshi", "Delicious in Dungeon",
    "Oshi no Ko", "My Hero Academia", "Demon Slayer",
    "Chainsaw Man", "Hell's Paradise", "Jigokuraku",
    "Vinland Saga", "Ranking of Kings", "Ousama Ranking",
    "Spy x Family", "Cyberpunk Edgerunners",
    "Lycoris Recoil", "Summer Time Rendering",
    "Wonder Egg Priority", "Odd Taxi",
    "Sonny Boy", "Wonder Egg Priority",
    "Super Cub", "Yuru Camp", "Laid-Back Camp",

    # Numbers in title (20+)
    "86 Eighty Six", "3-gatsu no Lion",
    "5-toubun no Hanayome", "5็ญ‰ๅˆ†ใฎ่Šฑๅซ",
    "7 Seeds", "7-seeds",
    "91 Days", "91Days",
    "100-man no Inochi no Ue ni Ore wa Tatteiru",
    "100ไธ‡ใฎๅ‘ฝใฎไธŠใซไฟบใฏ็ซ‹ใฃใฆใ„ใ‚‹",
    "300-en no Otsuki Samurai",
    "5000ๅ…†ๅ††ๆฌฒใ—ใ„๏ผ",
    "2.43 ๆธ…้™ฐ้ซ˜ๆ ก็”ทๅญใƒใƒฌใƒผ้ƒจ",
    "22/7", "24 2",
    "8 Girls", "80ไธ‡ๅ†็”Ÿ",

    # With punctuation (20+)
    "K-ON!", "NEW GAME!", "GO! GO! 575",
    "Wake Up, Girls!", "Show By Rock!!",
    "Hello!! KINMOZA", "Hiโ˜†sCoool! ใ‚ปใƒใ‚ฌใƒผใƒซ",
    "AKB0048", "Cยณ", "WIXOSS",
    "โˆšLetter", "โˆš3 (ใƒซใƒผใƒˆใ‚นใƒชใƒผ)",
    "DOG DAYS'", "DOG DAYS''",
    "RAIL WARS!", "M3๏ฝžใ‚ฝใƒŽ้ป’ใ‚ญ้‹ผ๏ฝž",
    "D.C.III ~Da Capo III~",
    "B-Project", "Fate/Extra",
    "DIABOLIK LOVERS", "B-PROJECT",
]

# ---- GROUPS (50+) ----
GROUPS_EN_BRACKET: List[str] = [
    "[ANi]", "[Baha]", "[VCB-Studio]", "[Lilith-Raws]",
    "[SubsPlease]", "[Erai-raws]", "[DBD-Raws]", "[AI-Raws]",
    "[Ohys-Raws]", "[Moozzi2]", "[NT-Raws]", "[Ember]",
    "[Judas]", "[Leopard-Raws]", "[m.3.3.w]", "[Kagura]",
    "[HorribleSubs]", "[DeadFish]", "[CBM]", "[FFF]",
    "[SSA]", "[C1]", "[WOLF]", "[CKJ]",
    "[Zero-Raws]", "[dHD]", "[UCCUSS]", "[Tk]",
    "[ReinForce]", "[Kuroi-Raws]", "[Kamigami]", "[DIY]",
    "[QTS]", "[XEI]", "[Snow-Raws]", "[Lv.1]",
    "[NAOKI]", "[Hakata]", "[PHZ]", "[Sakurato]",
    "[YYQ]", "[Beatrice]", "[Rally]", "[SweetSub]",
    "[DHR]", "[HR]", "[Hakugetsu]", "[DMG]",
    "[HYSUB]", "[POPGO]", "[SumiSora]", "[KPDM]",
    "[CASO]", "[KTXP]", "[Snow-Raws]", "[philosophy-raws]",
    "[Coalgirls]", "[Elysium]", "[FFF]", "[B-MXT]", "ANK-Raws",
]

GROUPS_CN_BRACKET: List[str] = [
    "ใ€ๅ–ต่Œๅฅถ่Œถๅฑ‹ใ€‘", "ใ€ๆกœ้ƒฝๅญ—ๅน•็ป„ใ€‘", "ใ€ๅนปๆจฑๅญ—ๅน•็ป„ใ€‘",
    "ใ€ๆžๅฝฑๅญ—ๅน•็คพใ€‘", "ใ€ๅŠจๆผซๅ›ฝๅญ—ๅน•็ป„ใ€‘", "ใ€ๆพ„็ฉบๅญฆๅ›ญใ€‘",
    "ใ€ๅŽ็›Ÿๅญ—ๅน•็คพใ€‘", "ใ€ๅƒๅคๅญ—ๅน•็ป„ใ€‘", "ใ€้“ƒ้ฃŽๅญ—ๅน•็ป„ใ€‘",
    "ใ€็™ฝๆœˆๅญ—ๅน•็ป„ใ€‘", "ใ€้ฃŽไน‹ๅœฃๆฎฟใ€‘", "ใ€่ฏธ็ฅžๅญ—ๅน•็ป„ใ€‘",
    "ใ€้›ช้ฃ˜ๅทฅไฝœๅฎคใ€‘", "ใ€่Œ‰่ฏญๆœˆ่ฏ‘ใ€‘", "ใ€็ˆฑๆ‹ๅญ—ๅน•็คพใ€‘",
    "ใ€ๅคฉๆœˆๅŠจๅทฅใ€‘", "ใ€ๆ˜Ÿ็ฉบๅญ—ๅน•็ป„ใ€‘", "ใ€่“่ฐƒๅŠจๆผซใ€‘",
    "ใ€ๆฃฎ็ฝ—ไธ‡ๅƒใ€‘", "ใ€่ฝปไน‹ๅ›ฝๅบฆใ€‘",
]

GROUPS_NO_BRACKET: List[str] = [
    "ANi", "Baha", "Nekomoe kissaten",
    "SubsPlease", "Erai-raws",
    "VCB-Studio", "Moozzi2",
    "HorribleSubs", "DeadFish",
    "Kamigami", "ReinForce",
    "Lilith-Raws", "Ohys-Raws",
]

GROUPS_PAREN: List[str] = [
    "(ๅ–ต่Œๅฅถ่Œถๅฑ‹)", "(ๆกœ้ƒฝๅญ—ๅน•็ป„)", "(ๅนปๆจฑๅญ—ๅน•็ป„)",
    "(ๆžๅฝฑๅญ—ๅน•็คพ)", "(ๅŠจๆผซๅ›ฝๅญ—ๅน•็ป„)", "(ๆพ„็ฉบๅญฆๅ›ญ)",
    "(VCB-Studio)", "(Erai-raws)",
]

# ---- SEASONS (20+ variations) ----
SEASONS: List[str] = [
    "S1", "S2", "S3", "S4", "S5",
    "S01", "S02", "S03", "S04",
    "Season 1", "Season 2", "Season 3",
    "็ฌฌไธ€ๅญฃ", "็ฌฌไบŒๅญฃ", "็ฌฌไธ‰ๅญฃ", "็ฌฌๅ››ๅญฃ",
    "1st Season", "2nd Season", "3rd Season",
    "Seasons 1", "Seasons 2",
    "S1Season", "S2Season",
]

# ---- EPISODES (15+ variations) ----
EPISODES: List[str] = [f"{i:02d}" for i in range(1, 100)]  # 01-99
EPISODE_PREFIXES: List[str] = [
    "EP", "Ep", "ep", "E",
]
EPISODE_CN: List[str] = [f"็ฌฌ{i}่ฏ" for i in range(1, 100)] + [f"็ฌฌ{i}่ฉฑ" for i in range(1, 100)]
EPISODE_HASH: List[str] = [f"#{i:02d}" for i in range(1, 100)]

# ---- META: RESOLUTION ----
RESOLUTIONS: List[str] = [
    "[1080P]", "[1080p]", "[720P]", "[720p]",
    "[4K]", "[2160P]", "[2160p]",
    "[480P]", "[480p]", "[360P]", "[360p]",
    "1080P", "1080p", "720P", "720p",
    "1920x1080", "1280x720", "3840x2160",
]

# ---- META: SOURCE ----
SOURCES: List[str] = [
    "[WEB-DL]", "[WEBDL]", "[BDRip]", "[BDMV]",
    "[DVD]", "[TVRip]", "[CR]", "[Netflix]",
    "[AMZN]", "[Baha]", "[WebRip]",
    "WEB-DL", "BDRip", "Baha",
]

# ---- META: CODEC ----
CODECS: List[str] = [
    "[x265]", "[x264]", "[HEVC]", "[AVC]", "[AV1]",
    "[H264]", "[H265]", "[h264]", "[h265]",
    "x265", "x264", "HEVC",
]

# ---- META: AUDIO ----
AUDIO: List[str] = [
    "[FLAC]", "[AAC]", "[MP3]", "[DTS]",
    "FLAC", "AAC",
]

# ---- META: LANGUAGE ----
LANGUAGES: List[str] = [
    "[CHT]", "[GB]", "[JP]", "[็ฎ€ๆ—ฅๅŒ่ฏญ]",
    "[CHS]", "[BIG5]",
    "CHT", "GB", "JP",
]

# ---- COMBINED META ----
ALL_METAS: List[str] = RESOLUTIONS + SOURCES + CODECS + AUDIO + LANGUAGES
ALL_METAS_BRACKET: List[str] = [m for m in ALL_METAS if m.startswith("[") or m.startswith("ใ€") or m.startswith("(")]

# ---- SPECIAL ----
SPECIALS: List[str] = [
    "[Movie]", "[OVA]", "[OAD]", "[SP]",
    "[ๅ‰งๅœบ็‰ˆ]", "[็‰นๅˆฅ็ฏ‡]", "[็‰นๅˆซ็ฏ‡]", "[NC]",
    "[OP]", "[ED]", "[PV]", "[CM]",
    "Movie", "OVA", "OAD", "SP",
]

# ---- SEPARATORS ----
SEPARATORS: List[str] = [" - ", " ", "_", " | ", "๏ฝž", "~", "-", " |"]


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Templates
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

TEMPLATES: List[str] = [
    # Standard: GROUP + TITLE + SEASON + SEP + EPISODE + META
    "{group} {title} {season} {sep} {episode} {meta1} {meta2}",
    "{group} {title} {season} {episode} {meta1} {meta2} {meta3}",
    "{group} {title} {episode} {meta1} {meta2}",
    "{group} {title} {season} {sep} {episode} {meta1}",

    # No GROUP
    "{title} {season} {sep} {episode} {meta1} {meta2}",
    "{title} {episode} {meta1} {meta2} {meta3}",

    # GROUP at end
    "{title} {season} {episode} {meta1} {group}",

    # META before title
    "{group} {meta1} {meta2} {title} {season} {episode}",

    # Special type
    "{group} {title} {special} {sep} {episode} {meta1}",
    "{group} {title} {special} {meta1} {meta2}",

    # CN bracket GROUP
    "ใ€{group_cn}ใ€‘{title} {season} {episode} {meta1} {meta2}",
    "ใ€{group_cn}ใ€‘{title} {episode} {meta1}",

    # CN decorative
    "ใ€{group_cn}ใ€‘โ˜…ๆ–ฐ็•ชโ˜…{title} {episode} {meta1}",

    # Paren GROUP
    "({group_cn_paren}) {title} {season} {episode} {meta1}",

    # No bracket GROUP
    "{group_no_bracket} {title} {season} {sep} {episode} {meta1}",

    # OVA/Movie
    "{group} {title} {special} {meta1} {meta2}",

    # Season with composite episode
    "{group} {title} {season} {sep} {episode} {meta1} {meta2} {meta3} {meta4}",

    # Minimal
    "{title} {episode}",

    # Title first, meta after
    "{title} {sep} {episode} [{meta_bracket}] [{meta_bracket}]",
]


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Label mapping
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

LABEL_MAP: Dict[str, str] = {
    "title": "TITLE",
    "season": "SEASON",
    "episode": "EPISODE",
    "group": "GROUP",
    "special": "SPECIAL",
    "resolution": "RESOLUTION",
    "source": "SOURCE",
    "codec": "SOURCE",      # CODEC merged into SOURCE
    "audio": "SOURCE",
    "language": "SOURCE",
    "sep": "O",
    "decoration": "O",
    "noise": "O",
}

# Additional meta tokens to categorize
META_RESOLUTION_TOKENS: List[str] = [
    "1080P", "1080p", "720P", "720p", "4K", "2160P", "2160p",
    "480P", "480p", "360P", "360p",
    "1920x1080", "1280x720", "3840x2160",
]

META_SOURCE_TOKENS: List[str] = [
    "WEB-DL", "WEBDL", "BDRip", "BDMV", "DVD", "TVRip",
    "CR", "Netflix", "AMZN", "Baha", "WebRip",
]

META_CODEC_TOKENS: List[str] = [
    "x265", "x264", "HEVC", "AVC", "AV1", "H264", "H265", "h264", "h265",
]

META_AUDIO_TOKENS: List[str] = [
    "FLAC", "AAC", "MP3", "DTS",
]

META_LANG_TOKENS: List[str] = [
    "CHT", "GB", "JP", "CHS", "BIG5", "็ฎ€ๆ—ฅๅŒ่ฏญ",
]


def categorize_meta_token(token: str) -> str:
    """Determine the entity type for a meta token (resolution/source/etc)."""
    # Strip brackets for matching
    clean = token.strip("[]()ใ€ใ€‘")
    if clean in META_RESOLUTION_TOKENS:
        return "RESOLUTION"
    if clean in META_SOURCE_TOKENS:
        return "SOURCE"
    if clean in META_CODEC_TOKENS:
        return "SOURCE"  # merged
    if clean in META_AUDIO_TOKENS:
        return "SOURCE"  # merged
    if clean in META_LANG_TOKENS:
        return "SOURCE"  # merged
    return "SOURCE"  # default meta type


def assign_bio(tokens: List[str], token_category: List[str]) -> List[str]:
    """
    Assign BIO labels to tokens based on their categories.

    Handles multi-token entities (TITLE, GROUP) that may span across
    separator tokens (spaces, etc.). For example, "Attack on Titan"
    should have B-TITLE for "Attack", I-TITLE for "on", I-TITLE for "Titan"
    even though there are O-labeled spaces between them.

    Args:
        tokens: List of token strings
        token_category: Category for each token (title, season, episode, etc.)

    Returns:
        List of BIO label strings (B-TITLE, I-TITLE, O, etc.)
    """
    labels: List[str] = []
    active_entity: Optional[str] = None  # tracks the current entity across O tokens

    for token, cat in zip(tokens, token_category):
        entity = LABEL_MAP.get(cat, "O")

        if entity == "O":
            labels.append("O")
            # Don't reset active_entity โ€” allows multi-word entities
            # to span across separator tokens (spaces, punctuation)
        elif entity in ("SEASON", "EPISODE", "SPECIAL", "RESOLUTION", "SOURCE"):
            # Single-token or always-B entities
            labels.append(f"B-{entity}")
            active_entity = None
        else:
            # Multi-token entities (TITLE, GROUP)
            if entity == active_entity:
                labels.append(f"I-{entity}")
            else:
                labels.append(f"B-{entity}")
                active_entity = entity

    return labels


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Sample Generation
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def pick_random(pool: list):
    """Pick a random item from a list."""
    return random.choice(pool)


# ---- Category tracking markers ----
# Using Unicode Private Use Area characters that NEVER appear in anime filenames.
# These are single characters that the tokenizer treats as "Other" โ†’ single-char tokens.
# They cannot be merged into bracket content, making them robust markers.
_CAT_PUA_BASE = '\uE100'  # Start of PUA region for category markers
_CAT_MARKER_END_CHAR = '\uE000'  # End marker character
_CAT_INDEX: Dict[str, int] = {
    "title": 0, "season": 1, "episode": 2, "special": 3,
    "group": 4, "resolution": 5, "source": 6, "sep": 7, "decoration": 8,
}
_CAT_FROM_INDEX: Dict[int, str] = {v: k for k, v in _CAT_INDEX.items()}
# Pre-compute marker characters
_CAT_MARKER_CHARS: Dict[str, str] = {
    cat: chr(ord(_CAT_PUA_BASE) + idx)
    for cat, idx in _CAT_INDEX.items()
}


def _cat_marker(category: str) -> str:
    """Get a category start marker character."""
    return _CAT_MARKER_CHARS.get(category, _CAT_MARKER_CHARS["title"])


# Regex to detect bracket-wrapped placeholders: ใ€{placeholder}ใ€‘, ({placeholder}), etc.
_BRACKET_WRAP_RE = re.compile(r'([\[๏ผˆใ€ใ€Š\(])\{(\w+)\}([\]๏ผ‰ใ€‘ใ€‹\)])')


def generate_template_filled(template: str) -> Tuple[str, Dict[str, str]]:
    """
    Fill a template with random content from pools.

    Returns:
        (filled_string, category_map) where each placeholder's value
        is wrapped with category marker characters for tracking.

    For bracket-wrapped placeholders (e.g., ใ€{group_cn}ใ€‘), markers
    are placed OUTSIDE the brackets to prevent marker-bracket merging.
    """
    fields: Dict[str, str] = {}
    marker_placeholders: List[str] = []

    for placeholder in ["group", "group_cn", "group_cn_paren", "group_no_bracket",
                        "title", "season", "episode", "special",
                        "meta1", "meta2", "meta3", "meta4",
                        "sep", "meta_bracket", "decoration"]:
        if "{" + placeholder + "}" not in template:
            continue

        if placeholder == "title":
            val = pick_random(TITLES)
            cat = "title"
        elif placeholder == "season":
            val = pick_random(SEASONS)
            cat = "season"
        elif placeholder == "episode":
            choice = random.random()
            if choice < 0.6:
                val = pick_random(EPISODES)
            elif choice < 0.8:
                prefix = pick_random(EPISODE_PREFIXES)
                val = prefix + pick_random(EPISODES)
            else:
                val = pick_random(EPISODE_CN)
            cat = "episode"
        elif placeholder == "group":
            val = pick_random(GROUPS_EN_BRACKET)
            cat = "group"
        elif placeholder == "group_cn":
            val = pick_random(GROUPS_CN_BRACKET)
            cat = "group"
        elif placeholder == "group_cn_paren":
            val = pick_random(GROUPS_PAREN)
            cat = "group"
        elif placeholder == "group_no_bracket":
            val = pick_random(GROUPS_NO_BRACKET)
            cat = "group"
        elif placeholder == "special":
            val = pick_random(SPECIALS)
            cat = "special"
        elif placeholder.startswith("meta"):
            meta_type = random.random()
            if meta_type < 0.3:
                val = pick_random(RESOLUTIONS)
                cat = "resolution"
            elif meta_type < 0.5:
                val = pick_random(SOURCES)
                cat = "source"
            elif meta_type < 0.65:
                val = pick_random(CODECS)
                cat = "source"
            elif meta_type < 0.8:
                val = pick_random(AUDIO)
                cat = "source"
            else:
                val = pick_random(LANGUAGES)
                cat = "source"
        elif placeholder == "sep":
            val = pick_random(SEPARATORS)
            cat = "sep"
        elif placeholder == "meta_bracket":
            val = pick_random(ALL_METAS_BRACKET)
            clean = val.strip("[]()ใ€ใ€‘")
            if clean in META_RESOLUTION_TOKENS:
                cat = "resolution"
            elif clean in META_SOURCE_TOKENS:
                cat = "source"
            elif clean in META_CODEC_TOKENS:
                cat = "source"
            elif clean in META_AUDIO_TOKENS:
                cat = "source"
            elif clean in META_LANG_TOKENS:
                cat = "source"
            else:
                cat = "source"
        elif placeholder == "decoration":
            decos = ["โ˜…04ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…07ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…10ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…01ๆœˆๆ–ฐ็•ชโ˜…",
                     "โ˜…2024โ˜…", "โ˜…2025โ˜…", "โ˜…2026โ˜…",
                     "[ๅฎŒ]", "[ๅˆ้›†]", "ใ€ๅฎŒ็ป“ใ€‘"]
            val = pick_random(decos)
            cat = "decoration"
        else:
            val = placeholder
            cat = "O"

        fields[placeholder] = cat
        placeholder_slot = "{" + placeholder + "}"

        # Check if placeholder is wrapped in template brackets: ใ€{x}ใ€‘, ({x}), etc.
        # If so, place markers OUTSIDE the brackets to prevent merging.
        bracket_match = _BRACKET_WRAP_RE.search(template)
        if bracket_match and bracket_match.group(2) == placeholder:
            open_bracket = bracket_match.group(1)
            close_bracket = bracket_match.group(3)
            replacement = f"{_cat_marker(cat)}{open_bracket}{val}{close_bracket}{_CAT_MARKER_END_CHAR}"
            template = template.replace(
                f"{open_bracket}{placeholder_slot}{close_bracket}",
                replacement,
                1
            )
        else:
            # Normal non-wrapped placeholder
            template = template.replace(
                placeholder_slot,
                f"{_cat_marker(cat)}{val}{_CAT_MARKER_END_CHAR}",
                1
            )

    return template, fields


def generate_sample(tokenizer: AnimeTokenizer, templates: List[str]) -> Dict:
    """
    Generate one labeled training sample.

    Placeholder values are wrapped with category marker tokens
    (e.g., [__title__]value[__/__]) so that assign_token_categories
    can track which token belongs to which category.

    Returns:
        {"tokens": [...], "labels": [...]} where labels are in BIO format.
    """
    template = pick_random(templates)
    filled_text, category_map = generate_template_filled(template)

    # Add noise: random decoration
    if random.random() < 0.05:
        deco = pick_random(["โ˜…04ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…07ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…10ๆœˆๆ–ฐ็•ชโ˜…", "โ˜…01ๆœˆๆ–ฐ็•ชโ˜…",
                           "[ๅฎŒ]", "ใ€ๅฎŒ็ป“ใ€‘", "โ˜…2024โ˜…", "โ˜…2025โ˜…"])
        if random.random() < 0.5:
            filled_text = _cat_marker("decoration") + deco + _CAT_MARKER_END_CHAR + filled_text
        else:
            filled_text = filled_text + _cat_marker("decoration") + deco + _CAT_MARKER_END_CHAR

    # Tokenize
    tokens = tokenizer.tokenize(filled_text)
    if not tokens:
        return generate_sample(tokenizer, templates)  # retry on empty

    # Assign categories using marker tokens (also filters out markers)
    filtered_tokens, token_categories = assign_token_categories(tokens, filled_text, category_map)

    # Retry if all tokens were filtered out (shouldn't happen, but safety)
    if not filtered_tokens:
        return generate_sample(tokenizer, templates)

    # Generate BIO labels
    labels = assign_bio(filtered_tokens, token_categories)

    assert len(filtered_tokens) == len(labels), f"Token/label mismatch: {len(filtered_tokens)} vs {len(labels)}"

    return {
        "tokens": filtered_tokens,
        "labels": labels,
    }


def assign_token_categories(
    tokens: List[str],
    filled_text: str,
    category_map: Dict[str, str]
) -> Tuple[List[str], List[str]]:
    """
    Assign categories to tokens using embedded Unicode PUA marker chars.

    Category markers are PUA Unicode chars (\uE100-\uE108) that the tokenizer
    outputs as single-character tokens. They bracket each placeholder's content
    and cannot be merged into bracket content.

    Returns:
        (filtered_tokens, categories) with marker chars removed.
    """
    filtered_tokens: List[str] = []
    categories: List[str] = []
    current_category: Optional[str] = None
    markers_encountered = 0

    for token in tokens:
        # Check for end marker
        if len(token) == 1 and token == _CAT_MARKER_END_CHAR:
            current_category = None
            markers_encountered += 1
            continue

        # Check for category start marker (PUA characters)
        if len(token) == 1 and _CAT_PUA_BASE <= token <= chr(ord(_CAT_PUA_BASE) + 8):
            idx = ord(token) - ord(_CAT_PUA_BASE)
            current_category = _CAT_FROM_INDEX.get(idx, None)
            markers_encountered += 1
            continue

        filtered_tokens.append(token)
        if current_category is not None:
            categories.append(current_category)
        else:
            categories.append(_heuristic_category(token))

    # If no markers were found, use pure heuristics as fallback
    if markers_encountered == 0:
        categories = [_heuristic_category(t) for t in filtered_tokens]

    return filtered_tokens, categories


def _heuristic_category(token: str) -> str:
    """
    Fallback heuristic category assignment for tokens not covered by markers.

    This is used only when a token appears outside the marker system
    (e.g., for the first call before markers are added to the template).
    Kept conservative to avoid mislabeling.
    """
    if token in SEPARATORS or token in " -_|๏ฝž~.":
        return "sep"

    if token.startswith("[") or token.startswith("(") or token.startswith("ใ€"):
        clean = token.strip("[]()ใ€ใ€‘")
        # Check group
        if any(g.strip("[]()ใ€ใ€‘") == clean for g in GROUPS_EN_BRACKET + GROUPS_CN_BRACKET + GROUPS_PAREN):
            return "group"
        # Check special
        if any(s.strip("[]()ใ€ใ€‘") == clean or s == clean for s in SPECIALS):
            return "special"
        # Otherwise meta
        cat = categorize_meta_token(token)
        return cat.lower()

    # Season โ€” only if exact known patterns
    if re.match(r'^[Ss]\d+$', token) or token.startswith("Season") or "ๅญฃ" in token:
        return "season"

    # Episode โ€” only if strong patterns
    if re.match(r'^[Ee][Pp]?\d{1,3}$', token):   # E01, EP01
        return "episode"
    if re.match(r'^#\d{1,3}$', token):            # #01
        return "episode"
    if re.match(r'^็ฌฌ\d+[่ฏ่ฉฑ]$', token):          # ็ฌฌ7่ฏ
        return "episode"
    if re.match(r'^\d{1,2}[Vv]\d*$', token):      # 01v2
        return "episode"

    # Meta tokens (without brackets)
    if token in ALL_METAS:
        return "source"
    clean = token.strip("[]()ใ€ใ€‘")
    if clean in META_RESOLUTION_TOKENS + META_SOURCE_TOKENS + META_CODEC_TOKENS + META_AUDIO_TOKENS + META_LANG_TOKENS:
        return "source"

    # Default: title
    return "title"



# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Main script
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def generate_dataset(num_samples: int, tokenizer: AnimeTokenizer, output_path: str):
    """
    Generate a synthetic dataset and save to JSONL.

    Args:
        num_samples: Number of samples to generate
        tokenizer: AnimeTokenizer instance
        output_path: Path to output JSONL file
    """
    os.makedirs(os.path.dirname(output_path), exist_ok=True)

    all_token_lists: List[List[str]] = []
    with open(output_path, 'w', encoding='utf-8') as f:
        for i in range(num_samples):
            sample = generate_sample(tokenizer, TEMPLATES)
            f.write(json.dumps(sample, ensure_ascii=False) + '\n')
            all_token_lists.append(sample["tokens"])

            if (i + 1) % 10000 == 0:
                print(f"Generated {i + 1}/{num_samples} samples...")

    print(f"Total samples generated: {num_samples}")
    return all_token_lists


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Generate synthetic anime filename dataset")
    parser.add_argument("--num-samples", type=int, default=100_000,
                        help="Number of samples to generate (default: 100000)")
    parser.add_argument("--output", type=str, default="data/synthetic.jsonl",
                        help="Output path (default: data/synthetic.jsonl)")
    parser.add_argument("--tokenizer", choices=["regex", "char"], default="regex",
                        help="Tokenizer variant used to generate the JSONL data")
    parser.add_argument("--vocab-output", type=str, default=None,
                        help="Vocab path (default: output directory vocab.json or vocab.char.json)")
    parser.add_argument("--seed", type=int, default=42,
                        help="Random seed (default: 42)")
    args = parser.parse_args()

    random.seed(args.seed)

    print(f"Generating {args.num_samples} synthetic samples...")
    print(f"Output: {args.output}")

    tokenizer = create_tokenizer(args.tokenizer)

    token_lists = generate_dataset(args.num_samples, tokenizer, args.output)

    # Build tokenizer vocabulary from generated data
    tokenizer.build_vocab(token_lists)

    # Save tokenizer vocab alongside data
    vocab_path = args.vocab_output or os.path.join(
        os.path.dirname(args.output),
        "vocab.json" if args.tokenizer == "regex" else "vocab.char.json",
    )
    vocab_dir = os.path.dirname(vocab_path) or "."
    os.makedirs(vocab_dir, exist_ok=True)
    with open(vocab_path, "w", encoding="utf-8") as f:
        json.dump(tokenizer.get_vocab(), f, ensure_ascii=False, indent=2)
    print(f"Tokenizer vocab saved to {vocab_path}")
    print(f"Vocab size: {tokenizer.vocab_size}")