yahyaabd commited on
Commit
4463ad1
·
verified ·
1 Parent(s): 6f6c413

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +133 -7
  2. added_tokens.json +54 -12
  3. config.json +1 -1
  4. model.safetensors +2 -2
  5. tokenizer.json +402 -12
  6. tokenizer_config.json +346 -10
README.md CHANGED
@@ -1,14 +1,140 @@
1
- # Custom BPS SentenceTransformer
 
 
 
 
 
 
 
2
 
3
- Model ini berbasis `paraphrase-multilingual-MiniLM-L12-v2` dengan tambahan token khusus untuk konteks statistik Badan Pusat Statistik (BPS) Indonesia. Token baru mencakup istilah seperti `PDRB`, `SP2020`, `SAKERNAS`, dll.
4
 
5
- ## Cara Penggunaan
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ```python
7
  from sentence_transformers import SentenceTransformer
8
 
9
- model = SentenceTransformer('yahyaabd/paraphrase-multilingual-MiniLM-L12-v2-bps-custom-tokenizer')
10
- embeddings = model.encode(['PDRB meningkat di tahun 2023.', 'BPS merilis ST2023.'])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ```
12
 
13
- ## Kontak
14
- Hubungi [yahyaabd] di Hugging Face untuk pertanyaan atau dukungan.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ pipeline_tag: sentence-similarity
7
+ library_name: sentence-transformers
8
+ ---
9
 
10
+ # SentenceTransformer
11
 
12
+ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
+
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+ - **Model Type:** Sentence Transformer
18
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
+ - **Maximum Sequence Length:** 512 tokens
20
+ - **Output Dimensionality:** 1024 dimensions
21
+ - **Similarity Function:** Cosine Similarity
22
+ <!-- - **Training Dataset:** Unknown -->
23
+ <!-- - **Language:** Unknown -->
24
+ <!-- - **License:** Unknown -->
25
+
26
+ ### Model Sources
27
+
28
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
+
32
+ ### Full Model Architecture
33
+
34
+ ```
35
+ SentenceTransformer(
36
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
37
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
38
+ )
39
+ ```
40
+
41
+ ## Usage
42
+
43
+ ### Direct Usage (Sentence Transformers)
44
+
45
+ First install the Sentence Transformers library:
46
+
47
+ ```bash
48
+ pip install -U sentence-transformers
49
+ ```
50
+
51
+ Then you can load this model and run inference.
52
  ```python
53
  from sentence_transformers import SentenceTransformer
54
 
55
+ # Download from the 🤗 Hub
56
+ model = SentenceTransformer("sentence_transformers_model_id")
57
+ # Run inference
58
+ sentences = [
59
+ 'The weather is lovely today.',
60
+ "It's so sunny outside!",
61
+ 'He drove to the stadium.',
62
+ ]
63
+ embeddings = model.encode(sentences)
64
+ print(embeddings.shape)
65
+ # [3, 1024]
66
+
67
+ # Get the similarity scores for the embeddings
68
+ similarities = model.similarity(embeddings, embeddings)
69
+ print(similarities.shape)
70
+ # [3, 3]
71
  ```
72
 
73
+ <!--
74
+ ### Direct Usage (Transformers)
75
+
76
+ <details><summary>Click to see the direct usage in Transformers</summary>
77
+
78
+ </details>
79
+ -->
80
+
81
+ <!--
82
+ ### Downstream Usage (Sentence Transformers)
83
+
84
+ You can finetune this model on your own dataset.
85
+
86
+ <details><summary>Click to expand</summary>
87
+
88
+ </details>
89
+ -->
90
+
91
+ <!--
92
+ ### Out-of-Scope Use
93
+
94
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
95
+ -->
96
+
97
+ <!--
98
+ ## Bias, Risks and Limitations
99
+
100
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
101
+ -->
102
+
103
+ <!--
104
+ ### Recommendations
105
+
106
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
107
+ -->
108
+
109
+ ## Training Details
110
+
111
+ ### Framework Versions
112
+ - Python: 3.11.12
113
+ - Sentence Transformers: 3.4.1
114
+ - Transformers: 4.51.3
115
+ - PyTorch: 2.6.0+cu124
116
+ - Accelerate: 1.6.0
117
+ - Datasets:
118
+ - Tokenizers: 0.21.1
119
+
120
+ ## Citation
121
+
122
+ ### BibTeX
123
+
124
+ <!--
125
+ ## Glossary
126
+
127
+ *Clearly define terms in order to be accessible across audiences.*
128
+ -->
129
+
130
+ <!--
131
+ ## Model Card Authors
132
+
133
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
134
+ -->
135
+
136
+ <!--
137
+ ## Model Card Contact
138
+
139
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
140
+ -->
added_tokens.json CHANGED
@@ -1,10 +1,11 @@
1
  {
2
- "BADAN PUSAT STATISTIK": 30522,
3
  "BPP": 30551,
4
  "BPS": 30521,
5
  "BRS": 30566,
 
6
  "CIF": 30556,
7
- "EKSPOR": 30552,
 
8
  "FOB": 30555,
9
  "HLS": 30543,
10
  "HS": 30554,
@@ -12,44 +13,85 @@
12
  "IHP": 30527,
13
  "IHPB": 30528,
14
  "IMK": 30532,
15
- "IMPOR": 30553,
16
  "IPH": 30526,
17
  "IPM": 30537,
18
  "ITB": 30531,
19
  "ITK": 30530,
20
- "KATALOG": 30572,
 
21
  "KB": 30542,
22
  "KCI": 30562,
23
- "KEGIATAN": 30568,
24
  "KKI": 30561,
25
  "KKL": 30560,
26
  "KRT": 30563,
 
27
  "LPE": 30533,
28
  "LTN": 30549,
29
  "LTT": 30548,
30
- "METADATA": 30571,
31
  "NTP": 30529,
32
  "NTUP": 30550,
33
  "PDB": 30524,
34
  "PDRB": 30523,
35
  "PKL": 30559,
36
- "PUBLIKASI": 30567,
 
37
  "RLS": 30544,
38
  "RT": 30564,
39
  "RW": 30565,
40
- "SAKERNAS": 30557,
41
  "SDGI": 30541,
42
  "SDKI": 30540,
43
- "SEKTORAL": 30570,
44
  "SP2020": 30538,
45
  "ST2013": 30547,
46
  "ST2023": 30546,
47
- "STATISTIK": 30569,
48
  "SUPAS": 30539,
49
- "SURVEI": 30573,
50
  "SUTAS": 30545,
 
 
 
51
  "TPAK": 30558,
52
  "TPK": 30534,
53
  "TPT": 30535,
54
- "UMP": 30536
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  }
 
1
  {
 
2
  "BPP": 30551,
3
  "BPS": 30521,
4
  "BRS": 30566,
5
+ "Badan Pusat Statistik": 30522,
6
  "CIF": 30556,
7
+ "EKSPOR": 30573,
8
+ "Ekspor": 30552,
9
  "FOB": 30555,
10
  "HLS": 30543,
11
  "HS": 30554,
 
13
  "IHP": 30527,
14
  "IHPB": 30528,
15
  "IMK": 30532,
16
+ "IMPOR": 30574,
17
  "IPH": 30526,
18
  "IPM": 30537,
19
  "ITB": 30531,
20
  "ITK": 30530,
21
+ "Impor": 30553,
22
+ "KATALOG": 30580,
23
  "KB": 30542,
24
  "KCI": 30562,
25
+ "KEGIATAN": 30577,
26
  "KKI": 30561,
27
  "KKL": 30560,
28
  "KRT": 30563,
29
+ "Katalog": 30571,
30
  "LPE": 30533,
31
  "LTN": 30549,
32
  "LTT": 30548,
33
+ "Metadata": 30570,
34
  "NTP": 30529,
35
  "NTUP": 30550,
36
  "PDB": 30524,
37
  "PDRB": 30523,
38
  "PKL": 30559,
39
+ "PUBLIKASI": 30576,
40
+ "Publikasi": 30567,
41
  "RLS": 30544,
42
  "RT": 30564,
43
  "RW": 30565,
44
+ "SAKERNAS": 30575,
45
  "SDGI": 30541,
46
  "SDKI": 30540,
47
+ "SEKTORAL": 30579,
48
  "SP2020": 30538,
49
  "ST2013": 30547,
50
  "ST2023": 30546,
51
+ "STATISTIK": 30578,
52
  "SUPAS": 30539,
53
+ "SURVEI": 30581,
54
  "SUTAS": 30545,
55
+ "Sakernas": 30557,
56
+ "Sektoral": 30569,
57
+ "Survei": 30572,
58
  "TPAK": 30558,
59
  "TPK": 30534,
60
  "TPT": 30535,
61
+ "UMP": 30536,
62
+ "_Statistik": 30568,
63
+ "bpp": 30604,
64
+ "brs": 30613,
65
+ "cif": 30606,
66
+ "fob": 30605,
67
+ "hls": 30596,
68
+ "ihk": 30583,
69
+ "imk": 30587,
70
+ "iph": 30584,
71
+ "ipm": 30591,
72
+ "itk": 30586,
73
+ "kci": 30611,
74
+ "kki": 30610,
75
+ "kkl": 30609,
76
+ "krt": 30612,
77
+ "lpe": 30588,
78
+ "ltn": 30602,
79
+ "ltt": 30601,
80
+ "metadata": 30615,
81
+ "ntp": 30585,
82
+ "ntup": 30603,
83
+ "pdrb": 30582,
84
+ "rls": 30597,
85
+ "sakernas": 30607,
86
+ "sdgi": 30595,
87
+ "sdki": 30594,
88
+ "sektoral": 30614,
89
+ "sp2020": 30592,
90
+ "st2013": 30600,
91
+ "st2023": 30599,
92
+ "supas": 30593,
93
+ "sutas": 30598,
94
+ "tpak": 30608,
95
+ "tpk": 30589,
96
+ "tpt": 30590
97
  }
config.json CHANGED
@@ -42,5 +42,5 @@
42
  "transformers_version": "4.51.3",
43
  "type_vocab_size": 2,
44
  "use_cache": true,
45
- "vocab_size": 30574
46
  }
 
42
  "transformers_version": "4.51.3",
43
  "type_vocab_size": 2,
44
  "use_cache": true,
45
+ "vocab_size": 30616
46
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8e60913acacea8cbbe7e23c51dc87a3861ab07d76a450979f1656e8409e32f6b
3
- size 1340825424
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:876577415f46625890d3d6bc9feda17ea98563b4a355527d299bb5d1a460b3ad
3
+ size 1340997456
tokenizer.json CHANGED
@@ -1,7 +1,19 @@
1
  {
2
  "version": "1.0",
3
- "truncation": null,
4
- "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,
@@ -59,7 +71,7 @@
59
  },
60
  {
61
  "id": 30522,
62
- "content": "BADAN PUSAT STATISTIK",
63
  "single_word": false,
64
  "lstrip": false,
65
  "rstrip": false,
@@ -329,7 +341,7 @@
329
  },
330
  {
331
  "id": 30552,
332
- "content": "EKSPOR",
333
  "single_word": false,
334
  "lstrip": false,
335
  "rstrip": false,
@@ -338,7 +350,7 @@
338
  },
339
  {
340
  "id": 30553,
341
- "content": "IMPOR",
342
  "single_word": false,
343
  "lstrip": false,
344
  "rstrip": false,
@@ -374,7 +386,7 @@
374
  },
375
  {
376
  "id": 30557,
377
- "content": "SAKERNAS",
378
  "single_word": false,
379
  "lstrip": false,
380
  "rstrip": false,
@@ -464,7 +476,7 @@
464
  },
465
  {
466
  "id": 30567,
467
- "content": "PUBLIKASI",
468
  "single_word": false,
469
  "lstrip": false,
470
  "rstrip": false,
@@ -473,7 +485,7 @@
473
  },
474
  {
475
  "id": 30568,
476
- "content": "KEGIATAN",
477
  "single_word": false,
478
  "lstrip": false,
479
  "rstrip": false,
@@ -482,7 +494,7 @@
482
  },
483
  {
484
  "id": 30569,
485
- "content": "STATISTIK",
486
  "single_word": false,
487
  "lstrip": false,
488
  "rstrip": false,
@@ -491,7 +503,7 @@
491
  },
492
  {
493
  "id": 30570,
494
- "content": "SEKTORAL",
495
  "single_word": false,
496
  "lstrip": false,
497
  "rstrip": false,
@@ -500,7 +512,7 @@
500
  },
501
  {
502
  "id": 30571,
503
- "content": "METADATA",
504
  "single_word": false,
505
  "lstrip": false,
506
  "rstrip": false,
@@ -509,7 +521,7 @@
509
  },
510
  {
511
  "id": 30572,
512
- "content": "KATALOG",
513
  "single_word": false,
514
  "lstrip": false,
515
  "rstrip": false,
@@ -518,12 +530,390 @@
518
  },
519
  {
520
  "id": 30573,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
521
  "content": "SURVEI",
522
  "single_word": false,
523
  "lstrip": false,
524
  "rstrip": false,
525
  "normalized": true,
526
  "special": false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
527
  }
528
  ],
529
  "normalizer": {
 
1
  {
2
  "version": "1.0",
3
+ "truncation": {
4
+ "direction": "Right",
5
+ "max_length": 256,
6
+ "strategy": "LongestFirst",
7
+ "stride": 0
8
+ },
9
+ "padding": {
10
+ "strategy": "BatchLongest",
11
+ "direction": "Right",
12
+ "pad_to_multiple_of": null,
13
+ "pad_id": 0,
14
+ "pad_type_id": 0,
15
+ "pad_token": "[PAD]"
16
+ },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
71
  },
72
  {
73
  "id": 30522,
74
+ "content": "Badan Pusat Statistik",
75
  "single_word": false,
76
  "lstrip": false,
77
  "rstrip": false,
 
341
  },
342
  {
343
  "id": 30552,
344
+ "content": "Ekspor",
345
  "single_word": false,
346
  "lstrip": false,
347
  "rstrip": false,
 
350
  },
351
  {
352
  "id": 30553,
353
+ "content": "Impor",
354
  "single_word": false,
355
  "lstrip": false,
356
  "rstrip": false,
 
386
  },
387
  {
388
  "id": 30557,
389
+ "content": "Sakernas",
390
  "single_word": false,
391
  "lstrip": false,
392
  "rstrip": false,
 
476
  },
477
  {
478
  "id": 30567,
479
+ "content": "Publikasi",
480
  "single_word": false,
481
  "lstrip": false,
482
  "rstrip": false,
 
485
  },
486
  {
487
  "id": 30568,
488
+ "content": "_Statistik",
489
  "single_word": false,
490
  "lstrip": false,
491
  "rstrip": false,
 
494
  },
495
  {
496
  "id": 30569,
497
+ "content": "Sektoral",
498
  "single_word": false,
499
  "lstrip": false,
500
  "rstrip": false,
 
503
  },
504
  {
505
  "id": 30570,
506
+ "content": "Metadata",
507
  "single_word": false,
508
  "lstrip": false,
509
  "rstrip": false,
 
512
  },
513
  {
514
  "id": 30571,
515
+ "content": "Katalog",
516
  "single_word": false,
517
  "lstrip": false,
518
  "rstrip": false,
 
521
  },
522
  {
523
  "id": 30572,
524
+ "content": "Survei",
525
  "single_word": false,
526
  "lstrip": false,
527
  "rstrip": false,
 
530
  },
531
  {
532
  "id": 30573,
533
+ "content": "EKSPOR",
534
+ "single_word": false,
535
+ "lstrip": false,
536
+ "rstrip": false,
537
+ "normalized": true,
538
+ "special": false
539
+ },
540
+ {
541
+ "id": 30574,
542
+ "content": "IMPOR",
543
+ "single_word": false,
544
+ "lstrip": false,
545
+ "rstrip": false,
546
+ "normalized": true,
547
+ "special": false
548
+ },
549
+ {
550
+ "id": 30575,
551
+ "content": "SAKERNAS",
552
+ "single_word": false,
553
+ "lstrip": false,
554
+ "rstrip": false,
555
+ "normalized": true,
556
+ "special": false
557
+ },
558
+ {
559
+ "id": 30576,
560
+ "content": "PUBLIKASI",
561
+ "single_word": false,
562
+ "lstrip": false,
563
+ "rstrip": false,
564
+ "normalized": true,
565
+ "special": false
566
+ },
567
+ {
568
+ "id": 30577,
569
+ "content": "KEGIATAN",
570
+ "single_word": false,
571
+ "lstrip": false,
572
+ "rstrip": false,
573
+ "normalized": true,
574
+ "special": false
575
+ },
576
+ {
577
+ "id": 30578,
578
+ "content": "STATISTIK",
579
+ "single_word": false,
580
+ "lstrip": false,
581
+ "rstrip": false,
582
+ "normalized": true,
583
+ "special": false
584
+ },
585
+ {
586
+ "id": 30579,
587
+ "content": "SEKTORAL",
588
+ "single_word": false,
589
+ "lstrip": false,
590
+ "rstrip": false,
591
+ "normalized": true,
592
+ "special": false
593
+ },
594
+ {
595
+ "id": 30580,
596
+ "content": "KATALOG",
597
+ "single_word": false,
598
+ "lstrip": false,
599
+ "rstrip": false,
600
+ "normalized": true,
601
+ "special": false
602
+ },
603
+ {
604
+ "id": 30581,
605
  "content": "SURVEI",
606
  "single_word": false,
607
  "lstrip": false,
608
  "rstrip": false,
609
  "normalized": true,
610
  "special": false
611
+ },
612
+ {
613
+ "id": 30582,
614
+ "content": "pdrb",
615
+ "single_word": false,
616
+ "lstrip": false,
617
+ "rstrip": false,
618
+ "normalized": true,
619
+ "special": false
620
+ },
621
+ {
622
+ "id": 30583,
623
+ "content": "ihk",
624
+ "single_word": false,
625
+ "lstrip": false,
626
+ "rstrip": false,
627
+ "normalized": true,
628
+ "special": false
629
+ },
630
+ {
631
+ "id": 30584,
632
+ "content": "iph",
633
+ "single_word": false,
634
+ "lstrip": false,
635
+ "rstrip": false,
636
+ "normalized": true,
637
+ "special": false
638
+ },
639
+ {
640
+ "id": 30585,
641
+ "content": "ntp",
642
+ "single_word": false,
643
+ "lstrip": false,
644
+ "rstrip": false,
645
+ "normalized": true,
646
+ "special": false
647
+ },
648
+ {
649
+ "id": 30586,
650
+ "content": "itk",
651
+ "single_word": false,
652
+ "lstrip": false,
653
+ "rstrip": false,
654
+ "normalized": true,
655
+ "special": false
656
+ },
657
+ {
658
+ "id": 30587,
659
+ "content": "imk",
660
+ "single_word": false,
661
+ "lstrip": false,
662
+ "rstrip": false,
663
+ "normalized": true,
664
+ "special": false
665
+ },
666
+ {
667
+ "id": 30588,
668
+ "content": "lpe",
669
+ "single_word": false,
670
+ "lstrip": false,
671
+ "rstrip": false,
672
+ "normalized": true,
673
+ "special": false
674
+ },
675
+ {
676
+ "id": 30589,
677
+ "content": "tpk",
678
+ "single_word": false,
679
+ "lstrip": false,
680
+ "rstrip": false,
681
+ "normalized": true,
682
+ "special": false
683
+ },
684
+ {
685
+ "id": 30590,
686
+ "content": "tpt",
687
+ "single_word": false,
688
+ "lstrip": false,
689
+ "rstrip": false,
690
+ "normalized": true,
691
+ "special": false
692
+ },
693
+ {
694
+ "id": 30591,
695
+ "content": "ipm",
696
+ "single_word": false,
697
+ "lstrip": false,
698
+ "rstrip": false,
699
+ "normalized": true,
700
+ "special": false
701
+ },
702
+ {
703
+ "id": 30592,
704
+ "content": "sp2020",
705
+ "single_word": false,
706
+ "lstrip": false,
707
+ "rstrip": false,
708
+ "normalized": true,
709
+ "special": false
710
+ },
711
+ {
712
+ "id": 30593,
713
+ "content": "supas",
714
+ "single_word": false,
715
+ "lstrip": false,
716
+ "rstrip": false,
717
+ "normalized": true,
718
+ "special": false
719
+ },
720
+ {
721
+ "id": 30594,
722
+ "content": "sdki",
723
+ "single_word": false,
724
+ "lstrip": false,
725
+ "rstrip": false,
726
+ "normalized": true,
727
+ "special": false
728
+ },
729
+ {
730
+ "id": 30595,
731
+ "content": "sdgi",
732
+ "single_word": false,
733
+ "lstrip": false,
734
+ "rstrip": false,
735
+ "normalized": true,
736
+ "special": false
737
+ },
738
+ {
739
+ "id": 30596,
740
+ "content": "hls",
741
+ "single_word": false,
742
+ "lstrip": false,
743
+ "rstrip": false,
744
+ "normalized": true,
745
+ "special": false
746
+ },
747
+ {
748
+ "id": 30597,
749
+ "content": "rls",
750
+ "single_word": false,
751
+ "lstrip": false,
752
+ "rstrip": false,
753
+ "normalized": true,
754
+ "special": false
755
+ },
756
+ {
757
+ "id": 30598,
758
+ "content": "sutas",
759
+ "single_word": false,
760
+ "lstrip": false,
761
+ "rstrip": false,
762
+ "normalized": true,
763
+ "special": false
764
+ },
765
+ {
766
+ "id": 30599,
767
+ "content": "st2023",
768
+ "single_word": false,
769
+ "lstrip": false,
770
+ "rstrip": false,
771
+ "normalized": true,
772
+ "special": false
773
+ },
774
+ {
775
+ "id": 30600,
776
+ "content": "st2013",
777
+ "single_word": false,
778
+ "lstrip": false,
779
+ "rstrip": false,
780
+ "normalized": true,
781
+ "special": false
782
+ },
783
+ {
784
+ "id": 30601,
785
+ "content": "ltt",
786
+ "single_word": false,
787
+ "lstrip": false,
788
+ "rstrip": false,
789
+ "normalized": true,
790
+ "special": false
791
+ },
792
+ {
793
+ "id": 30602,
794
+ "content": "ltn",
795
+ "single_word": false,
796
+ "lstrip": false,
797
+ "rstrip": false,
798
+ "normalized": true,
799
+ "special": false
800
+ },
801
+ {
802
+ "id": 30603,
803
+ "content": "ntup",
804
+ "single_word": false,
805
+ "lstrip": false,
806
+ "rstrip": false,
807
+ "normalized": true,
808
+ "special": false
809
+ },
810
+ {
811
+ "id": 30604,
812
+ "content": "bpp",
813
+ "single_word": false,
814
+ "lstrip": false,
815
+ "rstrip": false,
816
+ "normalized": true,
817
+ "special": false
818
+ },
819
+ {
820
+ "id": 30605,
821
+ "content": "fob",
822
+ "single_word": false,
823
+ "lstrip": false,
824
+ "rstrip": false,
825
+ "normalized": true,
826
+ "special": false
827
+ },
828
+ {
829
+ "id": 30606,
830
+ "content": "cif",
831
+ "single_word": false,
832
+ "lstrip": false,
833
+ "rstrip": false,
834
+ "normalized": true,
835
+ "special": false
836
+ },
837
+ {
838
+ "id": 30607,
839
+ "content": "sakernas",
840
+ "single_word": false,
841
+ "lstrip": false,
842
+ "rstrip": false,
843
+ "normalized": true,
844
+ "special": false
845
+ },
846
+ {
847
+ "id": 30608,
848
+ "content": "tpak",
849
+ "single_word": false,
850
+ "lstrip": false,
851
+ "rstrip": false,
852
+ "normalized": true,
853
+ "special": false
854
+ },
855
+ {
856
+ "id": 30609,
857
+ "content": "kkl",
858
+ "single_word": false,
859
+ "lstrip": false,
860
+ "rstrip": false,
861
+ "normalized": true,
862
+ "special": false
863
+ },
864
+ {
865
+ "id": 30610,
866
+ "content": "kki",
867
+ "single_word": false,
868
+ "lstrip": false,
869
+ "rstrip": false,
870
+ "normalized": true,
871
+ "special": false
872
+ },
873
+ {
874
+ "id": 30611,
875
+ "content": "kci",
876
+ "single_word": false,
877
+ "lstrip": false,
878
+ "rstrip": false,
879
+ "normalized": true,
880
+ "special": false
881
+ },
882
+ {
883
+ "id": 30612,
884
+ "content": "krt",
885
+ "single_word": false,
886
+ "lstrip": false,
887
+ "rstrip": false,
888
+ "normalized": true,
889
+ "special": false
890
+ },
891
+ {
892
+ "id": 30613,
893
+ "content": "brs",
894
+ "single_word": false,
895
+ "lstrip": false,
896
+ "rstrip": false,
897
+ "normalized": true,
898
+ "special": false
899
+ },
900
+ {
901
+ "id": 30614,
902
+ "content": "sektoral",
903
+ "single_word": false,
904
+ "lstrip": false,
905
+ "rstrip": false,
906
+ "normalized": true,
907
+ "special": false
908
+ },
909
+ {
910
+ "id": 30615,
911
+ "content": "metadata",
912
+ "single_word": false,
913
+ "lstrip": false,
914
+ "rstrip": false,
915
+ "normalized": true,
916
+ "special": false
917
  }
918
  ],
919
  "normalizer": {
tokenizer_config.json CHANGED
@@ -49,7 +49,7 @@
49
  "special": false
50
  },
51
  "30522": {
52
- "content": "BADAN PUSAT STATISTIK",
53
  "lstrip": false,
54
  "normalized": true,
55
  "rstrip": false,
@@ -289,7 +289,7 @@
289
  "special": false
290
  },
291
  "30552": {
292
- "content": "EKSPOR",
293
  "lstrip": false,
294
  "normalized": true,
295
  "rstrip": false,
@@ -297,7 +297,7 @@
297
  "special": false
298
  },
299
  "30553": {
300
- "content": "IMPOR",
301
  "lstrip": false,
302
  "normalized": true,
303
  "rstrip": false,
@@ -329,7 +329,7 @@
329
  "special": false
330
  },
331
  "30557": {
332
- "content": "SAKERNAS",
333
  "lstrip": false,
334
  "normalized": true,
335
  "rstrip": false,
@@ -409,7 +409,7 @@
409
  "special": false
410
  },
411
  "30567": {
412
- "content": "PUBLIKASI",
413
  "lstrip": false,
414
  "normalized": true,
415
  "rstrip": false,
@@ -417,7 +417,7 @@
417
  "special": false
418
  },
419
  "30568": {
420
- "content": "KEGIATAN",
421
  "lstrip": false,
422
  "normalized": true,
423
  "rstrip": false,
@@ -425,7 +425,7 @@
425
  "special": false
426
  },
427
  "30569": {
428
- "content": "STATISTIK",
429
  "lstrip": false,
430
  "normalized": true,
431
  "rstrip": false,
@@ -433,7 +433,7 @@
433
  "special": false
434
  },
435
  "30570": {
436
- "content": "SEKTORAL",
437
  "lstrip": false,
438
  "normalized": true,
439
  "rstrip": false,
@@ -441,7 +441,7 @@
441
  "special": false
442
  },
443
  "30571": {
444
- "content": "METADATA",
445
  "lstrip": false,
446
  "normalized": true,
447
  "rstrip": false,
@@ -449,7 +449,7 @@
449
  "special": false
450
  },
451
  "30572": {
452
- "content": "KATALOG",
453
  "lstrip": false,
454
  "normalized": true,
455
  "rstrip": false,
@@ -457,12 +457,348 @@
457
  "special": false
458
  },
459
  "30573": {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
460
  "content": "SURVEI",
461
  "lstrip": false,
462
  "normalized": true,
463
  "rstrip": false,
464
  "single_word": false,
465
  "special": false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
466
  }
467
  },
468
  "clean_up_tokenization_spaces": true,
 
49
  "special": false
50
  },
51
  "30522": {
52
+ "content": "Badan Pusat Statistik",
53
  "lstrip": false,
54
  "normalized": true,
55
  "rstrip": false,
 
289
  "special": false
290
  },
291
  "30552": {
292
+ "content": "Ekspor",
293
  "lstrip": false,
294
  "normalized": true,
295
  "rstrip": false,
 
297
  "special": false
298
  },
299
  "30553": {
300
+ "content": "Impor",
301
  "lstrip": false,
302
  "normalized": true,
303
  "rstrip": false,
 
329
  "special": false
330
  },
331
  "30557": {
332
+ "content": "Sakernas",
333
  "lstrip": false,
334
  "normalized": true,
335
  "rstrip": false,
 
409
  "special": false
410
  },
411
  "30567": {
412
+ "content": "Publikasi",
413
  "lstrip": false,
414
  "normalized": true,
415
  "rstrip": false,
 
417
  "special": false
418
  },
419
  "30568": {
420
+ "content": "_Statistik",
421
  "lstrip": false,
422
  "normalized": true,
423
  "rstrip": false,
 
425
  "special": false
426
  },
427
  "30569": {
428
+ "content": "Sektoral",
429
  "lstrip": false,
430
  "normalized": true,
431
  "rstrip": false,
 
433
  "special": false
434
  },
435
  "30570": {
436
+ "content": "Metadata",
437
  "lstrip": false,
438
  "normalized": true,
439
  "rstrip": false,
 
441
  "special": false
442
  },
443
  "30571": {
444
+ "content": "Katalog",
445
  "lstrip": false,
446
  "normalized": true,
447
  "rstrip": false,
 
449
  "special": false
450
  },
451
  "30572": {
452
+ "content": "Survei",
453
  "lstrip": false,
454
  "normalized": true,
455
  "rstrip": false,
 
457
  "special": false
458
  },
459
  "30573": {
460
+ "content": "EKSPOR",
461
+ "lstrip": false,
462
+ "normalized": true,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": false
466
+ },
467
+ "30574": {
468
+ "content": "IMPOR",
469
+ "lstrip": false,
470
+ "normalized": true,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": false
474
+ },
475
+ "30575": {
476
+ "content": "SAKERNAS",
477
+ "lstrip": false,
478
+ "normalized": true,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": false
482
+ },
483
+ "30576": {
484
+ "content": "PUBLIKASI",
485
+ "lstrip": false,
486
+ "normalized": true,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": false
490
+ },
491
+ "30577": {
492
+ "content": "KEGIATAN",
493
+ "lstrip": false,
494
+ "normalized": true,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": false
498
+ },
499
+ "30578": {
500
+ "content": "STATISTIK",
501
+ "lstrip": false,
502
+ "normalized": true,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": false
506
+ },
507
+ "30579": {
508
+ "content": "SEKTORAL",
509
+ "lstrip": false,
510
+ "normalized": true,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": false
514
+ },
515
+ "30580": {
516
+ "content": "KATALOG",
517
+ "lstrip": false,
518
+ "normalized": true,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": false
522
+ },
523
+ "30581": {
524
  "content": "SURVEI",
525
  "lstrip": false,
526
  "normalized": true,
527
  "rstrip": false,
528
  "single_word": false,
529
  "special": false
530
+ },
531
+ "30582": {
532
+ "content": "pdrb",
533
+ "lstrip": false,
534
+ "normalized": true,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": false
538
+ },
539
+ "30583": {
540
+ "content": "ihk",
541
+ "lstrip": false,
542
+ "normalized": true,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": false
546
+ },
547
+ "30584": {
548
+ "content": "iph",
549
+ "lstrip": false,
550
+ "normalized": true,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": false
554
+ },
555
+ "30585": {
556
+ "content": "ntp",
557
+ "lstrip": false,
558
+ "normalized": true,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": false
562
+ },
563
+ "30586": {
564
+ "content": "itk",
565
+ "lstrip": false,
566
+ "normalized": true,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": false
570
+ },
571
+ "30587": {
572
+ "content": "imk",
573
+ "lstrip": false,
574
+ "normalized": true,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": false
578
+ },
579
+ "30588": {
580
+ "content": "lpe",
581
+ "lstrip": false,
582
+ "normalized": true,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": false
586
+ },
587
+ "30589": {
588
+ "content": "tpk",
589
+ "lstrip": false,
590
+ "normalized": true,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": false
594
+ },
595
+ "30590": {
596
+ "content": "tpt",
597
+ "lstrip": false,
598
+ "normalized": true,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": false
602
+ },
603
+ "30591": {
604
+ "content": "ipm",
605
+ "lstrip": false,
606
+ "normalized": true,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": false
610
+ },
611
+ "30592": {
612
+ "content": "sp2020",
613
+ "lstrip": false,
614
+ "normalized": true,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": false
618
+ },
619
+ "30593": {
620
+ "content": "supas",
621
+ "lstrip": false,
622
+ "normalized": true,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": false
626
+ },
627
+ "30594": {
628
+ "content": "sdki",
629
+ "lstrip": false,
630
+ "normalized": true,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": false
634
+ },
635
+ "30595": {
636
+ "content": "sdgi",
637
+ "lstrip": false,
638
+ "normalized": true,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": false
642
+ },
643
+ "30596": {
644
+ "content": "hls",
645
+ "lstrip": false,
646
+ "normalized": true,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": false
650
+ },
651
+ "30597": {
652
+ "content": "rls",
653
+ "lstrip": false,
654
+ "normalized": true,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": false
658
+ },
659
+ "30598": {
660
+ "content": "sutas",
661
+ "lstrip": false,
662
+ "normalized": true,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": false
666
+ },
667
+ "30599": {
668
+ "content": "st2023",
669
+ "lstrip": false,
670
+ "normalized": true,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": false
674
+ },
675
+ "30600": {
676
+ "content": "st2013",
677
+ "lstrip": false,
678
+ "normalized": true,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": false
682
+ },
683
+ "30601": {
684
+ "content": "ltt",
685
+ "lstrip": false,
686
+ "normalized": true,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": false
690
+ },
691
+ "30602": {
692
+ "content": "ltn",
693
+ "lstrip": false,
694
+ "normalized": true,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": false
698
+ },
699
+ "30603": {
700
+ "content": "ntup",
701
+ "lstrip": false,
702
+ "normalized": true,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": false
706
+ },
707
+ "30604": {
708
+ "content": "bpp",
709
+ "lstrip": false,
710
+ "normalized": true,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": false
714
+ },
715
+ "30605": {
716
+ "content": "fob",
717
+ "lstrip": false,
718
+ "normalized": true,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": false
722
+ },
723
+ "30606": {
724
+ "content": "cif",
725
+ "lstrip": false,
726
+ "normalized": true,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": false
730
+ },
731
+ "30607": {
732
+ "content": "sakernas",
733
+ "lstrip": false,
734
+ "normalized": true,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": false
738
+ },
739
+ "30608": {
740
+ "content": "tpak",
741
+ "lstrip": false,
742
+ "normalized": true,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": false
746
+ },
747
+ "30609": {
748
+ "content": "kkl",
749
+ "lstrip": false,
750
+ "normalized": true,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": false
754
+ },
755
+ "30610": {
756
+ "content": "kki",
757
+ "lstrip": false,
758
+ "normalized": true,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": false
762
+ },
763
+ "30611": {
764
+ "content": "kci",
765
+ "lstrip": false,
766
+ "normalized": true,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": false
770
+ },
771
+ "30612": {
772
+ "content": "krt",
773
+ "lstrip": false,
774
+ "normalized": true,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": false
778
+ },
779
+ "30613": {
780
+ "content": "brs",
781
+ "lstrip": false,
782
+ "normalized": true,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": false
786
+ },
787
+ "30614": {
788
+ "content": "sektoral",
789
+ "lstrip": false,
790
+ "normalized": true,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": false
794
+ },
795
+ "30615": {
796
+ "content": "metadata",
797
+ "lstrip": false,
798
+ "normalized": true,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": false
802
  }
803
  },
804
  "clean_up_tokenization_spaces": true,