sewoong commited on
Commit
d4d7d55
·
verified ·
1 Parent(s): fa98e65

Korean neural sparse encoder: SPLADE-max with ModernBERT backbone

Browse files
Files changed (6) hide show
  1. README.md +104 -38
  2. config.json +33 -15
  3. model.safetensors +2 -2
  4. special_tokens_map.json +43 -7
  5. tokenizer.json +2 -2
  6. tokenizer_config.json +278 -11
README.md CHANGED
@@ -1,60 +1,126 @@
1
  ---
2
- language: ko
 
 
3
  tags:
4
  - neural-sparse
 
5
  - opensearch
6
  - korean
7
- - xlm-roberta
8
- - sparse-retrieval
9
  - information-retrieval
10
- license: apache-2.0
11
  library_name: transformers
12
  pipeline_tag: feature-extraction
13
  ---
14
 
15
- # Korean Neural Sparse Encoder V28
16
 
17
- Korean-optimized neural sparse retrieval model based on XLM-RoBERTa with Context Gate architecture.
18
 
19
  ## Model Description
20
 
21
- - **Architecture**: SPLADEDocContextGated (XLM-RoBERTa-base + Context Gate)
22
- - **Parameters**: 345M
23
- - **Training Data**: 8M+ Korean text pairs (V29.0 dataset)
24
- - **Training**: 25 epochs, 8x NVIDIA B200 GPUs (DDP), BF16
25
- - **Teacher**: BAAI/bge-m3 (knowledge distillation)
26
-
27
- ## Ko-StrategyQA Benchmark (592 queries, 9,251 documents)
28
-
29
- | Method | Recall@1 | Recall@5 | Recall@10 | MRR | P50 (ms) |
30
- |--------|----------|----------|-----------|-----|----------|
31
- | **semantic** (BGE-M3) | 73.5% | 87.3% | 89.4% | 0.795 | 16.1 |
32
- | hybrid_linear_0.3 | 70.3% | 86.0% | 88.7% | 0.772 | 96.6 |
33
- | bm25_semantic_rrf | 67.4% | 85.5% | 87.8% | 0.751 | 67.7 |
34
- | bm25 | 53.7% | 75.3% | 81.9% | 0.626 | 15.2 |
35
- | **neural_sparse** (this model) | 16.2% | 40.2% | 54.9% | 0.265 | 18.1 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Usage with OpenSearch
38
 
39
-
40
-
41
- ## Usage with Transformers
42
-
43
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Training Details
46
 
47
- - **Version**: V28 (Context-Gated SPLADE)
48
- - **Base Model**: xlm-roberta-base
49
- - **Loss**: InfoNCE + FLOPS + KD (BGE-M3) + Language Penalty
50
- - **Curriculum**: 2-phase (Foundation -> Balanced with hard negatives)
51
- - **Final Train Loss**: 1.8255
52
- - **Final Val Loss**: 1.9558
53
 
54
- ## Version History
55
 
56
- | Version | Recall@1 | Architecture |
57
- |---------|----------|--------------|
58
- | V28 | 16.2% | SPLADEDocContextGated |
59
- | V26 | 30.4% | SPLADEDocXLMR + IDF |
60
- | V25 | 21.0% | SPLADEDocXLMR |
 
1
  ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
  tags:
6
  - neural-sparse
7
+ - splade
8
  - opensearch
9
  - korean
 
 
10
  - information-retrieval
 
11
  library_name: transformers
12
  pipeline_tag: feature-extraction
13
  ---
14
 
15
+ # Korean Neural Sparse Encoder
16
 
17
+ A SPLADE-max neural sparse encoder optimized for Korean text retrieval with OpenSearch.
18
 
19
  ## Model Description
20
 
21
+ - **Architecture**: SPLADE-max (MLM head → log(1+ReLU) max pooling)
22
+ - **Base Model**: [skt/A.X-Encoder-base](https://huggingface.co/skt/A.X-Encoder-base) (ModernBERT)
23
+ - **Vocabulary**: 50,000 tokens (48.4% Korean)
24
+ - **Parameters**: 149M
25
+ - **Hidden Size**: 768
26
+ - **Layers**: 22
27
+ - **Training**: InfoNCE + FLOPS regularization with quadratic lambda warmup
28
+ - **Training Data**: 4.6M Korean triplets (query, positive, negative)
29
+
30
+ ## Benchmark Results
31
+
32
+ Evaluated on Korean retrieval benchmarks using OpenSearch `neural_sparse` search:
33
+
34
+ | Benchmark | Queries | BM25 R@1 | Neural Sparse R@1 | Semantic (BGE-M3) R@1 |
35
+ |-----------|---------|----------|-------------------|----------------------|
36
+ | Ko-StrategyQA | 592 | 53.7% | **62.2%** | 73.5% |
37
+ | MIRACL-ko | 213 | 44.1% | **62.0%** | 70.9% |
38
+ | Mr.TyDi-ko | 421 | 55.6% | **73.4%** | 84.1% |
39
+
40
+ Neural sparse consistently outperforms BM25 across all benchmarks while maintaining sparse, interpretable representations.
41
+
42
+ ### Detailed Metrics
43
+
44
+ | Benchmark | Method | R@1 | R@5 | R@10 | MRR | NDCG@10 |
45
+ |-----------|--------|-----|-----|------|-----|---------|
46
+ | Ko-StrategyQA | BM25 | 53.7% | 75.3% | 81.9% | 0.626 | 0.673 |
47
+ | Ko-StrategyQA | Neural Sparse | 62.2% | 80.6% | 83.6% | 0.700 | 0.734 |
48
+ | Ko-StrategyQA | Semantic | 73.5% | 87.3% | 89.4% | 0.795 | 0.819 |
49
+ | MIRACL-ko | BM25 | 44.1% | 80.8% | 90.6% | 0.589 | — |
50
+ | MIRACL-ko | Neural Sparse | 62.0% | 89.7% | 93.4% | 0.733 | — |
51
+ | MIRACL-ko | Semantic | 70.9% | 93.9% | 97.7% | 0.810 | — |
52
+ | Mr.TyDi-ko | BM25 | 55.6% | 79.1% | 85.7% | 0.656 | — |
53
+ | Mr.TyDi-ko | Neural Sparse | 73.4% | 92.4% | 94.8% | 0.816 | — |
54
+ | Mr.TyDi-ko | Semantic | 84.1% | 95.7% | 96.9% | 0.894 | — |
55
 
56
  ## Usage with OpenSearch
57
 
58
+ ### 1. Register the model
59
+
60
+ ```json
61
+ POST /_plugins/_ml/models/_register
62
+ {
63
+ "name": "korean-neural-sparse-encoder",
64
+ "version": "1.0.0",
65
+ "model_format": "TORCH_SCRIPT",
66
+ "model_config": {
67
+ "model_type": "bert",
68
+ "embedding_dimension": 1,
69
+ "framework_type": "huggingface_transformers"
70
+ },
71
+ "url": "https://huggingface.co/sewoong/korean-neural-sparse-encoder"
72
+ }
73
+ ```
74
+
75
+ ### 2. Create a sparse index
76
+
77
+ ```json
78
+ PUT /my-korean-index
79
+ {
80
+ "settings": {
81
+ "index.knn": true
82
+ },
83
+ "mappings": {
84
+ "properties": {
85
+ "content": { "type": "text" },
86
+ "sparse_embedding": { "type": "rank_features" }
87
+ }
88
+ }
89
+ }
90
+ ```
91
+
92
+ ### 3. Search with neural_sparse
93
+
94
+ ```json
95
+ GET /my-korean-index/_search
96
+ {
97
+ "query": {
98
+ "neural_sparse": {
99
+ "sparse_embedding": {
100
+ "query_text": "서울 여행 추천 장소",
101
+ "model_id": "<model_id>"
102
+ }
103
+ }
104
+ }
105
+ }
106
+ ```
107
+
108
+ ## Sparsity Characteristics
109
+
110
+ After training, the model produces ultra-sparse representations:
111
+ - **Query tokens**: ~35 active (out of 50,000 vocabulary)
112
+ - **Document tokens**: ~58 active
113
+ - Activation sparsity > 99.9%
114
 
115
  ## Training Details
116
 
117
+ - **Hardware**: NVIDIA B200 (183GB VRAM each)
118
+ - **Effective Batch Size**: 2,048 (64 per GPU × 4 gradient accumulation × 8 GPUs)
119
+ - **Epochs**: 25
120
+ - **Learning Rate**: 5e-5 with cosine decay
121
+ - **FLOPS Regularization**: λ_q=0.01, λ_d=0.003 with 20K-step quadratic warmup
122
+ - **Mixed Precision**: BF16
123
 
124
+ ## License
125
 
126
+ Apache 2.0
 
 
 
 
config.json CHANGED
@@ -1,27 +1,45 @@
1
  {
2
  "architectures": [
3
- "XLMRobertaForMaskedLM"
4
  ],
5
- "attention_probs_dropout_prob": 0.1,
 
6
  "bos_token_id": 0,
7
- "classifier_dropout": null,
 
 
 
 
 
 
8
  "dtype": "float32",
9
- "eos_token_id": 2,
10
- "hidden_act": "gelu",
11
- "hidden_dropout_prob": 0.1,
 
 
 
12
  "hidden_size": 768,
 
13
  "initializer_range": 0.02,
14
- "intermediate_size": 3072,
15
  "layer_norm_eps": 1e-05,
16
- "max_position_embeddings": 514,
17
- "model_type": "xlm-roberta",
 
 
 
 
 
 
18
  "num_attention_heads": 12,
19
- "num_hidden_layers": 12,
20
- "output_past": true,
21
- "pad_token_id": 1,
22
  "position_embedding_type": "absolute",
 
 
 
 
23
  "transformers_version": "4.57.6",
24
- "type_vocab_size": 1,
25
- "use_cache": true,
26
- "vocab_size": 250002
27
  }
 
1
  {
2
  "architectures": [
3
+ "ModernBertForMaskedLM"
4
  ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
  "bos_token_id": 0,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 0,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
  "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 1,
18
+ "global_attn_every_n_layers": 3,
19
+ "global_rope_theta": 160000,
20
+ "gradient_checkpointing": false,
21
+ "hidden_activation": "gelu",
22
  "hidden_size": 768,
23
+ "initializer_cutoff_factor": 2.0,
24
  "initializer_range": 0.02,
25
+ "intermediate_size": 1152,
26
  "layer_norm_eps": 1e-05,
27
+ "local_attention": 128,
28
+ "local_rope_theta": 10000.0,
29
+ "max_position_embeddings": 16384,
30
+ "mlp_bias": false,
31
+ "mlp_dropout": 0.0,
32
+ "model_type": "modernbert",
33
+ "norm_bias": false,
34
+ "norm_eps": 1e-05,
35
  "num_attention_heads": 12,
36
+ "num_hidden_layers": 22,
37
+ "pad_token_id": 49999,
 
38
  "position_embedding_type": "absolute",
39
+ "repad_logits_with_grad": false,
40
+ "sep_token_id": 1,
41
+ "sparse_pred_ignore_index": -100,
42
+ "sparse_prediction": false,
43
  "transformers_version": "4.57.6",
44
+ "vocab_size": 50000
 
 
45
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b29121fc71db41f44f9c2f36c235caadbb4996961c63e980f309b43e16324166
3
- size 1113205088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab9f9a35738765142d146b112b3a90e73044e858c73a277c01c504b2e70eb3ff
3
+ size 597503064
special_tokens_map.json CHANGED
@@ -1,15 +1,51 @@
1
  {
2
- "bos_token": "<s>",
3
- "cls_token": "<s>",
4
- "eos_token": "</s>",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  "mask_token": {
6
  "content": "<mask>",
7
- "lstrip": true,
8
  "normalized": false,
9
  "rstrip": false,
10
  "single_word": false
11
  },
12
- "pad_token": "<pad>",
13
- "sep_token": "</s>",
14
- "unk_token": "<unk>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  }
 
1
  {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<cls>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "<\\s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
  "mask_token": {
24
  "content": "<mask>",
25
+ "lstrip": false,
26
  "normalized": false,
27
  "rstrip": false,
28
  "single_word": false
29
  },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "<sep>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
  }
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
3
- size 17082734
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef9bebca9c6529bdefa19909059e07dfdfd7c2f8afeefbf4d230a784a3847d64
3
+ size 1087185
tokenizer_config.json CHANGED
@@ -9,7 +9,7 @@
9
  "special": true
10
  },
11
  "1": {
12
- "content": "<pad>",
13
  "lstrip": false,
14
  "normalized": false,
15
  "rstrip": false,
@@ -17,7 +17,7 @@
17
  "special": true
18
  },
19
  "2": {
20
- "content": "</s>",
21
  "lstrip": false,
22
  "normalized": false,
23
  "rstrip": false,
@@ -25,16 +25,280 @@
25
  "special": true
26
  },
27
  "3": {
28
- "content": "<unk>",
29
  "lstrip": false,
30
  "normalized": false,
31
  "rstrip": false,
32
  "single_word": false,
33
  "special": true
34
  },
35
- "250001": {
36
  "content": "<mask>",
37
- "lstrip": true,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  "normalized": false,
39
  "rstrip": false,
40
  "single_word": false,
@@ -42,14 +306,17 @@
42
  }
43
  },
44
  "bos_token": "<s>",
45
- "clean_up_tokenization_spaces": false,
46
- "cls_token": "<s>",
47
- "eos_token": "</s>",
 
48
  "extra_special_tokens": {},
49
  "mask_token": "<mask>",
50
- "model_max_length": 512,
51
  "pad_token": "<pad>",
52
- "sep_token": "</s>",
53
- "tokenizer_class": "XLMRobertaTokenizer",
 
 
54
  "unk_token": "<unk>"
55
  }
 
9
  "special": true
10
  },
11
  "1": {
12
+ "content": "<\\s>",
13
  "lstrip": false,
14
  "normalized": false,
15
  "rstrip": false,
 
17
  "special": true
18
  },
19
  "2": {
20
+ "content": "<unk>",
21
  "lstrip": false,
22
  "normalized": false,
23
  "rstrip": false,
 
25
  "special": true
26
  },
27
  "3": {
28
+ "content": "<sep>",
29
  "lstrip": false,
30
  "normalized": false,
31
  "rstrip": false,
32
  "single_word": false,
33
  "special": true
34
  },
35
+ "4": {
36
  "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "<cls>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "<unused0>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "<unused1>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "8": {
68
+ "content": "<unused2>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "9": {
76
+ "content": "<unused3>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "10": {
84
+ "content": "<unused4>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "11": {
92
+ "content": "<unused5>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "12": {
100
+ "content": "<unused6>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "13": {
108
+ "content": "<unused7>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "14": {
116
+ "content": "<unused8>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "15": {
124
+ "content": "<unused9>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "16": {
132
+ "content": "<unused10>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "17": {
140
+ "content": "<unused11>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "18": {
148
+ "content": "<unused12>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "19": {
156
+ "content": "<unused13>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "20": {
164
+ "content": "<unused14>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "21": {
172
+ "content": "<unused15>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "22": {
180
+ "content": "<unused16>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "23": {
188
+ "content": "<unused17>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "24": {
196
+ "content": "<unused18>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "25": {
204
+ "content": "<unused19>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "26": {
212
+ "content": "<unused20>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "27": {
220
+ "content": "<unused21>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "28": {
228
+ "content": "<unused22>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "29": {
236
+ "content": "<unused23>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "30": {
244
+ "content": "<unused24>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "31": {
252
+ "content": "<unused25>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "32": {
260
+ "content": "<unused26>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "33": {
268
+ "content": "<unused27>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "34": {
276
+ "content": "<unused28>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "35": {
284
+ "content": "<unused29>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "36": {
292
+ "content": "<unused30>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "49999": {
300
+ "content": "<pad>",
301
+ "lstrip": false,
302
  "normalized": false,
303
  "rstrip": false,
304
  "single_word": false,
 
306
  }
307
  },
308
  "bos_token": "<s>",
309
+ "clean_up_tokenization_spaces": true,
310
+ "cls_token": "<cls>",
311
+ "do_lower_case": false,
312
+ "eos_token": "<\\s>",
313
  "extra_special_tokens": {},
314
  "mask_token": "<mask>",
315
+ "model_max_length": 16384,
316
  "pad_token": "<pad>",
317
+ "sep_token": "<sep>",
318
+ "strip_accents": null,
319
+ "tokenize_chinese_chars": true,
320
+ "tokenizer_class": "BertTokenizer",
321
  "unk_token": "<unk>"
322
  }