MossaabDev commited on
Commit
5055cb8
·
verified ·
1 Parent(s): af47b50

Upload 10 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ unigram.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,358 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:217
9
+ - loss:CosineSimilarityLoss
10
+ base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
11
+ widget:
12
+ - source_sentence: my teacher died last year, I miss him
13
+ sentences:
14
+ - Every soul will taste death, then to Us you will ?all? be returned.
15
+ - And live with them in kindness. For if you dislike them - perhaps you dislike
16
+ a thing and Allah makes therein much good
17
+ - And live with them in kindness. For if you dislike them perhaps you dislike a
18
+ thing and Allah makes therein much good
19
+ - source_sentence: I am sad I saw kids in gaza are dying
20
+ sentences:
21
+ - And never think that Allah is unaware of what the wrongdoers do. He only delays
22
+ them for a Day when eyes will stare [in horror]
23
+ - Bid your people to pray, and be diligent in ?observing? it. We do not ask you
24
+ to provide. It is We Who provide for you. And the ultimate outcome is ?only? for
25
+ ?the people of? righteousness.
26
+ - And We will surely test you with something of fear and hunger and a loss of wealth
27
+ and lives and fruits, but give good tidings to the patient
28
+ - source_sentence: is prayers mandatory, I want my brother and may family to pray
29
+ everyday
30
+ sentences:
31
+ - Every soul will taste death. And you will only receive your full reward on the
32
+ Day of Judgment. Whoever is spared from the Fire and is admitted into Paradise
33
+ will ?indeed? triumph, whereas the life of this world is no more than the delusion
34
+ of enjoyment.
35
+ - And seek help through patience and prayer. Indeed, it is a burden except for the
36
+ humble
37
+ - Bid your people to pray, and be diligent in ?observing? it. We do not ask you
38
+ to provide. It is We Who provide for you. And the ultimate outcome is ?only? for
39
+ ?the people of? righteousness.
40
+ - source_sentence: I feel jhopeless
41
+ sentences:
42
+ - And when I am ill, it is He who cures me
43
+ - And seek help through patience and prayer. Indeed, it is a burden except for the
44
+ humble
45
+ - And when I am ill, it is He who cures me
46
+ - source_sentence: I failed in exams
47
+ sentences:
48
+ - But perhaps you hate a thing and it is good for you; and perhaps you love a thing
49
+ and it is bad for you. And Allah knows, while you know not
50
+ - O humanity! Indeed, there has come to you a warning from your Lord, a cure for
51
+ what is in the hearts, a guide, and a mercy for the believers.
52
+ - 'Give good news to those who patiently endure who say, when struck by a disaster, Surely
53
+ to Allah we belong and to Him we will ?all? return. '
54
+ pipeline_tag: sentence-similarity
55
+ library_name: sentence-transformers
56
  ---
57
+
58
+ # SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
59
+
60
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
61
+
62
+ ## Model Details
63
+
64
+ ### Model Description
65
+ - **Model Type:** Sentence Transformer
66
+ - **Base model:** [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) <!-- at revision 86741b4e3f5cb7765a600d3a3d55a0f6a6cb443d -->
67
+ - **Maximum Sequence Length:** 128 tokens
68
+ - **Output Dimensionality:** 384 dimensions
69
+ - **Similarity Function:** Cosine Similarity
70
+ <!-- - **Training Dataset:** Unknown -->
71
+ <!-- - **Language:** Unknown -->
72
+ <!-- - **License:** Unknown -->
73
+
74
+ ### Model Sources
75
+
76
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
77
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
78
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
79
+
80
+ ### Full Model Architecture
81
+
82
+ ```
83
+ SentenceTransformer(
84
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
85
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
86
+ )
87
+ ```
88
+
89
+ ## Usage
90
+
91
+ ### Direct Usage (Sentence Transformers)
92
+
93
+ First install the Sentence Transformers library:
94
+
95
+ ```bash
96
+ pip install -U sentence-transformers
97
+ ```
98
+
99
+ Then you can load this model and run inference.
100
+ ```python
101
+ from sentence_transformers import SentenceTransformer
102
+
103
+ # Download from the 🤗 Hub
104
+ model = SentenceTransformer("sentence_transformers_model_id")
105
+ # Run inference
106
+ sentences = [
107
+ 'I failed in exams',
108
+ 'But perhaps you hate a thing and it is good for you; and perhaps you love a thing and it is bad for you. And Allah knows, while you know not',
109
+ 'O humanity! Indeed, there has come to you a warning from your Lord, a cure for what is in the hearts, a guide, and a mercy for the believers.',
110
+ ]
111
+ embeddings = model.encode(sentences)
112
+ print(embeddings.shape)
113
+ # [3, 384]
114
+
115
+ # Get the similarity scores for the embeddings
116
+ similarities = model.similarity(embeddings, embeddings)
117
+ print(similarities)
118
+ # tensor([[1.0000, 0.8458, 0.7432],
119
+ # [0.8458, 1.0000, 0.7996],
120
+ # [0.7432, 0.7996, 1.0000]])
121
+ ```
122
+
123
+ <!--
124
+ ### Direct Usage (Transformers)
125
+
126
+ <details><summary>Click to see the direct usage in Transformers</summary>
127
+
128
+ </details>
129
+ -->
130
+
131
+ <!--
132
+ ### Downstream Usage (Sentence Transformers)
133
+
134
+ You can finetune this model on your own dataset.
135
+
136
+ <details><summary>Click to expand</summary>
137
+
138
+ </details>
139
+ -->
140
+
141
+ <!--
142
+ ### Out-of-Scope Use
143
+
144
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
145
+ -->
146
+
147
+ <!--
148
+ ## Bias, Risks and Limitations
149
+
150
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
151
+ -->
152
+
153
+ <!--
154
+ ### Recommendations
155
+
156
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
157
+ -->
158
+
159
+ ## Training Details
160
+
161
+ ### Training Dataset
162
+
163
+ #### Unnamed Dataset
164
+
165
+ * Size: 217 training samples
166
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
167
+ * Approximate statistics based on the first 217 samples:
168
+ | | sentence_0 | sentence_1 | label |
169
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------|
170
+ | type | string | string | float |
171
+ | details | <ul><li>min: 5 tokens</li><li>mean: 11.92 tokens</li><li>max: 34 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 37.53 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.88</li><li>max: 1.0</li></ul> |
172
+ * Samples:
173
+ | sentence_0 | sentence_1 | label |
174
+ |:-------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
175
+ | <code>how to avoid fights, and spread love?</code> | <code>Repel evil with that which is better, and then the one whom there is enmity between you and him will become as though he was a close friend</code> | <code>1.0</code> |
176
+ | <code>I can not provide for my family</code> | <code>Bid your people to pray, and be diligent in ?observing? it. We do not ask you to provide. It is We Who provide for you. And the ultimate outcome is ?only? for ?the people of? righteousness.</code> | <code>0.0</code> |
177
+ | <code>is allah testing me or turtoring me, it is really difficult</code> | <code>And We will surely test you with something of fear and hunger and a loss of wealth and lives and fruits, but give good tidings to the patient</code> | <code>1.0</code> |
178
+ * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
179
+ ```json
180
+ {
181
+ "loss_fct": "torch.nn.modules.loss.MSELoss"
182
+ }
183
+ ```
184
+
185
+ ### Training Hyperparameters
186
+ #### Non-Default Hyperparameters
187
+
188
+ - `num_train_epochs`: 10
189
+ - `multi_dataset_batch_sampler`: round_robin
190
+
191
+ #### All Hyperparameters
192
+ <details><summary>Click to expand</summary>
193
+
194
+ - `overwrite_output_dir`: False
195
+ - `do_predict`: False
196
+ - `eval_strategy`: no
197
+ - `prediction_loss_only`: True
198
+ - `per_device_train_batch_size`: 8
199
+ - `per_device_eval_batch_size`: 8
200
+ - `per_gpu_train_batch_size`: None
201
+ - `per_gpu_eval_batch_size`: None
202
+ - `gradient_accumulation_steps`: 1
203
+ - `eval_accumulation_steps`: None
204
+ - `torch_empty_cache_steps`: None
205
+ - `learning_rate`: 5e-05
206
+ - `weight_decay`: 0.0
207
+ - `adam_beta1`: 0.9
208
+ - `adam_beta2`: 0.999
209
+ - `adam_epsilon`: 1e-08
210
+ - `max_grad_norm`: 1
211
+ - `num_train_epochs`: 10
212
+ - `max_steps`: -1
213
+ - `lr_scheduler_type`: linear
214
+ - `lr_scheduler_kwargs`: {}
215
+ - `warmup_ratio`: 0.0
216
+ - `warmup_steps`: 0
217
+ - `log_level`: passive
218
+ - `log_level_replica`: warning
219
+ - `log_on_each_node`: True
220
+ - `logging_nan_inf_filter`: True
221
+ - `save_safetensors`: True
222
+ - `save_on_each_node`: False
223
+ - `save_only_model`: False
224
+ - `restore_callback_states_from_checkpoint`: False
225
+ - `no_cuda`: False
226
+ - `use_cpu`: False
227
+ - `use_mps_device`: False
228
+ - `seed`: 42
229
+ - `data_seed`: None
230
+ - `jit_mode_eval`: False
231
+ - `bf16`: False
232
+ - `fp16`: False
233
+ - `fp16_opt_level`: O1
234
+ - `half_precision_backend`: auto
235
+ - `bf16_full_eval`: False
236
+ - `fp16_full_eval`: False
237
+ - `tf32`: None
238
+ - `local_rank`: 0
239
+ - `ddp_backend`: None
240
+ - `tpu_num_cores`: None
241
+ - `tpu_metrics_debug`: False
242
+ - `debug`: []
243
+ - `dataloader_drop_last`: False
244
+ - `dataloader_num_workers`: 0
245
+ - `dataloader_prefetch_factor`: None
246
+ - `past_index`: -1
247
+ - `disable_tqdm`: False
248
+ - `remove_unused_columns`: True
249
+ - `label_names`: None
250
+ - `load_best_model_at_end`: False
251
+ - `ignore_data_skip`: False
252
+ - `fsdp`: []
253
+ - `fsdp_min_num_params`: 0
254
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
255
+ - `fsdp_transformer_layer_cls_to_wrap`: None
256
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
257
+ - `parallelism_config`: None
258
+ - `deepspeed`: None
259
+ - `label_smoothing_factor`: 0.0
260
+ - `optim`: adamw_torch_fused
261
+ - `optim_args`: None
262
+ - `adafactor`: False
263
+ - `group_by_length`: False
264
+ - `length_column_name`: length
265
+ - `project`: huggingface
266
+ - `trackio_space_id`: trackio
267
+ - `ddp_find_unused_parameters`: None
268
+ - `ddp_bucket_cap_mb`: None
269
+ - `ddp_broadcast_buffers`: False
270
+ - `dataloader_pin_memory`: True
271
+ - `dataloader_persistent_workers`: False
272
+ - `skip_memory_metrics`: True
273
+ - `use_legacy_prediction_loop`: False
274
+ - `push_to_hub`: False
275
+ - `resume_from_checkpoint`: None
276
+ - `hub_model_id`: None
277
+ - `hub_strategy`: every_save
278
+ - `hub_private_repo`: None
279
+ - `hub_always_push`: False
280
+ - `hub_revision`: None
281
+ - `gradient_checkpointing`: False
282
+ - `gradient_checkpointing_kwargs`: None
283
+ - `include_inputs_for_metrics`: False
284
+ - `include_for_metrics`: []
285
+ - `eval_do_concat_batches`: True
286
+ - `fp16_backend`: auto
287
+ - `push_to_hub_model_id`: None
288
+ - `push_to_hub_organization`: None
289
+ - `mp_parameters`:
290
+ - `auto_find_batch_size`: False
291
+ - `full_determinism`: False
292
+ - `torchdynamo`: None
293
+ - `ray_scope`: last
294
+ - `ddp_timeout`: 1800
295
+ - `torch_compile`: False
296
+ - `torch_compile_backend`: None
297
+ - `torch_compile_mode`: None
298
+ - `include_tokens_per_second`: False
299
+ - `include_num_input_tokens_seen`: no
300
+ - `neftune_noise_alpha`: None
301
+ - `optim_target_modules`: None
302
+ - `batch_eval_metrics`: False
303
+ - `eval_on_start`: False
304
+ - `use_liger_kernel`: False
305
+ - `liger_kernel_config`: None
306
+ - `eval_use_gather_object`: False
307
+ - `average_tokens_across_devices`: True
308
+ - `prompts`: None
309
+ - `batch_sampler`: batch_sampler
310
+ - `multi_dataset_batch_sampler`: round_robin
311
+ - `router_mapping`: {}
312
+ - `learning_rate_mapping`: {}
313
+
314
+ </details>
315
+
316
+ ### Framework Versions
317
+ - Python: 3.12.4
318
+ - Sentence Transformers: 5.1.2
319
+ - Transformers: 4.57.1
320
+ - PyTorch: 2.8.0+cu126
321
+ - Accelerate: 1.11.0
322
+ - Datasets: 4.4.1
323
+ - Tokenizers: 0.22.1
324
+
325
+ ## Citation
326
+
327
+ ### BibTeX
328
+
329
+ #### Sentence Transformers
330
+ ```bibtex
331
+ @inproceedings{reimers-2019-sentence-bert,
332
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
333
+ author = "Reimers, Nils and Gurevych, Iryna",
334
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
335
+ month = "11",
336
+ year = "2019",
337
+ publisher = "Association for Computational Linguistics",
338
+ url = "https://arxiv.org/abs/1908.10084",
339
+ }
340
+ ```
341
+
342
+ <!--
343
+ ## Glossary
344
+
345
+ *Clearly define terms in order to be accessible across audiences.*
346
+ -->
347
+
348
+ <!--
349
+ ## Model Card Authors
350
+
351
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
352
+ -->
353
+
354
+ <!--
355
+ ## Model Card Contact
356
+
357
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
358
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.57.1",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 250037
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.2",
4
+ "transformers": "4.57.1",
5
+ "pytorch": "2.8.0+cu126"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d668e566c7d1d7cf382a2eae91bf2b9e212f33b0d5083a99607d1f24fcd0bbcd
3
+ size 470637416
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "do_lower_case": true,
48
+ "eos_token": "</s>",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "<mask>",
51
+ "max_length": 128,
52
+ "model_max_length": 128,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "<pad>",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "</s>",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "<unk>"
65
+ }
unigram.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da145b5e7700ae40f16691ec32a0b1fdc1ee3298db22a31ea55f57a966c4a65d
3
+ size 14763260