Hariprasath5128 commited on
Commit
96b6914
·
verified ·
1 Parent(s): bdc3669

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:68
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: sentence-transformers/all-mpnet-base-v2
11
+ widget:
12
+ - source_sentence: The Atlantic spotted dolphin is a dolphin found in warm temperate
13
+ and tropical waters of the Atlantic Ocean. Older members of the species have a
14
+ very distinctive spotted coloration all over their bodies.
15
+ sentences:
16
+ - baikal_seal
17
+ - southern_right_whale
18
+ - atlantic_spotted_dolphin
19
+ - source_sentence: The burmeisters porpoise is a marine mammal belonging to the cetaceans
20
+ group. It inhabits ocean and coastal habitats worldwide and plays an important
21
+ role in marine ecosystems.
22
+ sentences:
23
+ - false_killer_whale
24
+ - burmeisters_porpoise
25
+ - south_asian_river_dolphin
26
+ - source_sentence: Dall's porpoise is a species of porpoise endemic to the North Pacific.
27
+ It is the largest of porpoises and the only member of the genus Phocoenoides.
28
+ The species is named after American naturalist W. H. Dall.
29
+ sentences:
30
+ - dalls_porpoise
31
+ - burrunan_dolphin
32
+ - bolivian_river_dolphin
33
+ - source_sentence: The hourglass dolphin is a small dolphin in the family Delphinidae
34
+ that inhabits offshore Antarctic and sub-Antarctic waters. It is commonly seen
35
+ from ships crossing the Drake Passage but has a circumpolar distribution.
36
+ sentences:
37
+ - common_dolphin
38
+ - hourglass_dolphin
39
+ - harbour_porpoise
40
+ - source_sentence: The harp seal, also known as the saddleback seal or Greenland seal,
41
+ is a species of earless seal, or true seal, native to the northernmost Atlantic
42
+ Ocean and Arctic Ocean. Originally in the genus Phoca with a number of other species,
43
+ it was reclassified into the monotypic genus Pagophilus in 1844. In Greek, its
44
+ scientific name translates to "Greenlandic ice-lover", and its taxonomic synonym,
45
+ Phoca groenlandica translates to "Greenlandic seal". This is the only species
46
+ in the genus Pagophilus.
47
+ sentences:
48
+ - harp_seal
49
+ - amazon_river_dolphin
50
+ - ringed_seal
51
+ pipeline_tag: sentence-similarity
52
+ library_name: sentence-transformers
53
+ ---
54
+
55
+ # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
56
+
57
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
58
+
59
+ ## Model Details
60
+
61
+ ### Model Description
62
+ - **Model Type:** Sentence Transformer
63
+ - **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision e8c3b32edf5434bc2275fc9bab85f82640a19130 -->
64
+ - **Maximum Sequence Length:** 384 tokens
65
+ - **Output Dimensionality:** 768 dimensions
66
+ - **Similarity Function:** Cosine Similarity
67
+ <!-- - **Training Dataset:** Unknown -->
68
+ <!-- - **Language:** Unknown -->
69
+ <!-- - **License:** Unknown -->
70
+
71
+ ### Model Sources
72
+
73
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
74
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
75
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
76
+
77
+ ### Full Model Architecture
78
+
79
+ ```
80
+ SentenceTransformer(
81
+ (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
82
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
83
+ (2): Normalize()
84
+ )
85
+ ```
86
+
87
+ ## Usage
88
+
89
+ ### Direct Usage (Sentence Transformers)
90
+
91
+ First install the Sentence Transformers library:
92
+
93
+ ```bash
94
+ pip install -U sentence-transformers
95
+ ```
96
+
97
+ Then you can load this model and run inference.
98
+ ```python
99
+ from sentence_transformers import SentenceTransformer
100
+
101
+ # Download from the 🤗 Hub
102
+ model = SentenceTransformer("sentence_transformers_model_id")
103
+ # Run inference
104
+ sentences = [
105
+ 'The harp seal, also known as the saddleback seal or Greenland seal, is a species of earless seal, or true seal, native to the northernmost Atlantic Ocean and Arctic Ocean. Originally in the genus Phoca with a number of other species, it was reclassified into the monotypic genus Pagophilus in 1844. In Greek, its scientific name translates to "Greenlandic ice-lover", and its taxonomic synonym, Phoca groenlandica translates to "Greenlandic seal". This is the only species in the genus Pagophilus.',
106
+ 'harp_seal',
107
+ 'ringed_seal',
108
+ ]
109
+ embeddings = model.encode(sentences)
110
+ print(embeddings.shape)
111
+ # [3, 768]
112
+
113
+ # Get the similarity scores for the embeddings
114
+ similarities = model.similarity(embeddings, embeddings)
115
+ print(similarities)
116
+ # tensor([[1.0000, 0.7737, 0.2011],
117
+ # [0.7737, 1.0000, 0.4141],
118
+ # [0.2011, 0.4141, 1.0000]])
119
+ ```
120
+
121
+ <!--
122
+ ### Direct Usage (Transformers)
123
+
124
+ <details><summary>Click to see the direct usage in Transformers</summary>
125
+
126
+ </details>
127
+ -->
128
+
129
+ <!--
130
+ ### Downstream Usage (Sentence Transformers)
131
+
132
+ You can finetune this model on your own dataset.
133
+
134
+ <details><summary>Click to expand</summary>
135
+
136
+ </details>
137
+ -->
138
+
139
+ <!--
140
+ ### Out-of-Scope Use
141
+
142
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
143
+ -->
144
+
145
+ <!--
146
+ ## Bias, Risks and Limitations
147
+
148
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
149
+ -->
150
+
151
+ <!--
152
+ ### Recommendations
153
+
154
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
155
+ -->
156
+
157
+ ## Training Details
158
+
159
+ ### Training Dataset
160
+
161
+ #### Unnamed Dataset
162
+
163
+ * Size: 68 training samples
164
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
165
+ * Approximate statistics based on the first 68 samples:
166
+ | | sentence_0 | sentence_1 |
167
+ |:--------|:-------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|
168
+ | type | string | string |
169
+ | details | <ul><li>min: 11 tokens</li><li>mean: 101.24 tokens</li><li>max: 226 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 6.79 tokens</li><li>max: 12 tokens</li></ul> |
170
+ * Samples:
171
+ | sentence_0 | sentence_1 |
172
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------|
173
+ | <code>Dall's porpoise is a species of porpoise endemic to the North Pacific. It is the largest of porpoises and the only member of the genus Phocoenoides. The species is named after American naturalist W. H. Dall.</code> | <code>dalls_porpoise</code> |
174
+ | <code>The Caspian seal is one of the smallest members of the earless seal family and unique in that it is found exclusively in the brackish Caspian Sea. It lives along the shorelines, but also on the many rocky islands and floating blocks of ice that dot the Caspian Sea. In winter and cooler parts of the spring and autumn season, it populates the northern Caspian coastline. As the ice melts in the summer and warmer parts of the spring and autumn season, it also occurs in the deltas of the Volga and Ural Rivers, as well as the southern latitudes of the Caspian where the water is cooler due to greater depth.</code> | <code>caspian_seal</code> |
175
+ | <code>The Weddell seal is a relatively large and abundant true seal with a circumpolar distribution surrounding Antarctica. The Weddell seal was discovered and named in the 1820s during expeditions led by British sealing captain James Weddell to the area of the Southern Ocean now known as the Weddell Sea. The life history of this species is well documented since it occupies fast ice environments close to the Antarctic continent and often adjacent to Antarctic bases. It is the only species in the genus Leptonychotes.</code> | <code>weddell_seal</code> |
176
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
177
+ ```json
178
+ {
179
+ "scale": 20.0,
180
+ "similarity_fct": "cos_sim",
181
+ "gather_across_devices": false
182
+ }
183
+ ```
184
+
185
+ ### Training Hyperparameters
186
+ #### Non-Default Hyperparameters
187
+
188
+ - `num_train_epochs`: 5
189
+ - `multi_dataset_batch_sampler`: round_robin
190
+
191
+ #### All Hyperparameters
192
+ <details><summary>Click to expand</summary>
193
+
194
+ - `overwrite_output_dir`: False
195
+ - `do_predict`: False
196
+ - `eval_strategy`: no
197
+ - `prediction_loss_only`: True
198
+ - `per_device_train_batch_size`: 8
199
+ - `per_device_eval_batch_size`: 8
200
+ - `per_gpu_train_batch_size`: None
201
+ - `per_gpu_eval_batch_size`: None
202
+ - `gradient_accumulation_steps`: 1
203
+ - `eval_accumulation_steps`: None
204
+ - `torch_empty_cache_steps`: None
205
+ - `learning_rate`: 5e-05
206
+ - `weight_decay`: 0.0
207
+ - `adam_beta1`: 0.9
208
+ - `adam_beta2`: 0.999
209
+ - `adam_epsilon`: 1e-08
210
+ - `max_grad_norm`: 1
211
+ - `num_train_epochs`: 5
212
+ - `max_steps`: -1
213
+ - `lr_scheduler_type`: linear
214
+ - `lr_scheduler_kwargs`: {}
215
+ - `warmup_ratio`: 0.0
216
+ - `warmup_steps`: 0
217
+ - `log_level`: passive
218
+ - `log_level_replica`: warning
219
+ - `log_on_each_node`: True
220
+ - `logging_nan_inf_filter`: True
221
+ - `save_safetensors`: True
222
+ - `save_on_each_node`: False
223
+ - `save_only_model`: False
224
+ - `restore_callback_states_from_checkpoint`: False
225
+ - `no_cuda`: False
226
+ - `use_cpu`: False
227
+ - `use_mps_device`: False
228
+ - `seed`: 42
229
+ - `data_seed`: None
230
+ - `jit_mode_eval`: False
231
+ - `use_ipex`: False
232
+ - `bf16`: False
233
+ - `fp16`: False
234
+ - `fp16_opt_level`: O1
235
+ - `half_precision_backend`: auto
236
+ - `bf16_full_eval`: False
237
+ - `fp16_full_eval`: False
238
+ - `tf32`: None
239
+ - `local_rank`: 0
240
+ - `ddp_backend`: None
241
+ - `tpu_num_cores`: None
242
+ - `tpu_metrics_debug`: False
243
+ - `debug`: []
244
+ - `dataloader_drop_last`: False
245
+ - `dataloader_num_workers`: 0
246
+ - `dataloader_prefetch_factor`: None
247
+ - `past_index`: -1
248
+ - `disable_tqdm`: False
249
+ - `remove_unused_columns`: True
250
+ - `label_names`: None
251
+ - `load_best_model_at_end`: False
252
+ - `ignore_data_skip`: False
253
+ - `fsdp`: []
254
+ - `fsdp_min_num_params`: 0
255
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
256
+ - `fsdp_transformer_layer_cls_to_wrap`: None
257
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
258
+ - `parallelism_config`: None
259
+ - `deepspeed`: None
260
+ - `label_smoothing_factor`: 0.0
261
+ - `optim`: adamw_torch
262
+ - `optim_args`: None
263
+ - `adafactor`: False
264
+ - `group_by_length`: False
265
+ - `length_column_name`: length
266
+ - `ddp_find_unused_parameters`: None
267
+ - `ddp_bucket_cap_mb`: None
268
+ - `ddp_broadcast_buffers`: False
269
+ - `dataloader_pin_memory`: True
270
+ - `dataloader_persistent_workers`: False
271
+ - `skip_memory_metrics`: True
272
+ - `use_legacy_prediction_loop`: False
273
+ - `push_to_hub`: False
274
+ - `resume_from_checkpoint`: None
275
+ - `hub_model_id`: None
276
+ - `hub_strategy`: every_save
277
+ - `hub_private_repo`: None
278
+ - `hub_always_push`: False
279
+ - `hub_revision`: None
280
+ - `gradient_checkpointing`: False
281
+ - `gradient_checkpointing_kwargs`: None
282
+ - `include_inputs_for_metrics`: False
283
+ - `include_for_metrics`: []
284
+ - `eval_do_concat_batches`: True
285
+ - `fp16_backend`: auto
286
+ - `push_to_hub_model_id`: None
287
+ - `push_to_hub_organization`: None
288
+ - `mp_parameters`:
289
+ - `auto_find_batch_size`: False
290
+ - `full_determinism`: False
291
+ - `torchdynamo`: None
292
+ - `ray_scope`: last
293
+ - `ddp_timeout`: 1800
294
+ - `torch_compile`: False
295
+ - `torch_compile_backend`: None
296
+ - `torch_compile_mode`: None
297
+ - `include_tokens_per_second`: False
298
+ - `include_num_input_tokens_seen`: False
299
+ - `neftune_noise_alpha`: None
300
+ - `optim_target_modules`: None
301
+ - `batch_eval_metrics`: False
302
+ - `eval_on_start`: False
303
+ - `use_liger_kernel`: False
304
+ - `liger_kernel_config`: None
305
+ - `eval_use_gather_object`: False
306
+ - `average_tokens_across_devices`: False
307
+ - `prompts`: None
308
+ - `batch_sampler`: batch_sampler
309
+ - `multi_dataset_batch_sampler`: round_robin
310
+ - `router_mapping`: {}
311
+ - `learning_rate_mapping`: {}
312
+
313
+ </details>
314
+
315
+ ### Framework Versions
316
+ - Python: 3.10.11
317
+ - Sentence Transformers: 5.2.3
318
+ - Transformers: 4.56.1
319
+ - PyTorch: 2.5.1+cu121
320
+ - Accelerate: 1.10.1
321
+ - Datasets: 4.0.0
322
+ - Tokenizers: 0.22.0
323
+
324
+ ## Citation
325
+
326
+ ### BibTeX
327
+
328
+ #### Sentence Transformers
329
+ ```bibtex
330
+ @inproceedings{reimers-2019-sentence-bert,
331
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
332
+ author = "Reimers, Nils and Gurevych, Iryna",
333
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
334
+ month = "11",
335
+ year = "2019",
336
+ publisher = "Association for Computational Linguistics",
337
+ url = "https://arxiv.org/abs/1908.10084",
338
+ }
339
+ ```
340
+
341
+ #### MultipleNegativesRankingLoss
342
+ ```bibtex
343
+ @misc{henderson2017efficient,
344
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
345
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
346
+ year={2017},
347
+ eprint={1705.00652},
348
+ archivePrefix={arXiv},
349
+ primaryClass={cs.CL}
350
+ }
351
+ ```
352
+
353
+ <!--
354
+ ## Glossary
355
+
356
+ *Clearly define terms in order to be accessible across audiences.*
357
+ -->
358
+
359
+ <!--
360
+ ## Model Card Authors
361
+
362
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
363
+ -->
364
+
365
+ <!--
366
+ ## Model Card Contact
367
+
368
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
369
+ -->
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "dtype": "float32",
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "transformers_version": "4.56.1",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.2.3",
4
+ "transformers": "4.56.1",
5
+ "pytorch": "2.5.1+cu121"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8a6a0f39e541e49ef9d5de138f296cc57e16159928f36805ca0d3071e4c4727
3
+ size 437967672
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "max_length": 128,
60
+ "model_max_length": 384,
61
+ "pad_to_multiple_of": null,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "strip_accents": null,
68
+ "tokenize_chinese_chars": true,
69
+ "tokenizer_class": "MPNetTokenizer",
70
+ "truncation_side": "right",
71
+ "truncation_strategy": "longest_first",
72
+ "unk_token": "[UNK]"
73
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff