dpshade22 commited on
Commit
0d35d7a
·
verified ·
1 Parent(s): b10702b

Upload hf-e5-bible-200 embedding model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,386 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:262023
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: intfloat/e5-base-v2
11
+ widget:
12
+ - source_sentence: 'query: Sin in the Bible'
13
+ sentences:
14
+ - 'passage: but each person is tempted when they are dragged away by their own evil
15
+ desire and enticed.'
16
+ - 'passage: If they want to inquire about something, they should ask their own husbands
17
+ at home; for it is disgraceful for a woman to speak in the church.'
18
+ - 'passage: The crowds that went ahead of him and those that followed shouted,
19
+
20
+ “Hosanna to the Son of David!”
21
+
22
+ “Blessed is he who comes in the name of the Lord!”
23
+
24
+ “Hosanna in the highest heaven!”'
25
+ - source_sentence: 'query: Naphtali in the Bible'
26
+ sentences:
27
+ - "passage: About Naphtali he said:\n“Naphtali is abounding with the favor of the\
28
+ \ Lord\n and is full of his blessing;\n he will inherit southward to the\
29
+ \ lake.”"
30
+ - "passage: You have enlarged the nation, Lord;\n you have enlarged the nation.\n\
31
+ You have gained glory for yourself;\n you have extended all the borders of\
32
+ \ the land."
33
+ - 'passage: For Herod himself had given orders to have John arrested, and he had
34
+ him bound and put in prison. He did this because of Herodias, his brother Philip’s
35
+ wife, whom he had married.'
36
+ - source_sentence: 'query: Ten Commandments Given in the Bible'
37
+ sentences:
38
+ - 'passage: As they were shouting and throwing off their cloaks and flinging dust
39
+ into the air,'
40
+ - 'passage: On the first day of the third month after the Israelites left Egypt—on
41
+ that very day—they came to the Desert of Sinai.'
42
+ - "passage: Blessed are the meek,\n for they will inherit the earth."
43
+ - source_sentence: 'query: But Nahash the Ammonite replied, “I will make a treaty
44
+ with you only on the condition that I gouge out the right eye of every one of
45
+ you and so bring disgrace on all Israel.”'
46
+ sentences:
47
+ - 'passage: A certain man in Maon, who had property there at Carmel, was very wealthy.
48
+ He had a thousand goats and three thousand sheep, which he was shearing in Carmel.'
49
+ - 'passage: Nor did Asher drive out those living in Akko or Sidon or Ahlab or Akzib
50
+ or Helbah or Aphek or Rehob.'
51
+ - 'passage: The elders of Jabesh said to him, “Give us seven days so we can send
52
+ messengers throughout Israel; if no one comes to rescue us, we will surrender
53
+ to you.”'
54
+ - source_sentence: 'query: who is Ephraim'
55
+ sentences:
56
+ - "passage: Those who guide this people mislead them,\n and those who are guided\
57
+ \ are led astray."
58
+ - "passage: No longer will they teach their neighbor,\n or say to one another,\
59
+ \ ‘Know the Lord,’\nbecause they will all know me,\n from the least of them\
60
+ \ to the greatest."
61
+ - 'passage: But a man of God came to him and said, “Your Majesty, these troops from
62
+ Israel must not march with you, for the Lord is not with Israel—not with any of
63
+ the people of Ephraim.'
64
+ pipeline_tag: sentence-similarity
65
+ library_name: sentence-transformers
66
+ ---
67
+
68
+ # SentenceTransformer based on intfloat/e5-base-v2
69
+
70
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
71
+
72
+ ## Model Details
73
+
74
+ ### Model Description
75
+ - **Model Type:** Sentence Transformer
76
+ - **Base model:** [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) <!-- at revision f52bf8ec8c7124536f0efb74aca902b2995e5bcd -->
77
+ - **Maximum Sequence Length:** 256 tokens
78
+ - **Output Dimensionality:** 768 dimensions
79
+ - **Similarity Function:** Cosine Similarity
80
+ <!-- - **Training Dataset:** Unknown -->
81
+ <!-- - **Language:** Unknown -->
82
+ <!-- - **License:** Unknown -->
83
+
84
+ ### Model Sources
85
+
86
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
87
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
88
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
89
+
90
+ ### Full Model Architecture
91
+
92
+ ```
93
+ SentenceTransformer(
94
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
95
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
96
+ (2): Normalize()
97
+ )
98
+ ```
99
+
100
+ ## Usage
101
+
102
+ ### Direct Usage (Sentence Transformers)
103
+
104
+ First install the Sentence Transformers library:
105
+
106
+ ```bash
107
+ pip install -U sentence-transformers
108
+ ```
109
+
110
+ Then you can load this model and run inference.
111
+ ```python
112
+ from sentence_transformers import SentenceTransformer
113
+
114
+ # Download from the 🤗 Hub
115
+ model = SentenceTransformer("sentence_transformers_model_id")
116
+ # Run inference
117
+ sentences = [
118
+ 'query: who is Ephraim',
119
+ 'passage: But a man of God came to him and said, “Your Majesty, these troops from Israel must not march with you, for the Lord is not with Israel—not with any of the people of Ephraim.',
120
+ 'passage: Those who guide this people mislead them,\n and those who are guided are led astray.',
121
+ ]
122
+ embeddings = model.encode(sentences)
123
+ print(embeddings.shape)
124
+ # [3, 768]
125
+
126
+ # Get the similarity scores for the embeddings
127
+ similarities = model.similarity(embeddings, embeddings)
128
+ print(similarities)
129
+ # tensor([[1.0000, 0.7104, 0.2667],
130
+ # [0.7104, 1.0000, 0.3225],
131
+ # [0.2667, 0.3225, 1.0000]])
132
+ ```
133
+
134
+ <!--
135
+ ### Direct Usage (Transformers)
136
+
137
+ <details><summary>Click to see the direct usage in Transformers</summary>
138
+
139
+ </details>
140
+ -->
141
+
142
+ <!--
143
+ ### Downstream Usage (Sentence Transformers)
144
+
145
+ You can finetune this model on your own dataset.
146
+
147
+ <details><summary>Click to expand</summary>
148
+
149
+ </details>
150
+ -->
151
+
152
+ <!--
153
+ ### Out-of-Scope Use
154
+
155
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
156
+ -->
157
+
158
+ <!--
159
+ ## Bias, Risks and Limitations
160
+
161
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
162
+ -->
163
+
164
+ <!--
165
+ ### Recommendations
166
+
167
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
168
+ -->
169
+
170
+ ## Training Details
171
+
172
+ ### Training Dataset
173
+
174
+ #### Unnamed Dataset
175
+
176
+ * Size: 262,023 training samples
177
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
178
+ * Approximate statistics based on the first 1000 samples:
179
+ | | sentence_0 | sentence_1 | label |
180
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
181
+ | type | string | string | float |
182
+ | details | <ul><li>min: 5 tokens</li><li>mean: 26.25 tokens</li><li>max: 256 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 35.61 tokens</li><li>max: 94 tokens</li></ul> | <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> |
183
+ * Samples:
184
+ | sentence_0 | sentence_1 | label |
185
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
186
+ | <code>query: God: (A.S. and Dutch God; Dan. Gud; Ger. Gott), the name of the Divine Being. It is the rendering (1) of the Hebrew <i> 'El</i> , from a word meaning to be strong; (2) of <i> 'Eloah_, plural _'Elohim</i> . The singular form, <i> Eloah</i> , is used only in poetry. The plural form is more commonly used in all parts of the Bible, The Hebrew word Jehovah (q.v.), the only other word generally employed to denote the Supreme Being, is uniformly rendered in the Authorized Version by "LORD," printed in small capitals. The existence of God is taken for granted in the Bible. There is nowhere any argument to prove it. He who disbelieves this truth is spoken of as one devoid of understanding ( Psalms 14:1 ). The arguments generally adduced by theologians in proof of the being of God are: <li> The a priori argument, which is the testimony afforded by reason. <li> The a posteriori argument, by which we proceed logically from the facts of experience to causes. These arguments are, ...</code> | <code>passage: But the Lord forbid that I should lay a hand on the Lord’s anointed. Now get the spear and water jug that are near his head, and let’s go.”</code> | <code>1.0</code> |
187
+ | <code>query: Predestination meaning</code> | <code>passage: From one man he made all the nations, that they should inhabit the whole earth; and he marked out their appointed times in history and the boundaries of their lands.</code> | <code>1.0</code> |
188
+ | <code>query: Noah and God</code> | <code>passage: And Noah did all that the Lord commanded him.</code> | <code>1.0</code> |
189
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
190
+ ```json
191
+ {
192
+ "scale": 20.0,
193
+ "similarity_fct": "cos_sim",
194
+ "gather_across_devices": false
195
+ }
196
+ ```
197
+
198
+ ### Training Hyperparameters
199
+ #### Non-Default Hyperparameters
200
+
201
+ - `per_device_train_batch_size`: 32
202
+ - `per_device_eval_batch_size`: 32
203
+ - `num_train_epochs`: 1
204
+ - `max_steps`: 200
205
+ - `multi_dataset_batch_sampler`: round_robin
206
+
207
+ #### All Hyperparameters
208
+ <details><summary>Click to expand</summary>
209
+
210
+ - `overwrite_output_dir`: False
211
+ - `do_predict`: False
212
+ - `eval_strategy`: no
213
+ - `prediction_loss_only`: True
214
+ - `per_device_train_batch_size`: 32
215
+ - `per_device_eval_batch_size`: 32
216
+ - `per_gpu_train_batch_size`: None
217
+ - `per_gpu_eval_batch_size`: None
218
+ - `gradient_accumulation_steps`: 1
219
+ - `eval_accumulation_steps`: None
220
+ - `torch_empty_cache_steps`: None
221
+ - `learning_rate`: 5e-05
222
+ - `weight_decay`: 0.0
223
+ - `adam_beta1`: 0.9
224
+ - `adam_beta2`: 0.999
225
+ - `adam_epsilon`: 1e-08
226
+ - `max_grad_norm`: 1
227
+ - `num_train_epochs`: 1
228
+ - `max_steps`: 200
229
+ - `lr_scheduler_type`: linear
230
+ - `lr_scheduler_kwargs`: None
231
+ - `warmup_ratio`: 0.0
232
+ - `warmup_steps`: 0
233
+ - `log_level`: passive
234
+ - `log_level_replica`: warning
235
+ - `log_on_each_node`: True
236
+ - `logging_nan_inf_filter`: True
237
+ - `save_safetensors`: True
238
+ - `save_on_each_node`: False
239
+ - `save_only_model`: False
240
+ - `restore_callback_states_from_checkpoint`: False
241
+ - `no_cuda`: False
242
+ - `use_cpu`: False
243
+ - `use_mps_device`: False
244
+ - `seed`: 42
245
+ - `data_seed`: None
246
+ - `jit_mode_eval`: False
247
+ - `bf16`: False
248
+ - `fp16`: False
249
+ - `fp16_opt_level`: O1
250
+ - `half_precision_backend`: auto
251
+ - `bf16_full_eval`: False
252
+ - `fp16_full_eval`: False
253
+ - `tf32`: None
254
+ - `local_rank`: 0
255
+ - `ddp_backend`: None
256
+ - `tpu_num_cores`: None
257
+ - `tpu_metrics_debug`: False
258
+ - `debug`: []
259
+ - `dataloader_drop_last`: False
260
+ - `dataloader_num_workers`: 0
261
+ - `dataloader_prefetch_factor`: None
262
+ - `past_index`: -1
263
+ - `disable_tqdm`: False
264
+ - `remove_unused_columns`: True
265
+ - `label_names`: None
266
+ - `load_best_model_at_end`: False
267
+ - `ignore_data_skip`: False
268
+ - `fsdp`: []
269
+ - `fsdp_min_num_params`: 0
270
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
271
+ - `fsdp_transformer_layer_cls_to_wrap`: None
272
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
273
+ - `parallelism_config`: None
274
+ - `deepspeed`: None
275
+ - `label_smoothing_factor`: 0.0
276
+ - `optim`: adamw_torch_fused
277
+ - `optim_args`: None
278
+ - `adafactor`: False
279
+ - `group_by_length`: False
280
+ - `length_column_name`: length
281
+ - `project`: huggingface
282
+ - `trackio_space_id`: trackio
283
+ - `ddp_find_unused_parameters`: None
284
+ - `ddp_bucket_cap_mb`: None
285
+ - `ddp_broadcast_buffers`: False
286
+ - `dataloader_pin_memory`: True
287
+ - `dataloader_persistent_workers`: False
288
+ - `skip_memory_metrics`: True
289
+ - `use_legacy_prediction_loop`: False
290
+ - `push_to_hub`: False
291
+ - `resume_from_checkpoint`: None
292
+ - `hub_model_id`: None
293
+ - `hub_strategy`: every_save
294
+ - `hub_private_repo`: None
295
+ - `hub_always_push`: False
296
+ - `hub_revision`: None
297
+ - `gradient_checkpointing`: False
298
+ - `gradient_checkpointing_kwargs`: None
299
+ - `include_inputs_for_metrics`: False
300
+ - `include_for_metrics`: []
301
+ - `eval_do_concat_batches`: True
302
+ - `fp16_backend`: auto
303
+ - `push_to_hub_model_id`: None
304
+ - `push_to_hub_organization`: None
305
+ - `mp_parameters`:
306
+ - `auto_find_batch_size`: False
307
+ - `full_determinism`: False
308
+ - `torchdynamo`: None
309
+ - `ray_scope`: last
310
+ - `ddp_timeout`: 1800
311
+ - `torch_compile`: False
312
+ - `torch_compile_backend`: None
313
+ - `torch_compile_mode`: None
314
+ - `include_tokens_per_second`: False
315
+ - `include_num_input_tokens_seen`: no
316
+ - `neftune_noise_alpha`: None
317
+ - `optim_target_modules`: None
318
+ - `batch_eval_metrics`: False
319
+ - `eval_on_start`: False
320
+ - `use_liger_kernel`: False
321
+ - `liger_kernel_config`: None
322
+ - `eval_use_gather_object`: False
323
+ - `average_tokens_across_devices`: True
324
+ - `prompts`: None
325
+ - `batch_sampler`: batch_sampler
326
+ - `multi_dataset_batch_sampler`: round_robin
327
+ - `router_mapping`: {}
328
+ - `learning_rate_mapping`: {}
329
+
330
+ </details>
331
+
332
+ ### Framework Versions
333
+ - Python: 3.11.14
334
+ - Sentence Transformers: 5.2.0
335
+ - Transformers: 4.57.6
336
+ - PyTorch: 2.10.0+cpu
337
+ - Accelerate: 1.12.0
338
+ - Datasets: 4.5.0
339
+ - Tokenizers: 0.22.2
340
+
341
+ ## Citation
342
+
343
+ ### BibTeX
344
+
345
+ #### Sentence Transformers
346
+ ```bibtex
347
+ @inproceedings{reimers-2019-sentence-bert,
348
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
349
+ author = "Reimers, Nils and Gurevych, Iryna",
350
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
351
+ month = "11",
352
+ year = "2019",
353
+ publisher = "Association for Computational Linguistics",
354
+ url = "https://arxiv.org/abs/1908.10084",
355
+ }
356
+ ```
357
+
358
+ #### MultipleNegativesRankingLoss
359
+ ```bibtex
360
+ @misc{henderson2017efficient,
361
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
362
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
363
+ year={2017},
364
+ eprint={1705.00652},
365
+ archivePrefix={arXiv},
366
+ primaryClass={cs.CL}
367
+ }
368
+ ```
369
+
370
+ <!--
371
+ ## Glossary
372
+
373
+ *Clearly define terms in order to be accessible across audiences.*
374
+ -->
375
+
376
+ <!--
377
+ ## Model Card Authors
378
+
379
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
380
+ -->
381
+
382
+ <!--
383
+ ## Model Card Contact
384
+
385
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
386
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.57.6",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.2.0",
5
+ "transformers": "4.57.6",
6
+ "pytorch": "2.10.0+cpu"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac4248984f432256f5ea91eaa38a0ad8254ead22a53eb97a75052934504e5b58
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff