dpshade22 commited on
Commit
fba69df
·
verified ·
1 Parent(s): bb9593d

Upload hf-e5-bible-25 embedding model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:262023
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: intfloat/e5-base-v2
11
+ widget:
12
+ - source_sentence: "query: A discerning person keeps wisdom in view,\n but a fool’s\
13
+ \ eyes wander to the ends of the earth."
14
+ sentences:
15
+ - "passage: A foolish son brings grief to his father\n and bitterness to the\
16
+ \ mother who bore him."
17
+ - 'passage: But whoever lives by the truth comes into the light, so that it may
18
+ be seen plainly that what they have done has been done in the sight of God.'
19
+ - 'passage: In the past, while Saul was king over us, you were the one who led Israel
20
+ on their military campaigns. And the Lord said to you, ‘You will shepherd my people
21
+ Israel, and you will become their ruler.’”'
22
+ - source_sentence: 'query: Who was Joanna in the Bible?'
23
+ sentences:
24
+ - 'passage: Joanna the wife of Chuza, the manager of Herod’s household; Susanna;
25
+ and many others. These women were helping to support them out of their own means.'
26
+ - 'passage: Meanwhile, Horam king of Gezer had come up to help Lachish, but Joshua
27
+ defeated him and his army—until no survivors were left.'
28
+ - 'passage: As they were going out, they met a man from Cyrene, named Simon, and
29
+ they forced him to carry the cross.'
30
+ - source_sentence: 'query: Girdle meaning'
31
+ sentences:
32
+ - 'passage: But Joseph said, “Far be it from me to do such a thing! Only the man
33
+ who was found to have the cup will become my slave. The rest of you, go back to
34
+ your father in peace.”'
35
+ - "passage: He takes off the shackles put on by kings\n and ties a loincloth\
36
+ \ around their waist."
37
+ - 'passage: In the tent of meeting, outside the curtain that shields the ark of
38
+ the covenant law, Aaron and his sons are to keep the lamps burning before the
39
+ Lord from evening till morning. This is to be a lasting ordinance among the Israelites
40
+ for the generations to come.'
41
+ - source_sentence: 'query: The event ''Blind Man Healed'' as recorded in Scripture,
42
+ involving Jesus.'
43
+ sentences:
44
+ - 'passage: Then he said:
45
+
46
+ “Praise be to the Lord, the God of Israel, who with his own hand has fulfilled
47
+ what he promised with his own mouth to my father David. For he said,'
48
+ - 'passage: After Terah had lived 70 years, he became the father of Abram, Nahor
49
+ and Haran.'
50
+ - 'passage: Jesus said, “For judgment I have come into this world, so that the blind
51
+ will see and those who see will become blind.”'
52
+ - source_sentence: 'query: Law meaning'
53
+ sentences:
54
+ - "passage: “I will record Rahab and Babylon\n among those who acknowledge me—\n\
55
+ Philistia too, and Tyre, along with Cush—\n and will say, ‘This one was born\
56
+ \ in Zion.’”"
57
+ - "passage: Your plunder, O nations, is harvested as by young locusts;\n like\
58
+ \ a swarm of locusts people pounce on it."
59
+ - 'passage: For truly I tell you, until heaven and earth disappear, not the smallest
60
+ letter, not the least stroke of a pen, will by any means disappear from the Law
61
+ until everything is accomplished.'
62
+ pipeline_tag: sentence-similarity
63
+ library_name: sentence-transformers
64
+ ---
65
+
66
+ # SentenceTransformer based on intfloat/e5-base-v2
67
+
68
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
69
+
70
+ ## Model Details
71
+
72
+ ### Model Description
73
+ - **Model Type:** Sentence Transformer
74
+ - **Base model:** [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) <!-- at revision f52bf8ec8c7124536f0efb74aca902b2995e5bcd -->
75
+ - **Maximum Sequence Length:** 256 tokens
76
+ - **Output Dimensionality:** 768 dimensions
77
+ - **Similarity Function:** Cosine Similarity
78
+ <!-- - **Training Dataset:** Unknown -->
79
+ <!-- - **Language:** Unknown -->
80
+ <!-- - **License:** Unknown -->
81
+
82
+ ### Model Sources
83
+
84
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
85
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
86
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
87
+
88
+ ### Full Model Architecture
89
+
90
+ ```
91
+ SentenceTransformer(
92
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
93
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
94
+ (2): Normalize()
95
+ )
96
+ ```
97
+
98
+ ## Usage
99
+
100
+ ### Direct Usage (Sentence Transformers)
101
+
102
+ First install the Sentence Transformers library:
103
+
104
+ ```bash
105
+ pip install -U sentence-transformers
106
+ ```
107
+
108
+ Then you can load this model and run inference.
109
+ ```python
110
+ from sentence_transformers import SentenceTransformer
111
+
112
+ # Download from the 🤗 Hub
113
+ model = SentenceTransformer("sentence_transformers_model_id")
114
+ # Run inference
115
+ sentences = [
116
+ 'query: Law meaning',
117
+ 'passage: For truly I tell you, until heaven and earth disappear, not the smallest letter, not the least stroke of a pen, will by any means disappear from the Law until everything is accomplished.',
118
+ 'passage: “I will record Rahab and Babylon\n among those who acknowledge me—\nPhilistia too, and Tyre, along with Cush—\n and will say, ‘This one was born in Zion.’”',
119
+ ]
120
+ embeddings = model.encode(sentences)
121
+ print(embeddings.shape)
122
+ # [3, 768]
123
+
124
+ # Get the similarity scores for the embeddings
125
+ similarities = model.similarity(embeddings, embeddings)
126
+ print(similarities)
127
+ # tensor([[1.0000, 0.7034, 0.5718],
128
+ # [0.7034, 1.0000, 0.6188],
129
+ # [0.5718, 0.6188, 1.0000]])
130
+ ```
131
+
132
+ <!--
133
+ ### Direct Usage (Transformers)
134
+
135
+ <details><summary>Click to see the direct usage in Transformers</summary>
136
+
137
+ </details>
138
+ -->
139
+
140
+ <!--
141
+ ### Downstream Usage (Sentence Transformers)
142
+
143
+ You can finetune this model on your own dataset.
144
+
145
+ <details><summary>Click to expand</summary>
146
+
147
+ </details>
148
+ -->
149
+
150
+ <!--
151
+ ### Out-of-Scope Use
152
+
153
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
154
+ -->
155
+
156
+ <!--
157
+ ## Bias, Risks and Limitations
158
+
159
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
160
+ -->
161
+
162
+ <!--
163
+ ### Recommendations
164
+
165
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
166
+ -->
167
+
168
+ ## Training Details
169
+
170
+ ### Training Dataset
171
+
172
+ #### Unnamed Dataset
173
+
174
+ * Size: 262,023 training samples
175
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
176
+ * Approximate statistics based on the first 1000 samples:
177
+ | | sentence_0 | sentence_1 | label |
178
+ |:--------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------|
179
+ | type | string | string | float |
180
+ | details | <ul><li>min: 5 tokens</li><li>mean: 29.07 tokens</li><li>max: 256 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 34.62 tokens</li><li>max: 94 tokens</li></ul> | <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> |
181
+ * Samples:
182
+ | sentence_0 | sentence_1 | label |
183
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
184
+ | <code>query: Messiah: (Heb. mashiah), in all the thirty-nine instances of its occurring in the Old Testament, is rendered by the LXX. “Christos.” It means anointed. Thus priests (Ex. 28:41; 40:15; Num. 3:3), prophets (1 Kings 19:16), and kings (1 Sam. 9:16; 16:3; 2 Sam. 12:7) were anointed with oil, and so consecrated to their respective offices. The great Messiah is anointed “above his fellows” (Ps. 45:7); i.e., he embraces in himself all the three offices.</code> | <code>passage: Anoint them just as you anointed their father, so they may serve me as priests. Their anointing will be to a priesthood that will continue throughout their generations.”</code> | <code>1.0</code> |
185
+ | <code>query: who was Toi</code> | <code>passage: he sent his son Joram to King David to greet him and congratulate him on his victory in battle over Hadadezer, who had been at war with Tou. Joram brought with him articles of silver, of gold and of bronze.</code> | <code>1.0</code> |
186
+ | <code>query: God</code> | <code>passage: Bring the grain offering made of these things to the Lord; present it to the priest, who shall take it to the altar.</code> | <code>1.0</code> |
187
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
188
+ ```json
189
+ {
190
+ "scale": 20.0,
191
+ "similarity_fct": "cos_sim",
192
+ "gather_across_devices": false
193
+ }
194
+ ```
195
+
196
+ ### Training Hyperparameters
197
+ #### Non-Default Hyperparameters
198
+
199
+ - `per_device_train_batch_size`: 32
200
+ - `per_device_eval_batch_size`: 32
201
+ - `num_train_epochs`: 1
202
+ - `max_steps`: 25
203
+ - `multi_dataset_batch_sampler`: round_robin
204
+
205
+ #### All Hyperparameters
206
+ <details><summary>Click to expand</summary>
207
+
208
+ - `overwrite_output_dir`: False
209
+ - `do_predict`: False
210
+ - `eval_strategy`: no
211
+ - `prediction_loss_only`: True
212
+ - `per_device_train_batch_size`: 32
213
+ - `per_device_eval_batch_size`: 32
214
+ - `per_gpu_train_batch_size`: None
215
+ - `per_gpu_eval_batch_size`: None
216
+ - `gradient_accumulation_steps`: 1
217
+ - `eval_accumulation_steps`: None
218
+ - `torch_empty_cache_steps`: None
219
+ - `learning_rate`: 5e-05
220
+ - `weight_decay`: 0.0
221
+ - `adam_beta1`: 0.9
222
+ - `adam_beta2`: 0.999
223
+ - `adam_epsilon`: 1e-08
224
+ - `max_grad_norm`: 1
225
+ - `num_train_epochs`: 1
226
+ - `max_steps`: 25
227
+ - `lr_scheduler_type`: linear
228
+ - `lr_scheduler_kwargs`: None
229
+ - `warmup_ratio`: 0.0
230
+ - `warmup_steps`: 0
231
+ - `log_level`: passive
232
+ - `log_level_replica`: warning
233
+ - `log_on_each_node`: True
234
+ - `logging_nan_inf_filter`: True
235
+ - `save_safetensors`: True
236
+ - `save_on_each_node`: False
237
+ - `save_only_model`: False
238
+ - `restore_callback_states_from_checkpoint`: False
239
+ - `no_cuda`: False
240
+ - `use_cpu`: False
241
+ - `use_mps_device`: False
242
+ - `seed`: 42
243
+ - `data_seed`: None
244
+ - `jit_mode_eval`: False
245
+ - `bf16`: False
246
+ - `fp16`: False
247
+ - `fp16_opt_level`: O1
248
+ - `half_precision_backend`: auto
249
+ - `bf16_full_eval`: False
250
+ - `fp16_full_eval`: False
251
+ - `tf32`: None
252
+ - `local_rank`: 0
253
+ - `ddp_backend`: None
254
+ - `tpu_num_cores`: None
255
+ - `tpu_metrics_debug`: False
256
+ - `debug`: []
257
+ - `dataloader_drop_last`: False
258
+ - `dataloader_num_workers`: 0
259
+ - `dataloader_prefetch_factor`: None
260
+ - `past_index`: -1
261
+ - `disable_tqdm`: False
262
+ - `remove_unused_columns`: True
263
+ - `label_names`: None
264
+ - `load_best_model_at_end`: False
265
+ - `ignore_data_skip`: False
266
+ - `fsdp`: []
267
+ - `fsdp_min_num_params`: 0
268
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
269
+ - `fsdp_transformer_layer_cls_to_wrap`: None
270
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
271
+ - `parallelism_config`: None
272
+ - `deepspeed`: None
273
+ - `label_smoothing_factor`: 0.0
274
+ - `optim`: adamw_torch_fused
275
+ - `optim_args`: None
276
+ - `adafactor`: False
277
+ - `group_by_length`: False
278
+ - `length_column_name`: length
279
+ - `project`: huggingface
280
+ - `trackio_space_id`: trackio
281
+ - `ddp_find_unused_parameters`: None
282
+ - `ddp_bucket_cap_mb`: None
283
+ - `ddp_broadcast_buffers`: False
284
+ - `dataloader_pin_memory`: True
285
+ - `dataloader_persistent_workers`: False
286
+ - `skip_memory_metrics`: True
287
+ - `use_legacy_prediction_loop`: False
288
+ - `push_to_hub`: False
289
+ - `resume_from_checkpoint`: None
290
+ - `hub_model_id`: None
291
+ - `hub_strategy`: every_save
292
+ - `hub_private_repo`: None
293
+ - `hub_always_push`: False
294
+ - `hub_revision`: None
295
+ - `gradient_checkpointing`: False
296
+ - `gradient_checkpointing_kwargs`: None
297
+ - `include_inputs_for_metrics`: False
298
+ - `include_for_metrics`: []
299
+ - `eval_do_concat_batches`: True
300
+ - `fp16_backend`: auto
301
+ - `push_to_hub_model_id`: None
302
+ - `push_to_hub_organization`: None
303
+ - `mp_parameters`:
304
+ - `auto_find_batch_size`: False
305
+ - `full_determinism`: False
306
+ - `torchdynamo`: None
307
+ - `ray_scope`: last
308
+ - `ddp_timeout`: 1800
309
+ - `torch_compile`: False
310
+ - `torch_compile_backend`: None
311
+ - `torch_compile_mode`: None
312
+ - `include_tokens_per_second`: False
313
+ - `include_num_input_tokens_seen`: no
314
+ - `neftune_noise_alpha`: None
315
+ - `optim_target_modules`: None
316
+ - `batch_eval_metrics`: False
317
+ - `eval_on_start`: False
318
+ - `use_liger_kernel`: False
319
+ - `liger_kernel_config`: None
320
+ - `eval_use_gather_object`: False
321
+ - `average_tokens_across_devices`: True
322
+ - `prompts`: None
323
+ - `batch_sampler`: batch_sampler
324
+ - `multi_dataset_batch_sampler`: round_robin
325
+ - `router_mapping`: {}
326
+ - `learning_rate_mapping`: {}
327
+
328
+ </details>
329
+
330
+ ### Framework Versions
331
+ - Python: 3.11.14
332
+ - Sentence Transformers: 5.2.0
333
+ - Transformers: 4.57.6
334
+ - PyTorch: 2.10.0+cpu
335
+ - Accelerate: 1.12.0
336
+ - Datasets: 4.5.0
337
+ - Tokenizers: 0.22.2
338
+
339
+ ## Citation
340
+
341
+ ### BibTeX
342
+
343
+ #### Sentence Transformers
344
+ ```bibtex
345
+ @inproceedings{reimers-2019-sentence-bert,
346
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
347
+ author = "Reimers, Nils and Gurevych, Iryna",
348
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
349
+ month = "11",
350
+ year = "2019",
351
+ publisher = "Association for Computational Linguistics",
352
+ url = "https://arxiv.org/abs/1908.10084",
353
+ }
354
+ ```
355
+
356
+ #### MultipleNegativesRankingLoss
357
+ ```bibtex
358
+ @misc{henderson2017efficient,
359
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
360
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
361
+ year={2017},
362
+ eprint={1705.00652},
363
+ archivePrefix={arXiv},
364
+ primaryClass={cs.CL}
365
+ }
366
+ ```
367
+
368
+ <!--
369
+ ## Glossary
370
+
371
+ *Clearly define terms in order to be accessible across audiences.*
372
+ -->
373
+
374
+ <!--
375
+ ## Model Card Authors
376
+
377
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
378
+ -->
379
+
380
+ <!--
381
+ ## Model Card Contact
382
+
383
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
384
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.57.6",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.2.0",
5
+ "transformers": "4.57.6",
6
+ "pytorch": "2.10.0+cpu"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:820d6deece2017988d040f784efad064aaac7939d381a17d439e4376cb5a5875
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff