potsu-potsu commited on
Commit
29c9b3a
·
verified ·
1 Parent(s): 04cabc3

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,880 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - generated_from_trainer
10
+ - dataset_size:4012
11
+ - loss:MatryoshkaLoss
12
+ - loss:MultipleNegativesRankingLoss
13
+ widget:
14
+ - source_sentence: Do cephalopods use RNA editing less frequently than other species?
15
+ sentences:
16
+ - 'Extensive messenger RNA editing generates transcript and protein diversity in
17
+ genes involved in neural excitability, as previously described, as well as in
18
+ genes participating in a broad range of other cellular functions. '
19
+ - GV1001 is a 16-amino-acid vaccine peptide derived from the human telomerase reverse
20
+ transcriptase sequence. It has been developed as a vaccine against various cancers.
21
+ - Using acetyl-specific K516 antibodies, we show that acetylation of endogenous
22
+ S6K1 at this site is potently induced upon growth factor stimulation. We propose
23
+ that K516 acetylation may serve to modulate important kinase-independent functions
24
+ of S6K1 in response to growth factor signalling. Following mitogen stimulation,
25
+ S6Ks interact with the p300 and p300/CBP-associated factor (PCAF) acetyltransferases.
26
+ S6Ks can be acetylated by p300 and PCAF in vitro and S6K acetylation is detected
27
+ in cells expressing p300
28
+ - source_sentence: Can pets affect infant microbiomed?
29
+ sentences:
30
+ - Yes, exposure to household furry pets influences the gut microbiota of infants.
31
+ - Thiazovivin is a selective small molecule that directly targets Rho-associated
32
+ kinase (ROCK) and increases expression of pluripotency factors.
33
+ - ' Here, we present evidence that the calcium/calmodulin-dependent protein kinase
34
+ IV (CaMK4) is increased and required during Th17 cell differentiation. Inhibition
35
+ of CaMK4 reduced Il17 transcription through decreased activation of the cAMP response
36
+ element modulator a (CREM-a) and reduced activation of the AKT/mTOR pathway, which
37
+ is known to enhance Th17 differentiation. CAMK4 knockdown and kinase-dead mutant
38
+ inhibited crocin-mediated HO-1 expression, Nrf2 activation, and phosphorylation
39
+ of Akt, indicating that HO-1 expression is mediated by CAMK4 and that Akt is a
40
+ downstream mediator of CAMK4 in crocin signaling'
41
+ - source_sentence: In what proportion of children with heart failure has Enalapril
42
+ been shown to be safe and effective?
43
+ sentences:
44
+ - 5-HT2A (5-hydroxytryptamine type 2a) receptor can be evaluated with the [18F]altanserin.
45
+ - "In children with heart failure evidence of the effect of enalapril is empirical.\
46
+ \ Enalapril was clinically safe and effective in 50% to 80% of for children with\
47
+ \ cardiac failure secondary to congenital heart malformations before and after\
48
+ \ cardiac surgery, impaired ventricular function , valvar regurgitation, congestive\
49
+ \ cardiomyopathy, , arterial hypertension, life-threatening arrhythmias coexisting\
50
+ \ with circulatory insufficiency. \nACE inhibitors have shown a transient beneficial\
51
+ \ effect on heart failure due to anticancer drugs and possibly a beneficial effect\
52
+ \ in muscular dystrophy-associated cardiomyopathy, which deserves further studies."
53
+ - "necroptosis\napoptosis \npro-survival/inflammation NF-κB activation"
54
+ - source_sentence: How are SAHFS created?
55
+ sentences:
56
+ - In particular, up to 17% of neutrophil nuclei of healthy women exhibit a drumstick-shaped
57
+ appendage that contains the inactive X chromosome.
58
+ - miR-1, miR-133, miR-208a, miR-206, miR-494, miR-146a, miR-222, miR-21, miR-221,
59
+ miR-20a, miR-133a, miR-133b, miR-23, miR-107 and miR-181 are involved in exercise
60
+ adaptation
61
+ - Cellular senescence-associated heterochromatic foci (SAHFS) are a novel type of
62
+ chromatin condensation involving alterations of linker histone H1 and linker DNA-binding
63
+ proteins. SAHFS can be formed by a variety of cell types, but their mechanism
64
+ of action remains unclear.
65
+ - source_sentence: What are the effects of the deletion of all three Pcdh clusters
66
+ (tricluster deletion) in mice?
67
+ sentences:
68
+ - Multicluster Pcdh diversity is required for mouse olfactory neural circuit assembly.
69
+ The vertebrate clustered protocadherin (Pcdh) cell surface proteins are encoded
70
+ by three closely linked gene clusters (Pcdhα, Pcdhβ, and Pcdhγ). Although deletion
71
+ of individual Pcdh clusters had subtle phenotypic consequences, the loss of all
72
+ three clusters (tricluster deletion) led to a severe axonal arborization defect
73
+ and loss of self-avoidance.
74
+ - The myocyte enhancer factor-2 (MEF2) proteins are MADS-box transcription factors
75
+ that are essential for differentiation of all muscle lineages but their mechanisms
76
+ of action remain largely undefined. MEF2C expression initiates cardiomyogenesis,
77
+ resulting in the up-regulation of Brachyury T, bone morphogenetic protein-4, Nkx2-5,
78
+ GATA-4, cardiac alpha-actin, and myosin heavy chain expression. Inactivation of
79
+ the MEF2C gene causes cardiac developmental arrest and severe downregulation of
80
+ a number of cardiac markers including atrial natriuretic factor (ANF). BMP-2,
81
+ a regulator of cardiac development during embryogenesis, was shown to increase
82
+ PI 3-kinase activity in cardiac precursor cells, resulting in increased expression
83
+ of sarcomeric myosin heavy chain (MHC) and MEF-2A. Furthermore, expression of
84
+ MEF-2A increased MHC expression in a PI 3-kinase-dependent manner. Other studies
85
+ showed that Gli2 and MEF2C proteins form a complex, capable of synergizing on
86
+ cardiomyogenesis-related promoters. Dominant interference of calcineurin/mAKAP
87
+ binding blunts the increase in MEF2 transcriptional activity seen during myoblast
88
+ differentiation, as well as the expression of endogenous MEF2-target genes. These
89
+ findings show that MEF-2 can direct early stages of cell differentiation into
90
+ a cardiomyogenic pathway.
91
+ - Investigators proposed that there have been three extended periods in the evolution
92
+ of gene regulatory elements. Early vertebrate evolution was characterized by regulatory
93
+ gains near transcription factors and developmental genes, but this trend was replaced
94
+ by innovations near extracellular signaling genes, and then innovations near posttranslational
95
+ protein modifiers.
96
+ pipeline_tag: sentence-similarity
97
+ library_name: sentence-transformers
98
+ metrics:
99
+ - cosine_accuracy@1
100
+ - cosine_accuracy@3
101
+ - cosine_accuracy@5
102
+ - cosine_accuracy@10
103
+ - cosine_precision@1
104
+ - cosine_precision@3
105
+ - cosine_precision@5
106
+ - cosine_precision@10
107
+ - cosine_recall@1
108
+ - cosine_recall@3
109
+ - cosine_recall@5
110
+ - cosine_recall@10
111
+ - cosine_ndcg@10
112
+ - cosine_mrr@10
113
+ - cosine_map@100
114
+ model-index:
115
+ - name: Biomedical MRL
116
+ results:
117
+ - task:
118
+ type: information-retrieval
119
+ name: Information Retrieval
120
+ dataset:
121
+ name: dim 768
122
+ type: dim_768
123
+ metrics:
124
+ - type: cosine_accuracy@1
125
+ value: 0.8062234794908062
126
+ name: Cosine Accuracy@1
127
+ - type: cosine_accuracy@3
128
+ value: 0.9292786421499293
129
+ name: Cosine Accuracy@3
130
+ - type: cosine_accuracy@5
131
+ value: 0.9533239038189534
132
+ name: Cosine Accuracy@5
133
+ - type: cosine_accuracy@10
134
+ value: 0.9660537482319661
135
+ name: Cosine Accuracy@10
136
+ - type: cosine_precision@1
137
+ value: 0.8062234794908062
138
+ name: Cosine Precision@1
139
+ - type: cosine_precision@3
140
+ value: 0.30975954738330974
141
+ name: Cosine Precision@3
142
+ - type: cosine_precision@5
143
+ value: 0.1906647807637906
144
+ name: Cosine Precision@5
145
+ - type: cosine_precision@10
146
+ value: 0.09660537482319659
147
+ name: Cosine Precision@10
148
+ - type: cosine_recall@1
149
+ value: 0.8062234794908062
150
+ name: Cosine Recall@1
151
+ - type: cosine_recall@3
152
+ value: 0.9292786421499293
153
+ name: Cosine Recall@3
154
+ - type: cosine_recall@5
155
+ value: 0.9533239038189534
156
+ name: Cosine Recall@5
157
+ - type: cosine_recall@10
158
+ value: 0.9660537482319661
159
+ name: Cosine Recall@10
160
+ - type: cosine_ndcg@10
161
+ value: 0.8940734682586426
162
+ name: Cosine Ndcg@10
163
+ - type: cosine_mrr@10
164
+ value: 0.8700764464201525
165
+ name: Cosine Mrr@10
166
+ - type: cosine_map@100
167
+ value: 0.8709063298425341
168
+ name: Cosine Map@100
169
+ - task:
170
+ type: information-retrieval
171
+ name: Information Retrieval
172
+ dataset:
173
+ name: dim 512
174
+ type: dim_512
175
+ metrics:
176
+ - type: cosine_accuracy@1
177
+ value: 0.809052333804809
178
+ name: Cosine Accuracy@1
179
+ - type: cosine_accuracy@3
180
+ value: 0.9292786421499293
181
+ name: Cosine Accuracy@3
182
+ - type: cosine_accuracy@5
183
+ value: 0.9519094766619519
184
+ name: Cosine Accuracy@5
185
+ - type: cosine_accuracy@10
186
+ value: 0.9660537482319661
187
+ name: Cosine Accuracy@10
188
+ - type: cosine_precision@1
189
+ value: 0.809052333804809
190
+ name: Cosine Precision@1
191
+ - type: cosine_precision@3
192
+ value: 0.30975954738330974
193
+ name: Cosine Precision@3
194
+ - type: cosine_precision@5
195
+ value: 0.19038189533239033
196
+ name: Cosine Precision@5
197
+ - type: cosine_precision@10
198
+ value: 0.09660537482319659
199
+ name: Cosine Precision@10
200
+ - type: cosine_recall@1
201
+ value: 0.809052333804809
202
+ name: Cosine Recall@1
203
+ - type: cosine_recall@3
204
+ value: 0.9292786421499293
205
+ name: Cosine Recall@3
206
+ - type: cosine_recall@5
207
+ value: 0.9519094766619519
208
+ name: Cosine Recall@5
209
+ - type: cosine_recall@10
210
+ value: 0.9660537482319661
211
+ name: Cosine Recall@10
212
+ - type: cosine_ndcg@10
213
+ value: 0.8941424934364455
214
+ name: Cosine Ndcg@10
215
+ - type: cosine_mrr@10
216
+ value: 0.8702515659729241
217
+ name: Cosine Mrr@10
218
+ - type: cosine_map@100
219
+ value: 0.8710899601617035
220
+ name: Cosine Map@100
221
+ - task:
222
+ type: information-retrieval
223
+ name: Information Retrieval
224
+ dataset:
225
+ name: dim 256
226
+ type: dim_256
227
+ metrics:
228
+ - type: cosine_accuracy@1
229
+ value: 0.801980198019802
230
+ name: Cosine Accuracy@1
231
+ - type: cosine_accuracy@3
232
+ value: 0.9207920792079208
233
+ name: Cosine Accuracy@3
234
+ - type: cosine_accuracy@5
235
+ value: 0.9519094766619519
236
+ name: Cosine Accuracy@5
237
+ - type: cosine_accuracy@10
238
+ value: 0.9632248939179632
239
+ name: Cosine Accuracy@10
240
+ - type: cosine_precision@1
241
+ value: 0.801980198019802
242
+ name: Cosine Precision@1
243
+ - type: cosine_precision@3
244
+ value: 0.3069306930693069
245
+ name: Cosine Precision@3
246
+ - type: cosine_precision@5
247
+ value: 0.19038189533239033
248
+ name: Cosine Precision@5
249
+ - type: cosine_precision@10
250
+ value: 0.09632248939179631
251
+ name: Cosine Precision@10
252
+ - type: cosine_recall@1
253
+ value: 0.801980198019802
254
+ name: Cosine Recall@1
255
+ - type: cosine_recall@3
256
+ value: 0.9207920792079208
257
+ name: Cosine Recall@3
258
+ - type: cosine_recall@5
259
+ value: 0.9519094766619519
260
+ name: Cosine Recall@5
261
+ - type: cosine_recall@10
262
+ value: 0.9632248939179632
263
+ name: Cosine Recall@10
264
+ - type: cosine_ndcg@10
265
+ value: 0.8888633341416707
266
+ name: Cosine Ndcg@10
267
+ - type: cosine_mrr@10
268
+ value: 0.8641695291978178
269
+ name: Cosine Mrr@10
270
+ - type: cosine_map@100
271
+ value: 0.8651249924605939
272
+ name: Cosine Map@100
273
+ - task:
274
+ type: information-retrieval
275
+ name: Information Retrieval
276
+ dataset:
277
+ name: dim 128
278
+ type: dim_128
279
+ metrics:
280
+ - type: cosine_accuracy@1
281
+ value: 0.7878359264497878
282
+ name: Cosine Accuracy@1
283
+ - type: cosine_accuracy@3
284
+ value: 0.9123055162659123
285
+ name: Cosine Accuracy@3
286
+ - type: cosine_accuracy@5
287
+ value: 0.9405940594059405
288
+ name: Cosine Accuracy@5
289
+ - type: cosine_accuracy@10
290
+ value: 0.9575671852899575
291
+ name: Cosine Accuracy@10
292
+ - type: cosine_precision@1
293
+ value: 0.7878359264497878
294
+ name: Cosine Precision@1
295
+ - type: cosine_precision@3
296
+ value: 0.3041018387553041
297
+ name: Cosine Precision@3
298
+ - type: cosine_precision@5
299
+ value: 0.1881188118811881
300
+ name: Cosine Precision@5
301
+ - type: cosine_precision@10
302
+ value: 0.09575671852899574
303
+ name: Cosine Precision@10
304
+ - type: cosine_recall@1
305
+ value: 0.7878359264497878
306
+ name: Cosine Recall@1
307
+ - type: cosine_recall@3
308
+ value: 0.9123055162659123
309
+ name: Cosine Recall@3
310
+ - type: cosine_recall@5
311
+ value: 0.9405940594059405
312
+ name: Cosine Recall@5
313
+ - type: cosine_recall@10
314
+ value: 0.9575671852899575
315
+ name: Cosine Recall@10
316
+ - type: cosine_ndcg@10
317
+ value: 0.8776845224261977
318
+ name: Cosine Ndcg@10
319
+ - type: cosine_mrr@10
320
+ value: 0.8513476347634764
321
+ name: Cosine Mrr@10
322
+ - type: cosine_map@100
323
+ value: 0.8524929782022361
324
+ name: Cosine Map@100
325
+ - task:
326
+ type: information-retrieval
327
+ name: Information Retrieval
328
+ dataset:
329
+ name: dim 64
330
+ type: dim_64
331
+ metrics:
332
+ - type: cosine_accuracy@1
333
+ value: 0.7609618104667609
334
+ name: Cosine Accuracy@1
335
+ - type: cosine_accuracy@3
336
+ value: 0.884016973125884
337
+ name: Cosine Accuracy@3
338
+ - type: cosine_accuracy@5
339
+ value: 0.9123055162659123
340
+ name: Cosine Accuracy@5
341
+ - type: cosine_accuracy@10
342
+ value: 0.9377652050919377
343
+ name: Cosine Accuracy@10
344
+ - type: cosine_precision@1
345
+ value: 0.7609618104667609
346
+ name: Cosine Precision@1
347
+ - type: cosine_precision@3
348
+ value: 0.29467232437529467
349
+ name: Cosine Precision@3
350
+ - type: cosine_precision@5
351
+ value: 0.18246110325318246
352
+ name: Cosine Precision@5
353
+ - type: cosine_precision@10
354
+ value: 0.09377652050919376
355
+ name: Cosine Precision@10
356
+ - type: cosine_recall@1
357
+ value: 0.7609618104667609
358
+ name: Cosine Recall@1
359
+ - type: cosine_recall@3
360
+ value: 0.884016973125884
361
+ name: Cosine Recall@3
362
+ - type: cosine_recall@5
363
+ value: 0.9123055162659123
364
+ name: Cosine Recall@5
365
+ - type: cosine_recall@10
366
+ value: 0.9377652050919377
367
+ name: Cosine Recall@10
368
+ - type: cosine_ndcg@10
369
+ value: 0.8544495237634239
370
+ name: Cosine Ndcg@10
371
+ - type: cosine_mrr@10
372
+ value: 0.8271598078175169
373
+ name: Cosine Mrr@10
374
+ - type: cosine_map@100
375
+ value: 0.8287981789570508
376
+ name: Cosine Map@100
377
+ ---
378
+
379
+ # Biomedical MRL
380
+
381
+ This is a [sentence-transformers](https://www.SBERT.net) model trained on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
382
+
383
+ ## Model Details
384
+
385
+ ### Model Description
386
+ - **Model Type:** Sentence Transformer
387
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
388
+ - **Maximum Sequence Length:** 512 tokens
389
+ - **Output Dimensionality:** 768 dimensions
390
+ - **Similarity Function:** Cosine Similarity
391
+ - **Training Dataset:**
392
+ - json
393
+ - **Language:** en
394
+ - **License:** apache-2.0
395
+
396
+ ### Model Sources
397
+
398
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
399
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
400
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
401
+
402
+ ### Full Model Architecture
403
+
404
+ ```
405
+ SentenceTransformer(
406
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
407
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
408
+ )
409
+ ```
410
+
411
+ ## Usage
412
+
413
+ ### Direct Usage (Sentence Transformers)
414
+
415
+ First install the Sentence Transformers library:
416
+
417
+ ```bash
418
+ pip install -U sentence-transformers
419
+ ```
420
+
421
+ Then you can load this model and run inference.
422
+ ```python
423
+ from sentence_transformers import SentenceTransformer
424
+
425
+ # Download from the 🤗 Hub
426
+ model = SentenceTransformer("potsu-potsu/pubmedbert-base-mrl")
427
+ # Run inference
428
+ sentences = [
429
+ 'What are the effects of the deletion of all three Pcdh clusters (tricluster deletion) in mice?',
430
+ 'Multicluster Pcdh diversity is required for mouse olfactory neural circuit assembly. The vertebrate clustered protocadherin (Pcdh) cell surface proteins are encoded by three closely linked gene clusters (Pcdhα, Pcdhβ, and Pcdhγ). Although deletion of individual Pcdh clusters had subtle phenotypic consequences, the loss of all three clusters (tricluster deletion) led to a severe axonal arborization defect and loss of self-avoidance.',
431
+ 'Investigators proposed that there have been three extended periods in the evolution of gene regulatory elements. Early vertebrate evolution was characterized by regulatory gains near transcription factors and developmental genes, but this trend was replaced by innovations near extracellular signaling genes, and then innovations near posttranslational protein modifiers.',
432
+ ]
433
+ embeddings = model.encode(sentences)
434
+ print(embeddings.shape)
435
+ # [3, 768]
436
+
437
+ # Get the similarity scores for the embeddings
438
+ similarities = model.similarity(embeddings, embeddings)
439
+ print(similarities.shape)
440
+ # [3, 3]
441
+ ```
442
+
443
+ <!--
444
+ ### Direct Usage (Transformers)
445
+
446
+ <details><summary>Click to see the direct usage in Transformers</summary>
447
+
448
+ </details>
449
+ -->
450
+
451
+ <!--
452
+ ### Downstream Usage (Sentence Transformers)
453
+
454
+ You can finetune this model on your own dataset.
455
+
456
+ <details><summary>Click to expand</summary>
457
+
458
+ </details>
459
+ -->
460
+
461
+ <!--
462
+ ### Out-of-Scope Use
463
+
464
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
465
+ -->
466
+
467
+ ## Evaluation
468
+
469
+ ### Metrics
470
+
471
+ #### Information Retrieval
472
+
473
+ * Dataset: `dim_768`
474
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
475
+ ```json
476
+ {
477
+ "truncate_dim": 768
478
+ }
479
+ ```
480
+
481
+ | Metric | Value |
482
+ |:--------------------|:-----------|
483
+ | cosine_accuracy@1 | 0.8062 |
484
+ | cosine_accuracy@3 | 0.9293 |
485
+ | cosine_accuracy@5 | 0.9533 |
486
+ | cosine_accuracy@10 | 0.9661 |
487
+ | cosine_precision@1 | 0.8062 |
488
+ | cosine_precision@3 | 0.3098 |
489
+ | cosine_precision@5 | 0.1907 |
490
+ | cosine_precision@10 | 0.0966 |
491
+ | cosine_recall@1 | 0.8062 |
492
+ | cosine_recall@3 | 0.9293 |
493
+ | cosine_recall@5 | 0.9533 |
494
+ | cosine_recall@10 | 0.9661 |
495
+ | **cosine_ndcg@10** | **0.8941** |
496
+ | cosine_mrr@10 | 0.8701 |
497
+ | cosine_map@100 | 0.8709 |
498
+
499
+ #### Information Retrieval
500
+
501
+ * Dataset: `dim_512`
502
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
503
+ ```json
504
+ {
505
+ "truncate_dim": 512
506
+ }
507
+ ```
508
+
509
+ | Metric | Value |
510
+ |:--------------------|:-----------|
511
+ | cosine_accuracy@1 | 0.8091 |
512
+ | cosine_accuracy@3 | 0.9293 |
513
+ | cosine_accuracy@5 | 0.9519 |
514
+ | cosine_accuracy@10 | 0.9661 |
515
+ | cosine_precision@1 | 0.8091 |
516
+ | cosine_precision@3 | 0.3098 |
517
+ | cosine_precision@5 | 0.1904 |
518
+ | cosine_precision@10 | 0.0966 |
519
+ | cosine_recall@1 | 0.8091 |
520
+ | cosine_recall@3 | 0.9293 |
521
+ | cosine_recall@5 | 0.9519 |
522
+ | cosine_recall@10 | 0.9661 |
523
+ | **cosine_ndcg@10** | **0.8941** |
524
+ | cosine_mrr@10 | 0.8703 |
525
+ | cosine_map@100 | 0.8711 |
526
+
527
+ #### Information Retrieval
528
+
529
+ * Dataset: `dim_256`
530
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
531
+ ```json
532
+ {
533
+ "truncate_dim": 256
534
+ }
535
+ ```
536
+
537
+ | Metric | Value |
538
+ |:--------------------|:-----------|
539
+ | cosine_accuracy@1 | 0.802 |
540
+ | cosine_accuracy@3 | 0.9208 |
541
+ | cosine_accuracy@5 | 0.9519 |
542
+ | cosine_accuracy@10 | 0.9632 |
543
+ | cosine_precision@1 | 0.802 |
544
+ | cosine_precision@3 | 0.3069 |
545
+ | cosine_precision@5 | 0.1904 |
546
+ | cosine_precision@10 | 0.0963 |
547
+ | cosine_recall@1 | 0.802 |
548
+ | cosine_recall@3 | 0.9208 |
549
+ | cosine_recall@5 | 0.9519 |
550
+ | cosine_recall@10 | 0.9632 |
551
+ | **cosine_ndcg@10** | **0.8889** |
552
+ | cosine_mrr@10 | 0.8642 |
553
+ | cosine_map@100 | 0.8651 |
554
+
555
+ #### Information Retrieval
556
+
557
+ * Dataset: `dim_128`
558
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
559
+ ```json
560
+ {
561
+ "truncate_dim": 128
562
+ }
563
+ ```
564
+
565
+ | Metric | Value |
566
+ |:--------------------|:-----------|
567
+ | cosine_accuracy@1 | 0.7878 |
568
+ | cosine_accuracy@3 | 0.9123 |
569
+ | cosine_accuracy@5 | 0.9406 |
570
+ | cosine_accuracy@10 | 0.9576 |
571
+ | cosine_precision@1 | 0.7878 |
572
+ | cosine_precision@3 | 0.3041 |
573
+ | cosine_precision@5 | 0.1881 |
574
+ | cosine_precision@10 | 0.0958 |
575
+ | cosine_recall@1 | 0.7878 |
576
+ | cosine_recall@3 | 0.9123 |
577
+ | cosine_recall@5 | 0.9406 |
578
+ | cosine_recall@10 | 0.9576 |
579
+ | **cosine_ndcg@10** | **0.8777** |
580
+ | cosine_mrr@10 | 0.8513 |
581
+ | cosine_map@100 | 0.8525 |
582
+
583
+ #### Information Retrieval
584
+
585
+ * Dataset: `dim_64`
586
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
587
+ ```json
588
+ {
589
+ "truncate_dim": 64
590
+ }
591
+ ```
592
+
593
+ | Metric | Value |
594
+ |:--------------------|:-----------|
595
+ | cosine_accuracy@1 | 0.761 |
596
+ | cosine_accuracy@3 | 0.884 |
597
+ | cosine_accuracy@5 | 0.9123 |
598
+ | cosine_accuracy@10 | 0.9378 |
599
+ | cosine_precision@1 | 0.761 |
600
+ | cosine_precision@3 | 0.2947 |
601
+ | cosine_precision@5 | 0.1825 |
602
+ | cosine_precision@10 | 0.0938 |
603
+ | cosine_recall@1 | 0.761 |
604
+ | cosine_recall@3 | 0.884 |
605
+ | cosine_recall@5 | 0.9123 |
606
+ | cosine_recall@10 | 0.9378 |
607
+ | **cosine_ndcg@10** | **0.8544** |
608
+ | cosine_mrr@10 | 0.8272 |
609
+ | cosine_map@100 | 0.8288 |
610
+
611
+ <!--
612
+ ## Bias, Risks and Limitations
613
+
614
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
615
+ -->
616
+
617
+ <!--
618
+ ### Recommendations
619
+
620
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
621
+ -->
622
+
623
+ ## Training Details
624
+
625
+ ### Training Dataset
626
+
627
+ #### json
628
+
629
+ * Dataset: json
630
+ * Size: 4,012 training samples
631
+ * Columns: <code>anchor</code> and <code>positive</code>
632
+ * Approximate statistics based on the first 1000 samples:
633
+ | | anchor | positive |
634
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
635
+ | type | string | string |
636
+ | details | <ul><li>min: 5 tokens</li><li>mean: 13.95 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 52.76 tokens</li><li>max: 428 tokens</li></ul> |
637
+ * Samples:
638
+ | anchor | positive |
639
+ |:---------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
640
+ | <code>What is the implication of histone lysine methylation in medulloblastoma?</code> | <code>Aberrant patterns of H3K4, H3K9, and H3K27 histone lysine methylation were shown to result in histone code alterations, which induce changes in gene expression, and affect the proliferation rate of cells in medulloblastoma.</code> |
641
+ | <code>What is the role of STAG1/STAG2 proteins in differentiation?</code> | <code>STAG1/STAG2 proteins are tumour suppressor proteins that suppress cell proliferation and are essential for differentiation.</code> |
642
+ | <code>What is the association between cell phone use and glioblastoma?</code> | <code>The association between cell phone use and incident glioblastoma remains unclear. Some studies have reported that cell phone use was associated with incident glioblastoma, and with reduced survival of patients diagnosed with glioblastoma. However, other studies have repeatedly replicated to find an association between cell phone use and glioblastoma.</code> |
643
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
644
+ ```json
645
+ {
646
+ "loss": "MultipleNegativesRankingLoss",
647
+ "matryoshka_dims": [
648
+ 768,
649
+ 512,
650
+ 256,
651
+ 128,
652
+ 64
653
+ ],
654
+ "matryoshka_weights": [
655
+ 1,
656
+ 1,
657
+ 1,
658
+ 1,
659
+ 1
660
+ ],
661
+ "n_dims_per_step": -1
662
+ }
663
+ ```
664
+
665
+ ### Training Hyperparameters
666
+ #### Non-Default Hyperparameters
667
+
668
+ - `eval_strategy`: epoch
669
+ - `per_device_train_batch_size`: 32
670
+ - `per_device_eval_batch_size`: 16
671
+ - `gradient_accumulation_steps`: 16
672
+ - `learning_rate`: 2e-05
673
+ - `num_train_epochs`: 4
674
+ - `lr_scheduler_type`: cosine
675
+ - `warmup_ratio`: 0.1
676
+ - `bf16`: True
677
+ - `tf32`: True
678
+ - `load_best_model_at_end`: True
679
+ - `optim`: adamw_torch_fused
680
+ - `batch_sampler`: no_duplicates
681
+
682
+ #### All Hyperparameters
683
+ <details><summary>Click to expand</summary>
684
+
685
+ - `overwrite_output_dir`: False
686
+ - `do_predict`: False
687
+ - `eval_strategy`: epoch
688
+ - `prediction_loss_only`: True
689
+ - `per_device_train_batch_size`: 32
690
+ - `per_device_eval_batch_size`: 16
691
+ - `per_gpu_train_batch_size`: None
692
+ - `per_gpu_eval_batch_size`: None
693
+ - `gradient_accumulation_steps`: 16
694
+ - `eval_accumulation_steps`: None
695
+ - `torch_empty_cache_steps`: None
696
+ - `learning_rate`: 2e-05
697
+ - `weight_decay`: 0.0
698
+ - `adam_beta1`: 0.9
699
+ - `adam_beta2`: 0.999
700
+ - `adam_epsilon`: 1e-08
701
+ - `max_grad_norm`: 1.0
702
+ - `num_train_epochs`: 4
703
+ - `max_steps`: -1
704
+ - `lr_scheduler_type`: cosine
705
+ - `lr_scheduler_kwargs`: {}
706
+ - `warmup_ratio`: 0.1
707
+ - `warmup_steps`: 0
708
+ - `log_level`: passive
709
+ - `log_level_replica`: warning
710
+ - `log_on_each_node`: True
711
+ - `logging_nan_inf_filter`: True
712
+ - `save_safetensors`: True
713
+ - `save_on_each_node`: False
714
+ - `save_only_model`: False
715
+ - `restore_callback_states_from_checkpoint`: False
716
+ - `no_cuda`: False
717
+ - `use_cpu`: False
718
+ - `use_mps_device`: False
719
+ - `seed`: 42
720
+ - `data_seed`: None
721
+ - `jit_mode_eval`: False
722
+ - `use_ipex`: False
723
+ - `bf16`: True
724
+ - `fp16`: False
725
+ - `fp16_opt_level`: O1
726
+ - `half_precision_backend`: auto
727
+ - `bf16_full_eval`: False
728
+ - `fp16_full_eval`: False
729
+ - `tf32`: True
730
+ - `local_rank`: 0
731
+ - `ddp_backend`: None
732
+ - `tpu_num_cores`: None
733
+ - `tpu_metrics_debug`: False
734
+ - `debug`: []
735
+ - `dataloader_drop_last`: False
736
+ - `dataloader_num_workers`: 0
737
+ - `dataloader_prefetch_factor`: None
738
+ - `past_index`: -1
739
+ - `disable_tqdm`: False
740
+ - `remove_unused_columns`: True
741
+ - `label_names`: None
742
+ - `load_best_model_at_end`: True
743
+ - `ignore_data_skip`: False
744
+ - `fsdp`: []
745
+ - `fsdp_min_num_params`: 0
746
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
747
+ - `fsdp_transformer_layer_cls_to_wrap`: None
748
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
749
+ - `deepspeed`: None
750
+ - `label_smoothing_factor`: 0.0
751
+ - `optim`: adamw_torch_fused
752
+ - `optim_args`: None
753
+ - `adafactor`: False
754
+ - `group_by_length`: False
755
+ - `length_column_name`: length
756
+ - `ddp_find_unused_parameters`: None
757
+ - `ddp_bucket_cap_mb`: None
758
+ - `ddp_broadcast_buffers`: False
759
+ - `dataloader_pin_memory`: True
760
+ - `dataloader_persistent_workers`: False
761
+ - `skip_memory_metrics`: True
762
+ - `use_legacy_prediction_loop`: False
763
+ - `push_to_hub`: False
764
+ - `resume_from_checkpoint`: None
765
+ - `hub_model_id`: None
766
+ - `hub_strategy`: every_save
767
+ - `hub_private_repo`: None
768
+ - `hub_always_push`: False
769
+ - `gradient_checkpointing`: False
770
+ - `gradient_checkpointing_kwargs`: None
771
+ - `include_inputs_for_metrics`: False
772
+ - `include_for_metrics`: []
773
+ - `eval_do_concat_batches`: True
774
+ - `fp16_backend`: auto
775
+ - `push_to_hub_model_id`: None
776
+ - `push_to_hub_organization`: None
777
+ - `mp_parameters`:
778
+ - `auto_find_batch_size`: False
779
+ - `full_determinism`: False
780
+ - `torchdynamo`: None
781
+ - `ray_scope`: last
782
+ - `ddp_timeout`: 1800
783
+ - `torch_compile`: False
784
+ - `torch_compile_backend`: None
785
+ - `torch_compile_mode`: None
786
+ - `include_tokens_per_second`: False
787
+ - `include_num_input_tokens_seen`: False
788
+ - `neftune_noise_alpha`: None
789
+ - `optim_target_modules`: None
790
+ - `batch_eval_metrics`: False
791
+ - `eval_on_start`: False
792
+ - `use_liger_kernel`: False
793
+ - `eval_use_gather_object`: False
794
+ - `average_tokens_across_devices`: False
795
+ - `prompts`: None
796
+ - `batch_sampler`: no_duplicates
797
+ - `multi_dataset_batch_sampler`: proportional
798
+
799
+ </details>
800
+
801
+ ### Training Logs
802
+ | Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 |
803
+ |:-------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:|
804
+ | 1.0 | 8 | - | 0.8813 | 0.8827 | 0.8776 | 0.8563 | 0.8169 |
805
+ | 1.2540 | 10 | 23.4731 | - | - | - | - | - |
806
+ | 2.0 | 16 | - | 0.8932 | 0.8913 | 0.8858 | 0.8712 | 0.8514 |
807
+ | 2.5079 | 20 | 8.7062 | - | - | - | - | - |
808
+ | 3.0 | 24 | - | 0.8943 | 0.8934 | 0.8888 | 0.8771 | 0.8550 |
809
+ | 3.7619 | 30 | 6.6704 | - | - | - | - | - |
810
+ | **4.0** | **32** | **-** | **0.8941** | **0.8941** | **0.8889** | **0.8777** | **0.8544** |
811
+
812
+ * The bold row denotes the saved checkpoint.
813
+
814
+ ### Framework Versions
815
+ - Python: 3.12.6
816
+ - Sentence Transformers: 4.1.0
817
+ - Transformers: 4.52.4
818
+ - PyTorch: 2.6.0+cu124
819
+ - Accelerate: 1.7.0
820
+ - Datasets: 3.6.0
821
+ - Tokenizers: 0.21.1
822
+
823
+ ## Citation
824
+
825
+ ### BibTeX
826
+
827
+ #### Sentence Transformers
828
+ ```bibtex
829
+ @inproceedings{reimers-2019-sentence-bert,
830
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
831
+ author = "Reimers, Nils and Gurevych, Iryna",
832
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
833
+ month = "11",
834
+ year = "2019",
835
+ publisher = "Association for Computational Linguistics",
836
+ url = "https://arxiv.org/abs/1908.10084",
837
+ }
838
+ ```
839
+
840
+ #### MatryoshkaLoss
841
+ ```bibtex
842
+ @misc{kusupati2024matryoshka,
843
+ title={Matryoshka Representation Learning},
844
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
845
+ year={2024},
846
+ eprint={2205.13147},
847
+ archivePrefix={arXiv},
848
+ primaryClass={cs.LG}
849
+ }
850
+ ```
851
+
852
+ #### MultipleNegativesRankingLoss
853
+ ```bibtex
854
+ @misc{henderson2017efficient,
855
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
856
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
857
+ year={2017},
858
+ eprint={1705.00652},
859
+ archivePrefix={arXiv},
860
+ primaryClass={cs.CL}
861
+ }
862
+ ```
863
+
864
+ <!--
865
+ ## Glossary
866
+
867
+ *Clearly define terms in order to be accessible across audiences.*
868
+ -->
869
+
870
+ <!--
871
+ ## Model Card Authors
872
+
873
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
874
+ -->
875
+
876
+ <!--
877
+ ## Model Card Contact
878
+
879
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
880
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.52.4",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.52.4",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f5ca9935aa45c2cbb9c25ad6ae487b89fa76de098cd3ddf713d92eee87c0afc
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "additional_special_tokens": [],
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "[CLS]",
47
+ "do_basic_tokenize": true,
48
+ "do_lower_case": true,
49
+ "extra_special_tokens": {},
50
+ "mask_token": "[MASK]",
51
+ "max_length": 512,
52
+ "model_max_length": 512,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "[SEP]",
59
+ "stride": 0,
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
+ "unk_token": "[UNK]"
66
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff