DannyAI commited on
Commit
0625858
·
verified ·
1 Parent(s): 80cd510

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,712 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - dense
10
+ - generated_from_trainer
11
+ - dataset_size:64147
12
+ - loss:CachedMultipleNegativesRankingLoss
13
+ base_model: BAAI/bge-large-en-v1.5
14
+ widget:
15
+ - source_sentence: who is the second prime minister of india
16
+ sentences:
17
+ - List of Prime Ministers of India Since 1947, India has had fourteen Prime Ministers,
18
+ fifteen including Gulzarilal Nanda who twice acted in the role. The first was
19
+ Jawaharlal Nehru of the Indian National Congress party, who was sworn-in on 15
20
+ August 1947, when India gained independence from the British. Serving until his
21
+ death in May 1964, Nehru remains India's longest-serving prime minister. He was
22
+ succeeded by fellow Congressman Lal Bahadur Shastri, whose 19-month term also
23
+ ended in death. Indira Gandhi, Nehru's daughter, succeeded Shastri in 1966 to
24
+ become the country's first woman premier. Eleven years later, she was voted out
25
+ of power in favour of the Janata Party, whose leader Morarji Desai became the
26
+ first non-Congress prime minister. After he resigned in 1979, his former deputy
27
+ Charan Singh briefly held office until Indira Gandhi was voted back six months
28
+ later. Indira Gandhi's second stint as Prime Minister ended five years later on
29
+ the morning of 31 October 1984, when she was gunned down by her own bodyguards.
30
+ That evening, her son Rajiv Gandhi was sworn-in as India's youngest premier, and
31
+ the third from his family. Thus far, members of Nehru–Gandhi dynasty have been
32
+ Prime Minister for a total of 37 years and 303 days.[1]
33
+ - Can You Feel the Love Tonight The song was performed in the film by Kristle Edwards,
34
+ Joseph Williams, Sally Dworsky, Nathan Lane, and Ernie Sabella, while the end
35
+ title version was performed by Elton John. It won the 1994 Academy Award for Best
36
+ Original Song,[1] and the Golden Globe Award for Best Original Song. It also earned
37
+ Elton John the Grammy Award for Best Male Pop Vocal Performance.
38
+ - 'Sam Worthington Samuel Henry John Worthington[1] (born 2 August 1976) is an English
39
+ born, Australian actor and writer. He portrayed Jake Sully in the 2009 film Avatar,
40
+ Marcus Wright in Terminator Salvation, and Perseus in Clash of the Titans as well
41
+ as its sequel Wrath of the Titans before transitioning to more dramatic roles
42
+ in Everest (2015), Hacksaw Ridge (2016), The Shack, and Manhunt: Unabomber (both
43
+ in 2017). He also played the main protagonist, Captain Alex Mason, in Call of
44
+ Duty: Black Ops.'
45
+ - source_sentence: who drafted most of the declaration of independence
46
+ sentences:
47
+ - 'United States Declaration of Independence John Adams persuaded the committee
48
+ to select Thomas Jefferson to compose the original draft of the document,[3] which
49
+ Congress would edit to produce the final version. The Declaration was ultimately
50
+ a formal explanation of why Congress had voted on July 2 to declare independence
51
+ from Great Britain, more than a year after the outbreak of the American Revolutionary
52
+ War. The next day, Adams wrote to his wife Abigail: "The Second Day of July 1776,
53
+ will be the most memorable Epocha, in the History of America."[4] But Independence
54
+ Day is actually celebrated on July 4, the date that the Declaration of Independence
55
+ was approved.'
56
+ - Luke Cage (season 2) The season is set to premiere in 2018.
57
+ - Politics of the European Union The competencies of the European Union stem from
58
+ the original Coal and Steel Community, which had as its goal an integrated market.
59
+ The original competencies were regulatory in nature, restricted to matters of
60
+ maintaining a healthy business environment. Rulings were confined to laws covering
61
+ trade, currency, and competition. Increases in the number of EU competencies result
62
+ from a process known as functional spillover. Functional spillover resulted in,
63
+ first, the integration of banking and insurance industries to manage finance and
64
+ investment. The size of the bureaucracies increased, requiring modifications to
65
+ the treaty system as the scope of competencies integrated more and more functions.
66
+ While member states hold their sovereignty inviolate, they remain within a system
67
+ to which they have delegated the tasks of managing the marketplace. These tasks
68
+ have expanded to include the competencies of free movement of persons, employment,
69
+ transportation, and environmental regulation.
70
+ - source_sentence: is there a difference between 300 blackout and 300 aac blackout
71
+ sentences:
72
+ - 'Call of Duty: World at War Call of Duty: World at War is a 2008 first-person
73
+ shooter video game developed by Treyarch and published by Activision for Microsoft
74
+ Windows, PlayStation 3, Wii, and Xbox 360. The game is the fifth mainstream game
75
+ of the Call of Duty series and returns the setting to World War II for the last
76
+ time until Call of Duty: WWII almost nine years later. The game is also the first
77
+ title in the Black Ops story line. The game was released in North America on November
78
+ 11, 2008, and in Europe on November 14, 2008. A Windows Mobile version was also
79
+ made available by Glu Mobile and different storyline versions for the Nintendo
80
+ DS and PlayStation 2 were also produced, but remain in the World War II setting.
81
+ The game is based on an enhanced version of the Call of Duty 4: Modern Warfare
82
+ game engine developed by Infinity Ward with increased development on audio and
83
+ visual effects.'
84
+ - Vincent van Gogh Van Gogh suffered from psychotic episodes and delusions and though
85
+ he worried about his mental stability, he often neglected his physical health,
86
+ did not eat properly and drank heavily. His friendship with Gauguin ended after
87
+ a confrontation with a razor, when in a rage, he severed part of his own left
88
+ ear. He spent time in psychiatric hospitals, including a period at Saint-Rémy.
89
+ After he discharged himself and moved to the Auberge Ravoux in Auvers-sur-Oise
90
+ near Paris, he came under the care of the homoeopathic doctor Paul Gachet. His
91
+ depression continued and on 27 July 1890, Van Gogh shot himself in the chest with
92
+ a revolver. He died from his injuries two days later.
93
+ - .300 AAC Blackout The .300 AAC Blackout (designated as the 300 BLK by the SAAMI[1]
94
+ and 300 AAC Blackout by the C.I.P.[2]), also known as 7.62×35mm is a carbine cartridge
95
+ developed in the United States by Advanced Armament Corporation (AAC) for use
96
+ in the M4 carbine. Its purpose is to achieve ballistics similar to the 7.62×39mm
97
+ Soviet cartridge in an AR-15 while using standard AR-15 magazines at their normal
98
+ capacity. It can be seen as a SAAMI-certified copy of J. D. Jones' wildcat .300
99
+ Whisper. Care should be taken not to use 300 BLK ammunition in a rifle chambered
100
+ for 7.62×40mm Wilson Tactical.[3]
101
+ - source_sentence: when does the new army uniform come out
102
+ sentences:
103
+ - United States v. Paramount Pictures, Inc. The case reached the U.S. Supreme Court
104
+ in 1948; their verdict went against the movie studios, forcing all of them to
105
+ divest themselves of their movie theater chains.[8] This, coupled with the advent
106
+ of television and the attendant drop in movie ticket sales, brought about a severe
107
+ slump in the movie business, a slump that would not be reversed until 1972, with
108
+ the release of The Godfather, the first modern blockbuster.
109
+ - 'E. L. James James says the idea for the Fifty Shades trilogy began as a response
110
+ to the vampire novel series Twilight. In late 2008 James saw the movie Twilight,
111
+ and then became intensely absorbed with the novels that the movie was based on.
112
+ She read the novels several times over in a period of a few days, and then, for
113
+ the first time in her life, sat down to write a book: basically a sequel to the
114
+ Twilight novels. Between January and August 2009 she wrote two such books in quick
115
+ succession. She says she then discovered the phenomenon of fan fiction, and this
116
+ inspired her to publish her novels as Kindle books under the pen name "Snowqueens
117
+ Icedragon". Beginning in August 2009 she then began to write the Fifty Shades
118
+ books.[12][13]'
119
+ - Army Combat Uniform In May 2014, the Army unofficially announced that the Operational
120
+ Camouflage Pattern (OCP) would replace UCP on the ACU. The original "Scorpion"
121
+ pattern was developed at United States Army Soldier Systems Center by Crye Precision
122
+ in 2002 for the Objective Force Warrior program. Crye later modified and trademarked
123
+ their version of the pattern as MultiCam, which was selected for use by U.S. soldiers
124
+ in Afghanistan in 2010. After talks to officially adopt MultiCam broke down over
125
+ costs in late 2013, the Army began experimenting with the original Scorpion pattern,
126
+ creating a variant code named "Scorpion W2", noting that while a pattern can be
127
+ copyrighted, a color palette cannot and that beyond 50 meters the actual pattern
128
+ is "not that relevant." The pattern resembles MultiCam with muted greens, light
129
+ beige, and dark brown colors, but uses fewer beige and brown patches and no vertical
130
+ twig and branch elements.[12] On 31 July 2014, the Army formally announced that
131
+ the pattern would begin being issued in uniforms in summer 2015. The official
132
+ name is intended to emphasize its use beyond Afghanistan to all combatant commands.[13]
133
+ The UCP pattern is planned to be fully replaced by the OCP on the ACU by 1 October
134
+ 2019.[14] ACUs printed in OCP first became available for purchase on 1 July 2015,
135
+ with deployed soldiers already being issued uniforms and equipment in the new
136
+ pattern.[15]
137
+ - source_sentence: what was agenda 21 of earth summit of rio de janeiro
138
+ sentences:
139
+ - 'Jab Harry Met Sejal Jab Harry Met Sejal (English: When Harry Met Sejal) is a
140
+ 2017 Indian romantic comedy film written and directed by Imtiaz Ali. It features
141
+ Shah Rukh Khan and Anushka Sharma in the lead roles,[1] their third collaboration
142
+ after Rab Ne Bana Di Jodi (2008) and Jab Tak Hai Jaan (2012). Pre-production of
143
+ the film begun in April 2015 and principal photography commenced in August 2016
144
+ in Prague, Amsterdam, Vienna, Lisbon and Budapest.'
145
+ - Agenda 21 Agenda 21 is a non-binding, action plan of the United Nations with regard
146
+ to sustainable development.[1] It is a product of the Earth Summit (UN Conference
147
+ on Environment and Development) held in Rio de Janeiro, Brazil, in 1992. It is
148
+ an action agenda for the UN, other multilateral organizations, and individual
149
+ governments around the world that can be executed at local, national, and global
150
+ levels.
151
+ - Pencil Most manufacturers, and almost all in Europe, designate their pencils with
152
+ the letters H (commonly interpreted as "hardness") to B (commonly "blackness"),
153
+ as well as F (usually taken to mean "fineness", although F pencils are no more
154
+ fine or more easily sharpened than any other grade. also known as "firm" in Japan[68]).
155
+ The standard writing pencil is graded HB.[69] This designation might have been
156
+ first used in the early 20th century by Brookman, an English pencil maker. It
157
+ used B for black and H for hard; a pencil's grade was described by a sequence
158
+ or successive Hs or Bs such as BB and BBB for successively softer leads, and HH
159
+ and HHH for successively harder ones.[70] The Koh-i-Noor Hardtmuth pencil manufacturers
160
+ claim to have first used the HB designations, with H standing for Hardtmuth, B
161
+ for the company's location of Budějovice, and F for Franz Hardtmuth, who was responsible
162
+ for technological improvements in pencil manufacture.[71][72]
163
+ datasets:
164
+ - sentence-transformers/natural-questions
165
+ pipeline_tag: sentence-similarity
166
+ library_name: sentence-transformers
167
+ metrics:
168
+ - cosine_accuracy@1
169
+ - cosine_accuracy@3
170
+ - cosine_accuracy@5
171
+ - cosine_accuracy@10
172
+ - cosine_precision@1
173
+ - cosine_precision@3
174
+ - cosine_precision@5
175
+ - cosine_precision@10
176
+ - cosine_recall@1
177
+ - cosine_recall@3
178
+ - cosine_recall@5
179
+ - cosine_recall@10
180
+ - cosine_ndcg@10
181
+ - cosine_mrr@10
182
+ - cosine_map@100
183
+ model-index:
184
+ - name: bge-large-en-v1.5
185
+ results:
186
+ - task:
187
+ type: information-retrieval
188
+ name: Information Retrieval
189
+ dataset:
190
+ name: NanoQuoraRetrieval
191
+ type: NanoQuoraRetrieval
192
+ metrics:
193
+ - type: cosine_accuracy@1
194
+ value: 0.88
195
+ name: Cosine Accuracy@1
196
+ - type: cosine_accuracy@3
197
+ value: 0.96
198
+ name: Cosine Accuracy@3
199
+ - type: cosine_accuracy@5
200
+ value: 0.98
201
+ name: Cosine Accuracy@5
202
+ - type: cosine_accuracy@10
203
+ value: 1.0
204
+ name: Cosine Accuracy@10
205
+ - type: cosine_precision@1
206
+ value: 0.88
207
+ name: Cosine Precision@1
208
+ - type: cosine_precision@3
209
+ value: 0.3999999999999999
210
+ name: Cosine Precision@3
211
+ - type: cosine_precision@5
212
+ value: 0.25999999999999995
213
+ name: Cosine Precision@5
214
+ - type: cosine_precision@10
215
+ value: 0.13599999999999998
216
+ name: Cosine Precision@10
217
+ - type: cosine_recall@1
218
+ value: 0.7673333333333332
219
+ name: Cosine Recall@1
220
+ - type: cosine_recall@3
221
+ value: 0.922
222
+ name: Cosine Recall@3
223
+ - type: cosine_recall@5
224
+ value: 0.966
225
+ name: Cosine Recall@5
226
+ - type: cosine_recall@10
227
+ value: 0.9933333333333334
228
+ name: Cosine Recall@10
229
+ - type: cosine_ndcg@10
230
+ value: 0.9311833586321692
231
+ name: Cosine Ndcg@10
232
+ - type: cosine_mrr@10
233
+ value: 0.9228888888888889
234
+ name: Cosine Mrr@10
235
+ - type: cosine_map@100
236
+ value: 0.9056754689754689
237
+ name: Cosine Map@100
238
+ - type: cosine_accuracy@1
239
+ value: 0.88
240
+ name: Cosine Accuracy@1
241
+ - type: cosine_accuracy@3
242
+ value: 0.96
243
+ name: Cosine Accuracy@3
244
+ - type: cosine_accuracy@5
245
+ value: 0.98
246
+ name: Cosine Accuracy@5
247
+ - type: cosine_accuracy@10
248
+ value: 1.0
249
+ name: Cosine Accuracy@10
250
+ - type: cosine_precision@1
251
+ value: 0.88
252
+ name: Cosine Precision@1
253
+ - type: cosine_precision@3
254
+ value: 0.3999999999999999
255
+ name: Cosine Precision@3
256
+ - type: cosine_precision@5
257
+ value: 0.25999999999999995
258
+ name: Cosine Precision@5
259
+ - type: cosine_precision@10
260
+ value: 0.13599999999999998
261
+ name: Cosine Precision@10
262
+ - type: cosine_recall@1
263
+ value: 0.7673333333333332
264
+ name: Cosine Recall@1
265
+ - type: cosine_recall@3
266
+ value: 0.922
267
+ name: Cosine Recall@3
268
+ - type: cosine_recall@5
269
+ value: 0.966
270
+ name: Cosine Recall@5
271
+ - type: cosine_recall@10
272
+ value: 0.9933333333333334
273
+ name: Cosine Recall@10
274
+ - type: cosine_ndcg@10
275
+ value: 0.9311833586321692
276
+ name: Cosine Ndcg@10
277
+ - type: cosine_mrr@10
278
+ value: 0.9228888888888889
279
+ name: Cosine Mrr@10
280
+ - type: cosine_map@100
281
+ value: 0.9056754689754689
282
+ name: Cosine Map@100
283
+ ---
284
+
285
+ # bge-large-en-v1.5
286
+
287
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) on the [natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions) dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
288
+
289
+ ## Model Details
290
+
291
+ ### Model Description
292
+ - **Model Type:** Sentence Transformer
293
+ - **Base model:** [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) <!-- at revision d4aa6901d3a41ba39fb536a557fa166f842b0e09 -->
294
+ - **Maximum Sequence Length:** 512 tokens
295
+ - **Output Dimensionality:** 1024 dimensions
296
+ - **Similarity Function:** Cosine Similarity
297
+ - **Training Dataset:**
298
+ - [natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions)
299
+ - **Language:** en
300
+ - **License:** mit
301
+
302
+ ### Model Sources
303
+
304
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
305
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
306
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
307
+
308
+ ### Full Model Architecture
309
+
310
+ ```
311
+ SentenceTransformer(
312
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
313
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
314
+ (2): Normalize()
315
+ )
316
+ ```
317
+
318
+ ## Usage
319
+
320
+ ### Direct Usage (Sentence Transformers)
321
+
322
+ First install the Sentence Transformers library:
323
+
324
+ ```bash
325
+ pip install -U sentence-transformers
326
+ ```
327
+
328
+ Then you can load this model and run inference.
329
+ ```python
330
+ from sentence_transformers import SentenceTransformer
331
+
332
+ # Download from the 🤗 Hub
333
+ model = SentenceTransformer("DannyAI/embedding_fine_tuning_with_prompts_bge_large_en_v1.5")
334
+ # Run inference
335
+ queries = [
336
+ "what was agenda 21 of earth summit of rio de janeiro",
337
+ ]
338
+ documents = [
339
+ 'Agenda 21 Agenda 21 is a non-binding, action plan of the United Nations with regard to sustainable development.[1] It is a product of the Earth Summit (UN Conference on Environment and Development) held in Rio de Janeiro, Brazil, in 1992. It is an action agenda for the UN, other multilateral organizations, and individual governments around the world that can be executed at local, national, and global levels.',
340
+ 'Jab Harry Met Sejal Jab Harry Met Sejal (English: When Harry Met Sejal) is a 2017 Indian romantic comedy film written and directed by Imtiaz Ali. It features Shah Rukh Khan and Anushka Sharma in the lead roles,[1] their third collaboration after Rab Ne Bana Di Jodi (2008) and Jab Tak Hai Jaan (2012). Pre-production of the film begun in April 2015 and principal photography commenced in August 2016 in Prague, Amsterdam, Vienna, Lisbon and Budapest.',
341
+ 'Pencil Most manufacturers, and almost all in Europe, designate their pencils with the letters H (commonly interpreted as "hardness") to B (commonly "blackness"), as well as F (usually taken to mean "fineness", although F pencils are no more fine or more easily sharpened than any other grade. also known as "firm" in Japan[68]). The standard writing pencil is graded HB.[69] This designation might have been first used in the early 20th century by Brookman, an English pencil maker. It used B for black and H for hard; a pencil\'s grade was described by a sequence or successive Hs or Bs such as BB and BBB for successively softer leads, and HH and HHH for successively harder ones.[70] The Koh-i-Noor Hardtmuth pencil manufacturers claim to have first used the HB designations, with H standing for Hardtmuth, B for the company\'s location of Budějovice, and F for Franz Hardtmuth, who was responsible for technological improvements in pencil manufacture.[71][72]',
342
+ ]
343
+ query_embeddings = model.encode_query(queries)
344
+ document_embeddings = model.encode_document(documents)
345
+ print(query_embeddings.shape, document_embeddings.shape)
346
+ # [1, 1024] [3, 1024]
347
+
348
+ # Get the similarity scores for the embeddings
349
+ similarities = model.similarity(query_embeddings, document_embeddings)
350
+ print(similarities)
351
+ # tensor([[0.9017, 0.2307, 0.2148]])
352
+ ```
353
+
354
+ <!--
355
+ ### Direct Usage (Transformers)
356
+
357
+ <details><summary>Click to see the direct usage in Transformers</summary>
358
+
359
+ </details>
360
+ -->
361
+
362
+ <!--
363
+ ### Downstream Usage (Sentence Transformers)
364
+
365
+ You can finetune this model on your own dataset.
366
+
367
+ <details><summary>Click to expand</summary>
368
+
369
+ </details>
370
+ -->
371
+
372
+ <!--
373
+ ### Out-of-Scope Use
374
+
375
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
376
+ -->
377
+
378
+ ## Evaluation
379
+
380
+ ### Metrics
381
+
382
+ #### Information Retrieval
383
+
384
+ * Dataset: `NanoQuoraRetrieval`
385
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
386
+ ```json
387
+ {
388
+ "query_prompt": "query: ",
389
+ "corpus_prompt": "document: "
390
+ }
391
+ ```
392
+
393
+ | Metric | Value |
394
+ |:--------------------|:-----------|
395
+ | cosine_accuracy@1 | 0.88 |
396
+ | cosine_accuracy@3 | 0.96 |
397
+ | cosine_accuracy@5 | 0.98 |
398
+ | cosine_accuracy@10 | 1.0 |
399
+ | cosine_precision@1 | 0.88 |
400
+ | cosine_precision@3 | 0.4 |
401
+ | cosine_precision@5 | 0.26 |
402
+ | cosine_precision@10 | 0.136 |
403
+ | cosine_recall@1 | 0.7673 |
404
+ | cosine_recall@3 | 0.922 |
405
+ | cosine_recall@5 | 0.966 |
406
+ | cosine_recall@10 | 0.9933 |
407
+ | **cosine_ndcg@10** | **0.9312** |
408
+ | cosine_mrr@10 | 0.9229 |
409
+ | cosine_map@100 | 0.9057 |
410
+
411
+ #### Information Retrieval
412
+
413
+ * Dataset: `NanoQuoraRetrieval`
414
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
415
+ ```json
416
+ {
417
+ "query_prompt": "query: ",
418
+ "corpus_prompt": "document: "
419
+ }
420
+ ```
421
+
422
+ | Metric | Value |
423
+ |:--------------------|:-----------|
424
+ | cosine_accuracy@1 | 0.88 |
425
+ | cosine_accuracy@3 | 0.96 |
426
+ | cosine_accuracy@5 | 0.98 |
427
+ | cosine_accuracy@10 | 1.0 |
428
+ | cosine_precision@1 | 0.88 |
429
+ | cosine_precision@3 | 0.4 |
430
+ | cosine_precision@5 | 0.26 |
431
+ | cosine_precision@10 | 0.136 |
432
+ | cosine_recall@1 | 0.7673 |
433
+ | cosine_recall@3 | 0.922 |
434
+ | cosine_recall@5 | 0.966 |
435
+ | cosine_recall@10 | 0.9933 |
436
+ | **cosine_ndcg@10** | **0.9312** |
437
+ | cosine_mrr@10 | 0.9229 |
438
+ | cosine_map@100 | 0.9057 |
439
+
440
+ <!--
441
+ ## Bias, Risks and Limitations
442
+
443
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
444
+ -->
445
+
446
+ <!--
447
+ ### Recommendations
448
+
449
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
450
+ -->
451
+
452
+ ## Training Details
453
+
454
+ ### Training Dataset
455
+
456
+ #### natural-questions
457
+
458
+ * Dataset: [natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions) at [f9e894e](https://huggingface.co/datasets/sentence-transformers/natural-questions/tree/f9e894e1081e206e577b4eaa9ee6de2b06ae6f17)
459
+ * Size: 64,147 training samples
460
+ * Columns: <code>query</code> and <code>answer</code>
461
+ * Approximate statistics based on the first 1000 samples:
462
+ | | query | answer |
463
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
464
+ | type | string | string |
465
+ | details | <ul><li>min: 10 tokens</li><li>mean: 11.81 tokens</li><li>max: 26 tokens</li></ul> | <ul><li>min: 21 tokens</li><li>mean: 137.28 tokens</li><li>max: 512 tokens</li></ul> |
466
+ * Samples:
467
+ | query | answer |
468
+ |:------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
469
+ | <code>the internal revenue code is part of federal statutory law. true false</code> | <code>Internal Revenue Code The Internal Revenue Code (IRC), formally the Internal Revenue Code of 1986, is the domestic portion of federal statutory tax law in the United States, published in various volumes of the United States Statutes at Large, and separately as Title 26 of the United States Code (USC).[1] It is organized topically, into subtitles and sections, covering income tax (see Income tax in the United States), payroll taxes, estate taxes, gift taxes, and excise taxes; as well as procedure and administration. Its implementing agency is the Internal Revenue Service.</code> |
470
+ | <code>where is the pyramid temple at borobudur located</code> | <code>Borobudur Approximately 40 kilometres (25 mi) northwest of Yogyakarta and 86 kilometres (53 mi) west of Surakarta, Borobudur is located in an elevated area between two twin volcanoes, Sundoro-Sumbing and Merbabu-Merapi, and two rivers, the Progo and the Elo. According to local myth, the area known as Kedu Plain is a Javanese "sacred" place and has been dubbed "the garden of Java" due to its high agricultural fertility.[19] During the restoration in the early 20th century, it was discovered that three Buddhist temples in the region, Borobudur, Pawon and Mendut, are positioned along a straight line.[20] A ritual relationship between the three temples must have existed, although the exact ritual process is unknown.[14]</code> |
471
+ | <code>what does uncle stand for in the show man from uncle</code> | <code>The Man from U.N.C.L.E. Originally, co-creator Sam Rolfe wanted to leave the meaning of U.N.C.L.E. ambiguous so it could refer to either "Uncle Sam" or the United Nations.[2]:14 Concerns by Metro-Goldwyn-Mayer's (MGM) legal department about using "U.N." for commercial purposes resulted in the producers' clarification that U.N.C.L.E. was an acronym for the United Network Command for Law and Enforcement.[3] Each episode had an "acknowledgement" to the U.N.C.L.E. in the end titles.</code> |
472
+ * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
473
+ ```json
474
+ {
475
+ "scale": 20.0,
476
+ "similarity_fct": "cos_sim",
477
+ "mini_batch_size": 16,
478
+ "gather_across_devices": false
479
+ }
480
+ ```
481
+
482
+ ### Evaluation Dataset
483
+
484
+ #### natural-questions
485
+
486
+ * Dataset: [natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions) at [f9e894e](https://huggingface.co/datasets/sentence-transformers/natural-questions/tree/f9e894e1081e206e577b4eaa9ee6de2b06ae6f17)
487
+ * Size: 16,037 evaluation samples
488
+ * Columns: <code>query</code> and <code>answer</code>
489
+ * Approximate statistics based on the first 1000 samples:
490
+ | | query | answer |
491
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
492
+ | type | string | string |
493
+ | details | <ul><li>min: 10 tokens</li><li>mean: 11.67 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 134.64 tokens</li><li>max: 512 tokens</li></ul> |
494
+ * Samples:
495
+ | query | answer |
496
+ |:----------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
497
+ | <code>when did last harry potter movie come out</code> | <code>Harry Potter (film series) Harry Potter is a British-American film series based on the Harry Potter novels by author J. K. Rowling. The series is distributed by Warner Bros. and consists of eight fantasy films, beginning with Harry Potter and the Philosopher's Stone (2001) and culminating with Harry Potter and the Deathly Hallows – Part 2 (2011).[2][3] A spin-off prequel series will consist of five films, starting with Fantastic Beasts and Where to Find Them (2016). The Fantastic Beasts films mark the beginning of a shared media franchise known as J. K. Rowling's Wizarding World.[4]</code> |
498
+ | <code>where did the saying debbie downer come from</code> | <code>Debbie Downer The character's name, Debbie Downer, is a slang phrase which refers to someone who frequently adds bad news and negative feelings to a gathering, thus bringing down the mood of everyone around them. Dratch's character would usually appear at social gatherings and interrupt the conversation to voice negative opinions and pronouncements. She is especially concerned about the rate of feline AIDS, a subject that she would bring up on more than one occasion, saying it was the number one killer of domestic cats.</code> |
499
+ | <code>the financial crisis of 2008 was caused by</code> | <code>Financial crisis of 2007–2008 It began in 2007 with a crisis in the subprime mortgage market in the United States, and developed into a full-blown international banking crisis with the collapse of the investment bank Lehman Brothers on September 15, 2008.[5] Excessive risk-taking by banks such as Lehman Brothers helped to magnify the financial impact globally.[6] Massive bail-outs of financial institutions and other palliative monetary and fiscal policies were employed to prevent a possible collapse of the world financial system. The crisis was nonetheless followed by a global economic downturn, the Great Recession. The European debt crisis, a crisis in the banking system of the European countries using the euro, followed later.</code> |
500
+ * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
501
+ ```json
502
+ {
503
+ "scale": 20.0,
504
+ "similarity_fct": "cos_sim",
505
+ "mini_batch_size": 16,
506
+ "gather_across_devices": false
507
+ }
508
+ ```
509
+
510
+ ### Training Hyperparameters
511
+ #### Non-Default Hyperparameters
512
+
513
+ - `eval_strategy`: steps
514
+ - `per_device_train_batch_size`: 5
515
+ - `per_device_eval_batch_size`: 5
516
+ - `learning_rate`: 2e-05
517
+ - `max_steps`: 100
518
+ - `warmup_ratio`: 0.1
519
+ - `seed`: 30
520
+ - `bf16`: True
521
+ - `load_best_model_at_end`: True
522
+ - `prompts`: {'query': 'query: ', 'answer': 'document: '}
523
+ - `batch_sampler`: no_duplicates
524
+
525
+ #### All Hyperparameters
526
+ <details><summary>Click to expand</summary>
527
+
528
+ - `overwrite_output_dir`: False
529
+ - `do_predict`: False
530
+ - `eval_strategy`: steps
531
+ - `prediction_loss_only`: True
532
+ - `per_device_train_batch_size`: 5
533
+ - `per_device_eval_batch_size`: 5
534
+ - `per_gpu_train_batch_size`: None
535
+ - `per_gpu_eval_batch_size`: None
536
+ - `gradient_accumulation_steps`: 1
537
+ - `eval_accumulation_steps`: None
538
+ - `torch_empty_cache_steps`: None
539
+ - `learning_rate`: 2e-05
540
+ - `weight_decay`: 0.0
541
+ - `adam_beta1`: 0.9
542
+ - `adam_beta2`: 0.999
543
+ - `adam_epsilon`: 1e-08
544
+ - `max_grad_norm`: 1.0
545
+ - `num_train_epochs`: 3.0
546
+ - `max_steps`: 100
547
+ - `lr_scheduler_type`: linear
548
+ - `lr_scheduler_kwargs`: {}
549
+ - `warmup_ratio`: 0.1
550
+ - `warmup_steps`: 0
551
+ - `log_level`: passive
552
+ - `log_level_replica`: warning
553
+ - `log_on_each_node`: True
554
+ - `logging_nan_inf_filter`: True
555
+ - `save_safetensors`: True
556
+ - `save_on_each_node`: False
557
+ - `save_only_model`: False
558
+ - `restore_callback_states_from_checkpoint`: False
559
+ - `no_cuda`: False
560
+ - `use_cpu`: False
561
+ - `use_mps_device`: False
562
+ - `seed`: 30
563
+ - `data_seed`: None
564
+ - `jit_mode_eval`: False
565
+ - `use_ipex`: False
566
+ - `bf16`: True
567
+ - `fp16`: False
568
+ - `fp16_opt_level`: O1
569
+ - `half_precision_backend`: auto
570
+ - `bf16_full_eval`: False
571
+ - `fp16_full_eval`: False
572
+ - `tf32`: None
573
+ - `local_rank`: 0
574
+ - `ddp_backend`: None
575
+ - `tpu_num_cores`: None
576
+ - `tpu_metrics_debug`: False
577
+ - `debug`: []
578
+ - `dataloader_drop_last`: False
579
+ - `dataloader_num_workers`: 0
580
+ - `dataloader_prefetch_factor`: None
581
+ - `past_index`: -1
582
+ - `disable_tqdm`: False
583
+ - `remove_unused_columns`: True
584
+ - `label_names`: None
585
+ - `load_best_model_at_end`: True
586
+ - `ignore_data_skip`: False
587
+ - `fsdp`: []
588
+ - `fsdp_min_num_params`: 0
589
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
590
+ - `fsdp_transformer_layer_cls_to_wrap`: None
591
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
592
+ - `parallelism_config`: None
593
+ - `deepspeed`: None
594
+ - `label_smoothing_factor`: 0.0
595
+ - `optim`: adamw_torch_fused
596
+ - `optim_args`: None
597
+ - `adafactor`: False
598
+ - `group_by_length`: False
599
+ - `length_column_name`: length
600
+ - `ddp_find_unused_parameters`: None
601
+ - `ddp_bucket_cap_mb`: None
602
+ - `ddp_broadcast_buffers`: False
603
+ - `dataloader_pin_memory`: True
604
+ - `dataloader_persistent_workers`: False
605
+ - `skip_memory_metrics`: True
606
+ - `use_legacy_prediction_loop`: False
607
+ - `push_to_hub`: False
608
+ - `resume_from_checkpoint`: None
609
+ - `hub_model_id`: None
610
+ - `hub_strategy`: every_save
611
+ - `hub_private_repo`: None
612
+ - `hub_always_push`: False
613
+ - `hub_revision`: None
614
+ - `gradient_checkpointing`: False
615
+ - `gradient_checkpointing_kwargs`: None
616
+ - `include_inputs_for_metrics`: False
617
+ - `include_for_metrics`: []
618
+ - `eval_do_concat_batches`: True
619
+ - `fp16_backend`: auto
620
+ - `push_to_hub_model_id`: None
621
+ - `push_to_hub_organization`: None
622
+ - `mp_parameters`:
623
+ - `auto_find_batch_size`: False
624
+ - `full_determinism`: False
625
+ - `torchdynamo`: None
626
+ - `ray_scope`: last
627
+ - `ddp_timeout`: 1800
628
+ - `torch_compile`: False
629
+ - `torch_compile_backend`: None
630
+ - `torch_compile_mode`: None
631
+ - `include_tokens_per_second`: False
632
+ - `include_num_input_tokens_seen`: False
633
+ - `neftune_noise_alpha`: None
634
+ - `optim_target_modules`: None
635
+ - `batch_eval_metrics`: False
636
+ - `eval_on_start`: False
637
+ - `use_liger_kernel`: False
638
+ - `liger_kernel_config`: None
639
+ - `eval_use_gather_object`: False
640
+ - `average_tokens_across_devices`: False
641
+ - `prompts`: {'query': 'query: ', 'answer': 'document: '}
642
+ - `batch_sampler`: no_duplicates
643
+ - `multi_dataset_batch_sampler`: proportional
644
+ - `router_mapping`: {}
645
+ - `learning_rate_mapping`: {}
646
+
647
+ </details>
648
+
649
+ ### Training Logs
650
+ | Epoch | Step | Training Loss | Validation Loss | NanoQuoraRetrieval_cosine_ndcg@10 |
651
+ |:----------:|:-------:|:-------------:|:---------------:|:---------------------------------:|
652
+ | -1 | -1 | - | - | 0.9583 |
653
+ | **0.0078** | **100** | **0.0063** | **0.0029** | **0.9312** |
654
+ | -1 | -1 | - | - | 0.9312 |
655
+
656
+ * The bold row denotes the saved checkpoint.
657
+
658
+ ### Framework Versions
659
+ - Python: 3.12.11
660
+ - Sentence Transformers: 5.1.0
661
+ - Transformers: 4.56.1
662
+ - PyTorch: 2.8.0+cu126
663
+ - Accelerate: 1.10.1
664
+ - Datasets: 4.0.0
665
+ - Tokenizers: 0.22.0
666
+
667
+ ## Citation
668
+
669
+ ### BibTeX
670
+
671
+ #### Sentence Transformers
672
+ ```bibtex
673
+ @inproceedings{reimers-2019-sentence-bert,
674
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
675
+ author = "Reimers, Nils and Gurevych, Iryna",
676
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
677
+ month = "11",
678
+ year = "2019",
679
+ publisher = "Association for Computational Linguistics",
680
+ url = "https://arxiv.org/abs/1908.10084",
681
+ }
682
+ ```
683
+
684
+ #### CachedMultipleNegativesRankingLoss
685
+ ```bibtex
686
+ @misc{gao2021scaling,
687
+ title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
688
+ author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
689
+ year={2021},
690
+ eprint={2101.06983},
691
+ archivePrefix={arXiv},
692
+ primaryClass={cs.LG}
693
+ }
694
+ ```
695
+
696
+ <!--
697
+ ## Glossary
698
+
699
+ *Clearly define terms in order to be accessible across audiences.*
700
+ -->
701
+
702
+ <!--
703
+ ## Model Card Authors
704
+
705
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
706
+ -->
707
+
708
+ <!--
709
+ ## Model Card Contact
710
+
711
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
712
+ -->
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 4096,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 16,
24
+ "num_hidden_layers": 24,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "transformers_version": "4.56.1",
28
+ "type_vocab_size": 2,
29
+ "use_cache": true,
30
+ "vocab_size": 30522
31
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.0",
4
+ "transformers": "4.56.1",
5
+ "pytorch": "2.8.0+cu126"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1fe310985cdd22d996028fa139271c157e1a4e5aa3115e9f1575cdf8b003097
3
+ size 1340612432
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff