amentaphd commited on
Commit
fb1b084
·
verified ·
1 Parent(s): 3a521e6

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,756 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:46338
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: What criteria must Member States consider when establishing penalties
13
+ for infringements of the specified Regulation, and what is the deadline for notifying
14
+ the Commission about these rules?
15
+ sentences:
16
+ - 'Enforcement
17
+
18
+
19
+ 1.
20
+
21
+
22
+ Member States shall lay down the rules on penalties applicable to infringements
23
+ of this Regulation and shall take all measures necessary to ensure that they are
24
+ implemented. The penalties provided for must be effective, proportionate and dissuasive
25
+ taking into account, in particular, the nature, duration, recurrence and gravity
26
+ of the infringement. Member States shall, by 31 December 2024, notify the Commission
27
+ of those rules and of those measures and shall notify it without delay of any
28
+ subsequent amendment affecting them.
29
+
30
+
31
+ 2.'
32
+ - Within the transitional periods established, Member States shall progressively
33
+ reduce their respective gaps with regard to the new minimum levels of taxation.
34
+ However, where the difference between the national level and the minimum level
35
+ does not exceed 3 % of that minimum level, the Member State concerned may wait
36
+ until the end of the period to adjust its national level.
37
+ - 'AR 10. ‘Indirect political contribution’ refers to those political contributions
38
+ made through an intermediary organisation such as a lobbyist or charity, or support
39
+ given to an organisation such as a think tank or trade association linked to or
40
+ supporting particular political parties or causes.
41
+
42
+
43
+ AR 11. When determining ‘comparable position’ in this standard, the undertaking
44
+ shall consider various factors, including level of responsibility and scope of
45
+ activities undertaken.
46
+
47
+
48
+ AR 12. The undertaking may provide the following information on its financial
49
+ or in-kind contributions with regard to its lobbying expenses:
50
+
51
+
52
+ (a)
53
+
54
+
55
+ the total monetary amount of such internal and external expenses; and
56
+
57
+
58
+ (b)'
59
+ - source_sentence: How does the use of AI systems impact access to essential public
60
+ assistance benefits and services?
61
+ sentences:
62
+ - X. Among these substances there are ‘priority hazardous substances’ which means
63
+ substances identified in accordance with Article 16(3) and (6) for which measures
64
+ have to be taken in accordance with Article 16(1) and (8). --- --- 31. ‘Pollutant’means
65
+ any substance liable to cause pollution, in particular those listed in Annex VIII.
66
+ --- --- 32. ‘Direct discharge to groundwater’means discharge of pollutants into
67
+ groundwater without percolation throughout the soil or subsoil. --- --- 33. ‘Pollution’means
68
+ the direct or indirect introduction, as a result of human activity, of substances
69
+ or heat into the air, water or land which may be harmful to human health or the
70
+ quality of aquatic ecosystems or terrestrial ecosystems directly depending on
71
+ - 'The competent authorities shall inform the requesting competent authorities of
72
+ any decision taken under the first subparagraph, stating the reasons therefor.
73
+
74
+
75
+ 4.
76
+
77
+
78
+ In order to ensure uniform application of this Article, ESMA may develop draft
79
+ implementing technical standards to establish common procedures for competent
80
+ authorities to cooperate in on-the-spot verifications and investigations.
81
+
82
+
83
+ Power is conferred on the Commission to adopt the implementing technical standards
84
+ referred to in the first subparagraph in accordance with Article 15 of Regulation
85
+ (EU) No 1095/2010.
86
+
87
+
88
+ Article 55
89
+
90
+
91
+ Dispute settlement'
92
+ - (58) Another area in which the use of AI systems deserves special consideration
93
+ is the access to and enjoyment of certain essential private and public services
94
+ and benefits necessary for people to fully participate in society or to improve
95
+ one’s standard of living. In particular, natural persons applying for or receiving
96
+ essential public assistance benefits and services from public authorities namely
97
+ healthcare services, social security benefits, social services providing protection
98
+ in cases such as maternity, illness, industrial accidents, dependency or old age
99
+ and loss of employment and social and housing assistance, are typically dependent
100
+ on those benefits and services and in a vulnerable position in relation to the
101
+ responsible
102
+ - source_sentence: How does the context suggest promoting vulnerable customers' active
103
+ engagement in the energy market?
104
+ sentences:
105
+ - energy efficiency improvement measures as priority actions; --- --- (c) carry
106
+ out early, forward-looking investments in energy efficiency improvement measures
107
+ before distributional impacts from other policies and measures show their effect;
108
+ --- --- (d) foster technical assistance and the roll-out of enabling funding and
109
+ financial tools, such as on-bill schemes, local loan-loss reserve, guarantee funds,
110
+ funds targeting deep renovations and renovations with minimum energy gains; ---
111
+ --- (e) foster technical assistance for social actors to promote vulnerable customer’s
112
+ active engagement in the energy market, and positive changes in their energy consumption
113
+ behaviour; --- --- (f) ensure access to finance, grants or subsidies bound to minimum
114
+ - '4.
115
+
116
+
117
+ To the extent that the tasks relating to the implementation of the Innovation
118
+ Fund are not delegated to an implementing body, the Commission shall carry out
119
+ those tasks.
120
+
121
+
122
+ Article 18
123
+
124
+
125
+ Tasks of the implementing body
126
+
127
+
128
+ ►M2 The implementing body designated in accordance with Article 17(1) of this
129
+ Regulation to implement the Innovation Fund in accordance with Article 17(2) may
130
+ be entrusted with the overall management of the calls for proposals, the disbursement
131
+ of the Innovation Fund support and the monitoring of the implementation of selected
132
+ projects. ◄ For that purpose, the implementing body may be entrusted with the
133
+ following tasks:
134
+
135
+
136
+ (a)
137
+
138
+
139
+ organising the call for proposals;
140
+
141
+
142
+ (b)'
143
+ - 'Calculation
144
+
145
+
146
+ Calculations of emissions shall be performed using the formula:
147
+
148
+
149
+ Activity data × Emission factor × Oxidation factor
150
+
151
+
152
+ Activity data (fuel used, production rate etc.) shall be monitored on the basis
153
+ of supply data or measurement.'
154
+ - source_sentence: What is the purpose of Directive 2004/109/EC of the European Parliament
155
+ and of the Council of 15 December 2004?
156
+ sentences:
157
+ - '3.7. Uses advised against ►M7 (see Section 1 of the safety data sheet) ◄
158
+
159
+
160
+ Where applicable, an indication of the uses which the registrant advises against
161
+ and why (i.e. non-statutory recommendations by supplier). This need not be an
162
+ exhaustive list.
163
+
164
+
165
+ 4. CLASSIFICATION AND LABELLING
166
+
167
+
168
+ ▼M3
169
+
170
+
171
+ 4.1 The hazard classification of the substance(s), resulting from the application
172
+ of Title I and II of Regulation (EC) No 1272/2008 for all hazard classes and categories
173
+ in that Regulation,
174
+
175
+
176
+ In addition, for each entry, the reasons why no classification is given for a
177
+ hazard class or differentiation of a hazard class should be provided (i.e. if
178
+ data are lacking, inconclusive, or conclusive but not sufficient for classification),'
179
+ - '(b)
180
+
181
+
182
+ operations by which the user of an energy product makes its reuse possible in
183
+ his own undertaking provided that the taxation already paid on such product is
184
+ not less than the taxation which would be due if the reused energy product were
185
+ again to be liable to taxation;
186
+
187
+
188
+ (c)
189
+
190
+
191
+ an operation consisting of mixing, outside a production establishment or a tax
192
+ warehouse, energy products with other energy products or other materials, provided
193
+ that:
194
+
195
+
196
+ (i)
197
+
198
+
199
+ taxation on the components has been paid previously; and
200
+
201
+
202
+ (ii)
203
+
204
+
205
+ the amount paid is not less than the amount of the tax which would be chargeable
206
+ on the mixture.
207
+
208
+
209
+ The condition under (i) shall not apply where the mixture is exempted for a specific
210
+ use.
211
+
212
+
213
+ Article 22'
214
+ - '( 15 ) Directive 2004/109/EC of the European Parliament and of the Council of
215
+ 15 December 2004 on the harmonisation of transparency requirements in relation
216
+ to information about issuers whose securities are admitted to trading on a regulated
217
+ market and amending Directive 2001/34/EC (OJ L 390, 31.12.2004, p. 38).
218
+
219
+
220
+ ( 16 ) Regulation (EU) 2020/852 of the European Parliament and of the Council
221
+ of 18 June 2020 on the establishment of a framework to facilitate sustainable
222
+ investment, and amending Regulation (EU) 2019/2088 (OJ L 198, 22.6.2020, p. 13).
223
+
224
+
225
+ ( 17 ) OJ L 142, 30.4.2004, p. 12.
226
+
227
+
228
+ ( 18 ) OJ L 340, 22.12.2007, p. 66.'
229
+ - source_sentence: What are the main objectives of the directives mentioned in the
230
+ text regarding greenhouse gas emissions and carbon dioxide storage, and how do
231
+ they relate to environmental protection and sustainability within the European
232
+ Union?
233
+ sentences:
234
+ - '(24) Directive 2003/87/EC of the European Parliament and of the Council of 13
235
+ October 2003 establishing a scheme for greenhouse gas emission allowance trading
236
+ within the Union and amending Council Directive 96/61/EC (OJ L 275, 25.10.2003,
237
+ p. 32).
238
+
239
+
240
+ (25) Directive 2009/31/EC of the European Parliament and of the Council of 23
241
+ April 2009 on the geological storage of carbon dioxide and amending Council Directive
242
+ 85/337/EEC, European Parliament and Council Directives 2000/60/EC, 2001/80/EC,
243
+ 2004/35/EC, 2006/12/EC, 2008/1/EC and Regulation (EC) No 1013/2006 (OJ L 140,
244
+ 5.6.2009, p. 114).
245
+
246
+
247
+ (26) Directive 2014/23/EU of the European Parliament and of the Council of 26
248
+ February 2014 on the award of concession contracts (OJ L 94, 28.3.2014, p. 1).'
249
+ - 'Article 33
250
+
251
+
252
+ Responsibility and liability for drawing up and publishing the financial statements
253
+ and the management report
254
+
255
+
256
+ ▼M4
257
+
258
+
259
+ 1.'
260
+ - '(b)
261
+
262
+
263
+ risks related to the undertaking’s dependencies on consumers and/or end-users
264
+ may include the loss of business continuity where an economic crisis makes consumers
265
+ unable to afford certain products or services;
266
+
267
+
268
+ (c)
269
+
270
+
271
+ ►C1 opportunities related to the undertaking’s impacts on consumers and/or end-
272
+ users may include market differentiation and greater customer appeal from offering
273
+ safe products or privacy-respecting services; and ◄
274
+
275
+
276
+ (d)'
277
+ pipeline_tag: sentence-similarity
278
+ library_name: sentence-transformers
279
+ metrics:
280
+ - cosine_accuracy@1
281
+ - cosine_accuracy@3
282
+ - cosine_accuracy@5
283
+ - cosine_accuracy@10
284
+ - cosine_precision@1
285
+ - cosine_precision@3
286
+ - cosine_precision@5
287
+ - cosine_precision@10
288
+ - cosine_recall@1
289
+ - cosine_recall@3
290
+ - cosine_recall@5
291
+ - cosine_recall@10
292
+ - cosine_ndcg@10
293
+ - cosine_mrr@10
294
+ - cosine_map@100
295
+ model-index:
296
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
297
+ results:
298
+ - task:
299
+ type: information-retrieval
300
+ name: Information Retrieval
301
+ dataset:
302
+ name: Unknown
303
+ type: unknown
304
+ metrics:
305
+ - type: cosine_accuracy@1
306
+ value: 0.6659761781460383
307
+ name: Cosine Accuracy@1
308
+ - type: cosine_accuracy@3
309
+ value: 0.8841705506645952
310
+ name: Cosine Accuracy@3
311
+ - type: cosine_accuracy@5
312
+ value: 0.9312963921974797
313
+ name: Cosine Accuracy@5
314
+ - type: cosine_accuracy@10
315
+ value: 0.9672017952701536
316
+ name: Cosine Accuracy@10
317
+ - type: cosine_precision@1
318
+ value: 0.6659761781460383
319
+ name: Cosine Precision@1
320
+ - type: cosine_precision@3
321
+ value: 0.29472351688819837
322
+ name: Cosine Precision@3
323
+ - type: cosine_precision@5
324
+ value: 0.1862592784394959
325
+ name: Cosine Precision@5
326
+ - type: cosine_precision@10
327
+ value: 0.09672017952701535
328
+ name: Cosine Precision@10
329
+ - type: cosine_recall@1
330
+ value: 0.6659761781460383
331
+ name: Cosine Recall@1
332
+ - type: cosine_recall@3
333
+ value: 0.8841705506645952
334
+ name: Cosine Recall@3
335
+ - type: cosine_recall@5
336
+ value: 0.9312963921974797
337
+ name: Cosine Recall@5
338
+ - type: cosine_recall@10
339
+ value: 0.9672017952701536
340
+ name: Cosine Recall@10
341
+ - type: cosine_ndcg@10
342
+ value: 0.8278291318026204
343
+ name: Cosine Ndcg@10
344
+ - type: cosine_mrr@10
345
+ value: 0.7818480980055302
346
+ name: Cosine Mrr@10
347
+ - type: cosine_map@100
348
+ value: 0.783515504381956
349
+ name: Cosine Map@100
350
+ ---
351
+
352
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
353
+
354
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
355
+
356
+ ## Model Details
357
+
358
+ ### Model Description
359
+ - **Model Type:** Sentence Transformer
360
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
361
+ - **Maximum Sequence Length:** 512 tokens
362
+ - **Output Dimensionality:** 1024 dimensions
363
+ - **Similarity Function:** Cosine Similarity
364
+ <!-- - **Training Dataset:** Unknown -->
365
+ <!-- - **Language:** Unknown -->
366
+ <!-- - **License:** Unknown -->
367
+
368
+ ### Model Sources
369
+
370
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
371
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
372
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
373
+
374
+ ### Full Model Architecture
375
+
376
+ ```
377
+ SentenceTransformer(
378
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
379
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
380
+ (2): Normalize()
381
+ )
382
+ ```
383
+
384
+ ## Usage
385
+
386
+ ### Direct Usage (Sentence Transformers)
387
+
388
+ First install the Sentence Transformers library:
389
+
390
+ ```bash
391
+ pip install -U sentence-transformers
392
+ ```
393
+
394
+ Then you can load this model and run inference.
395
+ ```python
396
+ from sentence_transformers import SentenceTransformer
397
+
398
+ # Download from the 🤗 Hub
399
+ model = SentenceTransformer("sentence_transformers_model_id")
400
+ # Run inference
401
+ sentences = [
402
+ 'What are the main objectives of the directives mentioned in the text regarding greenhouse gas emissions and carbon dioxide storage, and how do they relate to environmental protection and sustainability within the European Union?',
403
+ '(24) Directive 2003/87/EC of the European Parliament and of the Council of 13 October 2003 establishing a scheme for greenhouse gas emission allowance trading within the Union and amending Council Directive 96/61/EC (OJ L 275, 25.10.2003, p. 32).\n\n(25) Directive 2009/31/EC of the European Parliament and of the Council of 23 April 2009 on the geological storage of carbon dioxide and amending Council Directive 85/337/EEC, European Parliament and Council Directives 2000/60/EC, 2001/80/EC, 2004/35/EC, 2006/12/EC, 2008/1/EC and Regulation (EC) No 1013/2006 (OJ L 140, 5.6.2009, p. 114).\n\n(26) Directive 2014/23/EU of the European Parliament and of the Council of 26 February 2014 on the award of concession contracts (OJ L 94, 28.3.2014, p. 1).',
404
+ 'Article 33\n\nResponsibility and liability for drawing up and publishing the financial statements and the management report\n\n▼M4\n\n1.',
405
+ ]
406
+ embeddings = model.encode(sentences)
407
+ print(embeddings.shape)
408
+ # [3, 1024]
409
+
410
+ # Get the similarity scores for the embeddings
411
+ similarities = model.similarity(embeddings, embeddings)
412
+ print(similarities.shape)
413
+ # [3, 3]
414
+ ```
415
+
416
+ <!--
417
+ ### Direct Usage (Transformers)
418
+
419
+ <details><summary>Click to see the direct usage in Transformers</summary>
420
+
421
+ </details>
422
+ -->
423
+
424
+ <!--
425
+ ### Downstream Usage (Sentence Transformers)
426
+
427
+ You can finetune this model on your own dataset.
428
+
429
+ <details><summary>Click to expand</summary>
430
+
431
+ </details>
432
+ -->
433
+
434
+ <!--
435
+ ### Out-of-Scope Use
436
+
437
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
438
+ -->
439
+
440
+ ## Evaluation
441
+
442
+ ### Metrics
443
+
444
+ #### Information Retrieval
445
+
446
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
447
+
448
+ | Metric | Value |
449
+ |:--------------------|:-----------|
450
+ | cosine_accuracy@1 | 0.666 |
451
+ | cosine_accuracy@3 | 0.8842 |
452
+ | cosine_accuracy@5 | 0.9313 |
453
+ | cosine_accuracy@10 | 0.9672 |
454
+ | cosine_precision@1 | 0.666 |
455
+ | cosine_precision@3 | 0.2947 |
456
+ | cosine_precision@5 | 0.1863 |
457
+ | cosine_precision@10 | 0.0967 |
458
+ | cosine_recall@1 | 0.666 |
459
+ | cosine_recall@3 | 0.8842 |
460
+ | cosine_recall@5 | 0.9313 |
461
+ | cosine_recall@10 | 0.9672 |
462
+ | **cosine_ndcg@10** | **0.8278** |
463
+ | cosine_mrr@10 | 0.7818 |
464
+ | cosine_map@100 | 0.7835 |
465
+
466
+ <!--
467
+ ## Bias, Risks and Limitations
468
+
469
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
470
+ -->
471
+
472
+ <!--
473
+ ### Recommendations
474
+
475
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
476
+ -->
477
+
478
+ ## Training Details
479
+
480
+ ### Training Dataset
481
+
482
+ #### Unnamed Dataset
483
+
484
+ * Size: 46,338 training samples
485
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
486
+ * Approximate statistics based on the first 1000 samples:
487
+ | | sentence_0 | sentence_1 |
488
+ |:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
489
+ | type | string | string |
490
+ | details | <ul><li>min: 11 tokens</li><li>mean: 35.24 tokens</li><li>max: 206 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 193.39 tokens</li><li>max: 512 tokens</li></ul> |
491
+ * Samples:
492
+ | sentence_0 | sentence_1 |
493
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
494
+ | <code>How is materiality defined in the context of an entity's sustainability reporting as per QC 4?</code> | <code>QC 4. Materiality is an entity-specific aspect of relevance based on the nature or magnitude, or both, of the items to which the information relates, as assessed in the context of the undertaking’s sustainability reporting (see chapter 3 of this Standard).<br><br>Faithful representation<br><br>QC 5. To be useful, the information must not only represent relevant phenomena, it must also faithfully represent the substance of the phenomena that it purports to represent. Faithful representation requires information to be (i) complete, (ii) neutral and (iii) accurate.</code> |
495
+ | <code>What procedure must be followed for the adoption of implementing acts as mentioned in the text?</code> | <code>Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 22a(2).<br><br>3.<br><br>Articles 9, 9a and 10 shall apply to maritime transport activities in the same manner as they apply to other activities covered by the EU ETS with the following exception with regard to the application of Article 10.</code> |
496
+ | <code>How should monitoring points be distributed for groundwater bodies that flow across Member State boundaries to effectively estimate groundwater flow?</code> | <code>The network shall include sufficient representative monitoring points to estimate the groundwater level in each groundwater body or group of bodies taking into account short and long-term variations in recharge and in particular:<br><br>— for groundwater bodies identified as being at risk of failing to achieve environmental objectives under Article 4, ensure sufficient density of monitoring points to assess the impact of abstractions and discharges on the groundwater level,<br><br>— for groundwater bodies within which groundwater flows across a Member State boundary, ensure sufficient monitoring points are provided to estimate the direction and rate of groundwater flow across the Member State boundary.<br><br>2.2.3. Monitoring frequency</code> |
497
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
498
+ ```json
499
+ {
500
+ "loss": "MultipleNegativesRankingLoss",
501
+ "matryoshka_dims": [
502
+ 1024,
503
+ 768,
504
+ 512,
505
+ 256,
506
+ 128,
507
+ 64
508
+ ],
509
+ "matryoshka_weights": [
510
+ 1,
511
+ 1,
512
+ 1,
513
+ 1,
514
+ 1,
515
+ 1
516
+ ],
517
+ "n_dims_per_step": -1
518
+ }
519
+ ```
520
+
521
+ ### Training Hyperparameters
522
+ #### Non-Default Hyperparameters
523
+
524
+ - `eval_strategy`: steps
525
+ - `multi_dataset_batch_sampler`: round_robin
526
+
527
+ #### All Hyperparameters
528
+ <details><summary>Click to expand</summary>
529
+
530
+ - `overwrite_output_dir`: False
531
+ - `do_predict`: False
532
+ - `eval_strategy`: steps
533
+ - `prediction_loss_only`: True
534
+ - `per_device_train_batch_size`: 8
535
+ - `per_device_eval_batch_size`: 8
536
+ - `per_gpu_train_batch_size`: None
537
+ - `per_gpu_eval_batch_size`: None
538
+ - `gradient_accumulation_steps`: 1
539
+ - `eval_accumulation_steps`: None
540
+ - `torch_empty_cache_steps`: None
541
+ - `learning_rate`: 5e-05
542
+ - `weight_decay`: 0.0
543
+ - `adam_beta1`: 0.9
544
+ - `adam_beta2`: 0.999
545
+ - `adam_epsilon`: 1e-08
546
+ - `max_grad_norm`: 1
547
+ - `num_train_epochs`: 3
548
+ - `max_steps`: -1
549
+ - `lr_scheduler_type`: linear
550
+ - `lr_scheduler_kwargs`: {}
551
+ - `warmup_ratio`: 0.0
552
+ - `warmup_steps`: 0
553
+ - `log_level`: passive
554
+ - `log_level_replica`: warning
555
+ - `log_on_each_node`: True
556
+ - `logging_nan_inf_filter`: True
557
+ - `save_safetensors`: True
558
+ - `save_on_each_node`: False
559
+ - `save_only_model`: False
560
+ - `restore_callback_states_from_checkpoint`: False
561
+ - `no_cuda`: False
562
+ - `use_cpu`: False
563
+ - `use_mps_device`: False
564
+ - `seed`: 42
565
+ - `data_seed`: None
566
+ - `jit_mode_eval`: False
567
+ - `use_ipex`: False
568
+ - `bf16`: False
569
+ - `fp16`: False
570
+ - `fp16_opt_level`: O1
571
+ - `half_precision_backend`: auto
572
+ - `bf16_full_eval`: False
573
+ - `fp16_full_eval`: False
574
+ - `tf32`: None
575
+ - `local_rank`: 0
576
+ - `ddp_backend`: None
577
+ - `tpu_num_cores`: None
578
+ - `tpu_metrics_debug`: False
579
+ - `debug`: []
580
+ - `dataloader_drop_last`: False
581
+ - `dataloader_num_workers`: 0
582
+ - `dataloader_prefetch_factor`: None
583
+ - `past_index`: -1
584
+ - `disable_tqdm`: False
585
+ - `remove_unused_columns`: True
586
+ - `label_names`: None
587
+ - `load_best_model_at_end`: False
588
+ - `ignore_data_skip`: False
589
+ - `fsdp`: []
590
+ - `fsdp_min_num_params`: 0
591
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
592
+ - `fsdp_transformer_layer_cls_to_wrap`: None
593
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
594
+ - `deepspeed`: None
595
+ - `label_smoothing_factor`: 0.0
596
+ - `optim`: adamw_torch
597
+ - `optim_args`: None
598
+ - `adafactor`: False
599
+ - `group_by_length`: False
600
+ - `length_column_name`: length
601
+ - `ddp_find_unused_parameters`: None
602
+ - `ddp_bucket_cap_mb`: None
603
+ - `ddp_broadcast_buffers`: False
604
+ - `dataloader_pin_memory`: True
605
+ - `dataloader_persistent_workers`: False
606
+ - `skip_memory_metrics`: True
607
+ - `use_legacy_prediction_loop`: False
608
+ - `push_to_hub`: False
609
+ - `resume_from_checkpoint`: None
610
+ - `hub_model_id`: None
611
+ - `hub_strategy`: every_save
612
+ - `hub_private_repo`: None
613
+ - `hub_always_push`: False
614
+ - `gradient_checkpointing`: False
615
+ - `gradient_checkpointing_kwargs`: None
616
+ - `include_inputs_for_metrics`: False
617
+ - `include_for_metrics`: []
618
+ - `eval_do_concat_batches`: True
619
+ - `fp16_backend`: auto
620
+ - `push_to_hub_model_id`: None
621
+ - `push_to_hub_organization`: None
622
+ - `mp_parameters`:
623
+ - `auto_find_batch_size`: False
624
+ - `full_determinism`: False
625
+ - `torchdynamo`: None
626
+ - `ray_scope`: last
627
+ - `ddp_timeout`: 1800
628
+ - `torch_compile`: False
629
+ - `torch_compile_backend`: None
630
+ - `torch_compile_mode`: None
631
+ - `dispatch_batches`: None
632
+ - `split_batches`: None
633
+ - `include_tokens_per_second`: False
634
+ - `include_num_input_tokens_seen`: False
635
+ - `neftune_noise_alpha`: None
636
+ - `optim_target_modules`: None
637
+ - `batch_eval_metrics`: False
638
+ - `eval_on_start`: False
639
+ - `use_liger_kernel`: False
640
+ - `eval_use_gather_object`: False
641
+ - `average_tokens_across_devices`: False
642
+ - `prompts`: None
643
+ - `batch_sampler`: batch_sampler
644
+ - `multi_dataset_batch_sampler`: round_robin
645
+
646
+ </details>
647
+
648
+ ### Training Logs
649
+ | Epoch | Step | Training Loss | cosine_ndcg@10 |
650
+ |:------:|:-----:|:-------------:|:--------------:|
651
+ | 0.0863 | 500 | 0.938 | - |
652
+ | 0.1726 | 1000 | 0.2188 | - |
653
+ | 0.2589 | 1500 | 0.1998 | - |
654
+ | 0.3452 | 2000 | 0.2162 | 0.7843 |
655
+ | 0.4316 | 2500 | 0.1921 | - |
656
+ | 0.5179 | 3000 | 0.1749 | - |
657
+ | 0.6042 | 3500 | 0.1741 | - |
658
+ | 0.6905 | 4000 | 0.2007 | 0.7779 |
659
+ | 0.7768 | 4500 | 0.1456 | - |
660
+ | 0.8631 | 5000 | 0.1034 | - |
661
+ | 0.9494 | 5500 | 0.1285 | - |
662
+ | 1.0 | 5793 | - | 0.7806 |
663
+ | 1.0357 | 6000 | 0.1011 | 0.7879 |
664
+ | 1.1220 | 6500 | 0.065 | - |
665
+ | 1.2084 | 7000 | 0.0754 | - |
666
+ | 1.2947 | 7500 | 0.067 | - |
667
+ | 1.3810 | 8000 | 0.059 | 0.7953 |
668
+ | 1.4673 | 8500 | 0.0644 | - |
669
+ | 1.5536 | 9000 | 0.0705 | - |
670
+ | 1.6399 | 9500 | 0.0425 | - |
671
+ | 1.7262 | 10000 | 0.0515 | 0.8171 |
672
+ | 1.8125 | 10500 | 0.0358 | - |
673
+ | 1.8988 | 11000 | 0.0515 | - |
674
+ | 1.9852 | 11500 | 0.043 | - |
675
+ | 2.0 | 11586 | - | 0.8201 |
676
+ | 2.0715 | 12000 | 0.0257 | 0.8208 |
677
+ | 2.1578 | 12500 | 0.0343 | - |
678
+ | 2.2441 | 13000 | 0.0307 | - |
679
+ | 2.3304 | 13500 | 0.0324 | - |
680
+ | 2.4167 | 14000 | 0.0225 | 0.8236 |
681
+ | 2.5030 | 14500 | 0.0362 | - |
682
+ | 2.5893 | 15000 | 0.0255 | - |
683
+ | 2.6756 | 15500 | 0.0203 | - |
684
+ | 2.7620 | 16000 | 0.0244 | 0.8240 |
685
+ | 2.8483 | 16500 | 0.0461 | - |
686
+ | 2.9346 | 17000 | 0.0226 | - |
687
+ | 3.0 | 17379 | - | 0.8278 |
688
+
689
+
690
+ ### Framework Versions
691
+ - Python: 3.10.15
692
+ - Sentence Transformers: 3.4.1
693
+ - Transformers: 4.49.0
694
+ - PyTorch: 2.6.0+cu126
695
+ - Accelerate: 1.5.2
696
+ - Datasets: 3.4.1
697
+ - Tokenizers: 0.21.1
698
+
699
+ ## Citation
700
+
701
+ ### BibTeX
702
+
703
+ #### Sentence Transformers
704
+ ```bibtex
705
+ @inproceedings{reimers-2019-sentence-bert,
706
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
707
+ author = "Reimers, Nils and Gurevych, Iryna",
708
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
709
+ month = "11",
710
+ year = "2019",
711
+ publisher = "Association for Computational Linguistics",
712
+ url = "https://arxiv.org/abs/1908.10084",
713
+ }
714
+ ```
715
+
716
+ #### MatryoshkaLoss
717
+ ```bibtex
718
+ @misc{kusupati2024matryoshka,
719
+ title={Matryoshka Representation Learning},
720
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
721
+ year={2024},
722
+ eprint={2205.13147},
723
+ archivePrefix={arXiv},
724
+ primaryClass={cs.LG}
725
+ }
726
+ ```
727
+
728
+ #### MultipleNegativesRankingLoss
729
+ ```bibtex
730
+ @misc{henderson2017efficient,
731
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
732
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
733
+ year={2017},
734
+ eprint={1705.00652},
735
+ archivePrefix={arXiv},
736
+ primaryClass={cs.CL}
737
+ }
738
+ ```
739
+
740
+ <!--
741
+ ## Glossary
742
+
743
+ *Clearly define terms in order to be accessible across audiences.*
744
+ -->
745
+
746
+ <!--
747
+ ## Model Card Authors
748
+
749
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
750
+ -->
751
+
752
+ <!--
753
+ ## Model Card Contact
754
+
755
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
756
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.49.0",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.49.0",
5
+ "pytorch": "2.6.0+cu126"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
eval/Information-Retrieval_evaluation_results.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
2
+ 1.0,5793,0.6067667875021577,0.835663732090454,0.891765924391507,0.935957189711721,0.6067667875021577,0.6067667875021577,0.27855457736348466,0.835663732090454,0.17835318487830137,0.891765924391507,0.0935957189711721,0.935957189711721,0.7297235442885384,0.7806328668890359,0.7328065287048572
3
+ 2.0,11586,0.659243915069912,0.871569135163128,0.9214569307785259,0.9615052649749698,0.659243915069912,0.659243915069912,0.290523045054376,0.871569135163128,0.18429138615570514,0.9214569307785259,0.09615052649749696,0.9615052649749698,0.7736913598513812,0.8201268727574789,0.7756028742712455
4
+ 3.0,17379,0.6659761781460383,0.8841705506645952,0.9312963921974797,0.9672017952701536,0.6659761781460383,0.6659761781460383,0.29472351688819837,0.8841705506645952,0.1862592784394959,0.9312963921974797,0.09672017952701535,0.9672017952701536,0.7818480980055302,0.8278291318026204,0.783515504381956
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4714f59abe642537ff6370e30d2d97e5807b3222c2ac015a42456e4eeb7f6b4
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff