IrvinTopi commited on
Commit
ac6f9a2
·
verified ·
1 Parent(s): 2f467ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -784
README.md CHANGED
@@ -1,805 +1,87 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
- - sentence-transformers
 
 
 
 
 
4
  - sentence-similarity
5
- - feature-extraction
6
- - dense
7
- - generated_from_trainer
8
- - dataset_size:12753278
9
- - loss:MarginMSELoss
10
- base_model: PaDaS-Lab/xlm-roberta-base-msmarco
11
- widget:
12
- - source_sentence: Hur gammal måste jag vara för att betta på Melodifestivalen?
13
- sentences:
14
- - För att registrera ett spelkonto och betta online måste du vara över 18 år i Sverige.
15
- - Ja, för att delta i PrisPicks plattformen måste man vara minst 18 år gammal. Denna
16
- åldersgräns kan vara högre i vissa jurisdiktioner, så det rekommenderas att potentiella
17
- användare kontrollerar de specifika ålderskraven för deras plats.
18
- - Bu sorunun cevabı barındırdığınız sistemin büyüklüğüne, trafiğine ve optimizasyonuna
19
- göre değişmektedir. Çok iyi optimize edilmiş bir sisteminiz (script vb.) ve gelen
20
- trafiği optimum düzeyde karşılayacak donanımınız varsa minimum düzeyde bir sunucu
21
- yeterli olacaktır. Ancak iyi optimize edilmemiş bir sistem ve sunucu için farklı
22
- alternatifler aramanız gerekebilir. En iyi sunucu nedir sorusunun cevabı, sisteminize
23
- ve trafiğinize göre değişebilir.
24
- - source_sentence: क्या लॉजिकल रीजनिंग यूजीसी नेट परीक्षा का हिस्सा है?
25
- sentences:
26
- - हां, लॉजिकल रीजनिंग यूजीसी नेट परीक्षा का हिस्सा है।
27
- - 'Per il momento, non sono ancora entrate in vigore sul massimale minimo per le
28
- polizze rc professionale medici. Teniamo conto però di una cosa: se si lavora
29
- (e si è lavorato nei dieci anni precedenti) esclusivamente come dipendenti o specializzandi
30
- presso l’SSN, dobbiamo sapere che la rivalsa massima dell’SSN sarà plafonata al
31
- triplo del reddito annuo lordo del medico.
32
-
33
- Se invece si lavora in libera professione, non c’è alcun limite. Consigliamo comunque
34
- di scegliere massimali non inferiori al milione di euro.'
35
- - यूजीसी नेट की परीक्षा साल में दो बार आयोजित की जाती है। प्रथम परीक्षा जून में
36
- और द्वितीय परीक्षा दिसंबर महीने में नेशनल टेस्टिंग एजेंसी द्वारा आयोजित की जाती
37
- है।
38
- - source_sentence: Car Accident Lawyer in Denver, CO
39
- sentences:
40
- - A Vinsa Telêmaco Borba é uma empresa com sede no município de e fica localizada
41
- na Al. Washington Luiz, 490 – Alto das Oliveiras – Telêmaco Borba – PR.
42
- - When you are in need of a skilled car accident lawyer, a lawyer in the Denver,
43
- CO area, don’t wait to talk to a lawyer from The Law Offices of Cliff Enten. With
44
- years of legal experience, they have provided excellent results for their clients
45
- in many types of personal injury areas. They are knowledgeable about the tactics
46
- the other party will use to get the highest possible compensation amount. For
47
- more information set up an appointment with them now.
48
- - After seeking immediate medical attention for your injuries, you may have your
49
- case evaluated by a professional Denver personal injury lawyer at Mintz Law Firm
50
- and obtain expert legal advice. Regardless of the type of accident or event and
51
- the apparent extent of your injuries, a lawyer can help you pursue the compensation
52
- owed to you because the negligent parties are responsible for the damage. After
53
- any car accident, slip and fall, or animal bite, don’t delay contacting our legal
54
- team to help you strategize and take care of your case.
55
- - source_sentence: Wie wird Blood Suckers gespielt?
56
- sentences:
57
- - Ja, die erste Version von Blood Suckers war so erfolgreich, dass es mittlerweile
58
- sogar eine zweite Variante gibt. Der Automat Blood Suckers 2 wurde 2017 veröffentlicht.
59
- Hier gibt es das gleiche Motto und 25 Gewinnlinien. Zwar wird die Geschichte diesmal
60
- weiter erzählt.
61
- - La bozza del prodotto ordinato ti arriverà entro 24 ore dal tuo acquisto tramite
62
- e-mail. Se non riesci a visualizzarla ti consigliamo di controllare nello spam
63
- della tua posta elettronica. Non ti preoccupare puoi trovare la tua bozza anche
64
- nella sezione i miei ordini. In corrispondenza del prodotto acquistato troverai
65
- un pulsante
66
- - Sie spielen auf einem Raster von 5 Walzen und drei Reihen mit 25 Gewinnlinien.
67
- Beim Blood Suckers Spiel gewinnen Sie von links nach rechts.
68
- - source_sentence: Kiedy rozpoczyna się kurs?
69
- sentences:
70
- - 'Para convertirte en representante de ventas debes entender lo que son los productos
71
- de tecnología o del sector turístico para vender de manera efectiva. Utiliza tus
72
- conexiones y conocimiento, obtén clientes y genera ingresos. Debes familiarizarte
73
- con los procesos de ventas y marketing de productos de tecnología y/o de viaje,
74
- y estar activamente preparado para promover la plataforma moonstride.
75
-
76
- Para unirte a nuestro programa, tu negocio debe ser legítimo y estar bien establecido.
77
-
78
- moonstride realizará un proceso de selección para asegurar la calidad de nuestros
79
- representantes'
80
- - Nowe kursy grupowe zaczynają się w połowie lutego. Dokładna data jest przekazana
81
- kursantom tydzień przed rozpoczęciem kursu.
82
- - Kurs startuje dla Ciebie w momencie, kiedy się na niego zapiszesz. Chwilę później
83
- otrzymasz pierwszą lekcję oraz obiecany do niej BONUS w postaci e-booka „100+
84
- nieoczywistych pomysłów na e-sklep”. Kolejne lekcje będziemy wysyłać co 3 dni.
85
- pipeline_tag: sentence-similarity
86
- library_name: sentence-transformers
87
  ---
88
 
89
- # SentenceTransformer based on PaDaS-Lab/xlm-roberta-base-msmarco
90
-
91
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [PaDaS-Lab/xlm-roberta-base-msmarco](https://huggingface.co/PaDaS-Lab/xlm-roberta-base-msmarco). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
92
-
93
- ## Model Details
94
-
95
- ### Model Description
96
- - **Model Type:** Sentence Transformer
97
- - **Base model:** [PaDaS-Lab/xlm-roberta-base-msmarco](https://huggingface.co/PaDaS-Lab/xlm-roberta-base-msmarco) <!-- at revision cd02f4c38b71baa0dc6b3fcdd86a3b6bd407ef55 -->
98
- - **Maximum Sequence Length:** 512 tokens
99
- - **Output Dimensionality:** 768 dimensions
100
- - **Similarity Function:** Cosine Similarity
101
- <!-- - **Training Dataset:** Unknown -->
102
- <!-- - **Language:** Unknown -->
103
- <!-- - **License:** Unknown -->
104
-
105
- ### Model Sources
106
-
107
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
108
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
109
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
110
-
111
- ### Full Model Architecture
112
-
113
- ```
114
- SentenceTransformer(
115
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
116
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
117
- )
118
- ```
119
-
120
- ## Usage
121
-
122
- ### Direct Usage (Sentence Transformers)
123
-
124
- First install the Sentence Transformers library:
125
-
126
- ```bash
127
- pip install -U sentence-transformers
128
- ```
129
-
130
- Then you can load this model and run inference.
131
- ```python
132
- from sentence_transformers import SentenceTransformer
133
-
134
- # Download from the 🤗 Hub
135
- model = SentenceTransformer("sentence_transformers_model_id")
136
- # Run inference
137
- sentences = [
138
- 'Kiedy rozpoczyna się kurs?',
139
- 'Kurs startuje dla Ciebie w momencie, kiedy się na niego zapiszesz. Chwilę później otrzymasz pierwszą lekcję oraz obiecany do niej BONUS w postaci e-booka „100+ nieoczywistych pomysłów na e-sklep”. Kolejne lekcje będziemy wysyłać co 3 dni.',
140
- 'Nowe kursy grupowe zaczynają się w połowie lutego. Dokładna data jest przekazana kursantom tydzień przed rozpoczęciem kursu.',
141
- ]
142
- embeddings = model.encode(sentences)
143
- print(embeddings.shape)
144
- # [3, 768]
145
-
146
- # Get the similarity scores for the embeddings
147
- similarities = model.similarity(embeddings, embeddings)
148
- print(similarities)
149
- # tensor([[1.0000, 0.9920, 0.9929],
150
- # [0.9920, 1.0000, 0.9964],
151
- # [0.9929, 0.9964, 1.0000]])
152
- ```
153
-
154
- <!--
155
- ### Direct Usage (Transformers)
156
-
157
- <details><summary>Click to see the direct usage in Transformers</summary>
158
-
159
- </details>
160
- -->
161
 
162
- <!--
163
- ### Downstream Usage (Sentence Transformers)
164
 
165
- You can finetune this model on your own dataset.
 
 
166
 
167
- <details><summary>Click to expand</summary>
168
 
169
- </details>
170
- -->
171
 
172
- <!--
173
- ### Out-of-Scope Use
 
174
 
175
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
176
- -->
 
177
 
178
- <!--
179
- ## Bias, Risks and Limitations
 
 
 
180
 
181
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
182
- -->
183
 
184
- <!--
185
- ### Recommendations
186
 
187
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
188
- -->
 
 
 
 
189
 
190
- ## Training Details
191
-
192
- ### Training Dataset
193
-
194
- #### Unnamed Dataset
195
-
196
- * Size: 12,753,278 training samples
197
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, <code>sentence_2</code>, and <code>label</code>
198
- * Approximate statistics based on the first 1000 samples:
199
- | | sentence_0 | sentence_1 | sentence_2 | label |
200
- |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
201
- | type | string | string | string | float |
202
- | details | <ul><li>min: 6 tokens</li><li>mean: 15.49 tokens</li><li>max: 119 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 74.63 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 10 tokens</li><li>mean: 102.27 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: -0.97</li><li>mean: 0.43</li><li>max: 1.0</li></ul> |
203
- * Samples:
204
- | sentence_0 | sentence_1 | sentence_2 | label |
205
- |:-----------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------|
206
- | <code>Как найти актуальное зеркало Betwinner?</code> | <code>Букмекер старается обеспечить доступ к сайту, поэтому ссылки на зеркала обновляются ежедневно. Чтобы быть в курсе всех новостей, рекомендуется подписаться на почтовую рассылку и соцсети.</code> | <code>Зеркало BetWinner можно найти в интернете</code> | <code>-0.9661979675292969</code> |
207
- | <code>Jakie są minimalne zakłady w bakaracie?</code> | <code>Minimalny zakład w bakaracie zależy od konkretnej gry, w którą grasz. Mini Baccarat ma zwykle niskie limity zakładów, co czyni go atrakcyjnym dla nowych graczy. Istnieją też wersje gry w bakarata dla high-rollerów, które nakładają wyższy minimalny zakład.</code> | <code>Bakarat to jedna z popularniejszych gier hazardowych. To także najprostsza karciana gra kasynowa. Bakarat charakteryzuje się także niską przewagą kasyna nad graczem. Co za tym idzie stawki wygranych w niej nie są wysokie (najlepiej wyceniany jest zakład na remis, ale wtedy przewaga kasyna bardzo wzrasta - 14,44 proc. przy grze 6 taliami kart i 14,36 proc. przy grze 8 taliami kart). Przy zakładzie na gracza kasyno ma przewagę na poziomie 1,24 proc., a przy zakładzie na bankiera - na poziomie 1,06 proc.</code> | <code>0.579345703125</code> |
208
- | <code>Come scegliere il massimale assicurazione professionale medici?</code> | <code>Per il momento, non sono ancora entrate in vigore sul massimale minimo per le polizze rc professionale medici. Teniamo conto però di una cosa: se si lavora (e si è lavorato nei dieci anni precedenti) esclusivamente come dipendenti o specializzandi presso l’SSN, dobbiamo sapere che la rivalsa massima dell’SSN sarà plafonata al triplo del reddito annuo lordo del medico.<br>Se invece si lavora in libera professione, non c’è alcun limite. Consigliamo comunque di scegliere massimali non inferiori al milione di euro.</code> | <code>È un modello alternativo al modello standard dell’assicurazione malattia di base. Usufruite delle stesse prestazioni del modello standard dell’assicurazione malattia di base, ma pagate un premio inferiore. In cambio, accettate di consultare in primo luogo il medico di famiglia che avete scelto. Il medico di famiglia, chiamato anche «medico di primo ricorso (MPR)», vi cura e, se necessario, vi indirizza verso uno specialista. Ciò permette di evitare inutili consulti e contribuisce a ridurre i costi sanitari. Se il vostro medico di famiglia o un altro medico vi indirizza verso uno specialista, dovete chiedere al medico di rilasciarvi un attestato, chiamato anche «buono di delega». Alcuni medici ce lo inviano elettronicamente. In caso contrario, potete chiederlo al medico che vi ha raccomandato lo specialista (basta una semplice annotazione firmata, con indicati il tipo di specialista raccomandato e la durata di validità dell’attestato). Potete inviarci tale documento per posta o tramite ...</code> | <code>0.9451904296875</code> |
209
- * Loss: [<code>MarginMSELoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#marginmseloss)
210
-
211
- ### Training Hyperparameters
212
- #### Non-Default Hyperparameters
213
-
214
- - `per_device_train_batch_size`: 64
215
- - `per_device_eval_batch_size`: 64
216
- - `num_train_epochs`: 1
217
- - `fp16`: True
218
- - `multi_dataset_batch_sampler`: round_robin
219
-
220
- #### All Hyperparameters
221
- <details><summary>Click to expand</summary>
222
-
223
- - `overwrite_output_dir`: False
224
- - `do_predict`: False
225
- - `eval_strategy`: no
226
- - `prediction_loss_only`: True
227
- - `per_device_train_batch_size`: 64
228
- - `per_device_eval_batch_size`: 64
229
- - `per_gpu_train_batch_size`: None
230
- - `per_gpu_eval_batch_size`: None
231
- - `gradient_accumulation_steps`: 1
232
- - `eval_accumulation_steps`: None
233
- - `torch_empty_cache_steps`: None
234
- - `learning_rate`: 5e-05
235
- - `weight_decay`: 0.0
236
- - `adam_beta1`: 0.9
237
- - `adam_beta2`: 0.999
238
- - `adam_epsilon`: 1e-08
239
- - `max_grad_norm`: 1
240
- - `num_train_epochs`: 1
241
- - `max_steps`: -1
242
- - `lr_scheduler_type`: linear
243
- - `lr_scheduler_kwargs`: {}
244
- - `warmup_ratio`: 0.0
245
- - `warmup_steps`: 0
246
- - `log_level`: passive
247
- - `log_level_replica`: warning
248
- - `log_on_each_node`: True
249
- - `logging_nan_inf_filter`: True
250
- - `save_safetensors`: True
251
- - `save_on_each_node`: False
252
- - `save_only_model`: False
253
- - `restore_callback_states_from_checkpoint`: False
254
- - `no_cuda`: False
255
- - `use_cpu`: False
256
- - `use_mps_device`: False
257
- - `seed`: 42
258
- - `data_seed`: None
259
- - `jit_mode_eval`: False
260
- - `bf16`: False
261
- - `fp16`: True
262
- - `fp16_opt_level`: O1
263
- - `half_precision_backend`: auto
264
- - `bf16_full_eval`: False
265
- - `fp16_full_eval`: False
266
- - `tf32`: None
267
- - `local_rank`: 0
268
- - `ddp_backend`: None
269
- - `tpu_num_cores`: None
270
- - `tpu_metrics_debug`: False
271
- - `debug`: []
272
- - `dataloader_drop_last`: False
273
- - `dataloader_num_workers`: 0
274
- - `dataloader_prefetch_factor`: None
275
- - `past_index`: -1
276
- - `disable_tqdm`: False
277
- - `remove_unused_columns`: True
278
- - `label_names`: None
279
- - `load_best_model_at_end`: False
280
- - `ignore_data_skip`: False
281
- - `fsdp`: []
282
- - `fsdp_min_num_params`: 0
283
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
284
- - `fsdp_transformer_layer_cls_to_wrap`: None
285
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
286
- - `parallelism_config`: None
287
- - `deepspeed`: None
288
- - `label_smoothing_factor`: 0.0
289
- - `optim`: adamw_torch_fused
290
- - `optim_args`: None
291
- - `adafactor`: False
292
- - `group_by_length`: False
293
- - `length_column_name`: length
294
- - `project`: huggingface
295
- - `trackio_space_id`: trackio
296
- - `ddp_find_unused_parameters`: None
297
- - `ddp_bucket_cap_mb`: None
298
- - `ddp_broadcast_buffers`: False
299
- - `dataloader_pin_memory`: True
300
- - `dataloader_persistent_workers`: False
301
- - `skip_memory_metrics`: True
302
- - `use_legacy_prediction_loop`: False
303
- - `push_to_hub`: False
304
- - `resume_from_checkpoint`: None
305
- - `hub_model_id`: None
306
- - `hub_strategy`: every_save
307
- - `hub_private_repo`: None
308
- - `hub_always_push`: False
309
- - `hub_revision`: None
310
- - `gradient_checkpointing`: False
311
- - `gradient_checkpointing_kwargs`: None
312
- - `include_inputs_for_metrics`: False
313
- - `include_for_metrics`: []
314
- - `eval_do_concat_batches`: True
315
- - `fp16_backend`: auto
316
- - `push_to_hub_model_id`: None
317
- - `push_to_hub_organization`: None
318
- - `mp_parameters`:
319
- - `auto_find_batch_size`: False
320
- - `full_determinism`: False
321
- - `torchdynamo`: None
322
- - `ray_scope`: last
323
- - `ddp_timeout`: 1800
324
- - `torch_compile`: False
325
- - `torch_compile_backend`: None
326
- - `torch_compile_mode`: None
327
- - `include_tokens_per_second`: False
328
- - `include_num_input_tokens_seen`: no
329
- - `neftune_noise_alpha`: None
330
- - `optim_target_modules`: None
331
- - `batch_eval_metrics`: False
332
- - `eval_on_start`: False
333
- - `use_liger_kernel`: False
334
- - `liger_kernel_config`: None
335
- - `eval_use_gather_object`: False
336
- - `average_tokens_across_devices`: True
337
- - `prompts`: None
338
- - `batch_sampler`: batch_sampler
339
- - `multi_dataset_batch_sampler`: round_robin
340
- - `router_mapping`: {}
341
- - `learning_rate_mapping`: {}
342
-
343
- </details>
344
-
345
- ### Training Logs
346
- <details><summary>Click to expand</summary>
347
-
348
- | Epoch | Step | Training Loss |
349
- |:------:|:------:|:-------------:|
350
- | 0.0025 | 500 | 3.2897 |
351
- | 0.0050 | 1000 | 0.1515 |
352
- | 0.0075 | 1500 | 0.1374 |
353
- | 0.0100 | 2000 | 0.1319 |
354
- | 0.0125 | 2500 | 0.1322 |
355
- | 0.0151 | 3000 | 0.1294 |
356
- | 0.0176 | 3500 | 0.1254 |
357
- | 0.0201 | 4000 | 0.1234 |
358
- | 0.0226 | 4500 | 0.1201 |
359
- | 0.0251 | 5000 | 0.1196 |
360
- | 0.0276 | 5500 | 0.1215 |
361
- | 0.0301 | 6000 | 0.1174 |
362
- | 0.0326 | 6500 | 0.1184 |
363
- | 0.0351 | 7000 | 0.1176 |
364
- | 0.0376 | 7500 | 0.1152 |
365
- | 0.0401 | 8000 | 0.1141 |
366
- | 0.0427 | 8500 | 0.1137 |
367
- | 0.0452 | 9000 | 0.1144 |
368
- | 0.0477 | 9500 | 0.1132 |
369
- | 0.0502 | 10000 | 0.1123 |
370
- | 0.0527 | 10500 | 0.1117 |
371
- | 0.0552 | 11000 | 0.1117 |
372
- | 0.0577 | 11500 | 0.1102 |
373
- | 0.0602 | 12000 | 0.109 |
374
- | 0.0627 | 12500 | 0.1101 |
375
- | 0.0652 | 13000 | 0.1079 |
376
- | 0.0677 | 13500 | 0.1106 |
377
- | 0.0703 | 14000 | 0.1097 |
378
- | 0.0728 | 14500 | 0.1075 |
379
- | 0.0753 | 15000 | 0.1046 |
380
- | 0.0778 | 15500 | 0.1078 |
381
- | 0.0803 | 16000 | 0.1061 |
382
- | 0.0828 | 16500 | 0.1057 |
383
- | 0.0853 | 17000 | 0.1054 |
384
- | 0.0878 | 17500 | 0.1067 |
385
- | 0.0903 | 18000 | 0.1048 |
386
- | 0.0928 | 18500 | 0.1033 |
387
- | 0.0953 | 19000 | 0.104 |
388
- | 0.0979 | 19500 | 0.102 |
389
- | 0.1004 | 20000 | 0.1023 |
390
- | 0.1029 | 20500 | 0.101 |
391
- | 0.1054 | 21000 | 0.1035 |
392
- | 0.1079 | 21500 | 0.102 |
393
- | 0.1104 | 22000 | 0.1018 |
394
- | 0.1129 | 22500 | 0.1015 |
395
- | 0.1154 | 23000 | 0.1003 |
396
- | 0.1179 | 23500 | 0.1005 |
397
- | 0.1204 | 24000 | 0.0998 |
398
- | 0.1229 | 24500 | 0.099 |
399
- | 0.1255 | 25000 | 0.1001 |
400
- | 0.1280 | 25500 | 0.0979 |
401
- | 0.1305 | 26000 | 0.1001 |
402
- | 0.1330 | 26500 | 0.0995 |
403
- | 0.1355 | 27000 | 0.0992 |
404
- | 0.1380 | 27500 | 0.098 |
405
- | 0.1405 | 28000 | 0.0986 |
406
- | 0.1430 | 28500 | 0.0987 |
407
- | 0.1455 | 29000 | 0.0972 |
408
- | 0.1480 | 29500 | 0.0964 |
409
- | 0.1505 | 30000 | 0.0967 |
410
- | 0.1531 | 30500 | 0.0969 |
411
- | 0.1556 | 31000 | 0.0954 |
412
- | 0.1581 | 31500 | 0.0972 |
413
- | 0.1606 | 32000 | 0.0973 |
414
- | 0.1631 | 32500 | 0.096 |
415
- | 0.1656 | 33000 | 0.0952 |
416
- | 0.1681 | 33500 | 0.0974 |
417
- | 0.1706 | 34000 | 0.0945 |
418
- | 0.1731 | 34500 | 0.0936 |
419
- | 0.1756 | 35000 | 0.0945 |
420
- | 0.1782 | 35500 | 0.0946 |
421
- | 0.1807 | 36000 | 0.0942 |
422
- | 0.1832 | 36500 | 0.0955 |
423
- | 0.1857 | 37000 | 0.0948 |
424
- | 0.1882 | 37500 | 0.0925 |
425
- | 0.1907 | 38000 | 0.0929 |
426
- | 0.1932 | 38500 | 0.0934 |
427
- | 0.1957 | 39000 | 0.0939 |
428
- | 0.1982 | 39500 | 0.0933 |
429
- | 0.2007 | 40000 | 0.0937 |
430
- | 0.2032 | 40500 | 0.0916 |
431
- | 0.2058 | 41000 | 0.0932 |
432
- | 0.2083 | 41500 | 0.0921 |
433
- | 0.2108 | 42000 | 0.0912 |
434
- | 0.2133 | 42500 | 0.0906 |
435
- | 0.2158 | 43000 | 0.0905 |
436
- | 0.2183 | 43500 | 0.09 |
437
- | 0.2208 | 44000 | 0.0906 |
438
- | 0.2233 | 44500 | 0.092 |
439
- | 0.2258 | 45000 | 0.0906 |
440
- | 0.2283 | 45500 | 0.0908 |
441
- | 0.2308 | 46000 | 0.0916 |
442
- | 0.2334 | 46500 | 0.0907 |
443
- | 0.2359 | 47000 | 0.0899 |
444
- | 0.2384 | 47500 | 0.089 |
445
- | 0.2409 | 48000 | 0.0909 |
446
- | 0.2434 | 48500 | 0.0889 |
447
- | 0.2459 | 49000 | 0.0896 |
448
- | 0.2484 | 49500 | 0.088 |
449
- | 0.2509 | 50000 | 0.09 |
450
- | 0.2534 | 50500 | 0.0879 |
451
- | 0.2559 | 51000 | 0.0885 |
452
- | 0.2584 | 51500 | 0.0886 |
453
- | 0.2610 | 52000 | 0.0896 |
454
- | 0.2635 | 52500 | 0.0886 |
455
- | 0.2660 | 53000 | 0.0876 |
456
- | 0.2685 | 53500 | 0.0881 |
457
- | 0.2710 | 54000 | 0.0886 |
458
- | 0.2735 | 54500 | 0.0865 |
459
- | 0.2760 | 55000 | 0.0874 |
460
- | 0.2785 | 55500 | 0.0878 |
461
- | 0.2810 | 56000 | 0.0874 |
462
- | 0.2835 | 56500 | 0.0872 |
463
- | 0.2860 | 57000 | 0.0866 |
464
- | 0.2886 | 57500 | 0.0875 |
465
- | 0.2911 | 58000 | 0.0876 |
466
- | 0.2936 | 58500 | 0.0872 |
467
- | 0.2961 | 59000 | 0.0857 |
468
- | 0.2986 | 59500 | 0.0867 |
469
- | 0.3011 | 60000 | 0.0862 |
470
- | 0.3036 | 60500 | 0.0849 |
471
- | 0.3061 | 61000 | 0.0863 |
472
- | 0.3086 | 61500 | 0.0849 |
473
- | 0.3111 | 62000 | 0.0857 |
474
- | 0.3136 | 62500 | 0.084 |
475
- | 0.3162 | 63000 | 0.0857 |
476
- | 0.3187 | 63500 | 0.0853 |
477
- | 0.3212 | 64000 | 0.0849 |
478
- | 0.3237 | 64500 | 0.0842 |
479
- | 0.3262 | 65000 | 0.0851 |
480
- | 0.3287 | 65500 | 0.085 |
481
- | 0.3312 | 66000 | 0.0837 |
482
- | 0.3337 | 66500 | 0.0839 |
483
- | 0.3362 | 67000 | 0.0836 |
484
- | 0.3387 | 67500 | 0.0845 |
485
- | 0.3412 | 68000 | 0.0844 |
486
- | 0.3438 | 68500 | 0.0844 |
487
- | 0.3463 | 69000 | 0.0839 |
488
- | 0.3488 | 69500 | 0.084 |
489
- | 0.3513 | 70000 | 0.083 |
490
- | 0.3538 | 70500 | 0.0843 |
491
- | 0.3563 | 71000 | 0.082 |
492
- | 0.3588 | 71500 | 0.0834 |
493
- | 0.3613 | 72000 | 0.0826 |
494
- | 0.3638 | 72500 | 0.0833 |
495
- | 0.3663 | 73000 | 0.0843 |
496
- | 0.3688 | 73500 | 0.0821 |
497
- | 0.3714 | 74000 | 0.0822 |
498
- | 0.3739 | 74500 | 0.0823 |
499
- | 0.3764 | 75000 | 0.0818 |
500
- | 0.3789 | 75500 | 0.0836 |
501
- | 0.3814 | 76000 | 0.0813 |
502
- | 0.3839 | 76500 | 0.0829 |
503
- | 0.3864 | 77000 | 0.0828 |
504
- | 0.3889 | 77500 | 0.0799 |
505
- | 0.3914 | 78000 | 0.0819 |
506
- | 0.3939 | 78500 | 0.0815 |
507
- | 0.3964 | 79000 | 0.0812 |
508
- | 0.3990 | 79500 | 0.0803 |
509
- | 0.4015 | 80000 | 0.0819 |
510
- | 0.4040 | 80500 | 0.081 |
511
- | 0.4065 | 81000 | 0.0798 |
512
- | 0.4090 | 81500 | 0.0811 |
513
- | 0.4115 | 82000 | 0.0806 |
514
- | 0.4140 | 82500 | 0.0812 |
515
- | 0.4165 | 83000 | 0.0801 |
516
- | 0.4190 | 83500 | 0.0803 |
517
- | 0.4215 | 84000 | 0.0812 |
518
- | 0.4240 | 84500 | 0.0809 |
519
- | 0.4266 | 85000 | 0.0802 |
520
- | 0.4291 | 85500 | 0.0801 |
521
- | 0.4316 | 86000 | 0.08 |
522
- | 0.4341 | 86500 | 0.079 |
523
- | 0.4366 | 87000 | 0.0803 |
524
- | 0.4391 | 87500 | 0.08 |
525
- | 0.4416 | 88000 | 0.0802 |
526
- | 0.4441 | 88500 | 0.0799 |
527
- | 0.4466 | 89000 | 0.0795 |
528
- | 0.4491 | 89500 | 0.0787 |
529
- | 0.4516 | 90000 | 0.0784 |
530
- | 0.4542 | 90500 | 0.0781 |
531
- | 0.4567 | 91000 | 0.0802 |
532
- | 0.4592 | 91500 | 0.0781 |
533
- | 0.4617 | 92000 | 0.0796 |
534
- | 0.4642 | 92500 | 0.0774 |
535
- | 0.4667 | 93000 | 0.0794 |
536
- | 0.4692 | 93500 | 0.0786 |
537
- | 0.4717 | 94000 | 0.079 |
538
- | 0.4742 | 94500 | 0.0786 |
539
- | 0.4767 | 95000 | 0.0778 |
540
- | 0.4792 | 95500 | 0.0782 |
541
- | 0.4818 | 96000 | 0.0777 |
542
- | 0.4843 | 96500 | 0.0773 |
543
- | 0.4868 | 97000 | 0.0762 |
544
- | 0.4893 | 97500 | 0.0774 |
545
- | 0.4918 | 98000 | 0.0796 |
546
- | 0.4943 | 98500 | 0.0764 |
547
- | 0.4968 | 99000 | 0.0781 |
548
- | 0.4993 | 99500 | 0.0778 |
549
- | 0.5018 | 100000 | 0.0774 |
550
- | 0.5043 | 100500 | 0.0767 |
551
- | 0.5069 | 101000 | 0.0769 |
552
- | 0.5094 | 101500 | 0.0784 |
553
- | 0.5119 | 102000 | 0.0769 |
554
- | 0.5144 | 102500 | 0.0773 |
555
- | 0.5169 | 103000 | 0.0776 |
556
- | 0.5194 | 103500 | 0.0761 |
557
- | 0.5219 | 104000 | 0.0768 |
558
- | 0.5244 | 104500 | 0.0763 |
559
- | 0.5269 | 105000 | 0.0772 |
560
- | 0.5294 | 105500 | 0.076 |
561
- | 0.5319 | 106000 | 0.0776 |
562
- | 0.5345 | 106500 | 0.0768 |
563
- | 0.5370 | 107000 | 0.0754 |
564
- | 0.5395 | 107500 | 0.0759 |
565
- | 0.5420 | 108000 | 0.0764 |
566
- | 0.5445 | 108500 | 0.0764 |
567
- | 0.5470 | 109000 | 0.0766 |
568
- | 0.5495 | 109500 | 0.0762 |
569
- | 0.5520 | 110000 | 0.0749 |
570
- | 0.5545 | 110500 | 0.075 |
571
- | 0.5570 | 111000 | 0.0754 |
572
- | 0.5595 | 111500 | 0.0755 |
573
- | 0.5621 | 112000 | 0.0753 |
574
- | 0.5646 | 112500 | 0.0747 |
575
- | 0.5671 | 113000 | 0.0754 |
576
- | 0.5696 | 113500 | 0.0756 |
577
- | 0.5721 | 114000 | 0.074 |
578
- | 0.5746 | 114500 | 0.0759 |
579
- | 0.5771 | 115000 | 0.0755 |
580
- | 0.5796 | 115500 | 0.0757 |
581
- | 0.5821 | 116000 | 0.0744 |
582
- | 0.5846 | 116500 | 0.0732 |
583
- | 0.5871 | 117000 | 0.0745 |
584
- | 0.5897 | 117500 | 0.0748 |
585
- | 0.5922 | 118000 | 0.0724 |
586
- | 0.5947 | 118500 | 0.0739 |
587
- | 0.5972 | 119000 | 0.0749 |
588
- | 0.5997 | 119500 | 0.0755 |
589
- | 0.6022 | 120000 | 0.0735 |
590
- | 0.6047 | 120500 | 0.0742 |
591
- | 0.6072 | 121000 | 0.0738 |
592
- | 0.6097 | 121500 | 0.0733 |
593
- | 0.6122 | 122000 | 0.0728 |
594
- | 0.6147 | 122500 | 0.0745 |
595
- | 0.6173 | 123000 | 0.0741 |
596
- | 0.6198 | 123500 | 0.0726 |
597
- | 0.6223 | 124000 | 0.0744 |
598
- | 0.6248 | 124500 | 0.0743 |
599
- | 0.6273 | 125000 | 0.0732 |
600
- | 0.6298 | 125500 | 0.0731 |
601
- | 0.6323 | 126000 | 0.0729 |
602
- | 0.6348 | 126500 | 0.0737 |
603
- | 0.6373 | 127000 | 0.0735 |
604
- | 0.6398 | 127500 | 0.0738 |
605
- | 0.6423 | 128000 | 0.0731 |
606
- | 0.6449 | 128500 | 0.0736 |
607
- | 0.6474 | 129000 | 0.0728 |
608
- | 0.6499 | 129500 | 0.073 |
609
- | 0.6524 | 130000 | 0.0733 |
610
- | 0.6549 | 130500 | 0.073 |
611
- | 0.6574 | 131000 | 0.073 |
612
- | 0.6599 | 131500 | 0.0732 |
613
- | 0.6624 | 132000 | 0.0723 |
614
- | 0.6649 | 132500 | 0.0732 |
615
- | 0.6674 | 133000 | 0.0724 |
616
- | 0.6699 | 133500 | 0.0722 |
617
- | 0.6725 | 134000 | 0.0724 |
618
- | 0.6750 | 134500 | 0.0726 |
619
- | 0.6775 | 135000 | 0.0728 |
620
- | 0.6800 | 135500 | 0.0717 |
621
- | 0.6825 | 136000 | 0.0722 |
622
- | 0.6850 | 136500 | 0.0729 |
623
- | 0.6875 | 137000 | 0.0715 |
624
- | 0.6900 | 137500 | 0.072 |
625
- | 0.6925 | 138000 | 0.072 |
626
- | 0.6950 | 138500 | 0.0722 |
627
- | 0.6975 | 139000 | 0.0718 |
628
- | 0.7001 | 139500 | 0.0728 |
629
- | 0.7026 | 140000 | 0.0718 |
630
- | 0.7051 | 140500 | 0.0726 |
631
- | 0.7076 | 141000 | 0.0707 |
632
- | 0.7101 | 141500 | 0.072 |
633
- | 0.7126 | 142000 | 0.0706 |
634
- | 0.7151 | 142500 | 0.0706 |
635
- | 0.7176 | 143000 | 0.0708 |
636
- | 0.7201 | 143500 | 0.0717 |
637
- | 0.7226 | 144000 | 0.0713 |
638
- | 0.7251 | 144500 | 0.0723 |
639
- | 0.7277 | 145000 | 0.0709 |
640
- | 0.7302 | 145500 | 0.0709 |
641
- | 0.7327 | 146000 | 0.0706 |
642
- | 0.7352 | 146500 | 0.0713 |
643
- | 0.7377 | 147000 | 0.0709 |
644
- | 0.7402 | 147500 | 0.0703 |
645
- | 0.7427 | 148000 | 0.0709 |
646
- | 0.7452 | 148500 | 0.0702 |
647
- | 0.7477 | 149000 | 0.0705 |
648
- | 0.7502 | 149500 | 0.0707 |
649
- | 0.7527 | 150000 | 0.0702 |
650
- | 0.7553 | 150500 | 0.0696 |
651
- | 0.7578 | 151000 | 0.0701 |
652
- | 0.7603 | 151500 | 0.0707 |
653
- | 0.7628 | 152000 | 0.0703 |
654
- | 0.7653 | 152500 | 0.0703 |
655
- | 0.7678 | 153000 | 0.0711 |
656
- | 0.7703 | 153500 | 0.0706 |
657
- | 0.7728 | 154000 | 0.0701 |
658
- | 0.7753 | 154500 | 0.0699 |
659
- | 0.7778 | 155000 | 0.0704 |
660
- | 0.7803 | 155500 | 0.07 |
661
- | 0.7829 | 156000 | 0.0701 |
662
- | 0.7854 | 156500 | 0.0697 |
663
- | 0.7879 | 157000 | 0.0698 |
664
- | 0.7904 | 157500 | 0.0699 |
665
- | 0.7929 | 158000 | 0.069 |
666
- | 0.7954 | 158500 | 0.0703 |
667
- | 0.7979 | 159000 | 0.0696 |
668
- | 0.8004 | 159500 | 0.0701 |
669
- | 0.8029 | 160000 | 0.069 |
670
- | 0.8054 | 160500 | 0.0687 |
671
- | 0.8079 | 161000 | 0.069 |
672
- | 0.8105 | 161500 | 0.0692 |
673
- | 0.8130 | 162000 | 0.069 |
674
- | 0.8155 | 162500 | 0.0688 |
675
- | 0.8180 | 163000 | 0.0681 |
676
- | 0.8205 | 163500 | 0.0688 |
677
- | 0.8230 | 164000 | 0.0699 |
678
- | 0.8255 | 164500 | 0.0677 |
679
- | 0.8280 | 165000 | 0.0687 |
680
- | 0.8305 | 165500 | 0.0696 |
681
- | 0.8330 | 166000 | 0.0686 |
682
- | 0.8355 | 166500 | 0.069 |
683
- | 0.8381 | 167000 | 0.0692 |
684
- | 0.8406 | 167500 | 0.0698 |
685
- | 0.8431 | 168000 | 0.0684 |
686
- | 0.8456 | 168500 | 0.0681 |
687
- | 0.8481 | 169000 | 0.0683 |
688
- | 0.8506 | 169500 | 0.0701 |
689
- | 0.8531 | 170000 | 0.0697 |
690
- | 0.8556 | 170500 | 0.0688 |
691
- | 0.8581 | 171000 | 0.0689 |
692
- | 0.8606 | 171500 | 0.0689 |
693
- | 0.8632 | 172000 | 0.0687 |
694
- | 0.8657 | 172500 | 0.0693 |
695
- | 0.8682 | 173000 | 0.0678 |
696
- | 0.8707 | 173500 | 0.0688 |
697
- | 0.8732 | 174000 | 0.0686 |
698
- | 0.8757 | 174500 | 0.0695 |
699
- | 0.8782 | 175000 | 0.0679 |
700
- | 0.8807 | 175500 | 0.0686 |
701
- | 0.8832 | 176000 | 0.0683 |
702
- | 0.8857 | 176500 | 0.068 |
703
- | 0.8882 | 177000 | 0.0688 |
704
- | 0.8908 | 177500 | 0.0696 |
705
- | 0.8933 | 178000 | 0.0682 |
706
- | 0.8958 | 178500 | 0.0686 |
707
- | 0.8983 | 179000 | 0.0679 |
708
- | 0.9008 | 179500 | 0.0687 |
709
- | 0.9033 | 180000 | 0.0677 |
710
- | 0.9058 | 180500 | 0.0693 |
711
- | 0.9083 | 181000 | 0.0685 |
712
- | 0.9108 | 181500 | 0.0682 |
713
- | 0.9133 | 182000 | 0.0689 |
714
- | 0.9158 | 182500 | 0.0682 |
715
- | 0.9184 | 183000 | 0.0679 |
716
- | 0.9209 | 183500 | 0.0682 |
717
- | 0.9234 | 184000 | 0.0678 |
718
- | 0.9259 | 184500 | 0.0685 |
719
- | 0.9284 | 185000 | 0.0673 |
720
- | 0.9309 | 185500 | 0.0676 |
721
- | 0.9334 | 186000 | 0.068 |
722
- | 0.9359 | 186500 | 0.0678 |
723
- | 0.9384 | 187000 | 0.0679 |
724
- | 0.9409 | 187500 | 0.0674 |
725
- | 0.9434 | 188000 | 0.068 |
726
- | 0.9460 | 188500 | 0.0679 |
727
- | 0.9485 | 189000 | 0.0673 |
728
- | 0.9510 | 189500 | 0.0663 |
729
- | 0.9535 | 190000 | 0.068 |
730
- | 0.9560 | 190500 | 0.0672 |
731
- | 0.9585 | 191000 | 0.0668 |
732
- | 0.9610 | 191500 | 0.0665 |
733
- | 0.9635 | 192000 | 0.0679 |
734
- | 0.9660 | 192500 | 0.0678 |
735
- | 0.9685 | 193000 | 0.0667 |
736
- | 0.9710 | 193500 | 0.068 |
737
- | 0.9736 | 194000 | 0.0669 |
738
- | 0.9761 | 194500 | 0.0686 |
739
- | 0.9786 | 195000 | 0.0682 |
740
- | 0.9811 | 195500 | 0.0673 |
741
- | 0.9836 | 196000 | 0.0682 |
742
- | 0.9861 | 196500 | 0.0675 |
743
- | 0.9886 | 197000 | 0.0669 |
744
- | 0.9911 | 197500 | 0.0669 |
745
- | 0.9936 | 198000 | 0.0686 |
746
- | 0.9961 | 198500 | 0.068 |
747
- | 0.9986 | 199000 | 0.0667 |
748
-
749
- </details>
750
-
751
- ### Framework Versions
752
- - Python: 3.10.4
753
- - Sentence Transformers: 5.2.0
754
- - Transformers: 4.57.3
755
- - PyTorch: 2.9.1+cu128
756
- - Accelerate: 1.12.0
757
- - Datasets: 2.21.0
758
- - Tokenizers: 0.22.1
759
 
760
  ## Citation
761
 
762
- ### BibTeX
763
 
764
- #### Sentence Transformers
765
  ```bibtex
766
- @inproceedings{reimers-2019-sentence-bert,
767
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
768
- author = "Reimers, Nils and Gurevych, Iryna",
769
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
770
- month = "11",
771
- year = "2019",
772
- publisher = "Association for Computational Linguistics",
773
- url = "https://arxiv.org/abs/1908.10084",
774
- }
775
- ```
776
-
777
- #### MarginMSELoss
778
- ```bibtex
779
- @misc{hofstätter2021improving,
780
- title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation},
781
- author={Sebastian Hofstätter and Sophia Althammer and Michael Schröder and Mete Sertkan and Allan Hanbury},
782
- year={2021},
783
- eprint={2010.02666},
784
- archivePrefix={arXiv},
785
- primaryClass={cs.IR}
786
- }
787
- ```
788
-
789
- <!--
790
- ## Glossary
791
-
792
- *Clearly define terms in order to be accessible across audiences.*
793
- -->
794
-
795
- <!--
796
- ## Model Card Authors
797
-
798
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
799
- -->
800
-
801
- <!--
802
- ## Model Card Contact
803
-
804
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
805
- -->
 
1
  ---
2
+ language:
3
+ - ara
4
+ - dan
5
+ - deu
6
+ - eng
7
+ - fas
8
+ - fra
9
+ - hin
10
+ - ind
11
+ - ita
12
+ - jpn
13
+ - kor
14
+ - nld
15
+ - pol
16
+ - por
17
+ - rus
18
+ - spa
19
+ - swe
20
+ - tur
21
+ - vie
22
+ - zho
23
+ multilingual: true
24
  tags:
25
+ - dense-retrieval
26
+ - hard-negatives
27
+ - knowledge-distillation
28
+ - webfaq
29
+ license: cc-by-4.0
30
+ task_categories:
31
  - sentence-similarity
32
+ - text-retrieval
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ---
34
 
35
+ # WebFAQ 2.0: Multilingual Hard Negatives
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ This dataset contains **mined hard negatives** derived from the **WebFAQ 2.0** corpus. It covers roughly **1.3 million** samples across **20 languages**.
 
38
 
39
+ The dataset is designed to support robust training of dense retrieval models, specifically enabling:
40
+ 1. **Contrastive Learning:** Using strict hard negatives to improve discrimination.
41
+ 2. **Knowledge Distillation:** Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE).
42
 
43
+ ## Dataset Creation & Mining Process
44
 
45
+ To ensure high-quality training signals, we employed a **two-stage mining pipeline** that balances difficulty with correctness.
 
46
 
47
+ ### 1. Lexical Retrieval (Recall)
48
+ For every query in WebFAQ, we first retrieved the **top-200 candidate answers** from the monolingual corpus using **BM25**.
49
+ * **Goal:** Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish.
50
 
51
+ ### 2. Semantic Reranking (Precision)
52
+ We reranked the top-200 candidates using the state-of-the-art cross-encoder model: **[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)**.
53
+ * **Goal:** Assess the true semantic relevance of each candidate.
54
 
55
+ ### 3. Filtering & Scoring
56
+ We applied a rigorous filtering strategy to curate the final dataset:
57
+ * **False Negative Removal:** Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives.
58
+ * **Easy Negative Removal:** Candidates with very low scores were discarded to ensure training efficiency.
59
+ * **Score Retention:** We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows.
60
 
61
+ ## Dataset Structure
 
62
 
63
+ Each sample in the dataset contains the following fields:
 
64
 
65
+ | Field | Description |
66
+ | :--- | :--- |
67
+ | `query` | The user question. |
68
+ | `positive` | The ground-truth correct answer. |
69
+ | `negative` | The mined hard negative (non-relevant but similar). |
70
+ | `score` | The **BGE-M3 cross-encoder score** for the `(query, negative)` pair. |
71
 
72
+ ### Code & Reproduction
73
+ The code used for mining, filtering, and training is available in the official repository:
74
+ * **GitHub Repository:** [Link to your GitHub Repo]
75
+ * **WebFAQ Project:** [OpenWebSearch.EU](https://openwebsearch.eu)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Citation
78
 
79
+ If you use this dataset, please cite the WebFAQ 2.0 paper:
80
 
 
81
  ```bibtex
82
+ @inproceedings{dinzinger2025webfaq,
83
+ title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval},
84
+ author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael},
85
+ booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
86
+ year={2025}
87
+ }