xtr-replicability commited on
Commit
7d3982e
·
verified ·
1 Parent(s): 94978f9

Upload folder using huggingface_hub

Browse files
1_Dense/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "in_features": 384,
3
+ "out_features": 128,
4
+ "bias": false,
5
+ "activation_function": "torch.nn.modules.linear.Identity",
6
+ "use_residual": false
7
+ }
1_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:481e27c1e100415da70af1c72f2c888fad20fc38b8c4ab46409f11df305926d1
3
+ size 196696
README.md ADDED
@@ -0,0 +1,419 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ColBERT
4
+ - PyLate
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - generated_from_trainer
9
+ - dataset_size:9998000
10
+ - loss:Contrastive
11
+ base_model: BAAI/bge-small-en-v1.5
12
+ datasets:
13
+ - bclavie/msmarco-10m-triplets
14
+ pipeline_tag: sentence-similarity
15
+ library_name: PyLate
16
+ metrics:
17
+ - accuracy
18
+ model-index:
19
+ - name: PyLate model based on BAAI/bge-small-en-v1.5
20
+ results:
21
+ - task:
22
+ type: col-berttriplet
23
+ name: Col BERTTriplet
24
+ dataset:
25
+ name: Unknown
26
+ type: unknown
27
+ metrics:
28
+ - type: accuracy
29
+ value: 0.9910000562667847
30
+ name: Accuracy
31
+ ---
32
+
33
+ # PyLate model based on BAAI/bge-small-en-v1.5
34
+
35
+ This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) on the [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
36
+
37
+ ## Model Details
38
+
39
+ ### Model Description
40
+ - **Model Type:** PyLate model
41
+ - **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) <!-- at revision 5c38ec7c405ec4b44b94cc5a9bb96e735b38267a -->
42
+ - **Document Length:** 300 tokens
43
+ - **Query Length:** 32 tokens
44
+ - **Output Dimensionality:** 128 tokens
45
+ - **Similarity Function:** MaxSim
46
+ - **Training Dataset:**
47
+ - [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets)
48
+ <!-- - **Language:** Unknown -->
49
+ <!-- - **License:** Unknown -->
50
+
51
+ ### Model Sources
52
+
53
+ - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
54
+ - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
55
+ - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
56
+
57
+ ### Full Model Architecture
58
+
59
+ ```
60
+ ColBERT(
61
+ (0): Transformer({'max_seq_length': 300, 'do_lower_case': True, 'architecture': 'BertModel'})
62
+ (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
63
+ )
64
+ ```
65
+
66
+ ## Usage
67
+ First install the PyLate library:
68
+
69
+ ```bash
70
+ pip install -U pylate
71
+ ```
72
+
73
+ ### Retrieval
74
+
75
+ Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search.
76
+
77
+ #### Indexing documents
78
+
79
+ Load the ColBERT model and initialize the PLAID index, then encode and index your documents:
80
+
81
+ ```python
82
+ from pylate import indexes, models, retrieve
83
+
84
+ # Step 1: Load the ColBERT model
85
+ model = models.ColBERT(
86
+ model_name_or_path="pylate_model_id",
87
+ )
88
+
89
+ # Step 2: Initialize the PLAID index
90
+ index = indexes.PLAID(
91
+ index_folder="pylate-index",
92
+ index_name="index",
93
+ override=True, # This overwrites the existing index if any
94
+ )
95
+
96
+ # Step 3: Encode the documents
97
+ documents_ids = ["1", "2", "3"]
98
+ documents = ["document 1 text", "document 2 text", "document 3 text"]
99
+
100
+ documents_embeddings = model.encode(
101
+ documents,
102
+ batch_size=32,
103
+ is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
104
+ show_progress_bar=True,
105
+ )
106
+
107
+ # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
108
+ index.add_documents(
109
+ documents_ids=documents_ids,
110
+ documents_embeddings=documents_embeddings,
111
+ )
112
+ ```
113
+
114
+ Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
115
+
116
+ ```python
117
+ # To load an index, simply instantiate it with the correct folder/name and without overriding it
118
+ index = indexes.PLAID(
119
+ index_folder="pylate-index",
120
+ index_name="index",
121
+ )
122
+ ```
123
+
124
+ #### Retrieving top-k documents for queries
125
+
126
+ Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
127
+ To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
128
+
129
+ ```python
130
+ # Step 1: Initialize the ColBERT retriever
131
+ retriever = retrieve.ColBERT(index=index)
132
+
133
+ # Step 2: Encode the queries
134
+ queries_embeddings = model.encode(
135
+ ["query for document 3", "query for document 1"],
136
+ batch_size=32,
137
+ is_query=True, # # Ensure that it is set to False to indicate that these are queries
138
+ show_progress_bar=True,
139
+ )
140
+
141
+ # Step 3: Retrieve top-k documents
142
+ scores = retriever.retrieve(
143
+ queries_embeddings=queries_embeddings,
144
+ k=10, # Retrieve the top 10 matches for each query
145
+ )
146
+ ```
147
+
148
+ ### Reranking
149
+ If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
150
+
151
+ ```python
152
+ from pylate import rank, models
153
+
154
+ queries = [
155
+ "query A",
156
+ "query B",
157
+ ]
158
+
159
+ documents = [
160
+ ["document A", "document B"],
161
+ ["document 1", "document C", "document B"],
162
+ ]
163
+
164
+ documents_ids = [
165
+ [1, 2],
166
+ [1, 3, 2],
167
+ ]
168
+
169
+ model = models.ColBERT(
170
+ model_name_or_path="pylate_model_id",
171
+ )
172
+
173
+ queries_embeddings = model.encode(
174
+ queries,
175
+ is_query=True,
176
+ )
177
+
178
+ documents_embeddings = model.encode(
179
+ documents,
180
+ is_query=False,
181
+ )
182
+
183
+ reranked_documents = rank.rerank(
184
+ documents_ids=documents_ids,
185
+ queries_embeddings=queries_embeddings,
186
+ documents_embeddings=documents_embeddings,
187
+ )
188
+ ```
189
+
190
+ <!--
191
+ ### Direct Usage (Transformers)
192
+
193
+ <details><summary>Click to see the direct usage in Transformers</summary>
194
+
195
+ </details>
196
+ -->
197
+
198
+ <!--
199
+ ### Downstream Usage (Sentence Transformers)
200
+
201
+ You can finetune this model on your own dataset.
202
+
203
+ <details><summary>Click to expand</summary>
204
+
205
+ </details>
206
+ -->
207
+
208
+ <!--
209
+ ### Out-of-Scope Use
210
+
211
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
212
+ -->
213
+
214
+ ## Evaluation
215
+
216
+ ### Metrics
217
+
218
+ #### Col BERTTriplet
219
+
220
+ * Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
221
+
222
+ | Metric | Value |
223
+ |:-------------|:----------|
224
+ | **accuracy** | **0.991** |
225
+
226
+ <!--
227
+ ## Bias, Risks and Limitations
228
+
229
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
230
+ -->
231
+
232
+ <!--
233
+ ### Recommendations
234
+
235
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
236
+ -->
237
+
238
+ ## Training Details
239
+
240
+ ### Training Dataset
241
+
242
+ #### msmarco-10m-triplets
243
+
244
+ * Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2)
245
+ * Size: 9,998,000 training samples
246
+ * Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
247
+ * Approximate statistics based on the first 1000 samples:
248
+ | | query | positive | negative |
249
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
250
+ | type | string | string | string |
251
+ | details | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> |
252
+ * Samples:
253
+ | query | positive | negative |
254
+ |:-------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
255
+ | <code>what kind of carbohydrates can i eat in a gluten free diet?</code> | <code>What Can I Eat That is Gluten-Free? Even though going gluten-free can be difficult, you still have many food choices! Focus on eating a variety of fruits, vegetables, low-fat dairy products (those that do not have gluten-containing additives), beans, eggs, nuts, and lean meat, poultry, and fish. There are still many healthy whole grains and starchy carbohydrate foods to choose from that do not contain gluten: Amaranth. Arrowroot.</code> | <code>Gluten-free crust option upon request. While we try hard to maintain the integrity of our gluten free crust, please be aware that it does run the risk of exposure to wheat-based products. Due to the risk of cross contamination, MOD DOES NOT RECOMMEND this pizza for those with celiac disease or other gluten allergies. Feeling Inspired? Express Yourself Through Pizza</code> |
256
+ | <code>remsen area code</code> | <code>Remsen, NY Area Codes are. Remsen, NY is currently using two area codes which are area codes 315 and 680. In addition to Remsen, NY area code information read more details about area code 315, area code 680 and New York area codes. Remsen, NY is located in Oneida County and observes the Eastern Time Zone.</code> | <code>313 Area Code. AreaCode.org is an area code finder with detailed information on the 313 area code including 313 area code map. Major cities like Dearborn within area code 313 are also listed on this page.</code> |
257
+ | <code>when was betsy ross born</code> | <code>Early Life. Betsy Ross, best known for making the first American flag, was born Elizabeth Griscom in Philadelphia, Pennsylvania, on January 1, 1752. A fourth-generation American, and the great-granddaughter of a carpenter who had arrived in New Jersey in 1680 from England, Betsy was the eighth of 17 children.ynopsis. Betsy Ross, a fourth-generation America born in 1752 in Philadelphia, Pennsylvania, apprenticed with an upholsterer before irrevocably splitting with her family to marry outside the Quaker religion. She and her husband John Ross started their own upholstery business.</code> | <code>Katharine Ross (I) Katharine Juliet Ross was born on January 29, 1940 in Hollywood, California, to Katharine W. (Hall) and Dudley T. Ross. Her father, who also worked for the Associated Press, was away in the US Navy when she was born.</code> |
258
+ * Loss: <code>pylate.losses.contrastive.Contrastive</code>
259
+
260
+ ### Evaluation Dataset
261
+
262
+ #### msmarco-10m-triplets
263
+
264
+ * Dataset: [msmarco-10m-triplets](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets) at [8c5139a](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets/tree/8c5139a245a5997992605792faa49ec12a6eb5f2)
265
+ * Size: 2,000 evaluation samples
266
+ * Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
267
+ * Approximate statistics based on the first 1000 samples:
268
+ | | query | positive | negative |
269
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
270
+ | type | string | string | string |
271
+ | details | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 32 tokens</li><li>mean: 32.0 tokens</li><li>max: 32 tokens</li></ul> |
272
+ * Samples:
273
+ | query | positive | negative |
274
+ |:----------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
275
+ | <code>pikas are closely related to which typeb of animal</code> | <code>The pika is a small-sized mammal that is found across the Northern Hemisphere. Despite their rodent-like appearance, pikas are actually closely related to rabbits and hares.Pikas are most commonly identified by their small, rounded body and lack of tail. Pikas prefer the colder climates and are generally found in mountainous regions and rocky areas where there tend to be fewer predators.ikas defend their territory by whistling to one another, and their large, rounded ears come in useful to hear the calls from competing pikas. Pikas are herbivorous animals and the pika therefore has a diet based on vegetation.</code> | <code>Alpacas are very closely related to llamas. They are both from a group of four species known as South American Camelids. The llama is approximately twice the size of an alpaca with banana shaped ears and is principally used as a pack animal. Alpacas are exclusively bred as fleece animals in Australia.</code> |
276
+ | <code>when can we see northern lights in norway</code> | <code>The Northern Lights can appear at any time, but they usually grace the sky between 6 o’clock in the evening and 1 o’clock in the morning. 1 It is rare to see the Northern Lights before 18. 00/6pm, even during the dark months. 2 The highest frequency is around 22. 00–23. 3 If you see the Northern Lights at 19.</code> | <code>Transfer points on the Northern lights & Norway in a nutshell® trip. Oslo: Arrival/departure by plane to/from Oslo Airport Gardermoen, 28 mi./45 km north of city center. Transport by airport train or airport bus. Tromsø: Arrival/departure by plane to/from Tromsø Airport, 1.8 mi./3 km west of city center.</code> |
277
+ | <code>what games do markiplier play</code> | <code>List of Games. Markiplier is a professional gamer, who is best known for playing horror-themed video games. Along with many other types of games, including, but not limited to: flash games, indie point-and-click games and adventure games.</code> | <code>Stop wasting your time for playing games when you can play games and be paid for it. Be the one of the game testers and start earning money from something that makes you happy. Visit http://goo.gl/pT87xF, become a game tester today and get paid to play video games. Felisha · 1 year ago.</code> |
278
+ * Loss: <code>pylate.losses.contrastive.Contrastive</code>
279
+
280
+ ### Training Hyperparameters
281
+ #### Non-Default Hyperparameters
282
+
283
+ - `eval_strategy`: steps
284
+ - `per_device_train_batch_size`: 196
285
+ - `per_device_eval_batch_size`: 196
286
+ - `learning_rate`: 3e-05
287
+ - `max_grad_norm`: 10.0
288
+ - `num_train_epochs`: 0
289
+ - `max_steps`: 50000
290
+ - `warmup_ratio`: 0.01
291
+ - `bf16`: True
292
+ - `torch_compile`: True
293
+ - `torch_compile_backend`: inductor
294
+ - `eval_on_start`: True
295
+
296
+ #### All Hyperparameters
297
+ <details><summary>Click to expand</summary>
298
+
299
+ - `overwrite_output_dir`: False
300
+ - `do_predict`: False
301
+ - `eval_strategy`: steps
302
+ - `prediction_loss_only`: True
303
+ - `per_device_train_batch_size`: 196
304
+ - `per_device_eval_batch_size`: 196
305
+ - `per_gpu_train_batch_size`: None
306
+ - `per_gpu_eval_batch_size`: None
307
+ - `gradient_accumulation_steps`: 1
308
+ - `eval_accumulation_steps`: None
309
+ - `torch_empty_cache_steps`: None
310
+ - `learning_rate`: 3e-05
311
+ - `weight_decay`: 0.0
312
+ - `adam_beta1`: 0.9
313
+ - `adam_beta2`: 0.999
314
+ - `adam_epsilon`: 1e-08
315
+ - `max_grad_norm`: 10.0
316
+ - `num_train_epochs`: 0
317
+ - `max_steps`: 50000
318
+ - `lr_scheduler_type`: linear
319
+ - `lr_scheduler_kwargs`: {}
320
+ - `warmup_ratio`: 0.01
321
+ - `warmup_steps`: 0
322
+ - `log_level`: passive
323
+ - `log_level_replica`: warning
324
+ - `log_on_each_node`: True
325
+ - `logging_nan_inf_filter`: True
326
+ - `save_safetensors`: True
327
+ - `save_on_each_node`: False
328
+ - `save_only_model`: False
329
+ - `restore_callback_states_from_checkpoint`: False
330
+ - `no_cuda`: False
331
+ - `use_cpu`: False
332
+ - `use_mps_device`: False
333
+ - `seed`: 42
334
+ - `data_seed`: None
335
+ - `jit_mode_eval`: False
336
+ - `use_ipex`: False
337
+ - `bf16`: True
338
+ - `fp16`: False
339
+ - `fp16_opt_level`: O1
340
+ - `half_precision_backend`: auto
341
+ - `bf16_full_eval`: False
342
+ - `fp16_full_eval`: False
343
+ - `tf32`: None
344
+ - `local_rank`: 0
345
+ - `ddp_backend`: None
346
+ - `tpu_num_cores`: None
347
+ - `tpu_metrics_debug`: False
348
+ - `debug`: []
349
+ - `dataloader_drop_last`: False
350
+ - `dataloader_num_workers`: 0
351
+ - `dataloader_prefetch_factor`: None
352
+ - `past_index`: -1
353
+ - `disable_tqdm`: False
354
+ - `remove_unused_columns`: True
355
+ - `label_names`: None
356
+ - `load_best_model_at_end`: False
357
+ - `ignore_data_skip`: False
358
+ - `fsdp`: []
359
+ - `fsdp_min_num_params`: 0
360
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
361
+ - `fsdp_transformer_layer_cls_to_wrap`: None
362
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
363
+ - `parallelism_config`: None
364
+ - `deepspeed`: None
365
+ - `label_smoothing_factor`: 0.0
366
+ - `optim`: adamw_torch_fused
367
+ - `optim_args`: None
368
+ - `adafactor`: False
369
+ - `group_by_length`: False
370
+ - `length_column_name`: length
371
+ - `ddp_find_unused_parameters`: None
372
+ - `ddp_bucket_cap_mb`: None
373
+ - `ddp_broadcast_buffers`: False
374
+ - `dataloader_pin_memory`: True
375
+ - `dataloader_persistent_workers`: False
376
+ - `skip_memory_metrics`: True
377
+ - `use_legacy_prediction_loop`: False
378
+ - `push_to_hub`: False
379
+ - `resume_from_checkpoint`: None
380
+ - `hub_model_id`: None
381
+ - `hub_strategy`: every_save
382
+ - `hub_private_repo`: None
383
+ - `hub_always_push`: False
384
+ - `hub_revision`: None
385
+ - `gradient_checkpointing`: False
386
+ - `gradient_checkpointing_kwargs`: None
387
+ - `include_inputs_for_metrics`: False
388
+ - `include_for_metrics`: []
389
+ - `eval_do_concat_batches`: True
390
+ - `fp16_backend`: auto
391
+ - `push_to_hub_model_id`: None
392
+ - `push_to_hub_organization`: None
393
+ - `mp_parameters`:
394
+ - `auto_find_batch_size`: False
395
+ - `full_determinism`: False
396
+ - `torchdynamo`: None
397
+ - `ray_scope`: last
398
+ - `ddp_timeout`: 1800
399
+ - `torch_compile`: True
400
+ - `torch_compile_backend`: inductor
401
+ - `torch_compile_mode`: None
402
+ - `include_tokens_per_second`: False
403
+ - `include_num_input_tokens_seen`: False
404
+ - `neftune_noise_alpha`: None
405
+ - `optim_target_modules`: None
406
+ - `batch_eval_metrics`: False
407
+ - `eval_on_start`: True
408
+ - `use_liger_kernel`: False
409
+ - `liger_kernel_config`: None
410
+ - `eval_use_gather_object`: False
411
+ - `average_tokens_across_devices`: False
412
+ - `prompts`: None
413
+ - `batch_sampler`: batch_sampler
414
+ - `multi_dataset_batch_sampler`: proportional
415
+ - `router_mapping`: {}
416
+ - `learning_rate_mapping`: {}
417
+
418
+ </details>
419
+
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1536,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 512,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 0,
25
+ "position_embedding_type": "absolute",
26
+ "transformers_version": "4.56.2",
27
+ "type_vocab_size": 2,
28
+ "use_cache": true,
29
+ "vocab_size": 30522
30
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.1",
4
+ "transformers": "4.56.2",
5
+ "pytorch": "2.9.0+cu128"
6
+ },
7
+ "prompts": {
8
+ "query": "",
9
+ "document": ""
10
+ },
11
+ "default_prompt_name": null,
12
+ "similarity_fn_name": "MaxSim",
13
+ "query_prefix": null,
14
+ "document_prefix": null,
15
+ "query_length": 32,
16
+ "document_length": 300,
17
+ "attend_to_expansion_tokens": true,
18
+ "skiplist_words": [
19
+ "!",
20
+ "\"",
21
+ "#",
22
+ "$",
23
+ "%",
24
+ "&",
25
+ "'",
26
+ "(",
27
+ ")",
28
+ "*",
29
+ "+",
30
+ ",",
31
+ "-",
32
+ ".",
33
+ "/",
34
+ ":",
35
+ ";",
36
+ "<",
37
+ "=",
38
+ ">",
39
+ "?",
40
+ "@",
41
+ "[",
42
+ "\\",
43
+ "]",
44
+ "^",
45
+ "_",
46
+ "`",
47
+ "{",
48
+ "|",
49
+ "}",
50
+ "~"
51
+ ],
52
+ "do_query_expansion": true
53
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e20f8864f44458387be1bda33765ad91c33417cea2911128995a2dbab5ec4ea0
3
+ size 133462128
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Dense",
12
+ "type": "pylate.models.Dense.Dense"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 300,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "[MASK]",
17
+ "sep_token": {
18
+ "content": "[SEP]",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "unk_token": {
25
+ "content": "[UNK]",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[MASK]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff