souvickdascmsa019 commited on
Commit
7a50d08
·
verified ·
1 Parent(s): 42339aa

Upload folder using huggingface_hub

Browse files
1_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}
1_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:281514cbc08c1c7498f03557d572f6520cfed1dbc9b1d96ff0e7d282f7eefecb
3
+ size 393304
README.md ADDED
@@ -0,0 +1,475 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - ColBERT
6
+ - PyLate
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - generated_from_trainer
11
+ - dataset_size:497901
12
+ - loss:Contrastive
13
+ base_model: colbert-ir/colbertv2.0
14
+ datasets:
15
+ - sentence-transformers/msmarco-bm25
16
+ pipeline_tag: sentence-similarity
17
+ library_name: PyLate
18
+ ---
19
+
20
+ # PyLate model based on colbert-ir/colbertv2.0
21
+
22
+ This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) on the [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+ - **Model Type:** PyLate model
28
+ - **Base model:** [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) <!-- at revision c1e84128e85ef755c096a95bdb06b47793b13acf -->
29
+ - **Document Length:** 180 tokens
30
+ - **Query Length:** 32 tokens
31
+ - **Output Dimensionality:** 128 tokens
32
+ - **Similarity Function:** MaxSim
33
+ - **Training Dataset:**
34
+ - [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25)
35
+ - **Language:** en
36
+ <!-- - **License:** Unknown -->
37
+
38
+ ### Model Sources
39
+
40
+ - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
41
+ - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
42
+ - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
43
+
44
+ ### Full Model Architecture
45
+
46
+ ```
47
+ ColBERT(
48
+ (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel
49
+ (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
50
+ )
51
+ ```
52
+
53
+ ## Usage
54
+ First install the PyLate library:
55
+
56
+ ```bash
57
+ pip install -U pylate
58
+ ```
59
+
60
+ ### Retrieval
61
+
62
+ PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
63
+
64
+ #### Indexing documents
65
+
66
+ First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
67
+
68
+ ```python
69
+ from pylate import indexes, models, retrieve
70
+
71
+ # Step 1: Load the ColBERT model
72
+ model = models.ColBERT(
73
+ model_name_or_path=pylate_model_id,
74
+ )
75
+
76
+ # Step 2: Initialize the Voyager index
77
+ index = indexes.Voyager(
78
+ index_folder="pylate-index",
79
+ index_name="index",
80
+ override=True, # This overwrites the existing index if any
81
+ )
82
+
83
+ # Step 3: Encode the documents
84
+ documents_ids = ["1", "2", "3"]
85
+ documents = ["document 1 text", "document 2 text", "document 3 text"]
86
+
87
+ documents_embeddings = model.encode(
88
+ documents,
89
+ batch_size=32,
90
+ is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
91
+ show_progress_bar=True,
92
+ )
93
+
94
+ # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
95
+ index.add_documents(
96
+ documents_ids=documents_ids,
97
+ documents_embeddings=documents_embeddings,
98
+ )
99
+ ```
100
+
101
+ Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
102
+
103
+ ```python
104
+ # To load an index, simply instantiate it with the correct folder/name and without overriding it
105
+ index = indexes.Voyager(
106
+ index_folder="pylate-index",
107
+ index_name="index",
108
+ )
109
+ ```
110
+
111
+ #### Retrieving top-k documents for queries
112
+
113
+ Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
114
+ To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
115
+
116
+ ```python
117
+ # Step 1: Initialize the ColBERT retriever
118
+ retriever = retrieve.ColBERT(index=index)
119
+
120
+ # Step 2: Encode the queries
121
+ queries_embeddings = model.encode(
122
+ ["query for document 3", "query for document 1"],
123
+ batch_size=32,
124
+ is_query=True, # # Ensure that it is set to False to indicate that these are queries
125
+ show_progress_bar=True,
126
+ )
127
+
128
+ # Step 3: Retrieve top-k documents
129
+ scores = retriever.retrieve(
130
+ queries_embeddings=queries_embeddings,
131
+ k=10, # Retrieve the top 10 matches for each query
132
+ )
133
+ ```
134
+
135
+ ### Reranking
136
+ If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
137
+
138
+ ```python
139
+ from pylate import rank, models
140
+
141
+ queries = [
142
+ "query A",
143
+ "query B",
144
+ ]
145
+
146
+ documents = [
147
+ ["document A", "document B"],
148
+ ["document 1", "document C", "document B"],
149
+ ]
150
+
151
+ documents_ids = [
152
+ [1, 2],
153
+ [1, 3, 2],
154
+ ]
155
+
156
+ model = models.ColBERT(
157
+ model_name_or_path=pylate_model_id,
158
+ )
159
+
160
+ queries_embeddings = model.encode(
161
+ queries,
162
+ is_query=True,
163
+ )
164
+
165
+ documents_embeddings = model.encode(
166
+ documents,
167
+ is_query=False,
168
+ )
169
+
170
+ reranked_documents = rank.rerank(
171
+ documents_ids=documents_ids,
172
+ queries_embeddings=queries_embeddings,
173
+ documents_embeddings=documents_embeddings,
174
+ )
175
+ ```
176
+
177
+ <!--
178
+ ### Direct Usage (Transformers)
179
+
180
+ <details><summary>Click to see the direct usage in Transformers</summary>
181
+
182
+ </details>
183
+ -->
184
+
185
+ <!--
186
+ ### Downstream Usage (Sentence Transformers)
187
+
188
+ You can finetune this model on your own dataset.
189
+
190
+ <details><summary>Click to expand</summary>
191
+
192
+ </details>
193
+ -->
194
+
195
+ <!--
196
+ ### Out-of-Scope Use
197
+
198
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
199
+ -->
200
+
201
+ <!--
202
+ ## Bias, Risks and Limitations
203
+
204
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
205
+ -->
206
+
207
+ <!--
208
+ ### Recommendations
209
+
210
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
211
+ -->
212
+
213
+ ## Training Details
214
+
215
+ ### Training Dataset
216
+
217
+ #### msmarco-bm25
218
+
219
+ * Dataset: [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) at [ce8a493](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25/tree/ce8a493a65af5e872c3c92f72a89e2e99e175f02)
220
+ * Size: 497,901 training samples
221
+ * Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
222
+ * Approximate statistics based on the first 1000 samples:
223
+ | | query | positive | negative |
224
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
225
+ | type | string | string | string |
226
+ | details | <ul><li>min: 5 tokens</li><li>mean: 10.14 tokens</li><li>max: 20 tokens</li></ul> | <ul><li>min: 17 tokens</li><li>mean: 31.91 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 17 tokens</li><li>mean: 31.84 tokens</li><li>max: 32 tokens</li></ul> |
227
+ * Samples:
228
+ | query | positive | negative |
229
+ |:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
230
+ | <code>what is null hypothesis and why is it used in experimental research</code> | <code>A null hypothesis is one that is assumed to be true unless it has been contradicted. It is used to compare to another hypothesis. The experimental hypothesis is what you are observing, and you expect it to differ from the control. erm i know that a null hypothesis is when nothing happens at all i think.</code> | <code>A null hypothesis is one that is assumed to be true unless it has been contradicted. It is used to compare to another hypothesis. The experimental hypothesis is what you are observing, and you expect it to differ from the control. erm i know that a null hypothesis is when nothing happens at all i think.</code> |
231
+ | <code>number of students per instructor</code> | <code>The article posited that students preferred classes of 10-20 students, and instructors suggested that the ideal class would have 19 students. Instructors reported that at 39 students problems began to arise, and that a class of 51 students was impossible. They also reported that an uncomfortably small class begins at 7 students, and an impossibly small class has 4 or less.</code> | <code>The ratio of instructors to students isn’t as important here as in the lab setting. One to two instructors per 10 students will suffice. Once the students are divided into groups, the instructor should begin to methodically teach ECG interpretation. The instructor should start with waveform definition and recognition.</code> |
232
+ | <code>when should exclamation marks be used?</code> | <code>The exclamation mark (British English) or exclamation point (American English) is a punctuation mark usually used after an interjection or exclamation to indicate strong feelings or high volume (shouting), and often marks the end of a sentence.</code> | <code>1 Question marks and exclamation marks go inside the quotation marks when the quoted material is a question or an exclamation and outside the quotation marks when the whole sentence is a question or an exclamation. Question marks and exclamation marks go inside the quotation marks when the quoted material is a question or an exclamation and outside the quotation marks when the whole sentence is a question or an exclamation.</code> |
233
+ * Loss: <code>pylate.losses.contrastive.Contrastive</code>
234
+
235
+ ### Evaluation Dataset
236
+
237
+ #### msmarco-bm25
238
+
239
+ * Dataset: [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) at [ce8a493](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25/tree/ce8a493a65af5e872c3c92f72a89e2e99e175f02)
240
+ * Size: 5,030 evaluation samples
241
+ * Columns: <code>query</code>, <code>positive</code>, and <code>negative</code>
242
+ * Approximate statistics based on the first 1000 samples:
243
+ | | query | positive | negative |
244
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
245
+ | type | string | string | string |
246
+ | details | <ul><li>min: 5 tokens</li><li>mean: 10.17 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 20 tokens</li><li>mean: 31.92 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 31.93 tokens</li><li>max: 32 tokens</li></ul> |
247
+ * Samples:
248
+ | query | positive | negative |
249
+ |:---------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
250
+ | <code>what is a hypermarket</code> | <code>By definition a hypermarket is the combination of a supermarket and a department store which has at least 150,000 square feet of floor space, and at least 35% of that space is used for the sale of nonfood merchandise. Generally the terms hypermarket, and superstore are used interchangeably.</code> | <code>hypermarket meaning, definition, what is hypermarket: a very large shop, usually outside the centre of town. Learn more.</code> |
251
+ | <code>what is fd&c yellow #6 lake.</code> | <code>FD&C Yellow No. 6 Lake is a color additive used for drug dosage forms such as tablets and capsules. It is also approved for use in foods and cosmetics. FD&C Yellow No. 6 Lake imparts a reddish-yellow color to medicinal dosage forms. FDA performs regulatory review for color additives used in foods, drugs, cosmetics, and medical devices. FD&C specifies the color is approved for use in food, drugs and cosmetics. FD&C Yellow No. 6 Lake may be safely used as a color additive when following FDA specifications. To form lake colors, straight dyes (such as FD&C Yellow No. 6) are mixed with precipitants and salts. Aluminum may be a component. Lakes may be used as color additives for tablet coatings due to their stability.</code> | <code>Coumadin: 6 mg [scored; contains fd&c blue #1 aluminum lake, fd&c yellow #6 aluminum lake] Coumadin: 7.5 mg [scored; contains fd&c yellow #10 aluminum lake, fd&c yellow #6 aluminum lake] Coumadin: 10 mg [scored; dye free] Jantoven: 1 mg [scored; contains fd&c red #40 aluminum lake]</code> |
252
+ | <code>how long can ringworm live on clothes</code> | <code>-Sometimes the ringworm on the scalp can causes patches of hair loss. Ringworm in dogs can be spread many of the same ways. Even sharing clothes, towels, or combs may result in spreading the infection. Ringworm is caused by different kinds of fungus on the skin, hair, or nails caused by an infection.he fungus that causes ringworm can typically live up to 7 days on surfaces such as counter tops, carpets, and floors, but it has been reported that some types can live up to one year.</code> | <code>What Causes Ringworm? Ringworm is more common in unsanitary and crowded places. That's because it can live on both skin and surfaces like shower floors, and can be transferred by sharing clothes, sheets, and towels. Even other mammals, including cats and dogs, can easily transfer ringworm to humans. What Are the Types of Ringworm?</code> |
253
+ * Loss: <code>pylate.losses.contrastive.Contrastive</code>
254
+
255
+ ### Training Hyperparameters
256
+ #### Non-Default Hyperparameters
257
+
258
+ - `per_device_train_batch_size`: 32
259
+ - `per_device_eval_batch_size`: 32
260
+ - `learning_rate`: 3e-06
261
+ - `num_train_epochs`: 1
262
+ - `fp16`: True
263
+
264
+ #### All Hyperparameters
265
+ <details><summary>Click to expand</summary>
266
+
267
+ - `overwrite_output_dir`: False
268
+ - `do_predict`: False
269
+ - `eval_strategy`: no
270
+ - `prediction_loss_only`: True
271
+ - `per_device_train_batch_size`: 32
272
+ - `per_device_eval_batch_size`: 32
273
+ - `per_gpu_train_batch_size`: None
274
+ - `per_gpu_eval_batch_size`: None
275
+ - `gradient_accumulation_steps`: 1
276
+ - `eval_accumulation_steps`: None
277
+ - `torch_empty_cache_steps`: None
278
+ - `learning_rate`: 3e-06
279
+ - `weight_decay`: 0.0
280
+ - `adam_beta1`: 0.9
281
+ - `adam_beta2`: 0.999
282
+ - `adam_epsilon`: 1e-08
283
+ - `max_grad_norm`: 1.0
284
+ - `num_train_epochs`: 1
285
+ - `max_steps`: -1
286
+ - `lr_scheduler_type`: linear
287
+ - `lr_scheduler_kwargs`: {}
288
+ - `warmup_ratio`: 0.0
289
+ - `warmup_steps`: 0
290
+ - `log_level`: passive
291
+ - `log_level_replica`: warning
292
+ - `log_on_each_node`: True
293
+ - `logging_nan_inf_filter`: True
294
+ - `save_safetensors`: True
295
+ - `save_on_each_node`: False
296
+ - `save_only_model`: False
297
+ - `restore_callback_states_from_checkpoint`: False
298
+ - `no_cuda`: False
299
+ - `use_cpu`: False
300
+ - `use_mps_device`: False
301
+ - `seed`: 42
302
+ - `data_seed`: None
303
+ - `jit_mode_eval`: False
304
+ - `use_ipex`: False
305
+ - `bf16`: False
306
+ - `fp16`: True
307
+ - `fp16_opt_level`: O1
308
+ - `half_precision_backend`: auto
309
+ - `bf16_full_eval`: False
310
+ - `fp16_full_eval`: False
311
+ - `tf32`: None
312
+ - `local_rank`: 0
313
+ - `ddp_backend`: None
314
+ - `tpu_num_cores`: None
315
+ - `tpu_metrics_debug`: False
316
+ - `debug`: []
317
+ - `dataloader_drop_last`: False
318
+ - `dataloader_num_workers`: 0
319
+ - `dataloader_prefetch_factor`: None
320
+ - `past_index`: -1
321
+ - `disable_tqdm`: False
322
+ - `remove_unused_columns`: True
323
+ - `label_names`: None
324
+ - `load_best_model_at_end`: False
325
+ - `ignore_data_skip`: False
326
+ - `fsdp`: []
327
+ - `fsdp_min_num_params`: 0
328
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
329
+ - `fsdp_transformer_layer_cls_to_wrap`: None
330
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
331
+ - `deepspeed`: None
332
+ - `label_smoothing_factor`: 0.0
333
+ - `optim`: adamw_torch
334
+ - `optim_args`: None
335
+ - `adafactor`: False
336
+ - `group_by_length`: False
337
+ - `length_column_name`: length
338
+ - `ddp_find_unused_parameters`: None
339
+ - `ddp_bucket_cap_mb`: None
340
+ - `ddp_broadcast_buffers`: False
341
+ - `dataloader_pin_memory`: True
342
+ - `dataloader_persistent_workers`: False
343
+ - `skip_memory_metrics`: True
344
+ - `use_legacy_prediction_loop`: False
345
+ - `push_to_hub`: False
346
+ - `resume_from_checkpoint`: None
347
+ - `hub_model_id`: None
348
+ - `hub_strategy`: every_save
349
+ - `hub_private_repo`: None
350
+ - `hub_always_push`: False
351
+ - `gradient_checkpointing`: False
352
+ - `gradient_checkpointing_kwargs`: None
353
+ - `include_inputs_for_metrics`: False
354
+ - `include_for_metrics`: []
355
+ - `eval_do_concat_batches`: True
356
+ - `fp16_backend`: auto
357
+ - `push_to_hub_model_id`: None
358
+ - `push_to_hub_organization`: None
359
+ - `mp_parameters`:
360
+ - `auto_find_batch_size`: False
361
+ - `full_determinism`: False
362
+ - `torchdynamo`: None
363
+ - `ray_scope`: last
364
+ - `ddp_timeout`: 1800
365
+ - `torch_compile`: False
366
+ - `torch_compile_backend`: None
367
+ - `torch_compile_mode`: None
368
+ - `dispatch_batches`: None
369
+ - `split_batches`: None
370
+ - `include_tokens_per_second`: False
371
+ - `include_num_input_tokens_seen`: False
372
+ - `neftune_noise_alpha`: None
373
+ - `optim_target_modules`: None
374
+ - `batch_eval_metrics`: False
375
+ - `eval_on_start`: False
376
+ - `use_liger_kernel`: False
377
+ - `eval_use_gather_object`: False
378
+ - `average_tokens_across_devices`: False
379
+ - `prompts`: None
380
+ - `batch_sampler`: batch_sampler
381
+ - `multi_dataset_batch_sampler`: proportional
382
+
383
+ </details>
384
+
385
+ ### Training Logs
386
+ | Epoch | Step | Training Loss |
387
+ |:------:|:-----:|:-------------:|
388
+ | 0.0321 | 500 | 0.4976 |
389
+ | 0.0643 | 1000 | 0.3532 |
390
+ | 0.0964 | 1500 | 0.3195 |
391
+ | 0.1285 | 2000 | 0.3079 |
392
+ | 0.1607 | 2500 | 0.3067 |
393
+ | 0.1928 | 3000 | 0.2957 |
394
+ | 0.2249 | 3500 | 0.3086 |
395
+ | 0.2571 | 4000 | 0.2927 |
396
+ | 0.2892 | 4500 | 0.2922 |
397
+ | 0.3213 | 5000 | 0.2931 |
398
+ | 0.3535 | 5500 | 0.2957 |
399
+ | 0.3856 | 6000 | 0.2809 |
400
+ | 0.4177 | 6500 | 0.2773 |
401
+ | 0.4499 | 7000 | 0.2728 |
402
+ | 0.4820 | 7500 | 0.2888 |
403
+ | 0.5141 | 8000 | 0.2863 |
404
+ | 0.5463 | 8500 | 0.2813 |
405
+ | 0.5784 | 9000 | 0.2695 |
406
+ | 0.6105 | 9500 | 0.2834 |
407
+ | 0.6427 | 10000 | 0.2739 |
408
+ | 0.6748 | 10500 | 0.2744 |
409
+ | 0.7069 | 11000 | 0.2849 |
410
+ | 0.7391 | 11500 | 0.2808 |
411
+ | 0.7712 | 12000 | 0.2796 |
412
+ | 0.8033 | 12500 | 0.2772 |
413
+ | 0.8355 | 13000 | 0.2813 |
414
+ | 0.8676 | 13500 | 0.2756 |
415
+ | 0.8997 | 14000 | 0.2771 |
416
+ | 0.9319 | 14500 | 0.283 |
417
+ | 0.9640 | 15000 | 0.2731 |
418
+ | 0.9961 | 15500 | 0.2865 |
419
+
420
+
421
+ ### Framework Versions
422
+ - Python: 3.12.4
423
+ - Sentence Transformers: 4.0.2
424
+ - PyLate: 1.2.0
425
+ - Transformers: 4.48.2
426
+ - PyTorch: 2.6.0+cu124
427
+ - Accelerate: 1.7.0
428
+ - Datasets: 3.6.0
429
+ - Tokenizers: 0.21.1
430
+
431
+
432
+ ## Citation
433
+
434
+ ### BibTeX
435
+
436
+ #### Sentence Transformers
437
+ ```bibtex
438
+ @inproceedings{reimers-2019-sentence-bert,
439
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
440
+ author = "Reimers, Nils and Gurevych, Iryna",
441
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
442
+ month = "11",
443
+ year = "2019",
444
+ publisher = "Association for Computational Linguistics",
445
+ url = "https://arxiv.org/abs/1908.10084"
446
+ }
447
+ ```
448
+
449
+ #### PyLate
450
+ ```bibtex
451
+ @misc{PyLate,
452
+ title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
453
+ author={Chaffin, Antoine and Sourty, Raphaël},
454
+ url={https://github.com/lightonai/pylate},
455
+ year={2024}
456
+ }
457
+ ```
458
+
459
+ <!--
460
+ ## Glossary
461
+
462
+ *Clearly define terms in order to be accessible across audiences.*
463
+ -->
464
+
465
+ <!--
466
+ ## Model Card Authors
467
+
468
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
469
+ -->
470
+
471
+ <!--
472
+ ## Model Card Contact
473
+
474
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
475
+ -->
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "[D] ": 30523,
3
+ "[Q] ": 30522
4
+ }
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "colbert-ir/colbertv2.0",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.48.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30524
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.0.2",
4
+ "transformers": "4.48.2",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "MaxSim",
10
+ "query_prefix": "[Q] ",
11
+ "document_prefix": "[D] ",
12
+ "query_length": 32,
13
+ "document_length": 180,
14
+ "attend_to_expansion_tokens": false,
15
+ "skiplist_words": [
16
+ "!",
17
+ "\"",
18
+ "#",
19
+ "$",
20
+ "%",
21
+ "&",
22
+ "'",
23
+ "(",
24
+ ")",
25
+ "*",
26
+ "+",
27
+ ",",
28
+ "-",
29
+ ".",
30
+ "/",
31
+ ":",
32
+ ";",
33
+ "<",
34
+ "=",
35
+ ">",
36
+ "?",
37
+ "@",
38
+ "[",
39
+ "\\",
40
+ "]",
41
+ "^",
42
+ "_",
43
+ "`",
44
+ "{",
45
+ "|",
46
+ "}",
47
+ "~"
48
+ ]
49
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1adf0745524e4cb22cf93c06912624fee218aa4f520797622bb4a9f09f899801
3
+ size 437957472
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Dense",
12
+ "type": "pylate.models.Dense.Dense"
13
+ }
14
+ ]
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6606b030dcf8392882879051b69901f32433b8cbf0ed064760a894655d01fc63
3
+ size 872097594
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a81960726957c865f710e0fae56235c0206117db68a842694c4d869ee94467f
3
+ size 14244
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c28bb2e8575066a5c01e339540f590e3541d3814d4b5cac22687bd40d09c53b5
3
+ size 1064
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 179,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "[MASK]",
17
+ "sep_token": {
18
+ "content": "[SEP]",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "unk_token": {
25
+ "content": "[UNK]",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30522": {
44
+ "content": "[Q] ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "30523": {
52
+ "content": "[D] ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ }
59
+ },
60
+ "clean_up_tokenization_spaces": false,
61
+ "cls_token": "[CLS]",
62
+ "do_lower_case": true,
63
+ "extra_special_tokens": {},
64
+ "mask_token": "[MASK]",
65
+ "model_max_length": 512,
66
+ "pad_token": "[MASK]",
67
+ "sep_token": "[SEP]",
68
+ "strip_accents": null,
69
+ "tokenize_chinese_chars": true,
70
+ "tokenizer_class": "BertTokenizer",
71
+ "unk_token": "[UNK]"
72
+ }
trainer_state.json ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9961439588688946,
5
+ "eval_steps": 500,
6
+ "global_step": 15500,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.032133676092544985,
13
+ "grad_norm": 7.696083068847656,
14
+ "learning_rate": 2.9039845758354757e-06,
15
+ "loss": 0.4976,
16
+ "step": 500
17
+ },
18
+ {
19
+ "epoch": 0.06426735218508997,
20
+ "grad_norm": 4.741117000579834,
21
+ "learning_rate": 2.807583547557841e-06,
22
+ "loss": 0.3532,
23
+ "step": 1000
24
+ },
25
+ {
26
+ "epoch": 0.09640102827763496,
27
+ "grad_norm": 6.354885101318359,
28
+ "learning_rate": 2.711182519280206e-06,
29
+ "loss": 0.3195,
30
+ "step": 1500
31
+ },
32
+ {
33
+ "epoch": 0.12853470437017994,
34
+ "grad_norm": 4.841858386993408,
35
+ "learning_rate": 2.614781491002571e-06,
36
+ "loss": 0.3079,
37
+ "step": 2000
38
+ },
39
+ {
40
+ "epoch": 0.16066838046272494,
41
+ "grad_norm": 4.776275634765625,
42
+ "learning_rate": 2.518573264781491e-06,
43
+ "loss": 0.3067,
44
+ "step": 2500
45
+ },
46
+ {
47
+ "epoch": 0.1928020565552699,
48
+ "grad_norm": 5.285233974456787,
49
+ "learning_rate": 2.422172236503856e-06,
50
+ "loss": 0.2957,
51
+ "step": 3000
52
+ },
53
+ {
54
+ "epoch": 0.2249357326478149,
55
+ "grad_norm": 5.628823757171631,
56
+ "learning_rate": 2.3257712082262213e-06,
57
+ "loss": 0.3086,
58
+ "step": 3500
59
+ },
60
+ {
61
+ "epoch": 0.2570694087403599,
62
+ "grad_norm": 4.082389831542969,
63
+ "learning_rate": 2.229370179948586e-06,
64
+ "loss": 0.2927,
65
+ "step": 4000
66
+ },
67
+ {
68
+ "epoch": 0.2892030848329049,
69
+ "grad_norm": 5.4696478843688965,
70
+ "learning_rate": 2.1331619537275066e-06,
71
+ "loss": 0.2922,
72
+ "step": 4500
73
+ },
74
+ {
75
+ "epoch": 0.3213367609254499,
76
+ "grad_norm": 4.862800598144531,
77
+ "learning_rate": 2.0367609254498712e-06,
78
+ "loss": 0.2931,
79
+ "step": 5000
80
+ },
81
+ {
82
+ "epoch": 0.35347043701799485,
83
+ "grad_norm": 4.961813449859619,
84
+ "learning_rate": 1.9403598971722367e-06,
85
+ "loss": 0.2957,
86
+ "step": 5500
87
+ },
88
+ {
89
+ "epoch": 0.3856041131105398,
90
+ "grad_norm": 4.734184741973877,
91
+ "learning_rate": 1.8439588688946016e-06,
92
+ "loss": 0.2809,
93
+ "step": 6000
94
+ },
95
+ {
96
+ "epoch": 0.41773778920308485,
97
+ "grad_norm": 4.716980934143066,
98
+ "learning_rate": 1.7477506426735218e-06,
99
+ "loss": 0.2773,
100
+ "step": 6500
101
+ },
102
+ {
103
+ "epoch": 0.4498714652956298,
104
+ "grad_norm": 4.844335079193115,
105
+ "learning_rate": 1.651349614395887e-06,
106
+ "loss": 0.2728,
107
+ "step": 7000
108
+ },
109
+ {
110
+ "epoch": 0.4820051413881748,
111
+ "grad_norm": 5.491813659667969,
112
+ "learning_rate": 1.554948586118252e-06,
113
+ "loss": 0.2888,
114
+ "step": 7500
115
+ },
116
+ {
117
+ "epoch": 0.5141388174807198,
118
+ "grad_norm": 4.701641082763672,
119
+ "learning_rate": 1.458547557840617e-06,
120
+ "loss": 0.2863,
121
+ "step": 8000
122
+ },
123
+ {
124
+ "epoch": 0.5462724935732648,
125
+ "grad_norm": 5.017972469329834,
126
+ "learning_rate": 1.3623393316195374e-06,
127
+ "loss": 0.2813,
128
+ "step": 8500
129
+ },
130
+ {
131
+ "epoch": 0.5784061696658098,
132
+ "grad_norm": 5.8628764152526855,
133
+ "learning_rate": 1.2659383033419025e-06,
134
+ "loss": 0.2695,
135
+ "step": 9000
136
+ },
137
+ {
138
+ "epoch": 0.6105398457583547,
139
+ "grad_norm": 5.396206378936768,
140
+ "learning_rate": 1.1695372750642673e-06,
141
+ "loss": 0.2834,
142
+ "step": 9500
143
+ },
144
+ {
145
+ "epoch": 0.6426735218508998,
146
+ "grad_norm": 4.796625137329102,
147
+ "learning_rate": 1.0731362467866324e-06,
148
+ "loss": 0.2739,
149
+ "step": 10000
150
+ },
151
+ {
152
+ "epoch": 0.6748071979434447,
153
+ "grad_norm": 3.604219436645508,
154
+ "learning_rate": 9.769280205655526e-07,
155
+ "loss": 0.2744,
156
+ "step": 10500
157
+ },
158
+ {
159
+ "epoch": 0.7069408740359897,
160
+ "grad_norm": 4.8642048835754395,
161
+ "learning_rate": 8.80719794344473e-07,
162
+ "loss": 0.2849,
163
+ "step": 11000
164
+ },
165
+ {
166
+ "epoch": 0.7390745501285347,
167
+ "grad_norm": 4.076746940612793,
168
+ "learning_rate": 7.84318766066838e-07,
169
+ "loss": 0.2808,
170
+ "step": 11500
171
+ },
172
+ {
173
+ "epoch": 0.7712082262210797,
174
+ "grad_norm": 2.8937087059020996,
175
+ "learning_rate": 6.879177377892031e-07,
176
+ "loss": 0.2796,
177
+ "step": 12000
178
+ },
179
+ {
180
+ "epoch": 0.8033419023136247,
181
+ "grad_norm": 4.379210948944092,
182
+ "learning_rate": 5.915167095115681e-07,
183
+ "loss": 0.2772,
184
+ "step": 12500
185
+ },
186
+ {
187
+ "epoch": 0.8354755784061697,
188
+ "grad_norm": 6.368309020996094,
189
+ "learning_rate": 4.951156812339331e-07,
190
+ "loss": 0.2813,
191
+ "step": 13000
192
+ },
193
+ {
194
+ "epoch": 0.8676092544987146,
195
+ "grad_norm": 5.409502029418945,
196
+ "learning_rate": 3.9871465295629823e-07,
197
+ "loss": 0.2756,
198
+ "step": 13500
199
+ },
200
+ {
201
+ "epoch": 0.8997429305912596,
202
+ "grad_norm": 2.8725733757019043,
203
+ "learning_rate": 3.0231362467866326e-07,
204
+ "loss": 0.2771,
205
+ "step": 14000
206
+ },
207
+ {
208
+ "epoch": 0.9318766066838047,
209
+ "grad_norm": 8.13409423828125,
210
+ "learning_rate": 2.059125964010283e-07,
211
+ "loss": 0.283,
212
+ "step": 14500
213
+ },
214
+ {
215
+ "epoch": 0.9640102827763496,
216
+ "grad_norm": 5.422169208526611,
217
+ "learning_rate": 1.0970437017994858e-07,
218
+ "loss": 0.2731,
219
+ "step": 15000
220
+ },
221
+ {
222
+ "epoch": 0.9961439588688946,
223
+ "grad_norm": 4.597813606262207,
224
+ "learning_rate": 1.3303341902313626e-08,
225
+ "loss": 0.2865,
226
+ "step": 15500
227
+ }
228
+ ],
229
+ "logging_steps": 500,
230
+ "max_steps": 15560,
231
+ "num_input_tokens_seen": 0,
232
+ "num_train_epochs": 1,
233
+ "save_steps": 500,
234
+ "stateful_callbacks": {
235
+ "TrainerControl": {
236
+ "args": {
237
+ "should_epoch_stop": false,
238
+ "should_evaluate": false,
239
+ "should_log": false,
240
+ "should_save": true,
241
+ "should_training_stop": false
242
+ },
243
+ "attributes": {}
244
+ }
245
+ },
246
+ "total_flos": 0.0,
247
+ "train_batch_size": 32,
248
+ "trial_name": null,
249
+ "trial_params": null
250
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc79f90fbd4e4af766b84ef25beec473aea593f8e5f76062a7a882750e66962b
3
+ size 5560
vocab.txt ADDED
The diff for this file is too large to render. See raw diff