Files changed (1) hide show
  1. README.md +63 -54
README.md CHANGED
@@ -7,24 +7,43 @@ tags:
7
  - feature-extraction
8
  pipeline_tag: sentence-similarity
9
  library_name: PyLate
 
 
 
10
  ---
11
 
12
- # PyLate
 
 
13
 
14
- This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
- - **Model Type:** PyLate model
20
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
21
  - **Document Length:** 300 tokens
22
  - **Query Length:** 32 tokens
23
  - **Output Dimensionality:** 128 tokens
24
  - **Similarity Function:** MaxSim
25
- <!-- - **Training Dataset:** Unknown -->
26
- <!-- - **Language:** Unknown -->
27
- <!-- - **License:** Unknown -->
28
 
29
  ### Model Sources
30
 
@@ -114,7 +133,7 @@ retriever = retrieve.ColBERT(index=index)
114
  queries_embeddings = model.encode(
115
  ["query for document 3", "query for document 1"],
116
  batch_size=32,
117
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
118
  show_progress_bar=True,
119
  )
120
 
@@ -167,63 +186,35 @@ reranked_documents = rank.rerank(
167
  )
168
  ```
169
 
170
- <!--
171
- ### Direct Usage (Transformers)
172
-
173
- <details><summary>Click to see the direct usage in Transformers</summary>
174
-
175
- </details>
176
- -->
177
-
178
- <!--
179
- ### Downstream Usage (Sentence Transformers)
180
-
181
- You can finetune this model on your own dataset.
182
-
183
- <details><summary>Click to expand</summary>
184
-
185
- </details>
186
- -->
187
-
188
- <!--
189
- ### Out-of-Scope Use
190
-
191
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
192
- -->
193
-
194
- <!--
195
- ## Bias, Risks and Limitations
196
-
197
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
198
- -->
199
 
200
- <!--
201
- ### Recommendations
202
-
203
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
204
- -->
 
205
 
206
  ## Training Details
207
 
208
  ### Framework Versions
209
- - Python: 3.12.9
210
- - Sentence Transformers: 5.2.0
211
- - PyLate: 1.4.0
212
- - Transformers: 4.57.3
213
- - PyTorch: 2.7.0+cu126
214
- - Accelerate: 1.6.0
215
- - Datasets: 4.4.1
216
- - Tokenizers: 0.22.2
217
-
218
 
219
  ## Citation
220
 
221
  ### BibTeX
222
 
223
  ```bibtex
224
- @misc{sourty2026denseonlateon,
225
- title={DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models},
226
- author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amelie},
227
  year={2026},
228
  howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
229
  }
@@ -255,6 +246,24 @@ You can finetune this model on your own dataset.
255
  }
256
  ```
257
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  <!--
259
  ## Glossary
260
 
 
7
  - feature-extraction
8
  pipeline_tag: sentence-similarity
9
  library_name: PyLate
10
+ license: apache-2.0
11
+ language:
12
+ - en
13
  ---
14
 
15
+ <p align="center">
16
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/609bbe2f4932693ca2009d6a/kbQOAarw0eaApow3M9HIl.png" alt="LightOn" width="512">
17
+ </p>
18
 
19
+ <h1 align="center">LateOn-unsupervised</h1>
20
+
21
+ <h3 align="center">Unsupervised contrastive pre-training checkpoint by LightOn</h3>
22
+
23
+ <p align="center">
24
+ <a href="https://huggingface.co/lightonai/LateOn">LateOn</a> |
25
+ <a href="https://huggingface.co/lightonai/DenseOn">DenseOn</a> |
26
+ <a href="https://github.com/lightonai/pylate">PyLate</a> |
27
+ <a href="https://github.com/lightonai/fast-plaid">FastPLAID</a>
28
+ </p>
29
+
30
+ ---
31
+
32
+ **LateOn-unsupervised** is an unsupervised contrastive pre-training checkpoint built on ModernBERT (149M parameters), trained by [LightOn](https://lighton.ai) using [PyLate](https://github.com/lightonai/pylate). It serves as the foundation for building [LateOn](https://huggingface.co/lightonai/LateOn), a ColBERT retrieval model that encodes queries and documents independently into multi-vector representations, using `[Q]`/`[D]` prefixes and token-level similarity with MaxSim scoring.
33
+
34
+ For the final late-interaction retrieval model, use [LateOn](https://huggingface.co/lightonai/LateOn), which adds supervised fine-tuning with mined hard negatives on top of this checkpoint. See our [blog post](TODO) for full results and analysis.
35
 
36
  ## Model Details
37
 
38
  ### Model Description
39
+ - **Model Type:** PyLate ColBERT model
40
+ - **Base model:** [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters)
41
  - **Document Length:** 300 tokens
42
  - **Query Length:** 32 tokens
43
  - **Output Dimensionality:** 128 tokens
44
  - **Similarity Function:** MaxSim
45
+ - **Language:** English
46
+ - **License:** Apache 2.0
 
47
 
48
  ### Model Sources
49
 
 
133
  queries_embeddings = model.encode(
134
  ["query for document 3", "query for document 1"],
135
  batch_size=32,
136
+ is_query=True, # Ensure that it is set to True to indicate that these are queries
137
  show_progress_bar=True,
138
  )
139
 
 
186
  )
187
  ```
188
 
189
+ ## Related Models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
+ | Model | Description | Link |
192
+ |-------|-------------|------|
193
+ | **LateOn** | Supervised ColBERT model | [lightonai/LateOn](https://huggingface.co/lightonai/LateOn) |
194
+ | **LateOn-unsupervised** | Pre-training-only checkpoint (this model)| [lightonai/LateOn-unsupervised](https://huggingface.co/lightonai/LateOn-unsupervised) |
195
+ | **DenseOn** | Supervised dense (single-vector) model | [lightonai/DenseOn](https://huggingface.co/lightonai/DenseOn) |
196
+ | **DenseOn-unsupervised** | Pre-training-only checkpoint | [lightonai/DenseOn-unsupervised](https://huggingface.co/lightonai/DenseOn-unsupervised) |
197
 
198
  ## Training Details
199
 
200
  ### Framework Versions
201
+ - Python: 3.11.10
202
+ - Sentence Transformers: 5.1.1
203
+ - PyLate: 1.3.4
204
+ - Transformers: 4.57.5
205
+ - PyTorch: 2.9.0+cu128
206
+ - Accelerate: 1.12.0
207
+ - Datasets: 3.6.0
208
+ - Tokenizers: 0.22.1
 
209
 
210
  ## Citation
211
 
212
  ### BibTeX
213
 
214
  ```bibtex
215
+ @misc{sourty2025denseonlateon,
216
+ title={DenseOn with LateOn: Open State-of-the-Art Single and Multi-Vector Models},
217
+ author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Moura Junior, Paulo Roberto and Chatelain, Amelie},
218
  year={2026},
219
  howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
220
  }
 
246
  }
247
  ```
248
 
249
+ <!--
250
+ ### Out-of-Scope Use
251
+
252
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
253
+ -->
254
+
255
+ <!--
256
+ ## Bias, Risks and Limitations
257
+
258
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
259
+ -->
260
+
261
+ <!--
262
+ ### Recommendations
263
+
264
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
265
+ -->
266
+
267
  <!--
268
  ## Glossary
269