lightonai
/

LateOn-unsupervised

@@ -7,24 +7,43 @@ tags:
 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: PyLate
 ---
-# PyLate
-This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
 ## Model Details
 ### Model Description
-- **Model Type:** PyLate model
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Document Length:** 300 tokens
 - **Query Length:** 32 tokens
 - **Output Dimensionality:** 128 tokens
 - **Similarity Function:** MaxSim
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
 ### Model Sources
@@ -114,7 +133,7 @@ retriever = retrieve.ColBERT(index=index)
 queries_embeddings = model.encode(
     ["query for document 3", "query for document 1"],
     batch_size=32,
-    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
     show_progress_bar=True,
 )
@@ -167,63 +186,35 @@ reranked_documents = rank.rerank(
 )
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
 ### Framework Versions
-- Python: 3.12.9
-- Sentence Transformers: 5.2.0
-- PyLate: 1.4.0
-- Transformers: 4.57.3
-- PyTorch: 2.7.0+cu126
-- Accelerate: 1.6.0
-- Datasets: 4.4.1
-- Tokenizers: 0.22.2
 ## Citation
 ### BibTeX
 ```bibtex
-@misc{sourty2026denseonlateon,
-  title={DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models},
-  author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amelie},
   year={2026},
   howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
 }
@@ -255,6 +246,24 @@ You can finetune this model on your own dataset.
 }
 ```
 <!--
 ## Glossary

 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: PyLate
+license: apache-2.0
+language:
+- en
 ---
+<p align="center">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/609bbe2f4932693ca2009d6a/kbQOAarw0eaApow3M9HIl.png" alt="LightOn" width="512">
+</p>
+<h1 align="center">LateOn-unsupervised</h1>
+<h3 align="center">Unsupervised contrastive pre-training checkpoint by LightOn</h3>
+<p align="center">
+<a href="https://huggingface.co/lightonai/LateOn">LateOn</a> |
+<a href="https://huggingface.co/lightonai/DenseOn">DenseOn</a> |
+<a href="https://github.com/lightonai/pylate">PyLate</a> |
+<a href="https://github.com/lightonai/fast-plaid">FastPLAID</a>
+</p>
+---
+**LateOn-unsupervised** is an unsupervised contrastive pre-training checkpoint built on ModernBERT (149M parameters), trained by [LightOn](https://lighton.ai) using [PyLate](https://github.com/lightonai/pylate). It serves as the foundation for building [LateOn](https://huggingface.co/lightonai/LateOn), a ColBERT retrieval model that encodes queries and documents independently into multi-vector representations, using `[Q]`/`[D]` prefixes and token-level similarity with MaxSim scoring.
+For the final late-interaction retrieval model, use [LateOn](https://huggingface.co/lightonai/LateOn), which adds supervised fine-tuning with mined hard negatives on top of this checkpoint. See our [blog post](TODO) for full results and analysis.
 ## Model Details
 ### Model Description
+- **Model Type:** PyLate ColBERT model
+- **Base model:** [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters)
 - **Document Length:** 300 tokens
 - **Query Length:** 32 tokens
 - **Output Dimensionality:** 128 tokens
 - **Similarity Function:** MaxSim
+- **Language:** English
+- **License:** Apache 2.0
 ### Model Sources
 queries_embeddings = model.encode(
     ["query for document 3", "query for document 1"],
     batch_size=32,
+    is_query=True,  # Ensure that it is set to True to indicate that these are queries
     show_progress_bar=True,
 )
 )
 ```
+## Related Models
+| Model | Description | Link |
+|-------|-------------|------|
+| **LateOn** | Supervised ColBERT model | [lightonai/LateOn](https://huggingface.co/lightonai/LateOn) |
+| **LateOn-unsupervised** | Pre-training-only checkpoint (this model)| [lightonai/LateOn-unsupervised](https://huggingface.co/lightonai/LateOn-unsupervised) |
+| **DenseOn** | Supervised dense (single-vector) model | [lightonai/DenseOn](https://huggingface.co/lightonai/DenseOn) |
+| **DenseOn-unsupervised** | Pre-training-only checkpoint | [lightonai/DenseOn-unsupervised](https://huggingface.co/lightonai/DenseOn-unsupervised) |
 ## Training Details
 ### Framework Versions
+- Python: 3.11.10
+- Sentence Transformers: 5.1.1
+- PyLate: 1.3.4
+- Transformers: 4.57.5
+- PyTorch: 2.9.0+cu128
+- Accelerate: 1.12.0
+- Datasets: 3.6.0
+- Tokenizers: 0.22.1
 ## Citation
 ### BibTeX
 ```bibtex
+@misc{sourty2025denseonlateon,
+  title={DenseOn with LateOn: Open State-of-the-Art Single and Multi-Vector Models},
+  author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Moura Junior, Paulo Roberto and Chatelain, Amelie},
   year={2026},
   howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
 }
 }
 ```
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
 <!--
 ## Glossary