raphaelsty commited on
Commit
8f73eca
·
verified ·
1 Parent(s): ec27cc8

Update README with LightOn logo, BEIR scores, and improved documentation

Browse files
Files changed (1) hide show
  1. README.md +50 -64
README.md CHANGED
@@ -6,23 +6,44 @@ tags:
6
  - dense
7
  pipeline_tag: sentence-similarity
8
  library_name: sentence-transformers
 
 
 
9
  ---
10
 
11
- # SentenceTransformer
 
 
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Model Details
16
 
17
  ### Model Description
18
  - **Model Type:** Sentence Transformer
19
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
20
  - **Maximum Sequence Length:** 512 tokens
21
  - **Output Dimensionality:** 768 dimensions
22
  - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
 
26
 
27
  ### Model Sources
28
 
@@ -53,63 +74,37 @@ Then you can load this model and run inference.
53
  ```python
54
  from sentence_transformers import SentenceTransformer
55
 
56
- # Download from the 🤗 Hub
57
- model = SentenceTransformer("lightonai/LateOn-supervised")
 
58
  # Run inference
59
  queries = [
60
  "Which planet is known as the Red Planet?",
61
  ]
62
  documents = [
63
  "Venus is often called Earth's twin because of its similar size and proximity.",
64
- 'Mars, known for its reddish appearance, is often referred to as the Red Planet.',
65
- 'Saturn, famous for its rings, is sometimes mistaken for the Red Planet.',
66
  ]
67
- query_embeddings = model.encode_query(queries)
68
- document_embeddings = model.encode_document(documents)
 
69
  print(query_embeddings.shape, document_embeddings.shape)
70
  # [1, 768] [3, 768]
71
 
72
  # Get the similarity scores for the embeddings
73
  similarities = model.similarity(query_embeddings, document_embeddings)
74
  print(similarities)
75
- # tensor([[0.2046, 0.5422, 0.4971]])
76
  ```
77
 
78
- <!--
79
- ### Direct Usage (Transformers)
80
-
81
- <details><summary>Click to see the direct usage in Transformers</summary>
82
-
83
- </details>
84
- -->
85
-
86
- <!--
87
- ### Downstream Usage (Sentence Transformers)
88
-
89
- You can finetune this model on your own dataset.
90
-
91
- <details><summary>Click to expand</summary>
92
 
93
- </details>
94
- -->
95
-
96
- <!--
97
- ### Out-of-Scope Use
98
-
99
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
100
- -->
101
-
102
- <!--
103
- ## Bias, Risks and Limitations
104
-
105
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
106
- -->
107
-
108
- <!--
109
- ### Recommendations
110
-
111
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
112
- -->
113
 
114
  ## Training Details
115
 
@@ -126,20 +121,11 @@ You can finetune this model on your own dataset.
126
 
127
  ### BibTeX
128
 
129
- <!--
130
- ## Glossary
131
-
132
- *Clearly define terms in order to be accessible across audiences.*
133
- -->
134
-
135
- <!--
136
- ## Model Card Authors
137
-
138
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
139
- -->
140
-
141
- <!--
142
- ## Model Card Contact
143
-
144
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
145
- -->
 
6
  - dense
7
  pipeline_tag: sentence-similarity
8
  library_name: sentence-transformers
9
+ license: apache-2.0
10
+ language:
11
+ - en
12
  ---
13
 
14
+ <p align="center">
15
+ <img src="https://cdn-avatars.huggingface.co/v1/production/uploads/1651597775471-62715572ab9243b5d40cbb1d.png" alt="LightOn" width="120">
16
+ </p>
17
 
18
+ <h1 align="center">DenseOn</h1>
19
+
20
+ <h3 align="center">State-of-the-Art Dense Retrieval Model by LightOn</h3>
21
+
22
+ <p align="center">
23
+ <a href="https://huggingface.co/lightonai/DenseOn">DenseOn</a> |
24
+ <a href="https://huggingface.co/lightonai/LateOn">LateOn</a> |
25
+ <a href="https://github.com/lightonai/pylate">PyLate</a> |
26
+ <a href="https://github.com/lightonai/fast-plaid">FastPLAID</a>
27
+ </p>
28
+
29
+ ---
30
+
31
+ **DenseOn** is a dense (single-vector) retrieval model built on ModernBERT (149M parameters), trained by [LightOn](https://lighton.ai). It encodes queries and documents independently using cosine similarity with `query:`/`document:` prefixes and CLS pooling.
32
+
33
+ DenseOn achieves **56.75** average NDCG@10 on BEIR (14 datasets) and **57.71** on decontaminated BEIR (12 datasets), topping all base-size dense models and outperforming models up to 4x larger. See our [blog post](TODO) for full results and analysis.
34
 
35
  ## Model Details
36
 
37
  ### Model Description
38
  - **Model Type:** Sentence Transformer
39
+ - **Base model:** [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters)
40
  - **Maximum Sequence Length:** 512 tokens
41
  - **Output Dimensionality:** 768 dimensions
42
  - **Similarity Function:** Cosine Similarity
43
+ - **Pooling:** CLS token
44
+ - **Prompts:** `query:` for queries, `document:` for documents
45
+ - **Language:** English
46
+ - **License:** Apache 2.0
47
 
48
  ### Model Sources
49
 
 
74
  ```python
75
  from sentence_transformers import SentenceTransformer
76
 
77
+ # Download from the Hub
78
+ model = SentenceTransformer("lightonai/DenseOn")
79
+
80
  # Run inference
81
  queries = [
82
  "Which planet is known as the Red Planet?",
83
  ]
84
  documents = [
85
  "Venus is often called Earth's twin because of its similar size and proximity.",
86
+ "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
87
+ "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
88
  ]
89
+
90
+ query_embeddings = model.encode(queries, prompt_name="query")
91
+ document_embeddings = model.encode(documents, prompt_name="document")
92
  print(query_embeddings.shape, document_embeddings.shape)
93
  # [1, 768] [3, 768]
94
 
95
  # Get the similarity scores for the embeddings
96
  similarities = model.similarity(query_embeddings, document_embeddings)
97
  print(similarities)
 
98
  ```
99
 
100
+ ## Related Models
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
+ | Model | Description | Link |
103
+ |-------|-------------|------|
104
+ | **DenseOn** | Supervised dense model (this model) | [lightonai/DenseOn](https://huggingface.co/lightonai/DenseOn) |
105
+ | **DenseOn-unsupervised** | Pre-training-only checkpoint | [lightonai/DenseOn-unsupervised](https://huggingface.co/lightonai/DenseOn-unsupervised) |
106
+ | **LateOn** | Supervised ColBERT model | [lightonai/LateOn](https://huggingface.co/lightonai/LateOn) |
107
+ | **LateOn-unsupervised** | Pre-training-only checkpoint | [lightonai/LateOn-unsupervised](https://huggingface.co/lightonai/LateOn-unsupervised) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
  ## Training Details
110
 
 
121
 
122
  ### BibTeX
123
 
124
+ ```bibtex
125
+ @inproceedings{chaffin2025pylate,
126
+ title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
127
+ author={Chaffin, Antoine and Sourty, Raphael},
128
+ booktitle={Proceedings of CIKM},
129
+ year={2025}
130
+ }
131
+ ```