NohTow commited on
Commit
eb13084
·
verified ·
1 Parent(s): 119de45

Upload README.md

Browse files
Files changed (2) hide show
  1. .gitattributes +1 -0
  2. README.md +3 -237
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ README.md filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,237 +1,3 @@
1
- ---
2
- tags:
3
- - ColBERT
4
- - PyLate
5
- - sentence-transformers
6
- - sentence-similarity
7
- - feature-extraction
8
- pipeline_tag: sentence-similarity
9
- library_name: PyLate
10
- ---
11
-
12
- # PyLate
13
-
14
- This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
15
-
16
- ## Model Details
17
-
18
- ### Model Description
19
- - **Model Type:** PyLate model
20
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
21
- - **Document Length:** 2048 tokens
22
- - **Query Length:** 256 tokens
23
- - **Output Dimensionality:** 128 tokens
24
- - **Similarity Function:** MaxSim
25
- <!-- - **Training Dataset:** Unknown -->
26
- <!-- - **Language:** Unknown -->
27
- <!-- - **License:** Unknown -->
28
-
29
- ### Model Sources
30
-
31
- - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
32
- - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
33
- - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
34
-
35
- ### Full Model Architecture
36
-
37
- ```
38
- ColBERT(
39
- (0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
40
- (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
41
- )
42
- ```
43
-
44
- ## Usage
45
- First install the PyLate library:
46
-
47
- ```bash
48
- pip install -U pylate
49
- ```
50
-
51
- ### Retrieval
52
-
53
- Use this model with PyLate to index and retrieve documents. The index uses [FastPLAID](https://github.com/lightonai/fast-plaid) for efficient similarity search.
54
-
55
- #### Indexing documents
56
-
57
- Load the ColBERT model and initialize the PLAID index, then encode and index your documents:
58
-
59
- ```python
60
- from pylate import indexes, models, retrieve
61
-
62
- # Step 1: Load the ColBERT model
63
- model = models.ColBERT(
64
- model_name_or_path="lightonai/LateOn-Code-v0",
65
- )
66
-
67
- # Step 2: Initialize the PLAID index
68
- index = indexes.PLAID(
69
- index_folder="pylate-index",
70
- index_name="index",
71
- override=True, # This overwrites the existing index if any
72
- )
73
-
74
- # Step 3: Encode the documents
75
- documents_ids = ["1", "2", "3"]
76
- documents = ["document 1 text", "document 2 text", "document 3 text"]
77
-
78
- documents_embeddings = model.encode(
79
- documents,
80
- batch_size=32,
81
- is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
82
- show_progress_bar=True,
83
- )
84
-
85
- # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
86
- index.add_documents(
87
- documents_ids=documents_ids,
88
- documents_embeddings=documents_embeddings,
89
- )
90
- ```
91
-
92
- Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
93
-
94
- ```python
95
- # To load an index, simply instantiate it with the correct folder/name and without overriding it
96
- index = indexes.PLAID(
97
- index_folder="pylate-index",
98
- index_name="index",
99
- )
100
- ```
101
-
102
- #### Retrieving top-k documents for queries
103
-
104
- Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
105
- To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
106
-
107
- ```python
108
- # Step 1: Initialize the ColBERT retriever
109
- retriever = retrieve.ColBERT(index=index)
110
-
111
- # Step 2: Encode the queries
112
- queries_embeddings = model.encode(
113
- ["query for document 3", "query for document 1"],
114
- batch_size=32,
115
- is_query=True, # # Ensure that it is set to False to indicate that these are queries
116
- show_progress_bar=True,
117
- )
118
-
119
- # Step 3: Retrieve top-k documents
120
- scores = retriever.retrieve(
121
- queries_embeddings=queries_embeddings,
122
- k=10, # Retrieve the top 10 matches for each query
123
- )
124
- ```
125
-
126
- ### Reranking
127
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
128
-
129
- ```python
130
- from pylate import rank, models
131
-
132
- queries = [
133
- "query A",
134
- "query B",
135
- ]
136
-
137
- documents = [
138
- ["document A", "document B"],
139
- ["document 1", "document C", "document B"],
140
- ]
141
-
142
- documents_ids = [
143
- [1, 2],
144
- [1, 3, 2],
145
- ]
146
-
147
- model = models.ColBERT(
148
- model_name_or_path="lightonai/LateOn-Code-v0",
149
- )
150
-
151
- queries_embeddings = model.encode(
152
- queries,
153
- is_query=True,
154
- )
155
-
156
- documents_embeddings = model.encode(
157
- documents,
158
- is_query=False,
159
- )
160
-
161
- reranked_documents = rank.rerank(
162
- documents_ids=documents_ids,
163
- queries_embeddings=queries_embeddings,
164
- documents_embeddings=documents_embeddings,
165
- )
166
- ```
167
-
168
- <!--
169
- ### Direct Usage (Transformers)
170
-
171
- <details><summary>Click to see the direct usage in Transformers</summary>
172
-
173
- </details>
174
- -->
175
-
176
- <!--
177
- ### Downstream Usage (Sentence Transformers)
178
-
179
- You can finetune this model on your own dataset.
180
-
181
- <details><summary>Click to expand</summary>
182
-
183
- </details>
184
- -->
185
-
186
- <!--
187
- ### Out-of-Scope Use
188
-
189
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
190
- -->
191
-
192
- <!--
193
- ## Bias, Risks and Limitations
194
-
195
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
196
- -->
197
-
198
- <!--
199
- ### Recommendations
200
-
201
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
202
- -->
203
-
204
- ## Training Details
205
-
206
- ### Framework Versions
207
- - Python: 3.12.9
208
- - Sentence Transformers: 5.2.0
209
- - PyLate: 1.3.4
210
- - Transformers: 4.49.0
211
- - PyTorch: 2.7.0+cu126
212
- - Accelerate: 1.10.1
213
- - Datasets: 4.4.1
214
- - Tokenizers: 0.21.1
215
-
216
-
217
- ## Citation
218
-
219
- ### BibTeX
220
-
221
- <!--
222
- ## Glossary
223
-
224
- *Clearly define terms in order to be accessible across audiences.*
225
- -->
226
-
227
- <!--
228
- ## Model Card Authors
229
-
230
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
231
- -->
232
-
233
- <!--
234
- ## Model Card Contact
235
-
236
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
237
- -->
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38ccb8bc55065524bb30cd50451776d0aea691280a02f120142e53b83e9d33a2
3
+ size 50328126