DavidGF commited on
Commit
013759e
·
verified ·
0 Parent(s):

Initial commit.

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
1_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 384, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}
1_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f7099bc3cd07dea9d8ddc87820d38cc70aee52f2b76185ac8fd64d5d22c7167
3
+ size 196696
README.md ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ColBERT
4
+ - PyLate
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - multilingual
9
+ - late-interaction
10
+ - retrieval
11
+ - pretrained
12
+ - loss:Distillation
13
+ pipeline_tag: sentence-similarity
14
+ library_name: PyLate
15
+ license: apache-2.0
16
+ ---
17
+
18
+ <img src="https://vago-solutions.ai/wp-content/uploads/2025/08/SauerkrautLM-Multi-ColBERT-33M.png" width="500" height="auto">
19
+
20
+ # SauerkrautLM-Multi-ColBERT-33m
21
+
22
+ This model is a compact Late Interaction retriever that leverages:
23
+
24
+ **Pretraining** with over 8.2 billion tokens in a two-phase approach (4.6B multilingual + 3.6B English tokens).
25
+ **Knowledge Distillation** from state-of-the-art reranker models during pretraining.
26
+ **Efficient architecture** with 33M parameters – optimized for edge deployment while maintaining high performance.
27
+
28
+ ### 🎯 Core Features and Innovations:
29
+
30
+ - **Two-Phase Pretraining Strategy**:
31
+ - Phase 1: 4,641,714,000 tokens of multilingual data covering 7 European languages
32
+ - Phase 2: 3,620,166,317 tokens of high-quality English data for enhanced performance
33
+ - Total: Over **8.2 billion tokens** of pretrained knowledge
34
+
35
+ - **Advanced Knowledge Distillation**: Learning from powerful reranker models throughout the pretraining process
36
+
37
+ - **Balanced Efficiency**: With 33M parameters, achieving the sweet spot between performance and deployability
38
+
39
+ ### 💪 The Foundation Model: Compact yet Powerful
40
+
41
+ With **33 million parameters** – that's **less than 1/200th the size** of some competing models – SauerkrautLM-Multi-ColBERT-33m represents efficient pretraining at scale:
42
+ - **200× smaller** than 7B+ parameter models
43
+ - **4× smaller** than typical BERT models (110M)
44
+ - **2× larger** than the ultra-compact 15M variant
45
+ - Trained on **8.2 billion tokens** - that's 248 tokens per parameter!
46
+
47
+ This balanced architecture combined with pretraining creates a powerful foundation for downstream applications, offering superior performance compared to the 15M variant while remaining highly efficient.
48
+
49
+
50
+
51
+ ## Model Overview
52
+
53
+ **Model:** `VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m`\
54
+ **Type:** Pretrained foundation model for Late Interaction retrieval\
55
+ **Architecture:** PyLate / ColBERT (Late Interaction)\
56
+ **Languages:** Multilingual (optimized for 7 European languages: German, English, Spanish, French, Italian, Dutch, Portuguese)\
57
+ **License:** Apache 2.0\
58
+ **Model Size:** 33M parameters
59
+ **Training Data:** 8.2B tokens (4.6B multilingual + 3.6B English)
60
+
61
+ ### Model Description
62
+ - **Model Type:** PyLate model with innovative Late Interaction architecture
63
+ - **Document Length:** **8192 tokens** (32× longer than traditional BERT models)
64
+ - **Query Length:** 256 tokens (optimized for complex, multi-part queries)
65
+ - **Output Dimensionality:** 128 tokens (efficient vector representation)
66
+ - **Similarity Function:** MaxSim (enables precise token-level matching)
67
+ - **Training Method:** Two-phase knowledge distillation from reranker models
68
+
69
+ ### Architecture
70
+
71
+ ```
72
+ ColBERT(
73
+ (0): Transformer(CompressedModernBertModel)
74
+ (1): Dense(384 -> 128 dim, no bias)
75
+ )
76
+ ```
77
+
78
+ ## 🔬 Technical Innovations in Detail
79
+
80
+ ### Two-Phase Pretraining: Building Multilingual then English Excellence
81
+
82
+ Our 33M parameter model undergoes sophisticated two-phase pretraining:
83
+
84
+ #### Phase 1: Multilingual Foundation (4.6B tokens)
85
+ - **Data Volume**: 4,641,714,000 tokens across 7 European languages
86
+ - **Languages**: Balanced representation of German, English, Spanish, French, Italian, Dutch, and Portuguese
87
+ - **Objective**: Build robust multilingual understanding and cross-lingual capabilities
88
+
89
+ #### Phase 2: English Enhancement (3.6B tokens)
90
+ - **Data Volume**: 3,620,166,317 high-quality English tokens
91
+ - **Focus**: Enhance English performance while maintaining multilingual capabilities
92
+ - **Result**: State-of-the-art English retrieval without sacrificing other languages
93
+
94
+ ### Knowledge Distillation Throughout Pretraining
95
+
96
+ Unlike typical pretraining, we leverage continuous knowledge distillation:
97
+ - **Teacher Models**: State-of-the-art reranker models guide the learning process
98
+ - **Distillation Objective**: Learn optimal ranking patterns from the ground up
99
+ - **Efficiency Gain**: Achieves superior performance with 200× fewer parameters
100
+
101
+ ### Compact Yet Capable Design
102
+
103
+ SauerkrautLM-Multi-ColBERT-33m achieves optimal balance through:
104
+
105
+ - Compact Architecture (~33 M params)
106
+ - Balanced BERT design — 12 layers, hidden_size = 384
107
+ - Multi-head attention — 24 attention heads (16-dim each) for nuanced understanding
108
+ - Production-ready — deployable on standard infrastructure
109
+ - Intermediate size — 1152 (3× hidden size) for sufficient expressiveness
110
+
111
+ This architecture enables Late Interaction Retrieval with significantly better performance than the 15M variant while maintaining excellent efficiency.
112
+
113
+ ---
114
+
115
+ ## 🔬 Benchmarks: Foundation Model Performance
116
+
117
+ SauerkrautLM-Multi-ColBERT-33m delivers strong multilingual retrieval performance, demonstrating the effectiveness of our two-phase pretraining approach at this parameter scale.
118
+
119
+ ### NanoBEIR Europe (multilingual retrieval)
120
+
121
+ Average nDCG@10 across seven European languages, showing excellent multilingual capabilities from our two-phase pretraining:
122
+
123
+ | Language | nDCG@10 | Performance Notes |
124
+ | -------- | -------- | ----------------- |
125
+ | en | **51.74** | Enhanced by Phase 2 English pretraining |
126
+ | de | 38.46 | Strong german language performance |
127
+ | es | 43.10 | Excellent spanish language capabilities |
128
+ | fr | 40.96 | Consistent cross-lingual transfer |
129
+ | it | 40.44 | Balanced multilingual representation |
130
+ | nl | 37.51 | Effective on lower-resource languages |
131
+ | pt | 39.55 | Maintains quality across language families |
132
+
133
+ **Key Observations:**
134
+ - **English Excellence**: The two-phase training strategy yields exceptional English performance (51.74) while maintaining strong multilingual capabilities
135
+ - **Significant Improvement over 15M**: All languages show substantial gains compared to the 15M variant (5-7 points improvement on average)
136
+ - **Balanced Multilingual**: Non-English languages show strong performance (37-43 nDCG@10), demonstrating effective multilingual pretraining
137
+ - **Token Efficiency**: With 8.2B training tokens on 33M parameters, the model achieves excellent data efficiency (248 tokens per parameter)
138
+
139
+ ---
140
+
141
+ ### Why SauerkrautLM-Multi-ColBERT-33m Matters as a Foundation Model
142
+
143
+ - **Optimal Balance**: Perfect sweet spot between the ultra-compact 15M and larger models
144
+ - **Superior Performance**: Significant improvements over 15M variant across all languages
145
+ - **Production Ready**: Deployable on standard GPUs and cloud infrastructure
146
+ - **High context length**: Suitable for big documents up to 8192 tokens
147
+ - **True Multilingual Foundation**: Native support for 7 European languages from pretraining
148
+ - **Ideal for Fine-tuning**: Strong base model for task-specific adaptations
149
+ - **Cost-Effective**: Train specialized models without massive compute requirements
150
+
151
+ This pretrained model serves as an ideal foundation for:
152
+ - High-performance retrieval systems
153
+ - Multilingual search applications
154
+ - Standard deployment scenarios
155
+ - Rapid prototyping with better accuracy
156
+ - Production systems requiring reliability
157
+
158
+ ---
159
+
160
+ ### Real-World Applications
161
+
162
+ The combination of massive pretraining and balanced efficiency enables:
163
+
164
+ 1. **Production Search Systems**: Deploy on standard infrastructure with confidence
165
+ 2. **Multilingual Products**: Single model serving users across 7 languages with high quality
166
+ 3. **Hybrid Deployments**: Run on-premise or in cloud with reasonable resource requirements
167
+ 4. **Enhanced Accuracy**: Better performance for critical applications compared to 15M
168
+ 5. **Scalable Solutions**: Handle larger workloads without exponential resource growth
169
+
170
+ ## 📈 Summary: The Power of Balanced Pretraining
171
+
172
+ SauerkrautLM-Multi-ColBERT-33m demonstrates that thoughtful parameter scaling combined with strong pretraining creates optimal foundation models. By training on 8.2 billion tokens across two phases, we've created a model that:
173
+
174
+ - **Delivers superior performance** compared to ultra-compact variants
175
+ - **Maintains excellent efficiency** with just 33M parameters (248 tokens per parameter!)
176
+ - **Achieves strong multilingual results** across 7 European languages
177
+ - **Provides exceptional English retrieval** (51.74 nDCG@10) through targeted enhancement
178
+ - **Enables practical deployments** on standard infrastructure
179
+ - **Offers an ideal foundation** for diverse downstream applications
180
+
181
+ This model represents the optimal balance between performance and efficiency for production-grade multilingual retrieval systems.
182
+
183
+ ---
184
+
185
+ # PyLate
186
+
187
+ This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
188
+
189
+
190
+ ## Usage
191
+ First install the PyLate library:
192
+
193
+ ```bash
194
+ pip install -U pylate
195
+ ```
196
+
197
+ ### Retrieval
198
+
199
+ PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
200
+
201
+ #### Indexing documents
202
+
203
+ First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
204
+
205
+ ```python
206
+ from pylate import indexes, models, retrieve
207
+
208
+ # Step 1: Load the ColBERT model
209
+ model = models.ColBERT(
210
+ model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
211
+ )
212
+
213
+ # Step 2: Initialize the Voyager index
214
+ index = indexes.Voyager(
215
+ index_folder="pylate-index",
216
+ index_name="index",
217
+ override=True, # This overwrites the existing index if any
218
+ )
219
+
220
+ # Step 3: Encode the documents
221
+ documents_ids = ["1", "2", "3"]
222
+ documents = ["document 1 text", "document 2 text", "document 3 text"]
223
+
224
+ documents_embeddings = model.encode(
225
+ documents,
226
+ batch_size=32,
227
+ is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
228
+ show_progress_bar=True,
229
+ )
230
+
231
+ # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
232
+ index.add_documents(
233
+ documents_ids=documents_ids,
234
+ documents_embeddings=documents_embeddings,
235
+ )
236
+ ```
237
+
238
+ Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
239
+
240
+ ```python
241
+ # To load an index, simply instantiate it with the correct folder/name and without overriding it
242
+ index = indexes.Voyager(
243
+ index_folder="pylate-index",
244
+ index_name="index",
245
+ )
246
+ ```
247
+
248
+ #### Retrieving top-k documents for queries
249
+
250
+ Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
251
+ To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
252
+
253
+ ```python
254
+ # Step 1: Initialize the ColBERT retriever
255
+ retriever = retrieve.ColBERT(index=index)
256
+
257
+ # Step 2: Encode the queries
258
+ queries_embeddings = model.encode(
259
+ ["query for document 3", "query for document 1"],
260
+ batch_size=32,
261
+ is_query=True, # # Ensure that it is set to False to indicate that these are queries
262
+ show_progress_bar=True,
263
+ )
264
+
265
+ # Step 3: Retrieve top-k documents
266
+ scores = retriever.retrieve(
267
+ queries_embeddings=queries_embeddings,
268
+ k=10, # Retrieve the top 10 matches for each query
269
+ )
270
+ ```
271
+
272
+ ### Reranking
273
+ If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
274
+
275
+ ```python
276
+ from pylate import rank, models
277
+
278
+ queries = [
279
+ "query A",
280
+ "query B",
281
+ ]
282
+
283
+ documents = [
284
+ ["document A", "document B"],
285
+ ["document 1", "document C", "document B"],
286
+ ]
287
+
288
+ documents_ids = [
289
+ [1, 2],
290
+ [1, 3, 2],
291
+ ]
292
+
293
+ model = models.ColBERT(
294
+ model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
295
+ )
296
+
297
+ queries_embeddings = model.encode(
298
+ queries,
299
+ is_query=True,
300
+ )
301
+
302
+ documents_embeddings = model.encode(
303
+ documents,
304
+ is_query=False,
305
+ )
306
+
307
+ reranked_documents = rank.rerank(
308
+ documents_ids=documents_ids,
309
+ queries_embeddings=queries_embeddings,
310
+ documents_embeddings=documents_embeddings,
311
+ )
312
+ ```
313
+ ## Citation
314
+
315
+ ### BibTeX
316
+
317
+ #### SauerkrautLM‑Multi‑ColBERT-33m
318
+
319
+ ```bibtex
320
+ @misc{SauerkrautLM-Multi-ColBERT-33m,
321
+ title={SauerkrautLM-Multi-ColBERT-33m},
322
+ author={David Golchinfar},
323
+ url={https://huggingface.co/VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m},
324
+ year={2025}
325
+ }
326
+ ```
327
+
328
+
329
+ #### Sentence Transformers
330
+
331
+ ```bibtex
332
+ @inproceedings{reimers-2019-sentence-bert,
333
+ title = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
334
+ author = {Reimers, Nils and Gurevych, Iryna},
335
+ booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
336
+ month = {11},
337
+ year = {2019},
338
+ publisher = {Association for Computational Linguistics},
339
+ url = {https://arxiv.org/abs/1908.10084}
340
+ }
341
+ ```
342
+
343
+ #### PyLate
344
+
345
+ ```bibtex
346
+ @misc{PyLate,
347
+ title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
348
+ author={Chaffin, Antoine and Sourty, Raphaël},
349
+ url={https://github.com/lightonai/pylate},
350
+ year={2024}
351
+ }
352
+ ```
353
+
354
+
355
+ ## Acknowledgements
356
+ We thank the PyLate team for providing the training framework that made this work possible.
357
+
358
+ <!--
359
+ ## Glossary
360
+
361
+ *Clearly define terms in order to be accessible across audiences.*
362
+ -->
363
+
364
+ <!--
365
+ ## Model Card Authors
366
+
367
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
368
+ -->
369
+
370
+ <!--
371
+ ## Model Card Contact
372
+
373
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
374
+ -->
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "[D] ": 30523,
3
+ "[Q] ": 30522
4
+ }
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 384,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1152,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 8192,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 24,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.51.1",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30524
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.1",
5
+ "pytorch": "2.8.0.dev20250319+cu128"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "MaxSim",
10
+ "query_prefix": "[Q] ",
11
+ "document_prefix": "[D] ",
12
+ "query_length": 32,
13
+ "document_length": 300,
14
+ "attend_to_expansion_tokens": false,
15
+ "skiplist_words": [
16
+ "!",
17
+ "\"",
18
+ "#",
19
+ "$",
20
+ "%",
21
+ "&",
22
+ "'",
23
+ "(",
24
+ ")",
25
+ "*",
26
+ "+",
27
+ ",",
28
+ "-",
29
+ ".",
30
+ "/",
31
+ ":",
32
+ ";",
33
+ "<",
34
+ "=",
35
+ ">",
36
+ "?",
37
+ "@",
38
+ "[",
39
+ "\\",
40
+ "]",
41
+ "^",
42
+ "_",
43
+ "`",
44
+ "{",
45
+ "|",
46
+ "}",
47
+ "~"
48
+ ]
49
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13b4a955fce6dc3285322e5c89cfbdd32e17f0b92b910b4715091513b07fb501
3
+ size 131087504
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Dense",
12
+ "type": "pylate.models.Dense.Dense"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 299,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "[MASK]",
17
+ "sep_token": {
18
+ "content": "[SEP]",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "unk_token": {
25
+ "content": "[UNK]",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30522": {
44
+ "content": "[Q] ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "30523": {
52
+ "content": "[D] ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ }
59
+ },
60
+ "clean_up_tokenization_spaces": true,
61
+ "cls_token": "[CLS]",
62
+ "do_basic_tokenize": true,
63
+ "do_lower_case": true,
64
+ "extra_special_tokens": {},
65
+ "mask_token": "[MASK]",
66
+ "max_length": 299,
67
+ "model_max_length": 299,
68
+ "never_split": null,
69
+ "pad_to_multiple_of": null,
70
+ "pad_token": "[MASK]",
71
+ "pad_token_type_id": 0,
72
+ "padding_side": "right",
73
+ "sep_token": "[SEP]",
74
+ "stride": 0,
75
+ "strip_accents": null,
76
+ "tokenize_chinese_chars": true,
77
+ "tokenizer_class": "BertTokenizer",
78
+ "truncation_side": "right",
79
+ "truncation_strategy": "longest_first",
80
+ "unk_token": "[UNK]"
81
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff