Initial commit.

Browse files

Files changed (14) hide show

.gitattributes +35 -0
1_Dense/config.json +1 -0
1_Dense/model.safetensors +3 -0
README.md +374 -0
added_tokens.json +4 -0
config.json +24 -0
config_sentence_transformers.json +49 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +31 -0
tokenizer.json +0 -0
tokenizer_config.json +81 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

1_Dense/config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"in_features": 384, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}

1_Dense/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f7099bc3cd07dea9d8ddc87820d38cc70aee52f2b76185ac8fd64d5d22c7167
+size 196696

README.md ADDED Viewed

	@@ -0,0 +1,374 @@

+---
+tags:
+- ColBERT
+- PyLate
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- multilingual
+- late-interaction
+- retrieval
+- pretrained
+- loss:Distillation
+pipeline_tag: sentence-similarity
+library_name: PyLate
+license: apache-2.0
+---
+<img src="https://vago-solutions.ai/wp-content/uploads/2025/08/SauerkrautLM-Multi-ColBERT-33M.png" width="500" height="auto">
+# SauerkrautLM-Multi-ColBERT-33m
+This model is a compact Late Interaction retriever that leverages:
+**Pretraining** with over 8.2 billion tokens in a two-phase approach (4.6B multilingual + 3.6B English tokens).
+**Knowledge Distillation** from state-of-the-art reranker models during pretraining.
+**Efficient architecture** with 33M parameters – optimized for edge deployment while maintaining high performance.
+### 🎯 Core Features and Innovations:
+- **Two-Phase Pretraining Strategy**:
+  - Phase 1: 4,641,714,000 tokens of multilingual data covering 7 European languages
+  - Phase 2: 3,620,166,317 tokens of high-quality English data for enhanced performance
+  - Total: Over **8.2 billion tokens** of pretrained knowledge
+- **Advanced Knowledge Distillation**: Learning from powerful reranker models throughout the pretraining process
+- **Balanced Efficiency**: With 33M parameters, achieving the sweet spot between performance and deployability
+### 💪 The Foundation Model: Compact yet Powerful
+With **33 million parameters** – that's **less than 1/200th the size** of some competing models – SauerkrautLM-Multi-ColBERT-33m represents efficient pretraining at scale:
+- **200× smaller** than 7B+ parameter models
+- **4× smaller** than typical BERT models (110M)
+- **2× larger** than the ultra-compact 15M variant
+- Trained on **8.2 billion tokens** - that's 248 tokens per parameter!
+This balanced architecture combined with pretraining creates a powerful foundation for downstream applications, offering superior performance compared to the 15M variant while remaining highly efficient.
+## Model Overview
+**Model:** `VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m`\
+**Type:** Pretrained foundation model for Late Interaction retrieval\
+**Architecture:** PyLate / ColBERT (Late Interaction)\
+**Languages:** Multilingual (optimized for 7 European languages: German, English, Spanish, French, Italian, Dutch, Portuguese)\
+**License:** Apache 2.0\
+**Model Size:** 33M parameters
+**Training Data:** 8.2B tokens (4.6B multilingual + 3.6B English)
+### Model Description
+- **Model Type:** PyLate model with innovative Late Interaction architecture
+- **Document Length:** **8192 tokens** (32× longer than traditional BERT models)
+- **Query Length:** 256 tokens (optimized for complex, multi-part queries)
+- **Output Dimensionality:** 128 tokens (efficient vector representation)
+- **Similarity Function:** MaxSim (enables precise token-level matching)
+- **Training Method:** Two-phase knowledge distillation from reranker models
+### Architecture
+```
+ColBERT(
+  (0): Transformer(CompressedModernBertModel)
+  (1): Dense(384 -> 128 dim, no bias)
+)
+```
+## 🔬 Technical Innovations in Detail
+### Two-Phase Pretraining: Building Multilingual then English Excellence
+Our 33M parameter model undergoes sophisticated two-phase pretraining:
+#### Phase 1: Multilingual Foundation (4.6B tokens)
+- **Data Volume**: 4,641,714,000 tokens across 7 European languages
+- **Languages**: Balanced representation of German, English, Spanish, French, Italian, Dutch, and Portuguese
+- **Objective**: Build robust multilingual understanding and cross-lingual capabilities
+#### Phase 2: English Enhancement (3.6B tokens)
+- **Data Volume**: 3,620,166,317 high-quality English tokens
+- **Focus**: Enhance English performance while maintaining multilingual capabilities
+- **Result**: State-of-the-art English retrieval without sacrificing other languages
+### Knowledge Distillation Throughout Pretraining
+Unlike typical pretraining, we leverage continuous knowledge distillation:
+- **Teacher Models**: State-of-the-art reranker models guide the learning process
+- **Distillation Objective**: Learn optimal ranking patterns from the ground up
+- **Efficiency Gain**: Achieves superior performance with 200× fewer parameters
+### Compact Yet Capable Design
+SauerkrautLM-Multi-ColBERT-33m achieves optimal balance through:
+- Compact Architecture (~33 M params)
+- Balanced BERT design — 12 layers, hidden_size = 384
+- Multi-head attention — 24 attention heads (16-dim each) for nuanced understanding
+- Production-ready — deployable on standard infrastructure
+- Intermediate size — 1152 (3× hidden size) for sufficient expressiveness
+This architecture enables Late Interaction Retrieval with significantly better performance than the 15M variant while maintaining excellent efficiency.
+---
+## 🔬 Benchmarks: Foundation Model Performance
+SauerkrautLM-Multi-ColBERT-33m delivers strong multilingual retrieval performance, demonstrating the effectiveness of our two-phase pretraining approach at this parameter scale.
+### NanoBEIR Europe (multilingual retrieval)
+Average nDCG@10 across seven European languages, showing excellent multilingual capabilities from our two-phase pretraining:
+| Language | nDCG@10 | Performance Notes |
+| -------- | -------- | ----------------- |
+| en       | **51.74** | Enhanced by Phase 2 English pretraining |
+| de       | 38.46    | Strong german language performance |
+| es       | 43.10    | Excellent spanish language capabilities |
+| fr       | 40.96    | Consistent cross-lingual transfer |
+| it       | 40.44    | Balanced multilingual representation |
+| nl       | 37.51    | Effective on lower-resource languages |
+| pt       | 39.55    | Maintains quality across language families |
+**Key Observations:**
+- **English Excellence**: The two-phase training strategy yields exceptional English performance (51.74) while maintaining strong multilingual capabilities
+- **Significant Improvement over 15M**: All languages show substantial gains compared to the 15M variant (5-7 points improvement on average)
+- **Balanced Multilingual**: Non-English languages show strong performance (37-43 nDCG@10), demonstrating effective multilingual pretraining
+- **Token Efficiency**: With 8.2B training tokens on 33M parameters, the model achieves excellent data efficiency (248 tokens per parameter)
+---
+### Why SauerkrautLM-Multi-ColBERT-33m Matters as a Foundation Model
+- **Optimal Balance**: Perfect sweet spot between the ultra-compact 15M and larger models
+- **Superior Performance**: Significant improvements over 15M variant across all languages
+- **Production Ready**: Deployable on standard GPUs and cloud infrastructure
+- **High context length**: Suitable for big documents up to 8192 tokens
+- **True Multilingual Foundation**: Native support for 7 European languages from pretraining
+- **Ideal for Fine-tuning**: Strong base model for task-specific adaptations
+- **Cost-Effective**: Train specialized models without massive compute requirements
+This pretrained model serves as an ideal foundation for:
+- High-performance retrieval systems
+- Multilingual search applications
+- Standard deployment scenarios
+- Rapid prototyping with better accuracy
+- Production systems requiring reliability
+---
+### Real-World Applications
+The combination of massive pretraining and balanced efficiency enables:
+1. **Production Search Systems**: Deploy on standard infrastructure with confidence
+2. **Multilingual Products**: Single model serving users across 7 languages with high quality
+3. **Hybrid Deployments**: Run on-premise or in cloud with reasonable resource requirements
+4. **Enhanced Accuracy**: Better performance for critical applications compared to 15M
+5. **Scalable Solutions**: Handle larger workloads without exponential resource growth
+## 📈 Summary: The Power of Balanced Pretraining
+SauerkrautLM-Multi-ColBERT-33m demonstrates that thoughtful parameter scaling combined with strong pretraining creates optimal foundation models. By training on 8.2 billion tokens across two phases, we've created a model that:
+- **Delivers superior performance** compared to ultra-compact variants
+- **Maintains excellent efficiency** with just 33M parameters (248 tokens per parameter!)
+- **Achieves strong multilingual results** across 7 European languages
+- **Provides exceptional English retrieval** (51.74 nDCG@10) through targeted enhancement
+- **Enables practical deployments** on standard infrastructure
+- **Offers an ideal foundation** for diverse downstream applications
+This model represents the optimal balance between performance and efficiency for production-grade multilingual retrieval systems.
+---
+# PyLate
+This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
+## Usage
+First install the PyLate library:
+```bash
+pip install -U pylate
+```
+### Retrieval
+PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
+#### Indexing documents
+First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
+```python
+from pylate import indexes, models, retrieve
+# Step 1: Load the ColBERT model
+model = models.ColBERT(
+    model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
+)
+# Step 2: Initialize the Voyager index
+index = indexes.Voyager(
+    index_folder="pylate-index",
+    index_name="index",
+    override=True,  # This overwrites the existing index if any
+)
+# Step 3: Encode the documents
+documents_ids = ["1", "2", "3"]
+documents = ["document 1 text", "document 2 text", "document 3 text"]
+documents_embeddings = model.encode(
+    documents,
+    batch_size=32,
+    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
+    show_progress_bar=True,
+)
+# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
+index.add_documents(
+    documents_ids=documents_ids,
+    documents_embeddings=documents_embeddings,
+)
+```
+Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
+```python
+# To load an index, simply instantiate it with the correct folder/name and without overriding it
+index = indexes.Voyager(
+    index_folder="pylate-index",
+    index_name="index",
+)
+```
+#### Retrieving top-k documents for queries
+Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
+To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
+```python
+# Step 1: Initialize the ColBERT retriever
+retriever = retrieve.ColBERT(index=index)
+# Step 2: Encode the queries
+queries_embeddings = model.encode(
+    ["query for document 3", "query for document 1"],
+    batch_size=32,
+    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
+    show_progress_bar=True,
+)
+# Step 3: Retrieve top-k documents
+scores = retriever.retrieve(
+    queries_embeddings=queries_embeddings,
+    k=10,  # Retrieve the top 10 matches for each query
+)
+```
+### Reranking
+If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
+```python
+from pylate import rank, models
+queries = [
+    "query A",
+    "query B",
+]
+documents = [
+    ["document A", "document B"],
+    ["document 1", "document C", "document B"],
+]
+documents_ids = [
+    [1, 2],
+    [1, 3, 2],
+]
+model = models.ColBERT(
+    model_name_or_path="VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m",
+)
+queries_embeddings = model.encode(
+    queries,
+    is_query=True,
+)
+documents_embeddings = model.encode(
+    documents,
+    is_query=False,
+)
+reranked_documents = rank.rerank(
+    documents_ids=documents_ids,
+    queries_embeddings=queries_embeddings,
+    documents_embeddings=documents_embeddings,
+)
+```
+## Citation
+### BibTeX
+#### SauerkrautLM‑Multi‑ColBERT-33m
+```bibtex
+@misc{SauerkrautLM-Multi-ColBERT-33m,
+  title={SauerkrautLM-Multi-ColBERT-33m},
+  author={David Golchinfar},
+  url={https://huggingface.co/VAGOsolutions/SauerkrautLM-Multi-ColBERT-33m},
+  year={2025}
+}
+```
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+  title = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
+  author = {Reimers, Nils and Gurevych, Iryna},
+  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
+  month = {11},
+  year = {2019},
+  publisher = {Association for Computational Linguistics},
+  url = {https://arxiv.org/abs/1908.10084}
+}
+```
+#### PyLate
+```bibtex
+@misc{PyLate,
+  title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
+  author={Chaffin, Antoine and Sourty, Raphaël},
+  url={https://github.com/lightonai/pylate},
+  year={2024}
+}
+```
+## Acknowledgements
+We thank the PyLate team for providing the training framework that made this work possible.
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "[D] ": 30523,
+  "[Q] ": 30522
+}

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 8192,
+  "model_type": "bert",
+  "num_attention_heads": 24,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30524
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.1.0",
+    "transformers": "4.51.1",
+    "pytorch": "2.8.0.dev20250319+cu128"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "MaxSim",
+  "query_prefix": "[Q] ",
+  "document_prefix": "[D] ",
+  "query_length": 32,
+  "document_length": 300,
+  "attend_to_expansion_tokens": false,
+  "skiplist_words": [
+    "!",
+    "\"",
+    "#",
+    "$",
+    "%",
+    "&",
+    "'",
+    "(",
+    ")",
+    "*",
+    "+",
+    ",",
+    "-",
+    ".",
+    "/",
+    ":",
+    ";",
+    "<",
+    "=",
+    ">",
+    "?",
+    "@",
+    "[",
+    "\\",
+    "]",
+    "^",
+    "_",
+    "`",
+    "{",
+    "|",
+    "}",
+    "~"
+  ]
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:13b4a955fce6dc3285322e5c89cfbdd32e17f0b92b910b4715091513b07fb501
+size 131087504

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Dense",
+    "type": "pylate.models.Dense.Dense"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 299,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "[MASK]",
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30522": {
+      "content": "[Q] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "30523": {
+      "content": "[D] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 299,
+  "model_max_length": 299,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[MASK]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff