samheym
/

GerColBERT

@@ -6,9 +6,12 @@ tags:
 - PyLate
 - sentence-transformers
 - sentence-similarity
-- feature-extraction
 pipeline_tag: sentence-similarity
 library_name: PyLate
 ---
 # GerColBERT
@@ -19,29 +22,16 @@ This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps s
 ### Model Description
 - **Model Type:** PyLate model
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Document Length:** 180 tokens
 - **Query Length:** 32 tokens
 - **Output Dimensionality:** 128 tokens
 - **Similarity Function:** MaxSim
-<!-- - **Training Dataset:** Unknown -->
 - **Language:** de
 <!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
-- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
-- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
-### Full Model Architecture
-```
-ColBERT(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
-  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
-)
-```
 ## Usage
 First install the PyLate library:
@@ -54,10 +44,6 @@ pip install -U pylate
 PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
-#### Indexing documents
-First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
 ```python
 from pylate import indexes, models, retrieve
@@ -65,143 +51,9 @@ from pylate import indexes, models, retrieve
 model = models.ColBERT(
     model_name_or_path=samheym/GerColBERT,
 )
-# Step 2: Initialize the Voyager index
-index = indexes.Voyager(
-    index_folder="pylate-index",
-    index_name="index",
-    override=True,  # This overwrites the existing index if any
-)
-# Step 3: Encode the documents
-documents_ids = ["1", "2", "3"]
-documents = ["document 1 text", "document 2 text", "document 3 text"]
-documents_embeddings = model.encode(
-    documents,
-    batch_size=32,
-    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
-    show_progress_bar=True,
-)
-# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
-index.add_documents(
-    documents_ids=documents_ids,
-    documents_embeddings=documents_embeddings,
-)
-```
-Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
-```python
-# To load an index, simply instantiate it with the correct folder/name and without overriding it
-index = indexes.Voyager(
-    index_folder="pylate-index",
-    index_name="index",
-)
 ```
-#### Retrieving top-k documents for queries
-Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
-To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
-```python
-# Step 1: Initialize the ColBERT retriever
-retriever = retrieve.ColBERT(index=index)
-# Step 2: Encode the queries
-queries_embeddings = model.encode(
-    ["query for document 3", "query for document 1"],
-    batch_size=32,
-    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
-    show_progress_bar=True,
-)
-# Step 3: Retrieve top-k documents
-scores = retriever.retrieve(
-    queries_embeddings=queries_embeddings,
-    k=10,  # Retrieve the top 10 matches for each query
-)
-```
-### Reranking
-If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
-```python
-from pylate import rank, models
-queries = [
-    "query A",
-    "query B",
-]
-documents = [
-    ["document A", "document B"],
-    ["document 1", "document C", "document B"],
-]
-documents_ids = [
-    [1, 2],
-    [1, 3, 2],
-]
-model = models.ColBERT(
-    model_name_or_path=samheym/GerColBERT,
-)
-queries_embeddings = model.encode(
-    queries,
-    is_query=True,
-)
-documents_embeddings = model.encode(
-    documents,
-    is_query=False,
-)
-reranked_documents = rank.rerank(
-    documents_ids=documents_ids,
-    queries_embeddings=queries_embeddings,
-    documents_embeddings=documents_embeddings,
-)
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
@@ -215,7 +67,7 @@ You can finetune this model on your own dataset.
 - Datasets: 2.21.0
 - Tokenizers: 0.21.0
 ## Citation
 ### BibTeX

 - PyLate
 - sentence-transformers
 - sentence-similarity
 pipeline_tag: sentence-similarity
 library_name: PyLate
+datasets:
+- samheym/ger-dpr-collection
+base_model:
+- deepset/gbert-base
 ---
 # GerColBERT
 ### Model Description
 - **Model Type:** PyLate model
+- **Base model:** [deepset/gbert-base](https://huggingface.co/deepset/gbert-base)
 - **Document Length:** 180 tokens
 - **Query Length:** 32 tokens
 - **Output Dimensionality:** 128 tokens
 - **Similarity Function:** MaxSim
+- **Training Dataset:** samheym/ger-dpr-collection
 - **Language:** de
 <!-- - **License:** Unknown -->
 ## Usage
 First install the PyLate library:
 PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
 ```python
 from pylate import indexes, models, retrieve
 model = models.ColBERT(
     model_name_or_path=samheym/GerColBERT,
 )
 ```
 ## Training Details
 - Datasets: 2.21.0
 - Tokenizers: 0.21.0
+<!--
 ## Citation
 ### BibTeX