| | --- |
| | tags: |
| | - ColBERT |
| | - PyLate |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - generated_from_trainer |
| | - loss:Distillation |
| | - turkish |
| | pipeline_tag: sentence-similarity |
| | library_name: PyLate |
| | --- |
| | |
| | # PyLate |
| |
|
| | This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| | - **Model Type:** PyLate model |
| | <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) --> |
| | - **Document Length:** 8192 tokens |
| | - **Query Length:** 32 tokens |
| | - **Output Dimensionality:** 128 tokens |
| | - **Similarity Function:** MaxSim |
| | <!-- - **Training Dataset:** Unknown --> |
| | <!-- - **Language:** Unknown --> |
| | <!-- - **License:** Unknown --> |
| |
|
| | ### Model Sources |
| |
|
| | - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/) |
| | - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate) |
| | - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate) |
| |
|
| | ### Full Model Architecture |
| |
|
| | ``` |
| | ColBERT( |
| | (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel |
| | (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'}) |
| | ) |
| | ``` |
| | # Evaluation |
| | nDCG and Recall scores of this model(out-of-domain predictions) and other multilingual late interaction retrieval models on [Tr-NanoBEIR](https://huggingface.co/datasets/99eren99/Tr-NanoBEIR). |
| | <img src="https://huggingface.co/99eren99/TrColBERT-Long/resolve/main/assets/scores.png" |
| | alt="drawing"/> |
| |
|
| | ## Usage |
| | First install required libraries (Flash Attention 2 supporting GPU is a must for consistency otherwise you need to mask query expansion tokens in the output layer manually): |
| |
|
| | ```bash |
| | pip install -U einops flash_attn |
| | pip install -U pylate |
| | ``` |
| |
|
| | Then normalize your text ---> lambda x: x.replace("İ", "i").replace("I", "ı").lower() |
| |
|
| | ### Retrieval |
| |
|
| | PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval. |
| |
|
| | #### Indexing documents |
| |
|
| | First, load the ColBERT model and initialize the Voyager index, then encode and index your documents: |
| |
|
| | ```python |
| | from pylate import indexes, models, retrieve |
| | |
| | # Step 1: Load the ColBERT model |
| | document_length = 8192 #[1,8192] for truncating documents |
| | model = models.ColBERT( |
| | model_name_or_path="99eren99/TrColBERT-Long",document_length=document_length |
| | ) |
| | try: |
| | model.tokenizer.model_input_names.remove("token_type_ids") |
| | except: |
| | pass |
| | model.eval() |
| | model.to("cuda") |
| | |
| | # Step 2: Initialize the Voyager index |
| | index = indexes.Voyager( |
| | index_folder="pylate-index", |
| | index_name="index", |
| | override=True, # This overwrites the existing index if any |
| | ) |
| | |
| | # Step 3: Encode the documents |
| | documents_ids = ["1", "2", "3"] |
| | documents = ["document 1 text", "document 2 text", "document 3 text"] |
| | |
| | documents_embeddings = model.encode( |
| | documents, |
| | batch_size=32, |
| | is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries |
| | show_progress_bar=True, |
| | ) |
| | |
| | # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids |
| | index.add_documents( |
| | documents_ids=documents_ids, |
| | documents_embeddings=documents_embeddings, |
| | ) |
| | ``` |
| |
|
| | Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: |
| |
|
| | ```python |
| | # To load an index, simply instantiate it with the correct folder/name and without overriding it |
| | index = indexes.Voyager( |
| | index_folder="pylate-index", |
| | index_name="index", |
| | ) |
| | ``` |
| |
|
| | #### Retrieving top-k documents for queries |
| |
|
| | Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. |
| | To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: |
| |
|
| | ```python |
| | # Step 1: Initialize the ColBERT retriever |
| | retriever = retrieve.ColBERT(index=index) |
| | |
| | # Step 2: Encode the queries |
| | queries_embeddings = model.encode( |
| | ["query for document 3", "query for document 1"], |
| | batch_size=32, |
| | is_query=True, # Ensure that it is set to True to indicate that these are queries |
| | show_progress_bar=True, |
| | ) |
| | |
| | # Step 3: Retrieve top-k documents |
| | scores = retriever.retrieve( |
| | queries_embeddings=queries_embeddings, |
| | k=10, # Retrieve the top 10 matches for each query |
| | ) |
| | ``` |
| |
|
| | ### Reranking |
| | If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: |
| |
|
| | ```python |
| | from pylate import rank, models |
| | |
| | queries = [ |
| | "query A", |
| | "query B", |
| | ] |
| | |
| | documents = [ |
| | ["document A", "document B"], |
| | ["document 1", "document C", "document B"], |
| | ] |
| | |
| | documents_ids = [ |
| | [1, 2], |
| | [1, 3, 2], |
| | ] |
| | |
| | model = models.ColBERT( |
| | model_name_or_path=pylate_model_id, |
| | ) |
| | |
| | queries_embeddings = model.encode( |
| | queries, |
| | is_query=True, |
| | ) |
| | |
| | documents_embeddings = model.encode( |
| | documents, |
| | is_query=False, |
| | ) |
| | |
| | reranked_documents = rank.rerank( |
| | documents_ids=documents_ids, |
| | queries_embeddings=queries_embeddings, |
| | documents_embeddings=documents_embeddings, |
| | ) |
| | ``` |
| |
|
| | <!-- |
| | ### Direct Usage (Transformers) |
| |
|
| | <details><summary>Click to see the direct usage in Transformers</summary> |
| |
|
| | </details> |
| | --> |
| |
|
| | <!-- |
| | ### Downstream Usage (Sentence Transformers) |
| |
|
| | You can finetune this model on your own dataset. |
| |
|
| | <details><summary>Click to expand</summary> |
| |
|
| | </details> |
| | --> |
| |
|
| | <!-- |
| | ### Out-of-Scope Use |
| |
|
| | *List how the model may foreseeably be misused and address what users ought not to do with the model.* |
| | --> |
| |
|
| | <!-- |
| | ## Bias, Risks and Limitations |
| |
|
| | *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
| | --> |
| |
|
| | <!-- |
| | ### Recommendations |
| |
|
| | *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
| | --> |
| |
|
| |
|
| | ### Framework Versions |
| | - Python: 3.10.16 |
| | - Sentence Transformers: 4.0.2 |
| | - PyLate: 1.1.7 |
| | - Transformers: 4.48.2 |
| | - PyTorch: 2.5.1+cu124 |
| | - Accelerate: 1.2.1 |
| | - Datasets: 2.21.0 |
| | - Tokenizers: 0.21.0 |
| |
|
| |
|
| | ## Citation |
| |
|
| | ### BibTeX |
| |
|
| | #### Sentence Transformers |
| | ```bibtex |
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| | month = "11", |
| | year = "2019", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://arxiv.org/abs/1908.10084" |
| | } |
| | ``` |
| |
|
| | #### PyLate |
| | ```bibtex |
| | @misc{PyLate, |
| | title={PyLate: Flexible Training and Retrieval for Late Interaction Models}, |
| | author={Chaffin, Antoine and Sourty, Raphaël}, |
| | url={https://github.com/lightonai/pylate}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | <!-- |
| | ## Glossary |
| |
|
| | *Clearly define terms in order to be accessible across audiences.* |
| | --> |
| |
|
| | <!-- |
| | ## Model Card Authors |
| |
|
| | *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
| | --> |
| |
|
| | <!-- |
| | ## Model Card Contact |
| |
|
| | *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
| | --> |