PyLate model based on jinaai/jina-colbert-v2

This is a PyLate model finetuned from jinaai/jina-colbert-v2. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: jinaai/jina-colbert-v2
  • Document Length: 300 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 299, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Dense({'in_features': 1024, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.

Indexing documents

Load the ColBERT model and initialize the PLAID index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

# Step 2: Initialize the PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Training Details

Training Dataset

Unnamed Dataset

  • Size: 256,886 training samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
  • Samples:
    query positive negative
    There was no Mughal tradition of primogeniture, the systematic passing of rule, upon an emperor's death, to his eldest son. Sanskrit: चक्रवर्तिनः मृत्योः अनन्तरं तस्य शासनस्य व्यवस्थितरूपेण सङ्क्रमणस्य, मुघलपरम्परायाः ज्येष्ठपुत्राधिकारपद्धतिः नासीत्।
    English: There was no Mughal tradition of primogeniture, the systematic passing of rule, upon an emperor's death, to his eldest son.
    Sanskrit: येऽरक्ष्यमाणा हीयन्ते दैवेनाभ्याहता नृप। तस्करैश्चापि हीयन्ते सर्वं तद् राजकिल्बिषम्॥
    English: If the subjects of a king, O monarch, die from want of protection and are afflicted by the gods and oppressed by robbers, the sin of all this affects the king himself.
    The four sons of Shah Jahan all held governorships during their father's reign. Sanskrit: शाह्-जहाँ-नामकस्य चत्वारः पुत्राः, सर्वे पितुः शासनकाले शासकपदम् अधारयन्।
    English: The four sons of Shah Jahan all held governorships during their father's reign.
    Sanskrit: आयेन वासव्ययस्य तुलने कृते सति वासः व्यययोग्यः न वेति ज्ञायते।
    English: Comparing the price of housing to income tells if housing is affordable.
    In this regard he discusses the correlation between social opportunities of education and health and how both of these complement economic and political freedoms as a healthy and well-educated person is better suited to make informed economic decisions and be involved in fruitful political demonstrations etc. Sanskrit: अस्मिन् विषये सः शिक्षणस्य स्वास्थ्यस्य च सामाजिकावकाशानाम् अन्योन्य-सम्बन्धस्य, तथा च एतद्द्वयम् अपि आर्थिक-राजनैतिक-स्वातन्त्र्ययोः कथं पूरकं भवतः इति च चर्चां करोति, यतोहि स्वस्था सुशिक्षिता च व्यक्तिः ज्ञानपूर्वम् आर्थिकविषयान् निर्णेतुं तथा फलप्रदेषु राजनैतिकेषु प्रतिपादनादिषु संलग्नः भवितुं च अधिकारी भवति इति।
    English: In this regard he discusses the correlation between social opportunities of education and health and how both of these complement economic and political freedoms as a healthy and well-educated person is better suited to make informed economic decisions and be involved in fruitful political demonstrations etc.
    Sanskrit: स्पर्धायां दलानां विश्रामस्थानत्रयेषु अन्तिमम् अस्ति वैट्-मौण्टन्।
    English: White Mountain is the last of three rest stops for teams in the race.
  • Loss: pylate.losses.contrastive.Contrastive

Evaluation Dataset

Unnamed Dataset

  • Size: 2,000 evaluation samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
  • Samples:
    query positive negative
    You two were eloquent speakers. Sanskrit: युवां वाक्पटू आस्तम् ।
    English: You two were eloquent speakers.
    Sanskrit: एतत् 'URL' इन्स्टलेषन् काले दत्तं 'पोर्ट् नम्बर्' तथा 'डोमैन्' नाम आधारीकृत्य वर्तते ।
    English: This URL is based on- the port number and domain name given at the time of installation.
    """And James the son of Zebedee, and John the brother of James; and he surnamed them Boanerges, which is, The sons of thunder:""" Sanskrit: """याकूब् तस्य भ्राता योहन् च आन्द्रियः फिलिपो बर्थलमयः,"""
    English: """And James the son of Zebedee, and John the brother of James; and he surnamed them Boanerges, which is, The sons of thunder:"""
    Sanskrit: "पश्यामः यत्, Animal interface इत्यस्य सर्वाणि मेथड्स्, नाम - talk(), see() अपि च move() इतीमानि क्लास् मध्ये इम्प्लिमेण्ट् जातानि ।"
    English: "We can see that all the methods of the Animal interface- talk(), see() and move() are implemented inside this class."
    The heart of a healthy adult person beats 60-80 times per minute Sanskrit: स्वस्थस्य कस्यचन प्रौढस्य प्रति मिनट्-काले षष्टिवारात अधिकाधिकतया अशीतिवारं कम्पते।
    English: The heart of a healthy adult person beats 60-80 times per minute
    Sanskrit: यतस्तेन बहवो यिहूदीया गत्वा यीशौ व्यश्वसन्।
    English: """Because that by reason of him many of the Jews went away, and believed on Jesus."""
  • Loss: pylate.losses.contrastive.Contrastive

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 2
  • num_train_epochs: 2
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss
0.0156 500 0.0421
0.0311 1000 0.0622
0.0467 1500 0.0062
0.0623 2000 0.0024
0.0779 2500 0.0002

Framework Versions

  • Python: 3.10.18
  • Sentence Transformers: 5.1.1
  • PyLate: 1.3.4
  • Transformers: 4.51.3
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
Downloads last month
9
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bheri/ithasa-jina-colbertv2

Finetuned
(1)
this model