PyLate model based on BAAI/bge-small-en-v1.5

This is a PyLate model finetuned from BAAI/bge-small-en-v1.5 on the msmarco-10m-triplets dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: BAAI/bge-small-en-v1.5
  • Document Length: 300 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim
  • Training Dataset:

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 300, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.

Indexing documents

Load the ColBERT model and initialize the PLAID index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

# Step 2: Initialize the PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="pylate_model_id",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Col BERTTriplet

  • Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
Metric Value
accuracy 0.9925

Training Details

Training Dataset

msmarco-10m-triplets

  • Dataset: msmarco-10m-triplets at 8c5139a
  • Size: 9,998,000 training samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
  • Samples:
    query positive negative
    what can black mold do Two of the better-known toxic molds include Stachybotrys chartarum (black mold), which can cause everything from headaches to cancer, and Aspergillus, which can cause severe lung infections, or progress to whole-body infections. Mold is particularly dangerous for infants and children. Learn more about mold and health effects in a A Brief Guide to Mold, Moisture and Your Home The entire booklet: A Brief Guide to Mold, Moisture and Your Home (Web Version) A Brief Guide to Mold, Moisture and Your Home (Print Version) Top of Page
    what factors increase a population growth Quick Answer. Factors that cause population growth include increased food production, improved health care services, immigration and high birth rate. These factors have led to overpopulation, which has more negative effects than positive impacts. The increase in crawfish size during molting, and the length of time between molts, can vary greatly and are affected by factors such as water temperature, water quality, food quality and quantity, population density, oxygen levels and to a lesser extent by genetic influences.
    is herpes spread through saliva This type of herpes is transmittable through contact with the saliva or the herpes blisters (cold sores) of an infected person. This said – yes, it is entirely possible to get herpes from kissing.It is also possible, though less common, that herpes type 1 might spread to genital regions through oral sex.enital herpes can spread to the mouth through oral sex. Once you have contracted either type of herpes virus you will be a carrier for life. However, both types tend to become less severe with the passing of time and though they may still be contagious to others, many times people stop having breakouts at all. Introduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils.HSV-1 can also cause genital herpes, although HSV-2 is the main cause of genital herpes.HSV-2 is spread through sexual contact.You may be infected with HSV-1 or HSV-2 but not show any symptoms. Often symptoms are triggered by exposure to the sun, fever, menstruation, emotional stress, a weakened immune system, or an illness. There is no cure for herpes, and once you have it, it is likely to come back.ntroduction. Herpes simplex virus (HSV) infections are very common worldwide. HSV-1 is the main cause of herpes infections on the mouth and lips, including cold sores and fever blisters. It is transmitted through kissing or sharing drinking glasses and utensils.
  • Loss: pylate.losses.xtr_primeqa.XTRPrimeQA

Evaluation Dataset

msmarco-10m-triplets

  • Dataset: msmarco-10m-triplets at 8c5139a
  • Size: 2,000 evaluation samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
    • min: 32 tokens
    • mean: 32.0 tokens
    • max: 32 tokens
  • Samples:
    query positive negative
    what school district is siefert elementary Siefert Elementary is a public elementary school located in Milwaukee, WI in the Milwaukee School District. It enrolls 307 students in grades 1st through 12th. Siefert Elementary is the 756th largest public school in Wisconsin and the 40,598th largest nationally. It has 17.5 students to every teacher. Due to the hazardous road conditions, there will be a two hour delay today, Wednesday, March 2nd, 2016 for Black River HS/MS, Cavendish Town Elementary School, Chester-Andover Elementary School, Green Mountain Union HS, Ludlow Elementary School and Mount Holly School.
    what is dst Daylight Saving Time (DST) is the practice of turning the clock ahead as warmer weather approaches and back as it becomes colder again so that people will have one more hour of daylight in the afternoon and evening during the warmer season of the year. Franklin Park time zone: UTC-05:00 or CDT. Daylight saving time is in effect in Franklin Park. See recent and expected DST changes in Franklin Park in the table below.
    who played sister sister mom The main cast of Sister, Sister (from left to right), Tia Mowry with Jackée Harry as Tia and Lisa Landry and Tim Reid with Tamera Mowry as Ray and Tamera Campbell. Sister, Sister is an American television sitcom starring fraternal twins Tia and Tamera Mowry. It aired from 1994 to 1999. The boy, 16, was killed by his sister, 15, on January 5 after she escaped the room her brother had locked her in and shot him in the neck. The older sister asked her younger sister to keep watch as she went outside and cut out the air conditioner of her parents' locked bedroom window to retrieve a pistol.
  • Loss: pylate.losses.xtr_primeqa.XTRPrimeQA

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 196
  • per_device_eval_batch_size: 196
  • learning_rate: 3e-05
  • max_grad_norm: 10.0
  • num_train_epochs: 0
  • max_steps: 50000
  • warmup_ratio: 0.01
  • bf16: True
  • torch_compile: True
  • torch_compile_backend: inductor
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 196
  • per_device_eval_batch_size: 196
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 10.0
  • num_train_epochs: 0
  • max_steps: 50000
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.01
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: True
  • torch_compile_backend: inductor
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}
Downloads last month
22
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xtr-replicability/bge_small_xtr_contrastive_k256

Finetuned
(355)
this model

Dataset used to train xtr-replicability/bge_small_xtr_contrastive_k256

Evaluation results