PyLate model based on BAAI/bge-small-en-v1.5
This is a PyLate model finetuned from BAAI/bge-small-en-v1.5 on the msmarco-10m-triplets dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
Model Details
Model Description
- Model Type: PyLate model
- Base model: BAAI/bge-small-en-v1.5
- Document Length: 300 tokens
- Query Length: 32 tokens
- Output Dimensionality: 128 tokens
- Similarity Function: MaxSim
- Training Dataset:
Model Sources
- Documentation: PyLate Documentation
- Repository: PyLate on GitHub
- Hugging Face: PyLate models on Hugging Face
Full Model Architecture
ColBERT(
(0): Transformer({'max_seq_length': 300, 'do_lower_case': True, 'architecture': 'BertModel'})
(1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)
Usage
First install the PyLate library:
pip install -U pylate
Retrieval
Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.
Indexing documents
Load the ColBERT model and initialize the PLAID index, then encode and index your documents:
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path="pylate_model_id",
)
# Step 2: Initialize the PLAID index
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
)
Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path="pylate_model_id",
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
Evaluation
Metrics
Col BERTTriplet
- Evaluated with
pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
| Metric | Value |
|---|---|
| accuracy | 0.989 |
Training Details
Training Dataset
msmarco-10m-triplets
- Dataset: msmarco-10m-triplets at 8c5139a
- Size: 9,998,000 training samples
- Columns:
query,positive, andnegative - Approximate statistics based on the first 1000 samples:
query positive negative type string string string details - min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- Samples:
query positive negative what is the intent of the encomienda systemEncomienda system established. The encomienda system was created by the Spanish to control and regulate American Indian labor and behavior during the colonization of the Americas. Under the encomienda system, conquistadors and other leaders (encomenderos) received grants of a number of Indians, from whom they could exact âtributeâ in the form of gold or labor.SearchNetworking. 1 Intent-based networking needed to run more complex networks Experts address why intent-based networking systems are needed to manage networks of the future that connect data center, public ...current weather in el cajon caEl Cajon, CA Weather Forecast. TODAY - Patchy fog in the morning. Cloudy with a chance of rain. Highs around 63 in the western valleys to 56 to 61 near the foothills. Areas of winds southwest 15 mph. Chance of measurable precipitation 50 percent.The two weather patterns alternate every five years and affect the weather of the entire continent. Summers in North America during the El Nino niño weather pattern are wetter than. Average winters In North america During El nino niño are warmer on average and there is less. snow falll Nino niño causes higher ocean water temperatures While La nina niña causes colder. Water temperatures the two weather patterns alternate every five years and affect the weather of the. Entire continent Summers In north america During The el nino niño weather pattern are. wetter than averagewhat is meant by gathering natural gasOur Gathering and Processing segment consists of gathering, compressing, dehydrating, treating, conditioning, processing, and marketing natural gas and gathering crude oil. The gathering of natural gas consists of aggregating natural gas produced from various wells through small diameter gathering lines to processing plants. Natural gas has a widely varying composition depending on the field, the formation and the reservoir from which it is produced.Propane is a byproduct of the oil refining process and therefore fluctuates in price more so than natural gas. Since the USA is the Saudi Arabia of natural gas, it should be cheaper per BTU than propane. - Loss:
pylate.losses.xtr_primeqa.XTRPrimeQA
Evaluation Dataset
msmarco-10m-triplets
- Dataset: msmarco-10m-triplets at 8c5139a
- Size: 2,000 evaluation samples
- Columns:
query,positive, andnegative - Approximate statistics based on the first 1000 samples:
query positive negative type string string string details - min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- min: 32 tokens
- mean: 32.0 tokens
- max: 32 tokens
- Samples:
query positive negative vegetables good for eye blood flowA diet high in saturated fats and sugar lacks the antioxidants needed for good eye health. A diet high in saturated fats also creates substances that put the health of your eyes at risk, such as arterial plaque, which can cause restricted blood flow in the blood vessels in your eyes.ome Excellent Choices for Feeding Healthy Eyes. The following list of foods provides good to excellent antioxidant support for your eyes: 1 Green vegetables like broccoli, green beans, spinach, peas and Brussels sprouts. 2 Carrots, celery and corn. 3 Leafy lettuce. 4 Sweet potatoes and yams.For this reason, a bag of frozen vegetables should be wrapped in a towel if used on extremities with very low blood flow, such as toes or hands. If the injury occurs in an area with little fat or muscle beneath the skin, such as fingers, take the compress off after 10 minutes maximum, wait 5 minutes and reapply.is battery fully chargedIn both Living on 12 Volts with Ample Power, and Wiring 12 Volts for Ample Power the authors explain that a battery is fully charged when the voltage is about 14.4 Volts and current through the battery has declined to less than 2% of the capacity of the battery in Amp-hours ...2 Amps for a 100 Ah battery.1 Do not fully charge or fully discharge your deviceâs battery â charge it to around 50%. 2 If you store a device when its battery is fully discharged, the battery could fall into a deep discharge state, which renders it incapable of holding a charge.what does the name adalina representAdalina /ada-li-na/ [4 sylls.] as a girls' name is of Old German derivation, and the name Adalina means noble. Adalina is a version of Adeline (Old German): French version of Adela. Andolina is a popular surname. Kreatif forms: Adalipa, Adalira, Adilina.However, licensing as a broker or salesperson authorizes the licensee to represent parties on either side of a transaction. The choice of which side to represent is a business decision for the licensee. In the U.S., real estate brokers and salespersons are licensed by each state, not by the federal government. - Loss:
pylate.losses.xtr_primeqa.XTRPrimeQA
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: stepsper_device_train_batch_size: 196per_device_eval_batch_size: 196learning_rate: 3e-05max_grad_norm: 10.0num_train_epochs: 0max_steps: 50000warmup_ratio: 0.01bf16: Truetorch_compile: Truetorch_compile_backend: inductoreval_on_start: True
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 196per_device_eval_batch_size: 196per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 3e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 10.0num_train_epochs: 0max_steps: 50000lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.01warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Truetorch_compile_backend: inductortorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Trueuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
- Downloads last month
- 18
Model tree for xtr-replicability/bge_small_xtr_contrastive_k512
Base model
BAAI/bge-small-en-v1.5Dataset used to train xtr-replicability/bge_small_xtr_contrastive_k512
Evaluation results
- Accuracy on Unknownself-reported0.989