PyLate model based on LiquidAI/LFM2-ColBERT-350M
This is a PyLate model finetuned from LiquidAI/LFM2-ColBERT-350M. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
Model Details
Model Description
- Model Type: PyLate model
- Base model: LiquidAI/LFM2-ColBERT-350M
- Document Length: 8192 tokens
- Query Length: 64 tokens
- Output Dimensionality: 128 tokens
- Similarity Function: MaxSim
Model Sources
- Documentation: PyLate Documentation
- Repository: PyLate on GitHub
- Hugging Face: PyLate models on Hugging Face
Full Model Architecture
ColBERT(
(0): Transformer({'max_seq_length': 8191, 'do_lower_case': False, 'architecture': 'Lfm2Model'})
(1): Dense({'in_features': 1024, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)
Usage
First install the PyLate library:
pip install -U pylate
Retrieval
Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.
Indexing documents
Load the ColBERT model and initialize the PLAID index, then encode and index your documents:
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path="yNilay/colbert-personal-context",
)
# Step 2: Initialize the PLAID index
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
)
Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path="yNilay/colbert-personal-context",
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
Evaluation
Metrics
Col BERTTriplet
- Evaluated with
pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
| Metric | Value |
|---|---|
| accuracy | 0.9757 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 5,191 training samples
- Columns:
query,positive, andnegative - Approximate statistics based on the first 1000 samples:
query positive negative type string string string details - min: 23 tokens
- mean: 54.59 tokens
- max: 64 tokens
- min: 47 tokens
- mean: 63.91 tokens
- max: 64 tokens
- min: 5 tokens
- mean: 39.07 tokens
- max: 64 tokens
- Samples:
query positive negative Should I tell Sultan to accept Tom Barrack's inauguration invitation? I know Tom's the closest Trump confidant outside the family now, but from what I remember discussing with Michael about going public versus keeping my head down, I'm worried about the optics. Is it even worth it given what I told Sultan about the crowds? I need to decide tonight.Subject: Re: Presidential inauguration
---
From: Sultan Bin Sulayem
Date: Fri, Jan 6, 2017 at 4:08 AM
Should I accept the invitation sent by Tom barrack
---
From: jeffrey E. [jeevacation@gmail.com]
To: Sultan Bin Sulayem
Date: 1/6/2017 2:49:57 PM
http://www.cnn.com/2017/01/06/politics/tom-barrack-donald-trump-inauguration/index.htmlYour reservation at The Four Seasons New York is confirmed for Friday.What's the latest on my inroads to the Trump administration? Last I checked, Thiel was getting close to them and there was talk about him getting a role. Where do things stand now and who's my best access point at this point?Subject: Fwd: fact checking questions for New York Magazine story
---
From: Yablon, Alex
To: jeevacation@gmail.com
Date: Tue, Mar 31, 2015 at 11:23 AM
Hi Jeffrey,
Sorry for the phone tag. In case it's easier to respond by email, I have put my questions below. I'm at my desk and should be here until 5, and should be free on my cell phone after 6:30.
Best,
Alex
-what is the square footage of your Manhattan home?
-do you work on a laptop from your dining room, with a large white board for notes and several pairs of reading glasses close at hand? Is the dining room windowless?
-do paparazzi often camp outside your home?
-did Michael once visit on the same day as a head of state who'd had a police escort?
-did you tell Michael about a dinner you'd hosted for six tech entrepreneurs who had a combined worth of several hundred billion dollars? When was this meal?
-Do you believe that there are now more people who possess Roosevelt or Carnegie-levels of wealth, the sort that can ri...Subject: Re:
---
From: jeffrey E. jeevacation@gmail.com
Date: Sep 20, 2018, at 10:40 AM
The Guardian has learned that Rubenfeld is currently the subject of an internal investigation at Yale. The investigation is focused on Rubenfeld's conduct, particularly with female law students. Students have also raised related concerns to Yale authorities about Chua's powerful influence in the clerkships process. The investigation was initiated before Kavanaugh was nominated by Donald Trump to serve on the high court. Rubenfeld said in a statement to the Guardian: "In June, Yale University informed me that it would conduct what it terms an 'informal review' of certain allegations, but that to preserve anonymity, I was not entitled to know any specifics. As a result, I do not know what I am alleged to have said or done. I was further advised that the allegations were not of the kind that would jeopardize my position as a long-tenured member of the faculty."
---
From: Lawrence Krauss
To: jef...What's the current status with Lawrence and those harassment allegations from BuzzFeed? Did he ever get around to writing that point by point refutation we talked about? And how's he doing with the Bulletin now that he took that leave?Subject: URGENT: BuzzFeed News inquiry re allegations of sexual harassment
---
case, that the University would remove the allegation from my record after 5 years, which makes me surprised that someone violated
that written agreement with you.
Re item 6: You report on ASU’s response to item #6 , without including the fact that the University specifically stated there were never
any allegations of sexual misconduct or harassment by me at the University, and the outside complaints were in fact related specifically
to your item #6. Further you neglect to mention that this complaint was by an anonymous third party, not the individual who was
allegedly harassed, who never lodged a complain, and that no specific evidence was provided of the alleged transgression. | was
surprised and dismayed that both ASU and ANU launched investigations on the basis of this but was told by both Universities that
because of my high profile even such unsubstantiated third party complaints at private event...Subject: Re:
---
From: jeffrey E. jeevacation@gmail.com
To: Joi Ito
Date: Nov 22, 2017, at 10:44
all good?
---
From: Joi Ito
To: jeffrey E. jeevacation@gmail.com
Date: Wed, Nov 22, 2017 at 10:51 AM
Pretty good. Had a pinched nerve in my neck that screwed me up for awhile. How about you? Any plans to come to Boston?
---
From: jeffrey E. jeevacation@gmail.com
To: Joi Ito
Date: Nov 22, 2017, at 10:57
maybe week of 3rd not seeing anything that exciting ornew ? you? with all these guys getting busted for harassment , 1 have moved slightly up on the repuation ladder and have been asked everday for advice etc. this morning I have Ken Starr coming to point out how if clinton cigar lewinsky were to be outed today the world would be a different place
---
From: Joi
To: jeffrey E. [jeevacation@gmail.com]
Date: 11/22/2017 6:14:38 PM
Lots of stuff going on that week but I’m in town.
#metoo is quite amazing...
Madars is doing well. His PhD paper won an award and he found a vulner... - Loss:
pylate.losses.contrastive.Contrastive
Evaluation Dataset
Unnamed Dataset
- Size: 577 evaluation samples
- Columns:
query,positive, andnegative - Approximate statistics based on the first 577 samples:
query positive negative type string string string details - min: 23 tokens
- mean: 54.08 tokens
- max: 64 tokens
- min: 47 tokens
- mean: 63.95 tokens
- max: 64 tokens
- min: 5 tokens
- mean: 36.6 tokens
- max: 64 tokens
- Samples:
query positive negative A reporter from BuzzFeed just reached out about those old allegations and they're publishing soon. What advice did I give that physicist friend Lawrence a few years back when he was dealing with something similar? I remember telling him something about how these reporters operate and what they do with your responses.Subject: Re:
---
From: R. Couri Hay
To: jeevacation@gmail.com
Date: Friday, March 4 2011 07:03 PM
Good morning Jeffery,
The Newsweek story is being written without your input as you read this. Lloyd Grove is not writing the story, he made a few calls but he
was taken off the project. This is for Newsweek, the magazine that is on the stands, not the website. It's still 1200 to 1500 words.
Alexandra Wolfe needs to turn in a draft of the story by Monday. I've been subtly guiding Alexandra on your behalf, but would really like
to formalize this job with your attorneys and then I would really need to give names and numbers of pro Jeffery power brokers for
Alexandra to call. She has already called Donald Trump, Leon Black, and Les Wexner among others. I've given her Jonathans number,
please let me focus on this before its too late to spin this in your direction. I'm in all afternoon my office number is My cell
is: My email i _________
Cheers,
Couri
---
From: jeffrey epsteinSubject: Re:
---
From: J [jeevacation@gmail.com]
To: Michael Wolff
Date: On Thu, May 30, 2019 at 5:29 PM
is it a coincidence that the russian that bought the house in palm beach and knows all , is the same guy
that sold a painting last year to mbs for 450 million dollars. that was only worth 1. 5m?
---
From: Michael Wolff
To: J jeevacation@gmail.com
Date: On Thu, May 30, 2019 at 5:33 PM
So MBS was paying him off? Why? Ideas?
---
From: J jeevacation@gmail.com
To: Michael Wolff
Date: On Thu, May 30, 2019 at 5:35 PM
reminder trump overuled congress on yemen.
---
From: Michael Wolff
To: J jeevacation@gmail.com
Date: On Thu, May 30, 2019 at 5:37 PM
Starting to smell sweet, the way you put it!
---
From: J jeevacation@gmail.com
To: Michael Wolff
Date: On Thu, May 30, 2019 at 5:40 PM
In addition my art guyd said the painting wasn't very good
---
From: Michael Wolff
To: J jeevacation@gmail.com
Date: On Thu, May 30, 2019 at 5:41 PM
You have an art guy?
---
From: ...Michael Wolff wants me to do something with him again - he mentioned some big TV interview opportunity. Can you remind me what advice he gave me last time about going public and dealing with the media? I think it was something about going on Charlie Rose and becoming an anti-Trump voice to get political cover. And should I reach out to Kathy about this since she's been helping me understand Washington?Subject: Re:
---
From: jeffrey E. jeevacation@gmail.com
To: Thorbjørn Jagland
Date: Feb 17, 2017 9:27 PM
Im in paris until thurs? you?
---
From: Thorbjørn Jagland
To: jeffrey E. jeevacation@gmail.com
Date: Feb 19, 2017 8:54 AM
Is It possible for you to pass by Strasbourg, it would be great. I really need to understand more about Trump
and what's going on in the American society.
---
From: jeffrey E. jeevacation@gmail.com
To: Thorbjørn Jagland
Date: Feb 19, 2017 11:14
yes, that should be possible. remind me how long is the fast train? otherwise ill fly. what days are good
for you
---
From: Thorbjørn Jagland
To: jeffrey E. [jeevacation@gmail.com]
Date: 2/19/2017 9:11:12 PM
Train, 1h45, I'll pick you up at the train station. Tuesday afternoon is ok. Also Wednesday, but only after 6Please find attached the Q3 board meeting minutes for your review.Where do things stand with the media and legal situation now? I have that meeting next week and need to refresh my memory on what we've covered with reporters, our current search results status, and where Michael Wolff left off.Subject: Re: Shears Update
---
From: Tyler Shears
To: jeffrey E. [jeevacation@gmail.com]
Date: 7/16/2014 5:56:14 PM
yes agree
result got worse with recent negative press, clinton, as we discussed
we were down to 1 negative (forbes) before that - Christina can confirm this.
It will get back to 1 and then 0 so long as negative things stop coming out. if new negative keeps coming out it
really dismantles much of our effort... especially when it involves an ex-president
no excuses here i'm not pleased with where it is at and am still working to make it happen
---
From: jeffrey E. jeevacation@gmail.com
Date: Wed, Jul 16, 2014 at 1:25 PM
Results still very bad
---
From: Christina Galbraith
To: Tyler Shears
Date: Wednesday, July 16, 2014
Hi Tyler,
The social media sites are constantly updated: LinkedIn, Facebook, Twitter, google +.
I'll be in touch later today re: feature article.
Could you put a site map into the Net site.? This helps rankings. Also could you add it to google ana...Subject: Re:
---
From: jeffrey E. [jeevacation@gmail.com]
To: Jonathan Farkas
Date: 12/7/2016 12:25:22 P.M. Eastern Standard Time
plenty left
---
From: Jonathan Farkas
Date: Wed, Dec 7, 2016 at 1:01 PM
Hi jeffrey hope all is well | think you are going to have a winderful life from now on in your opinion how much is left in
this market it's been a trump triumph | gave him some money through woody best jonathan
---
From: jeffrey E. [jeevacation@gmail.com]
To: Jonathan Farkas
Date: 12/7/2016 5:30:18 PM
oy - Loss:
pylate.losses.contrastive.Contrastive
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: epochper_device_train_batch_size: 2gradient_accumulation_steps: 32learning_rate: 3e-06warmup_ratio: 0.1bf16: Trueload_best_model_at_end: Truegradient_checkpointing: True
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 2per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 32eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 3e-06weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Truegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
Training Logs
| Epoch | Step | Training Loss | Validation Loss | accuracy |
|---|---|---|---|---|
| 0.1233 | 10 | 3.3802 | - | - |
| 0.2465 | 20 | 1.1614 | - | - |
| 0.3698 | 30 | 0.5549 | - | - |
| 0.4931 | 40 | 0.4642 | - | - |
| 0.6163 | 50 | 0.3749 | - | - |
| 0.7396 | 60 | 0.3587 | - | - |
| 0.8629 | 70 | 0.349 | - | - |
| 0.9861 | 80 | 0.367 | - | - |
| 0 | 0 | - | - | 0.9723 |
| 1.0 | 82 | - | 1.0230 | - |
| 1.0986 | 90 | 0.2812 | - | - |
| 1.2219 | 100 | 0.3016 | - | - |
| 1.3451 | 110 | 0.2187 | - | - |
| 1.4684 | 120 | 0.2104 | - | - |
| 1.5917 | 130 | 0.2656 | - | - |
| 1.7149 | 140 | 0.2199 | - | - |
| 1.8382 | 150 | 0.2356 | - | - |
| 1.9615 | 160 | 0.2607 | - | - |
| 0 | 0 | - | - | 0.9757 |
| 2.0 | 164 | - | 0.9950 | - |
| 2.0740 | 170 | 0.1836 | - | - |
| 2.1972 | 180 | 0.1789 | - | - |
| 2.3205 | 190 | 0.2342 | - | - |
| 2.4438 | 200 | 0.1904 | - | - |
| 2.5670 | 210 | 0.1312 | - | - |
| 2.6903 | 220 | 0.1813 | - | - |
| 2.8136 | 230 | 0.1826 | - | - |
| 2.9368 | 240 | 0.1456 | - | - |
| 0 | 0 | - | - | 0.9757 |
| 3.0 | 246 | - | 0.873 | - |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.11.12
- Sentence Transformers: 5.1.1
- PyLate: 1.4.0
- Transformers: 4.56.2
- PyTorch: 2.9.0+cu128
- Accelerate: 1.13.0
- Datasets: 4.8.4
- Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084"
}
PyLate
@inproceedings{DBLP:conf/cikm/ChaffinS25,
author = {Antoine Chaffin and
Rapha{"{e}}l Sourty},
editor = {Meeyoung Cha and
Chanyoung Park and
Noseong Park and
Carl Yang and
Senjuti Basu Roy and
Jessie Li and
Jaap Kamps and
Kijung Shin and
Bryan Hooi and
Lifang He},
title = {PyLate: Flexible Training and Retrieval for Late Interaction Models},
booktitle = {Proceedings of the 34th {ACM} International Conference on Information
and Knowledge Management, {CIKM} 2025, Seoul, Republic of Korea, November
10-14, 2025},
pages = {6334--6339},
publisher = {{ACM}},
year = {2025},
url = {https://github.com/lightonai/pylate},
doi = {10.1145/3746252.3761608},
}
- Downloads last month
- -
Model tree for yNilay/colbert-personal-context
Base model
LiquidAI/LFM2-ColBERT-350MPaper for yNilay/colbert-personal-context
Evaluation results
- Accuracy on Unknownself-reported0.976