|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: colbert |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- information-retrieval |
|
|
- retrieval |
|
|
- late-interaction |
|
|
- ColBERT |
|
|
license: mit |
|
|
base_model: colbert-ir/colbertv1.9 |
|
|
--- |
|
|
|
|
|
# Colbert-Finetuned |
|
|
|
|
|
**ColBERT** (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a **fine-tuned ColBERT checkpoint** for neural information retrieval. |
|
|
|
|
|
- **Base model:** `colbert-ir/colbertv1.9` |
|
|
- **Library:** [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones) |
|
|
- **Intended use:** passage/document retrieval in RAG and search systems |
|
|
|
|
|
> ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable. |
|
|
|
|
|
--- |
|
|
|
|
|
## ✨ What’s in this checkpoint |
|
|
|
|
|
- Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`. |
|
|
- Trained with **triples JSONL** (`[qid, pid+, pid-]`) using **TSV** `queries.tsv` and `collection.tsv` (IDs + text). |
|
|
- Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.). |
|
|
- This checkpoint and the associated contrastive training data are part of the work: [`NLKI: A lightweight Natural Language Knowledge Integration Framework |
|
|
for Improving Small VLMs in Commonsense VQA Tasks`](https://arxiv.org/pdf/2508.19724) |
|
|
- All copyrights for the training data are retained by their original owners; we do not claim ownership. |
|
|
--- |
|
|
|
|
|
## 🔧 Quickstart |
|
|
|
|
|
### Option A — Use with the ColBERT library (recommended) |
|
|
|
|
|
```python |
|
|
from colbert.infra import Run, RunConfig, ColBERTConfig |
|
|
from colbert import Indexer, Searcher |
|
|
from colbert.data import Queries |
|
|
|
|
|
# 1) Index your collection (pid \t passage) |
|
|
with Run().context(RunConfig(nranks=1, experiment="my-exp")): |
|
|
cfg = ColBERTConfig(root="/path/to/experiments") |
|
|
indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg) |
|
|
indexer.index( |
|
|
name="my.index", |
|
|
collection="/path/to/collection.tsv" # "pid \t passage text" |
|
|
) |
|
|
|
|
|
# 2) Search with queries (qid \t query) |
|
|
with Run().context(RunConfig(nranks=1, experiment="my-exp")): |
|
|
cfg = ColBERTConfig(root="/path/to/experiments") |
|
|
searcher = Searcher(index="my.index", config=cfg) |
|
|
queries = Queries("/path/to/queries.tsv") # "qid \t query text" |
|
|
ranking = searcher.search_all(queries, k=20) |
|
|
ranking.save("my.index.top20.tsv") |
|
|
|