dutta18
/

Colbert-Finetuned

Sentence Similarity

information-retrieval

late-interaction

Model card Files Files and versions

Colbert-Finetuned / README.md

dutta18's picture

Update README.md

4d969b2 verified 5 months ago

|

history blame contribute delete

2.57 kB

	---
	language:
	- en
	library_name: colbert
	pipeline_tag: sentence-similarity
	tags:
	- information-retrieval
	- retrieval
	- late-interaction
	- ColBERT
	license: mit # ← change if needed
	base_model: colbert-ir/colbertv1.9
	---

	# Colbert-Finetuned

	ColBERT (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a fine-tuned ColBERT checkpoint for neural information retrieval.

	- Base model: `colbert-ir/colbertv1.9`
	- Library: [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones)
	- Intended use: passage/document retrieval in RAG and search systems

	> ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable.

	---

	## ✨ What’s in this checkpoint

	- Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`.
	- Trained with triples JSONL (`[qid, pid+, pid-]`) using TSV `queries.tsv` and `collection.tsv` (IDs + text).
	- Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.).
	- This checkpoint and the associated contrastive training data are part of the work: [`NLKI: A lightweight Natural Language Knowledge Integration Framework
	for Improving Small VLMs in Commonsense VQA Tasks`](https://arxiv.org/pdf/2508.19724)
	- All copyrights for the training data are retained by their original owners; we do not claim ownership.
	---

	## 🔧 Quickstart

	### Option A — Use with the ColBERT library (recommended)

	```python
	from colbert.infra import Run, RunConfig, ColBERTConfig
	from colbert import Indexer, Searcher
	from colbert.data import Queries

	# 1) Index your collection (pid \t passage)
	with Run().context(RunConfig(nranks=1, experiment="my-exp")):
	cfg = ColBERTConfig(root="/path/to/experiments")
	indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg)
	indexer.index(
	name="my.index",
	collection="/path/to/collection.tsv" # "pid \t passage text"
	)

	# 2) Search with queries (qid \t query)
	with Run().context(RunConfig(nranks=1, experiment="my-exp")):
	cfg = ColBERTConfig(root="/path/to/experiments")
	searcher = Searcher(index="my.index", config=cfg)
	queries = Queries("/path/to/queries.tsv") # "qid \t query text"
	ranking = searcher.search_all(queries, k=20)
	ranking.save("my.index.top20.tsv")