dutta18
/

Colbert-Finetuned

Sentence Similarity

information-retrieval

late-interaction

Model card Files Files and versions

dutta18 commited on Sep 11, 2025

Commit

4e7952a

·

verified ·

1 Parent(s): 46ca325

Create README.md

Files changed (1) hide show

README.md +59 -0

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+---
+language:
+- en
+library_name: colbert
+pipeline_tag: sentence-similarity
+tags:
+- information-retrieval
+- retrieval
+- late-interaction
+- ColBERT
+license: mit  # ← change if needed
+base_model: colbert-ir/colbertv1.9
+---
+# Colbert-Finetuned
+**ColBERT** (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a **fine-tuned ColBERT checkpoint** for neural information retrieval.
+- **Base model:** `colbert-ir/colbertv1.9`
+- **Library:** [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones)
+- **Intended use:** passage/document retrieval in RAG and search systems
+> ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable.
+---
+## ✨ What’s in this checkpoint
+- Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`.
+- Trained with **triples JSONL** (`[qid, pid+, pid-]`) using **TSV** `queries.tsv` and `collection.tsv` (IDs + text).
+- Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.).
+---
+## 🔧 Quickstart
+### Option A — Use with the ColBERT library (recommended)
+```python
+from colbert.infra import Run, RunConfig, ColBERTConfig
+from colbert import Indexer, Searcher
+from colbert.data import Queries
+# 1) Index your collection (pid \t passage)
+with Run().context(RunConfig(nranks=1, experiment="my-exp")):
+    cfg = ColBERTConfig(root="/path/to/experiments")
+    indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg)
+    indexer.index(
+        name="my.index",
+        collection="/path/to/collection.tsv"  # "pid \t passage text"
+    )
+# 2) Search with queries (qid \t query)
+with Run().context(RunConfig(nranks=1, experiment="my-exp")):
+    cfg = ColBERTConfig(root="/path/to/experiments")
+    searcher = Searcher(index="my.index", config=cfg)
+    queries = Queries("/path/to/queries.tsv")  # "qid \t query text"
+    ranking = searcher.search_all(queries, k=20)
+    ranking.save("my.index.top20.tsv")