--- language: - en library_name: colbert pipeline_tag: sentence-similarity tags: - information-retrieval - retrieval - late-interaction - ColBERT license: mit # ← change if needed base_model: colbert-ir/colbertv1.9 --- # Colbert-Finetuned **ColBERT** (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a **fine-tuned ColBERT checkpoint** for neural information retrieval. - **Base model:** `colbert-ir/colbertv1.9` - **Library:** [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones) - **Intended use:** passage/document retrieval in RAG and search systems > ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable. --- ## ✨ What’s in this checkpoint - Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`. - Trained with **triples JSONL** (`[qid, pid+, pid-]`) using **TSV** `queries.tsv` and `collection.tsv` (IDs + text). - Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.). - This checkpoint and the associated contrastive training data are part of the work: [`NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks`](https://arxiv.org/pdf/2508.19724) - All copyrights for the training data are retained by their original owners; we do not claim ownership. --- ## 🔧 Quickstart ### Option A — Use with the ColBERT library (recommended) ```python from colbert.infra import Run, RunConfig, ColBERTConfig from colbert import Indexer, Searcher from colbert.data import Queries # 1) Index your collection (pid \t passage) with Run().context(RunConfig(nranks=1, experiment="my-exp")): cfg = ColBERTConfig(root="/path/to/experiments") indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg) indexer.index( name="my.index", collection="/path/to/collection.tsv" # "pid \t passage text" ) # 2) Search with queries (qid \t query) with Run().context(RunConfig(nranks=1, experiment="my-exp")): cfg = ColBERTConfig(root="/path/to/experiments") searcher = Searcher(index="my.index", config=cfg) queries = Queries("/path/to/queries.tsv") # "qid \t query text" ranking = searcher.search_all(queries, k=20) ranking.save("my.index.top20.tsv")