Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
library_name: colbert
|
| 5 |
+
pipeline_tag: sentence-similarity
|
| 6 |
+
tags:
|
| 7 |
+
- information-retrieval
|
| 8 |
+
- retrieval
|
| 9 |
+
- late-interaction
|
| 10 |
+
- ColBERT
|
| 11 |
+
license: mit # ← change if needed
|
| 12 |
+
base_model: colbert-ir/colbertv1.9
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Colbert-Finetuned
|
| 16 |
+
|
| 17 |
+
**ColBERT** (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a **fine-tuned ColBERT checkpoint** for neural information retrieval.
|
| 18 |
+
|
| 19 |
+
- **Base model:** `colbert-ir/colbertv1.9`
|
| 20 |
+
- **Library:** [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones)
|
| 21 |
+
- **Intended use:** passage/document retrieval in RAG and search systems
|
| 22 |
+
|
| 23 |
+
> ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## ✨ What’s in this checkpoint
|
| 28 |
+
|
| 29 |
+
- Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`.
|
| 30 |
+
- Trained with **triples JSONL** (`[qid, pid+, pid-]`) using **TSV** `queries.tsv` and `collection.tsv` (IDs + text).
|
| 31 |
+
- Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.).
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## 🔧 Quickstart
|
| 36 |
+
|
| 37 |
+
### Option A — Use with the ColBERT library (recommended)
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
from colbert.infra import Run, RunConfig, ColBERTConfig
|
| 41 |
+
from colbert import Indexer, Searcher
|
| 42 |
+
from colbert.data import Queries
|
| 43 |
+
|
| 44 |
+
# 1) Index your collection (pid \t passage)
|
| 45 |
+
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
|
| 46 |
+
cfg = ColBERTConfig(root="/path/to/experiments")
|
| 47 |
+
indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg)
|
| 48 |
+
indexer.index(
|
| 49 |
+
name="my.index",
|
| 50 |
+
collection="/path/to/collection.tsv" # "pid \t passage text"
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
# 2) Search with queries (qid \t query)
|
| 54 |
+
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
|
| 55 |
+
cfg = ColBERTConfig(root="/path/to/experiments")
|
| 56 |
+
searcher = Searcher(index="my.index", config=cfg)
|
| 57 |
+
queries = Queries("/path/to/queries.tsv") # "qid \t query text"
|
| 58 |
+
ranking = searcher.search_all(queries, k=20)
|
| 59 |
+
ranking.save("my.index.top20.tsv")
|