Stffens commited on
Commit
8cb04ff
·
verified ·
1 Parent(s): ce697f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -43
README.md CHANGED
@@ -1,52 +1,179 @@
1
- # bge-small-rrf-v2: A 33M Parameter Model That Beats ColBERTv2 on 3/5 BEIR Datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- **Trained with zero human labels using hybrid retrieval disagreement.**
4
-
5
  ## Key Result
6
 
7
- | Dataset | ColBERTv2 (110M) | BGE-small base (33M) | **This model (33M)** | vs ColBERTv2 |
8
- |---------|:-:|:-:|:-:|:-:|
9
- | SciFact | 0.693 | 0.646 | **0.695** | **+0.2%** |
10
- | NFCorpus | 0.344 | 0.330 | **0.395** | **+14.8%** |
11
- | SciDocs | 0.154 | 0.178 | **0.188** | **+21.8%** |
12
- | FiQA | 0.356 | 0.328 | 0.328 | -7.8% |
13
- | ArguAna | 0.463 | 0.419 | 0.424 | -8.4% |
14
-
15
- ## Why This Matters
16
-
17
- Most embedding improvements need larger models, human labels, or teacher distillation. This model needs
18
- none. The signal comes from observing where vector search and keyword search disagree. **The system
19
- improves itself.**
20
-
21
- ## Training Signal: 82% of queries produce disagreement between vector and keyword search
22
-
23
- - **Vector blind spots** (51%): ranked high by vector but keywords ignore
24
- - **Keyword blind spots** (49%): found by keywords but vector misses
25
-
26
- 76K (query, positive, hard_negative) triples. Zero human labels. $0 cost.
27
-
28
- ## Training
29
-
30
- - Base: BAAI/bge-small-en-v1.5 (33M params, 384d)
31
- - Loss: MNRL + explicit hard negatives
32
- - TripletLoss destroyed the model (-84%). MNRL preserves knowledge.
33
- - 2 epochs, lr=3e-6, batch 64, ~30 min on T4 GPU
34
-
35
- ## Usage
36
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
  from sentence_transformers import SentenceTransformer
39
- model = SentenceTransformer("Stffens/bge-small-rrf-v2")
40
- embeddings = model.encode(["query", "document"])
41
-
42
- Train on your own data
43
 
44
- pip install vstash sentence-transformers torch
45
- vstash retrain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  vstash reindex --model ~/.vstash/models/retrained
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- Links
49
 
50
- - vstash
51
- - Paper
52
- - Base model
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: sentence-transformers
5
+ tags:
6
+ - sentence-transformers
7
+ - embedding
8
+ - retrieval
9
+ - hybrid-search
10
+ - self-supervised
11
+ - fine-tuned
12
+ - BEIR
13
+ - information-retrieval
14
+ base_model: BAAI/bge-small-en-v1.5
15
+ datasets:
16
+ - BeIR/scifact
17
+ - BeIR/nfcorpus
18
+ - BeIR/fiqa
19
+ pipeline_tag: feature-extraction
20
+ model-index:
21
+ - name: bge-small-rrf-v2
22
+ results:
23
+ - task:
24
+ type: Retrieval
25
+ dataset:
26
+ name: BEIR SciFact
27
+ type: BeIR/scifact
28
+ metrics:
29
+ - type: ndcg_at_10
30
+ value: 0.6945
31
+ - task:
32
+ type: Retrieval
33
+ dataset:
34
+ name: BEIR NFCorpus
35
+ type: BeIR/nfcorpus
36
+ metrics:
37
+ - type: ndcg_at_10
38
+ value: 0.3949
39
+ - task:
40
+ type: Retrieval
41
+ dataset:
42
+ name: BEIR SciDocs
43
+ type: BeIR/scidocs
44
+ metrics:
45
+ - type: ndcg_at_10
46
+ value: 0.1875
47
+ ---
48
+
49
+ # bge-small-rrf-v2: A 33M Parameter Model That Beats ColBERTv2 on 3/5 BEIR Datasets
50
+
51
+ **Trained with zero human labels using a novel self-supervised signal: hybrid retrieval disagreement.**
52
+
53
+ When vector search and keyword search disagree on what's relevant for a query, that disagreement reveals where the embedding model fails. We exploit this signal to fine-tune BGE-small, producing a model that better distinguishes "semantically close" from "actually relevant."
54
 
 
 
55
  ## Key Result
56
 
57
+ A 33M parameter model, fine-tuned for $0 with zero human labels, **surpasses ColBERTv2 (110M parameters) on 3 out of 5 standard BEIR benchmarks**:
58
+
59
+ | Dataset | Docs | ColBERTv2 (110M) | BGE-small base (33M) | **This model (33M)** | vs ColBERTv2 |
60
+ |---------|:----:|:-:|:-:|:-:|:-:|
61
+ | SciFact | 5K | 0.693 | 0.646 | **0.695** | **+0.2%** |
62
+ | NFCorpus | 3.6K | 0.344 | 0.330 | **0.395** | **+14.8%** |
63
+ | SciDocs | 25K | 0.154 | 0.178 | **0.188** | **+21.8%** |
64
+ | FiQA | 57K | 0.356 | 0.328 | 0.328 | -7.8% |
65
+ | ArguAna | 8.6K | 0.463 | 0.419 | 0.424 | -8.4% |
66
+
67
+ **Up to +19.5% NDCG improvement over the base model, with zero additional inference cost.**
68
+
69
+ ## Why This Matters
70
+
71
+ Most embedding improvements require either:
72
+ - A larger model (more compute, more latency)
73
+ - Human-labeled training data (expensive, slow)
74
+ - A teacher model for distillation (adds complexity)
75
+
76
+ This model needs none of that. The training signal comes from running the existing hybrid retrieval pipeline and observing where its two components (vector search and keyword search) disagree. **The system improves itself.**
77
+
78
+ ## The Training Signal: Hybrid Retrieval Disagreement
79
+
80
+ We discovered that **82% of queries produce disagreement** between vector and keyword search in the top-5 results. These disagreements fall into two categories:
81
+
82
+ - **Vector blind spots** (51%): chunks the vector search ranks high but keyword search ignores. These are semantically similar but not actually relevant.
83
+ - **Keyword blind spots** (49%): chunks keyword search finds but vector search misses. These contain relevant terms but the embedding doesn't recognize their relevance.
84
+
85
+ Fine-tuning on these disagreement pairs teaches the model to fix both types of blind spots.
86
+
87
+ ## Training Details
88
+
89
+ | Parameter | Value |
90
+ |-----------|-------|
91
+ | Base model | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |
92
+ | Parameters | 33M (unchanged) |
93
+ | Embedding dimension | 384 (unchanged) |
94
+ | Loss function | MultipleNegativesRankingLoss with explicit hard negatives |
95
+ | Training data | 76K (query, positive, hard_negative) triples |
96
+ | Data source | RRF signal disagreement on SciFact, NFCorpus, FiQA |
97
+ | Human labels | **Zero** |
98
+ | Epochs | 2 |
99
+ | Learning rate | 3e-6 |
100
+ | Batch size | 64 |
101
+ | Training time | ~30 min on T4 GPU |
102
+ | Training cost | $0 (Colab free tier) |
103
+
104
+ ### Why MNRL, not TripletLoss?
105
+
106
+ We tested TripletLoss first. It **destroyed the model** (-84% NDCG after 3 epochs). TripletLoss pushes individual negatives away with brute force, distorting the embedding space. MNRL adjusts relationships across 64 documents simultaneously per batch, preserving the model's general knowledge while learning from disagreements.
107
+
108
+ | Loss Function | NDCG@10 on SciFact | Result |
109
+ |---|:-:|---|
110
+ | TripletLoss (3 epochs, lr=2e-5) | 0.055 | -84% (destroyed) |
111
+ | TripletLoss (1 epoch, lr=1e-6) | 0.347 | -0.03% (no effect) |
112
+ | MNRL batch-only negatives (v1) | 0.683 | +5.6% |
113
+ | **MNRL + explicit hard negatives (this model)** | **0.695** | **+7.4%** |
114
+
115
+ ## Usage
116
+
117
+ ### With sentence-transformers
118
  ```python
119
  from sentence_transformers import SentenceTransformer
 
 
 
 
120
 
121
+ model = SentenceTransformer("Stffens/bge-small-rrf-v2")
122
+ embeddings = model.encode(["your query", "your document"])
123
+ similarity = embeddings[0] @ embeddings[1]
124
+ ```
125
+
126
+ ### With vstash (hybrid retrieval system)
127
+ ```bash
128
+ pip install vstash
129
+ vstash reindex --model Stffens/bge-small-rrf-v2
130
+ vstash search "your query"
131
+ ```
132
+
133
+ ### Train your own version on your data
134
+ ```bash
135
+ pip install vstash sentence-transformers torch
136
+ vstash retrain # generates disagreement pairs from YOUR corpus and fine-tunes
137
  vstash reindex --model ~/.vstash/models/retrained
138
+ ```
139
+
140
+ ## Reproduce From Scratch
141
+
142
+ ```bash
143
+ git clone https://github.com/stffns/vstash
144
+ cd vstash
145
+ pip install -e . sentence-transformers torch
146
+
147
+ # Generate disagreement triples
148
+ python -m experiments.rrf_training_pairs --datasets scifact nfcorpus fiqa
149
+
150
+ # Train (GPU recommended)
151
+ python -m experiments.finetune_rrf --epochs 2 --lr 3e-6 --batch-size 64
152
+
153
+ # Evaluate
154
+ python -m experiments.finetune_rrf --evaluate-only
155
+ ```
156
+
157
+ ## Limitations
158
+
159
+ - **ArguAna regression**: queries with 200+ words show -8.4% vs ColBERTv2. Long argumentative queries produce only 1.1% signal disagreement, leaving no training signal.
160
+ - **FiQA neutral**: financial queries show +0.1% vs base but -7.8% vs ColBERTv2. The disagreement signal exists (86.7%) but doesn't translate to NDCG gains on this dataset.
161
+ - **English only**: inherited from BGE-small-en-v1.5.
162
+ - **Not tested beyond BEIR**: performance on domain-specific corpora may vary.
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @software{vstash2026,
168
+ author = {Steffens, Jayson},
169
+ title = {vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents},
170
+ url = {https://github.com/stffns/vstash},
171
+ year = {2026}
172
+ }
173
+ ```
174
 
175
+ ## Related
176
 
177
+ - [vstash paper](https://github.com/stffns/vstash/blob/main/paper/vstash-paper.md) (Section 8.10: Self-Supervised Embedding Refinement)
178
+ - [vstash GitHub](https://github.com/stffns/vstash)
179
+ - [Base model: BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)