Update README.md
Browse files
README.md
CHANGED
|
@@ -1,52 +1,179 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
**Trained with zero human labels using hybrid retrieval disagreement.**
|
| 4 |
-
|
| 5 |
## Key Result
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
| 10 |
-
|
|
| 11 |
-
|
|
| 12 |
-
|
|
| 13 |
-
|
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
## Training
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
-
|
| 33 |
-
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
```python
|
| 38 |
from sentence_transformers import SentenceTransformer
|
| 39 |
-
model = SentenceTransformer("Stffens/bge-small-rrf-v2")
|
| 40 |
-
embeddings = model.encode(["query", "document"])
|
| 41 |
-
|
| 42 |
-
Train on your own data
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
vstash reindex --model ~/.vstash/models/retrained
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
- vstash
|
| 51 |
-
-
|
| 52 |
-
- Base model
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
library_name: sentence-transformers
|
| 5 |
+
tags:
|
| 6 |
+
- sentence-transformers
|
| 7 |
+
- embedding
|
| 8 |
+
- retrieval
|
| 9 |
+
- hybrid-search
|
| 10 |
+
- self-supervised
|
| 11 |
+
- fine-tuned
|
| 12 |
+
- BEIR
|
| 13 |
+
- information-retrieval
|
| 14 |
+
base_model: BAAI/bge-small-en-v1.5
|
| 15 |
+
datasets:
|
| 16 |
+
- BeIR/scifact
|
| 17 |
+
- BeIR/nfcorpus
|
| 18 |
+
- BeIR/fiqa
|
| 19 |
+
pipeline_tag: feature-extraction
|
| 20 |
+
model-index:
|
| 21 |
+
- name: bge-small-rrf-v2
|
| 22 |
+
results:
|
| 23 |
+
- task:
|
| 24 |
+
type: Retrieval
|
| 25 |
+
dataset:
|
| 26 |
+
name: BEIR SciFact
|
| 27 |
+
type: BeIR/scifact
|
| 28 |
+
metrics:
|
| 29 |
+
- type: ndcg_at_10
|
| 30 |
+
value: 0.6945
|
| 31 |
+
- task:
|
| 32 |
+
type: Retrieval
|
| 33 |
+
dataset:
|
| 34 |
+
name: BEIR NFCorpus
|
| 35 |
+
type: BeIR/nfcorpus
|
| 36 |
+
metrics:
|
| 37 |
+
- type: ndcg_at_10
|
| 38 |
+
value: 0.3949
|
| 39 |
+
- task:
|
| 40 |
+
type: Retrieval
|
| 41 |
+
dataset:
|
| 42 |
+
name: BEIR SciDocs
|
| 43 |
+
type: BeIR/scidocs
|
| 44 |
+
metrics:
|
| 45 |
+
- type: ndcg_at_10
|
| 46 |
+
value: 0.1875
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
# bge-small-rrf-v2: A 33M Parameter Model That Beats ColBERTv2 on 3/5 BEIR Datasets
|
| 50 |
+
|
| 51 |
+
**Trained with zero human labels using a novel self-supervised signal: hybrid retrieval disagreement.**
|
| 52 |
+
|
| 53 |
+
When vector search and keyword search disagree on what's relevant for a query, that disagreement reveals where the embedding model fails. We exploit this signal to fine-tune BGE-small, producing a model that better distinguishes "semantically close" from "actually relevant."
|
| 54 |
|
|
|
|
|
|
|
| 55 |
## Key Result
|
| 56 |
|
| 57 |
+
A 33M parameter model, fine-tuned for $0 with zero human labels, **surpasses ColBERTv2 (110M parameters) on 3 out of 5 standard BEIR benchmarks**:
|
| 58 |
+
|
| 59 |
+
| Dataset | Docs | ColBERTv2 (110M) | BGE-small base (33M) | **This model (33M)** | vs ColBERTv2 |
|
| 60 |
+
|---------|:----:|:-:|:-:|:-:|:-:|
|
| 61 |
+
| SciFact | 5K | 0.693 | 0.646 | **0.695** | **+0.2%** |
|
| 62 |
+
| NFCorpus | 3.6K | 0.344 | 0.330 | **0.395** | **+14.8%** |
|
| 63 |
+
| SciDocs | 25K | 0.154 | 0.178 | **0.188** | **+21.8%** |
|
| 64 |
+
| FiQA | 57K | 0.356 | 0.328 | 0.328 | -7.8% |
|
| 65 |
+
| ArguAna | 8.6K | 0.463 | 0.419 | 0.424 | -8.4% |
|
| 66 |
+
|
| 67 |
+
**Up to +19.5% NDCG improvement over the base model, with zero additional inference cost.**
|
| 68 |
+
|
| 69 |
+
## Why This Matters
|
| 70 |
+
|
| 71 |
+
Most embedding improvements require either:
|
| 72 |
+
- A larger model (more compute, more latency)
|
| 73 |
+
- Human-labeled training data (expensive, slow)
|
| 74 |
+
- A teacher model for distillation (adds complexity)
|
| 75 |
+
|
| 76 |
+
This model needs none of that. The training signal comes from running the existing hybrid retrieval pipeline and observing where its two components (vector search and keyword search) disagree. **The system improves itself.**
|
| 77 |
+
|
| 78 |
+
## The Training Signal: Hybrid Retrieval Disagreement
|
| 79 |
+
|
| 80 |
+
We discovered that **82% of queries produce disagreement** between vector and keyword search in the top-5 results. These disagreements fall into two categories:
|
| 81 |
+
|
| 82 |
+
- **Vector blind spots** (51%): chunks the vector search ranks high but keyword search ignores. These are semantically similar but not actually relevant.
|
| 83 |
+
- **Keyword blind spots** (49%): chunks keyword search finds but vector search misses. These contain relevant terms but the embedding doesn't recognize their relevance.
|
| 84 |
+
|
| 85 |
+
Fine-tuning on these disagreement pairs teaches the model to fix both types of blind spots.
|
| 86 |
+
|
| 87 |
+
## Training Details
|
| 88 |
+
|
| 89 |
+
| Parameter | Value |
|
| 90 |
+
|-----------|-------|
|
| 91 |
+
| Base model | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) |
|
| 92 |
+
| Parameters | 33M (unchanged) |
|
| 93 |
+
| Embedding dimension | 384 (unchanged) |
|
| 94 |
+
| Loss function | MultipleNegativesRankingLoss with explicit hard negatives |
|
| 95 |
+
| Training data | 76K (query, positive, hard_negative) triples |
|
| 96 |
+
| Data source | RRF signal disagreement on SciFact, NFCorpus, FiQA |
|
| 97 |
+
| Human labels | **Zero** |
|
| 98 |
+
| Epochs | 2 |
|
| 99 |
+
| Learning rate | 3e-6 |
|
| 100 |
+
| Batch size | 64 |
|
| 101 |
+
| Training time | ~30 min on T4 GPU |
|
| 102 |
+
| Training cost | $0 (Colab free tier) |
|
| 103 |
+
|
| 104 |
+
### Why MNRL, not TripletLoss?
|
| 105 |
+
|
| 106 |
+
We tested TripletLoss first. It **destroyed the model** (-84% NDCG after 3 epochs). TripletLoss pushes individual negatives away with brute force, distorting the embedding space. MNRL adjusts relationships across 64 documents simultaneously per batch, preserving the model's general knowledge while learning from disagreements.
|
| 107 |
+
|
| 108 |
+
| Loss Function | NDCG@10 on SciFact | Result |
|
| 109 |
+
|---|:-:|---|
|
| 110 |
+
| TripletLoss (3 epochs, lr=2e-5) | 0.055 | -84% (destroyed) |
|
| 111 |
+
| TripletLoss (1 epoch, lr=1e-6) | 0.347 | -0.03% (no effect) |
|
| 112 |
+
| MNRL batch-only negatives (v1) | 0.683 | +5.6% |
|
| 113 |
+
| **MNRL + explicit hard negatives (this model)** | **0.695** | **+7.4%** |
|
| 114 |
+
|
| 115 |
+
## Usage
|
| 116 |
+
|
| 117 |
+
### With sentence-transformers
|
| 118 |
```python
|
| 119 |
from sentence_transformers import SentenceTransformer
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
model = SentenceTransformer("Stffens/bge-small-rrf-v2")
|
| 122 |
+
embeddings = model.encode(["your query", "your document"])
|
| 123 |
+
similarity = embeddings[0] @ embeddings[1]
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### With vstash (hybrid retrieval system)
|
| 127 |
+
```bash
|
| 128 |
+
pip install vstash
|
| 129 |
+
vstash reindex --model Stffens/bge-small-rrf-v2
|
| 130 |
+
vstash search "your query"
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Train your own version on your data
|
| 134 |
+
```bash
|
| 135 |
+
pip install vstash sentence-transformers torch
|
| 136 |
+
vstash retrain # generates disagreement pairs from YOUR corpus and fine-tunes
|
| 137 |
vstash reindex --model ~/.vstash/models/retrained
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## Reproduce From Scratch
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
git clone https://github.com/stffns/vstash
|
| 144 |
+
cd vstash
|
| 145 |
+
pip install -e . sentence-transformers torch
|
| 146 |
+
|
| 147 |
+
# Generate disagreement triples
|
| 148 |
+
python -m experiments.rrf_training_pairs --datasets scifact nfcorpus fiqa
|
| 149 |
+
|
| 150 |
+
# Train (GPU recommended)
|
| 151 |
+
python -m experiments.finetune_rrf --epochs 2 --lr 3e-6 --batch-size 64
|
| 152 |
+
|
| 153 |
+
# Evaluate
|
| 154 |
+
python -m experiments.finetune_rrf --evaluate-only
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## Limitations
|
| 158 |
+
|
| 159 |
+
- **ArguAna regression**: queries with 200+ words show -8.4% vs ColBERTv2. Long argumentative queries produce only 1.1% signal disagreement, leaving no training signal.
|
| 160 |
+
- **FiQA neutral**: financial queries show +0.1% vs base but -7.8% vs ColBERTv2. The disagreement signal exists (86.7%) but doesn't translate to NDCG gains on this dataset.
|
| 161 |
+
- **English only**: inherited from BGE-small-en-v1.5.
|
| 162 |
+
- **Not tested beyond BEIR**: performance on domain-specific corpora may vary.
|
| 163 |
+
|
| 164 |
+
## Citation
|
| 165 |
+
|
| 166 |
+
```bibtex
|
| 167 |
+
@software{vstash2026,
|
| 168 |
+
author = {Steffens, Jayson},
|
| 169 |
+
title = {vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents},
|
| 170 |
+
url = {https://github.com/stffns/vstash},
|
| 171 |
+
year = {2026}
|
| 172 |
+
}
|
| 173 |
+
```
|
| 174 |
|
| 175 |
+
## Related
|
| 176 |
|
| 177 |
+
- [vstash paper](https://github.com/stffns/vstash/blob/main/paper/vstash-paper.md) (Section 8.10: Self-Supervised Embedding Refinement)
|
| 178 |
+
- [vstash GitHub](https://github.com/stffns/vstash)
|
| 179 |
+
- [Base model: BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
|