File size: 5,860 Bytes
1587b68
 
 
 
 
 
 
 
 
 
 
 
a1ad6c7
173f28e
db0da0a
173f28e
db0da0a
173f28e
db0da0a
 
 
 
 
 
 
 
173f28e
 
 
 
 
 
 
 
 
db0da0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173f28e
db0da0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf74331
173f28e
 
 
 
 
 
 
db0da0a
f56dbf3
 
173f28e
 
 
 
db0da0a
bf74331
 
db0da0a
bf74331
db0da0a
bf74331
 
 
db0da0a
 
bf74331
 
db0da0a
173f28e
db0da0a
 
 
 
 
 
 
 
 
173f28e
db0da0a
bf74331
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173f28e
 
db0da0a
 
 
173f28e
 
 
 
 
 
 
 
 
 
 
 
db0da0a
bf74331
db0da0a
 
bf74331
 
db0da0a
173f28e
db0da0a
bf74331
db0da0a
 
173f28e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: Embedding Bench
emoji: πŸ“
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.56.0"
app_file: app.py
pinned: false
license: mit
---

# embedding-bench

Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI.

## Features

- **40+ pre-configured models** β€” sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more
- **4 backends** β€” sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding
- **7 built-in datasets** β€” STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA
- **Custom datasets** β€” upload your own CSV/TSV or load any HuggingFace dataset
- **Custom models** β€” add any HuggingFace embedding model from the UI
- **11 retrieval metrics** β€” MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable)
- **LLM as a Judge** β€” use OpenAI or Anthropic to rate retrieval relevance
- **Interactive charts** β€” Plotly-powered, with hover, zoom, and PNG export

## Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## Web UI

```bash
streamlit run app.py
```

The sidebar has three sections:

1. **Models** β€” select from the registry or add a custom HuggingFace model
2. **Datasets** β€” pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset
3. **Evaluation** β€” configure metrics, speed/memory benchmarks, LLM judge, and max pairs

### Custom datasets

You can add datasets two ways from the sidebar:

- **Upload file** β€” CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used.
- **HuggingFace Hub** β€” provide the dataset ID (e.g. `mteb/stsbenchmark-sts`), config, split, and column names. The dataset is validated on add.

### LLM as a Judge

Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1–5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5.

### Metrics

| Dimension | Metrics | Method |
|-----------|---------|--------|
| Quality (scored) | Spearman | Cosine similarity vs gold scores |
| Quality (pairs) | MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 | Retrieval ranking of positive passages |
| LLM Judge | Avg@1, Avg@5, nDCG@5 | LLM relevance ratings on retrieved passages |
| Speed | Median encode time, sent/s | Wall-clock over N runs with warmup |
| Memory | Peak RSS delta (MB) | Isolated subprocess via `psutil` |

## CLI

```bash
# Full benchmark (quality + speed + memory)
python bench.py

# Specific models
python bench.py --models mpnet bge-small

# Compare backends
python bench.py --models bge-small bge-small-fe

# Skip expensive evals
python bench.py --skip-quality
python bench.py --skip-memory

# Multiple datasets with pair limit
python bench.py --models mpnet bge-small \
  --datasets sts natural-questions squad \
  --max-pairs 1000 --skip-speed --skip-memory

# Custom HF dataset
python bench.py --dataset my-org/my-pairs \
  --query-col query --passage-col passage --score-col none

# Export
python bench.py --csv results.csv --charts ./results
```

### Built-in dataset presets

| Preset | HF Dataset | Type |
|--------|-----------|------|
| `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) |
| `natural-questions` | `sentence-transformers/natural-questions` | Retrieval |
| `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval |
| `squad` | `sentence-transformers/squad` | Retrieval |
| `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval |
| `gooaq` | `sentence-transformers/gooaq` | Retrieval |
| `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval |

### CLI flags

```
--models            Models to benchmark (default: all)
--corpus-size       Sentences for speed/memory tests (default: 1000)
--batch-size        Encoding batch size (default: 64)
--num-runs          Speed benchmark runs (default: 3)
--skip-quality      Skip quality evaluation
--skip-speed        Skip speed measurement
--skip-memory       Skip memory measurement
--datasets          Dataset presets (default: sts)
--max-pairs         Limit pairs per dataset
--dataset           Custom HF dataset (overrides --datasets)
--config            Dataset config/subset name (e.g. 'triplet')
--split             Dataset split (default: test)
--query-col         Query column name (default: sentence1)
--passage-col       Passage column name (default: sentence2)
--score-col         Score column (default: score, 'none' for pairs)
--score-scale       Score normalization divisor (default: 5.0)
--csv               Export results to CSV
--charts            Save charts to directory
```

## Adding a model

From the web UI, click **Add Custom Model** in the sidebar β€” just provide a display name and a HuggingFace model ID.

Or edit `models.py` directly:

```python
"e5-small": ModelConfig(
    name="e5-small-v2",
    model_id="intfloat/e5-small-v2",
),
```

## Project structure

```
embedding-bench/
β”œβ”€β”€ app.py               # Streamlit web UI
β”œβ”€β”€ bench.py             # CLI entry point
β”œβ”€β”€ models.py            # Model registry (40+ models)
β”œβ”€β”€ wrapper.py           # Backend wrappers (sbert, fastembed, gguf, libembedding)
β”œβ”€β”€ corpus.py            # Sentence corpus builder
β”œβ”€β”€ dataset_config.py    # Dataset presets and configuration
β”œβ”€β”€ report.py            # Table formatting, CSV export, charts (CLI)
β”œβ”€β”€ evals/
β”‚   β”œβ”€β”€ quality.py       # Quality evaluation (Spearman + retrieval metrics)
β”‚   β”œβ”€β”€ speed.py         # Latency measurement
β”‚   β”œβ”€β”€ memory.py        # Memory measurement
β”‚   └── llm_judge.py     # LLM-as-a-Judge evaluation
└── requirements.txt
```