Spaces:
Running
Running
File size: 5,860 Bytes
1587b68 a1ad6c7 173f28e db0da0a 173f28e db0da0a 173f28e db0da0a 173f28e db0da0a 173f28e db0da0a bf74331 173f28e db0da0a f56dbf3 173f28e db0da0a bf74331 db0da0a bf74331 db0da0a bf74331 db0da0a bf74331 db0da0a 173f28e db0da0a 173f28e db0da0a bf74331 173f28e db0da0a 173f28e db0da0a bf74331 db0da0a bf74331 db0da0a 173f28e db0da0a bf74331 db0da0a 173f28e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
title: Embedding Bench
emoji: π
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.56.0"
app_file: app.py
pinned: false
license: mit
---
# embedding-bench
Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI.
## Features
- **40+ pre-configured models** β sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more
- **4 backends** β sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding
- **7 built-in datasets** β STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA
- **Custom datasets** β upload your own CSV/TSV or load any HuggingFace dataset
- **Custom models** β add any HuggingFace embedding model from the UI
- **11 retrieval metrics** β MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable)
- **LLM as a Judge** β use OpenAI or Anthropic to rate retrieval relevance
- **Interactive charts** β Plotly-powered, with hover, zoom, and PNG export
## Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
## Web UI
```bash
streamlit run app.py
```
The sidebar has three sections:
1. **Models** β select from the registry or add a custom HuggingFace model
2. **Datasets** β pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset
3. **Evaluation** β configure metrics, speed/memory benchmarks, LLM judge, and max pairs
### Custom datasets
You can add datasets two ways from the sidebar:
- **Upload file** β CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used.
- **HuggingFace Hub** β provide the dataset ID (e.g. `mteb/stsbenchmark-sts`), config, split, and column names. The dataset is validated on add.
### LLM as a Judge
Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1β5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5.
### Metrics
| Dimension | Metrics | Method |
|-----------|---------|--------|
| Quality (scored) | Spearman | Cosine similarity vs gold scores |
| Quality (pairs) | MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 | Retrieval ranking of positive passages |
| LLM Judge | Avg@1, Avg@5, nDCG@5 | LLM relevance ratings on retrieved passages |
| Speed | Median encode time, sent/s | Wall-clock over N runs with warmup |
| Memory | Peak RSS delta (MB) | Isolated subprocess via `psutil` |
## CLI
```bash
# Full benchmark (quality + speed + memory)
python bench.py
# Specific models
python bench.py --models mpnet bge-small
# Compare backends
python bench.py --models bge-small bge-small-fe
# Skip expensive evals
python bench.py --skip-quality
python bench.py --skip-memory
# Multiple datasets with pair limit
python bench.py --models mpnet bge-small \
--datasets sts natural-questions squad \
--max-pairs 1000 --skip-speed --skip-memory
# Custom HF dataset
python bench.py --dataset my-org/my-pairs \
--query-col query --passage-col passage --score-col none
# Export
python bench.py --csv results.csv --charts ./results
```
### Built-in dataset presets
| Preset | HF Dataset | Type |
|--------|-----------|------|
| `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) |
| `natural-questions` | `sentence-transformers/natural-questions` | Retrieval |
| `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval |
| `squad` | `sentence-transformers/squad` | Retrieval |
| `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval |
| `gooaq` | `sentence-transformers/gooaq` | Retrieval |
| `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval |
### CLI flags
```
--models Models to benchmark (default: all)
--corpus-size Sentences for speed/memory tests (default: 1000)
--batch-size Encoding batch size (default: 64)
--num-runs Speed benchmark runs (default: 3)
--skip-quality Skip quality evaluation
--skip-speed Skip speed measurement
--skip-memory Skip memory measurement
--datasets Dataset presets (default: sts)
--max-pairs Limit pairs per dataset
--dataset Custom HF dataset (overrides --datasets)
--config Dataset config/subset name (e.g. 'triplet')
--split Dataset split (default: test)
--query-col Query column name (default: sentence1)
--passage-col Passage column name (default: sentence2)
--score-col Score column (default: score, 'none' for pairs)
--score-scale Score normalization divisor (default: 5.0)
--csv Export results to CSV
--charts Save charts to directory
```
## Adding a model
From the web UI, click **Add Custom Model** in the sidebar β just provide a display name and a HuggingFace model ID.
Or edit `models.py` directly:
```python
"e5-small": ModelConfig(
name="e5-small-v2",
model_id="intfloat/e5-small-v2",
),
```
## Project structure
```
embedding-bench/
βββ app.py # Streamlit web UI
βββ bench.py # CLI entry point
βββ models.py # Model registry (40+ models)
βββ wrapper.py # Backend wrappers (sbert, fastembed, gguf, libembedding)
βββ corpus.py # Sentence corpus builder
βββ dataset_config.py # Dataset presets and configuration
βββ report.py # Table formatting, CSV export, charts (CLI)
βββ evals/
β βββ quality.py # Quality evaluation (Spearman + retrieval metrics)
β βββ speed.py # Latency measurement
β βββ memory.py # Memory measurement
β βββ llm_judge.py # LLM-as-a-Judge evaluation
βββ requirements.txt
```
|