Spaces:
Running
Running
| title: Embedding Bench | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: "1.56.0" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # embedding-bench | |
| Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI. | |
| ## Features | |
| - **40+ pre-configured models** β sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more | |
| - **4 backends** β sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding | |
| - **7 built-in datasets** β STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA | |
| - **Custom datasets** β upload your own CSV/TSV or load any HuggingFace dataset | |
| - **Custom models** β add any HuggingFace embedding model from the UI | |
| - **11 retrieval metrics** β MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable) | |
| - **LLM as a Judge** β use OpenAI or Anthropic to rate retrieval relevance | |
| - **Interactive charts** β Plotly-powered, with hover, zoom, and PNG export | |
| ## Setup | |
| ```bash | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| ## Web UI | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| The sidebar has three sections: | |
| 1. **Models** β select from the registry or add a custom HuggingFace model | |
| 2. **Datasets** β pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset | |
| 3. **Evaluation** β configure metrics, speed/memory benchmarks, LLM judge, and max pairs | |
| ### Custom datasets | |
| You can add datasets two ways from the sidebar: | |
| - **Upload file** β CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used. | |
| - **HuggingFace Hub** β provide the dataset ID (e.g. `mteb/stsbenchmark-sts`), config, split, and column names. The dataset is validated on add. | |
| ### LLM as a Judge | |
| Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1β5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5. | |
| ### Metrics | |
| | Dimension | Metrics | Method | | |
| |-----------|---------|--------| | |
| | Quality (scored) | Spearman | Cosine similarity vs gold scores | | |
| | Quality (pairs) | MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 | Retrieval ranking of positive passages | | |
| | LLM Judge | Avg@1, Avg@5, nDCG@5 | LLM relevance ratings on retrieved passages | | |
| | Speed | Median encode time, sent/s | Wall-clock over N runs with warmup | | |
| | Memory | Peak RSS delta (MB) | Isolated subprocess via `psutil` | | |
| ## CLI | |
| ```bash | |
| # Full benchmark (quality + speed + memory) | |
| python bench.py | |
| # Specific models | |
| python bench.py --models mpnet bge-small | |
| # Compare backends | |
| python bench.py --models bge-small bge-small-fe | |
| # Skip expensive evals | |
| python bench.py --skip-quality | |
| python bench.py --skip-memory | |
| # Multiple datasets with pair limit | |
| python bench.py --models mpnet bge-small \ | |
| --datasets sts natural-questions squad \ | |
| --max-pairs 1000 --skip-speed --skip-memory | |
| # Custom HF dataset | |
| python bench.py --dataset my-org/my-pairs \ | |
| --query-col query --passage-col passage --score-col none | |
| # Export | |
| python bench.py --csv results.csv --charts ./results | |
| ``` | |
| ### Built-in dataset presets | |
| | Preset | HF Dataset | Type | | |
| |--------|-----------|------| | |
| | `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) | | |
| | `natural-questions` | `sentence-transformers/natural-questions` | Retrieval | | |
| | `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval | | |
| | `squad` | `sentence-transformers/squad` | Retrieval | | |
| | `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval | | |
| | `gooaq` | `sentence-transformers/gooaq` | Retrieval | | |
| | `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval | | |
| ### CLI flags | |
| ``` | |
| --models Models to benchmark (default: all) | |
| --corpus-size Sentences for speed/memory tests (default: 1000) | |
| --batch-size Encoding batch size (default: 64) | |
| --num-runs Speed benchmark runs (default: 3) | |
| --skip-quality Skip quality evaluation | |
| --skip-speed Skip speed measurement | |
| --skip-memory Skip memory measurement | |
| --datasets Dataset presets (default: sts) | |
| --max-pairs Limit pairs per dataset | |
| --dataset Custom HF dataset (overrides --datasets) | |
| --config Dataset config/subset name (e.g. 'triplet') | |
| --split Dataset split (default: test) | |
| --query-col Query column name (default: sentence1) | |
| --passage-col Passage column name (default: sentence2) | |
| --score-col Score column (default: score, 'none' for pairs) | |
| --score-scale Score normalization divisor (default: 5.0) | |
| --csv Export results to CSV | |
| --charts Save charts to directory | |
| ``` | |
| ## Adding a model | |
| From the web UI, click **Add Custom Model** in the sidebar β just provide a display name and a HuggingFace model ID. | |
| Or edit `models.py` directly: | |
| ```python | |
| "e5-small": ModelConfig( | |
| name="e5-small-v2", | |
| model_id="intfloat/e5-small-v2", | |
| ), | |
| ``` | |
| ## Project structure | |
| ``` | |
| embedding-bench/ | |
| βββ app.py # Streamlit web UI | |
| βββ bench.py # CLI entry point | |
| βββ models.py # Model registry (40+ models) | |
| βββ wrapper.py # Backend wrappers (sbert, fastembed, gguf, libembedding) | |
| βββ corpus.py # Sentence corpus builder | |
| βββ dataset_config.py # Dataset presets and configuration | |
| βββ report.py # Table formatting, CSV export, charts (CLI) | |
| βββ evals/ | |
| β βββ quality.py # Quality evaluation (Spearman + retrieval metrics) | |
| β βββ speed.py # Latency measurement | |
| β βββ memory.py # Memory measurement | |
| β βββ llm_judge.py # LLM-as-a-Judge evaluation | |
| βββ requirements.txt | |
| ``` | |