Spaces:

Sefaria
/

Rabbinic-Embedding-Bench

Running

App Files Files Community

Rabbinic-Embedding-Bench / README.md

Lev Israel

Refactor to use gr.Progress API and upgrade to Gradio 5

1a6f495 12 days ago

preview code

raw

history blame contribute delete

3.1 kB

	---
	title: Rabbinic Embedding Benchmark
	emoji: 📚
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.9.1
	app_file: app.py
	pinned: false
	license: mit
	datasets:
	- Sefaria/Rabbinic-Hebrew-English-Pairs
	- Sefaria/Rabbinic-Embedding-Leaderboard
	---

	# Rabbinic Hebrew/Aramaic Embedding Benchmark

	Evaluate embedding models on cross-lingual retrieval between Hebrew/Aramaic source texts and their English translations from Sefaria.

	## How It Works

	Given a Hebrew/Aramaic text, can the model find its correct English translation from a pool of candidates? Models that excel at this task produce high-quality embeddings for Rabbinic literature.

	## Metrics

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| MRR \| Mean Reciprocal Rank (average of 1/rank of correct answer) \|
	\| Recall@k \| % of queries where correct translation is in top k results \|
	\| Bitext Accuracy \| True pair vs random pair classification \|

	## Corpus

	The benchmark uses the [Sefaria/Rabbinic-Hebrew-English-Pairs](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs) dataset, which includes diverse texts with English translations:

	- Talmud: Bavli & Yerushalmi
	- Mishnah: Selected tractates
	- Midrash: Midrash Rabbah
	- Commentary: Rashi, Ramban, Radak, Rabbeinu Behaye
	- Philosophy: Guide for the Perplexed, Sefer HaIkkarim
	- Hasidic/Kabbalistic: Likutei Moharan, Tomer Devorah, Kalach Pitchei Chokhmah
	- Mussar: Chafetz Chaim, Kav HaYashar, Iggeret HaRamban
	- Halacha: Sefer HaChinukh, Mishneh Torah

	All texts sourced from [Sefaria](https://www.sefaria.org).

	## Leaderboard

	Results are stored persistently in the [Sefaria/Rabbinic-Embedding-Leaderboard](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard) dataset.

	## Configuration (Space Secrets)

	The following environment variables can be set in Space settings:

	### Required for Leaderboard Persistence

	\| Secret \| Description \|
	\|--------\|-------------\|
	\| `HF_TOKEN` \| HuggingFace token with write access to `Sefaria/Rabbinic-Embedding-Leaderboard`. Without this, evaluations will run but results won't be saved to the leaderboard. \|

	### Optional for API-based Models

	\| Secret \| Description \|
	\|--------\|-------------\|
	\| `OPENAI_API_KEY` \| For OpenAI embedding models \|
	\| `VOYAGE_API_KEY` \| For Voyage AI embedding models \|
	\| `GEMINI_API_KEY` \| For Google Gemini embedding models \|

	Users can also enter API keys directly in the interface (they are not stored).

	## Local Development

	```bash
	# Clone and install dependencies
	git clone https://huggingface.co/spaces/Sefaria/Rabbinic-Embedding-Benchmark
	cd Rabbinic-Embedding-Benchmark
	pip install -r requirements.txt

	# Run locally (leaderboard will be read-only without HF_TOKEN)
	python app.py

	# Or with write access to leaderboard
	export HF_TOKEN=your_token_here
	python app.py
	```

	## Related

	- [Benchmark Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs)
	- [Leaderboard Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard)
	- [Sefaria](https://www.sefaria.org)