Spaces:

Sefaria
/

Rabbinic-Embedding-Bench

Sleeping

App Files Files Community

Lev Israel commited on Jan 12

Commit

112e258

1 Parent(s): 5990acd

Setup HF space

Browse files

Files changed (4) hide show

README.md +66 -22
app.py +30 -51
data_loader.py +29 -6
requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,47 +1,91 @@
-# Rabbinic Hebrew/Aramaic Embedding Evaluation
-A Hugging Face Space for evaluating embedding models on Rabbinic Hebrew and Aramaic texts using cross-lingual retrieval benchmarks.
-## Overview
-This tool helps identify which embedding models best capture the semantics of Rabbinic Hebrew and Aramaic by measuring how well they align source texts with their English translations. Models that excel at this task are likely to produce high-quality embeddings for untranslated texts.
-## Evaluation Approach
-Given a Hebrew/Aramaic text, the benchmark tests whether the embedding model can find its correct English translation from a pool of candidates. This cross-lingual retrieval task measures semantic alignment across languages.
-### Metrics
 | Metric | Description |
 |--------|-------------|
-| **Recall@1** | % of queries where correct translation is the top result |
-| **Recall@5** | % where correct translation is in top 5 results |
-| **Recall@10** | % where correct translation is in top 10 results |
 | **MRR** | Mean Reciprocal Rank (average of 1/rank of correct answer) |
 ## Corpus
-The benchmark includes diverse texts from Sefaria with English translations:
-Representative Segment pairs from Talmud Bavli, Yerushalmi, Mishnah, Midrash, Tanakh Commentary, Halacha, Hassidic texts, Works of Philosophy, and Kabbalah.
-## Usage
-1. Select a model from the curated list or enter any Hugging Face model ID
-2. Click "Run Evaluation"
-3. View results and compare with the leaderboard
-## Models
-Support for OpenAI, Google, and Voyage embedding APIs, and any sentence-transformer compatible model from Hugging Face Hub.
 ## Local Development
 ```bash
 pip install -r requirements.txt
 python app.py
-```
-## License
-MIT

+---
+title: Rabbinic Embedding Benchmark
+emoji: 📚
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+datasets:
+  - Sefaria/Rabbinic-Hebrew-English-Pairs
+  - Sefaria/Rabbinic-Embedding-Leaderboard
+---
+# Rabbinic Hebrew/Aramaic Embedding Benchmark
+Evaluate embedding models on cross-lingual retrieval between Hebrew/Aramaic source texts and their English translations from Sefaria.
+## How It Works
+Given a Hebrew/Aramaic text, can the model find its correct English translation from a pool of candidates? Models that excel at this task produce high-quality embeddings for Rabbinic literature.
+## Metrics
 | Metric | Description |
 |--------|-------------|
 | **MRR** | Mean Reciprocal Rank (average of 1/rank of correct answer) |
+| **Recall@k** | % of queries where correct translation is in top k results |
+| **Bitext Accuracy** | True pair vs random pair classification |
 ## Corpus
+The benchmark uses the [Sefaria/Rabbinic-Hebrew-English-Pairs](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs) dataset, which includes diverse texts with English translations:
+- **Talmud**: Bavli & Yerushalmi
+- **Mishnah**: Selected tractates
+- **Midrash**: Midrash Rabbah
+- **Commentary**: Rashi, Ramban, Radak, Rabbeinu Behaye
+- **Philosophy**: Guide for the Perplexed, Sefer HaIkkarim
+- **Hasidic/Kabbalistic**: Likutei Moharan, Tomer Devorah, Kalach Pitchei Chokhmah
+- **Mussar**: Chafetz Chaim, Kav HaYashar, Iggeret HaRamban
+- **Halacha**: Sefer HaChinukh, Mishneh Torah
+All texts sourced from [Sefaria](https://www.sefaria.org).
+## Leaderboard
+Results are stored persistently in the [Sefaria/Rabbinic-Embedding-Leaderboard](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard) dataset.
+## Configuration (Space Secrets)
+The following environment variables can be set in Space settings:
+### Required for Leaderboard Persistence
+| Secret | Description |
+|--------|-------------|
+| `HF_TOKEN` | HuggingFace token with write access to `Sefaria/Rabbinic-Embedding-Leaderboard`. Without this, evaluations will run but results won't be saved to the leaderboard. |
+### Optional for API-based Models
+| Secret | Description |
+|--------|-------------|
+| `OPENAI_API_KEY` | For OpenAI embedding models |
+| `VOYAGE_API_KEY` | For Voyage AI embedding models |
+| `GEMINI_API_KEY` | For Google Gemini embedding models |
+Users can also enter API keys directly in the interface (they are not stored).
 ## Local Development
 ```bash
+# Clone and install dependencies
+git clone https://huggingface.co/spaces/Sefaria/Rabbinic-Embedding-Benchmark
+cd Rabbinic-Embedding-Benchmark
 pip install -r requirements.txt
+# Run locally (leaderboard will be read-only without HF_TOKEN)
 python app.py
+# Or with write access to leaderboard
+export HF_TOKEN=your_token_here
+python app.py
+```
+## Related
+- [Benchmark Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs)
+- [Leaderboard Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard)
+- [Sefaria](https://www.sefaria.org)

app.py CHANGED Viewed

@@ -5,10 +5,8 @@ A Hugging Face Space for evaluating embedding models on cross-lingual
 retrieval between Hebrew/Aramaic source texts and English translations.
 """
-import json
 import os
 from datetime import datetime
-from pathlib import Path
 import gradio as gr
 import pandas as pd
@@ -37,28 +35,31 @@ from evaluation import (
     compute_similarity_matrix,
     get_rank_distribution,
 )
-# Paths
-BENCHMARK_PATH = "benchmark_data/benchmark.json"
-LEADERBOARD_PATH = "benchmark_data/leaderboard.json"
 # Global state
 _benchmark_data = None
-_leaderboard = []
 def load_benchmark():
-    """Load benchmark data, with fallback to sample data."""
     global _benchmark_data
     if _benchmark_data is not None:
         return _benchmark_data
     try:
-        _benchmark_data = load_benchmark_dataset(BENCHMARK_PATH)
-        print(f"Loaded {len(_benchmark_data)} benchmark pairs")
-    except FileNotFoundError:
-        print("Benchmark not found, using sample data")
         # Create minimal sample data for testing
         _benchmark_data = [
             {
@@ -79,56 +80,34 @@ def load_benchmark():
 def load_leaderboard():
-    """Load saved leaderboard results."""
-    global _leaderboard
-    try:
-        with open(LEADERBOARD_PATH, "r") as f:
-            _leaderboard = json.load(f)
-    except FileNotFoundError:
-        _leaderboard = []
-    return _leaderboard
-def save_leaderboard():
-    """Save leaderboard to file."""
-    global _leaderboard
-    Path(LEADERBOARD_PATH).parent.mkdir(parents=True, exist_ok=True)
-    with open(LEADERBOARD_PATH, "w") as f:
-        json.dump(_leaderboard, f, indent=2)
 def add_to_leaderboard(results: EvaluationResults):
-    """Add evaluation results to leaderboard."""
-    global _leaderboard
     entry = results.to_dict()
     entry["timestamp"] = datetime.now().isoformat()
-    # Remove existing entry for same model
-    _leaderboard = [e for e in _leaderboard if e["model_id"] != results.model_id]
-    _leaderboard.append(entry)
-    # Sort by MRR descending
-    _leaderboard.sort(key=lambda x: x["mrr"], reverse=True)
-    save_leaderboard()
 def format_leaderboard_df():
     """Format leaderboard as pandas DataFrame for display."""
-    load_leaderboard()
-    if not _leaderboard:
         return pd.DataFrame(columns=[
             "#", "Model", "MRR", "R@1", "R@5", "R@10",
             "Bitext", "TrueSim", "RandSim", "N"
         ])
     rows = []
-    for i, entry in enumerate(_leaderboard, 1):
         rows.append({
             "#": i,
             "Model": entry.get("model_name", entry["model_id"]),
@@ -261,17 +240,17 @@ def run_evaluation(
 def create_leaderboard_comparison():
     """Create comparison chart of all models on leaderboard."""
-    load_leaderboard()
-    if len(_leaderboard) < 2:
         return None
-    models = [e.get("model_name", e["model_id"]) for e in _leaderboard]
-    mrr = [e["mrr"] for e in _leaderboard]
-    r1 = [e["recall_at_1"] for e in _leaderboard]
-    r5 = [e["recall_at_5"] for e in _leaderboard]
-    r10 = [e["recall_at_10"] for e in _leaderboard]
-    bitext = [e["bitext_accuracy"] for e in _leaderboard]
     fig = go.Figure()

 retrieval between Hebrew/Aramaic source texts and English translations.
 """
 import os
 from datetime import datetime
 import gradio as gr
 import pandas as pd
     compute_similarity_matrix,
     get_rank_distribution,
 )
+from leaderboard import (
+    load_leaderboard as load_leaderboard_from_hub,
+    add_result as add_result_to_hub,
+)
+# HuggingFace Dataset ID for benchmark data
+BENCHMARK_DATASET_ID = "Sefaria/Rabbinic-Hebrew-English-Pairs"
 # Global state
 _benchmark_data = None
 def load_benchmark():
+    """Load benchmark data from HuggingFace Hub, with fallback to sample data."""
     global _benchmark_data
     if _benchmark_data is not None:
         return _benchmark_data
     try:
+        _benchmark_data = load_benchmark_dataset(BENCHMARK_DATASET_ID)
+        print(f"Loaded {len(_benchmark_data)} benchmark pairs from {BENCHMARK_DATASET_ID}")
+    except Exception as e:
+        print(f"Failed to load benchmark: {e}")
+        print("Using sample data for testing")
         # Create minimal sample data for testing
         _benchmark_data = [
             {
 def load_leaderboard():
+    """Load leaderboard from HuggingFace Hub."""
+    return load_leaderboard_from_hub()
 def add_to_leaderboard(results: EvaluationResults):
+    """Add evaluation results to leaderboard on HuggingFace Hub."""
     entry = results.to_dict()
     entry["timestamp"] = datetime.now().isoformat()
+    # Add to Hub (handles deduplication and sorting internally)
+    success = add_result_to_hub(entry)
+    if not success:
+        print("Note: Results saved locally but not persisted to Hub (no HF_TOKEN)")
 def format_leaderboard_df():
     """Format leaderboard as pandas DataFrame for display."""
+    leaderboard = load_leaderboard()
+    if not leaderboard:
         return pd.DataFrame(columns=[
             "#", "Model", "MRR", "R@1", "R@5", "R@10",
             "Bitext", "TrueSim", "RandSim", "N"
         ])
     rows = []
+    for i, entry in enumerate(leaderboard, 1):
         rows.append({
             "#": i,
             "Model": entry.get("model_name", entry["model_id"]),
 def create_leaderboard_comparison():
     """Create comparison chart of all models on leaderboard."""
+    leaderboard = load_leaderboard()
+    if len(leaderboard) < 2:
         return None
+    models = [e.get("model_name", e["model_id"]) for e in leaderboard]
+    mrr = [e["mrr"] for e in leaderboard]
+    r1 = [e["recall_at_1"] for e in leaderboard]
+    r5 = [e["recall_at_5"] for e in leaderboard]
+    r10 = [e["recall_at_10"] for e in leaderboard]
+    bitext = [e["bitext_accuracy"] for e in leaderboard]
     fig = go.Figure()

data_loader.py CHANGED Viewed

@@ -719,18 +719,41 @@ def build_benchmark_dataset(
     return all_pairs
-def load_benchmark_dataset(path: str = "benchmark_data/benchmark.json") -> list[dict]:
     """
-    Load the pre-cached benchmark dataset.
     Args:
-        path: Path to the benchmark JSON file
     Returns:
-        List of benchmark pairs
     """
-    with open(path, "r", encoding="utf-8") as f:
-        return json.load(f)
 def get_benchmark_stats(pairs: list[dict]) -> dict:

     return all_pairs
+def load_benchmark_dataset(
+    source: str = "Sefaria/Rabbinic-Hebrew-English-Pairs",
+    use_local: bool = False,
+) -> list[dict]:
     """
+    Load the benchmark dataset from HuggingFace Hub or local file.
     Args:
+        source: HuggingFace dataset ID or local file path
+        use_local: If True, load from local JSON file instead of HuggingFace
     Returns:
+        List of benchmark pairs with keys: ref, he, en, category
     """
+    if use_local or source.endswith(".json"):
+        # Load from local JSON file
+        with open(source, "r", encoding="utf-8") as f:
+            return json.load(f)
+    # Load from HuggingFace Hub
+    try:
+        from datasets import load_dataset
+        print(f"Loading benchmark from HuggingFace: {source}")
+        ds = load_dataset(source, split="train")
+        return ds.to_list()
+    except Exception as e:
+        print(f"Failed to load from HuggingFace: {e}")
+        # Fallback to local file if it exists
+        local_path = "benchmark_data/benchmark.json"
+        if Path(local_path).exists():
+            print(f"Falling back to local file: {local_path}")
+            with open(local_path, "r", encoding="utf-8") as f:
+                return json.load(f)
+        raise
 def get_benchmark_stats(pairs: list[dict]) -> dict:

requirements.txt CHANGED Viewed

@@ -3,6 +3,8 @@ gradio>=4.0.0
 transformers>=4.36.0
 sentence-transformers>=2.2.2
 torch>=2.0.0
 # Data processing
 numpy>=1.24.0

 transformers>=4.36.0
 sentence-transformers>=2.2.2
 torch>=2.0.0
+datasets>=2.14.0
+huggingface_hub>=0.19.0
 # Data processing
 numpy>=1.24.0