Lev Israel commited on
Commit
112e258
·
1 Parent(s): 5990acd

Setup HF space

Browse files
Files changed (4) hide show
  1. README.md +66 -22
  2. app.py +30 -51
  3. data_loader.py +29 -6
  4. requirements.txt +2 -0
README.md CHANGED
@@ -1,47 +1,91 @@
1
- # Rabbinic Hebrew/Aramaic Embedding Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- A Hugging Face Space for evaluating embedding models on Rabbinic Hebrew and Aramaic texts using cross-lingual retrieval benchmarks.
4
 
5
- ## Overview
6
 
7
- This tool helps identify which embedding models best capture the semantics of Rabbinic Hebrew and Aramaic by measuring how well they align source texts with their English translations. Models that excel at this task are likely to produce high-quality embeddings for untranslated texts.
8
 
9
- ## Evaluation Approach
10
 
11
- Given a Hebrew/Aramaic text, the benchmark tests whether the embedding model can find its correct English translation from a pool of candidates. This cross-lingual retrieval task measures semantic alignment across languages.
12
-
13
- ### Metrics
14
 
15
  | Metric | Description |
16
  |--------|-------------|
17
- | **Recall@1** | % of queries where correct translation is the top result |
18
- | **Recall@5** | % where correct translation is in top 5 results |
19
- | **Recall@10** | % where correct translation is in top 10 results |
20
  | **MRR** | Mean Reciprocal Rank (average of 1/rank of correct answer) |
 
 
21
 
22
  ## Corpus
23
 
24
- The benchmark includes diverse texts from Sefaria with English translations:
25
- Representative Segment pairs from Talmud Bavli, Yerushalmi, Mishnah, Midrash, Tanakh Commentary, Halacha, Hassidic texts, Works of Philosophy, and Kabbalah.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- ## Usage
28
 
29
- 1. Select a model from the curated list or enter any Hugging Face model ID
30
- 2. Click "Run Evaluation"
31
- 3. View results and compare with the leaderboard
 
 
32
 
33
- ## Models
 
 
 
 
34
 
35
- Support for OpenAI, Google, and Voyage embedding APIs, and any sentence-transformer compatible model from Hugging Face Hub.
36
 
37
  ## Local Development
38
 
39
  ```bash
 
 
 
40
  pip install -r requirements.txt
 
 
41
  python app.py
42
- ```
43
 
44
- ## License
 
 
 
45
 
46
- MIT
47
 
 
 
 
 
1
+ ---
2
+ title: Rabbinic Embedding Benchmark
3
+ emoji: 📚
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ datasets:
12
+ - Sefaria/Rabbinic-Hebrew-English-Pairs
13
+ - Sefaria/Rabbinic-Embedding-Leaderboard
14
+ ---
15
 
16
+ # Rabbinic Hebrew/Aramaic Embedding Benchmark
17
 
18
+ Evaluate embedding models on cross-lingual retrieval between Hebrew/Aramaic source texts and their English translations from Sefaria.
19
 
20
+ ## How It Works
21
 
22
+ Given a Hebrew/Aramaic text, can the model find its correct English translation from a pool of candidates? Models that excel at this task produce high-quality embeddings for Rabbinic literature.
23
 
24
+ ## Metrics
 
 
25
 
26
  | Metric | Description |
27
  |--------|-------------|
 
 
 
28
  | **MRR** | Mean Reciprocal Rank (average of 1/rank of correct answer) |
29
+ | **Recall@k** | % of queries where correct translation is in top k results |
30
+ | **Bitext Accuracy** | True pair vs random pair classification |
31
 
32
  ## Corpus
33
 
34
+ The benchmark uses the [Sefaria/Rabbinic-Hebrew-English-Pairs](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs) dataset, which includes diverse texts with English translations:
35
+
36
+ - **Talmud**: Bavli & Yerushalmi
37
+ - **Mishnah**: Selected tractates
38
+ - **Midrash**: Midrash Rabbah
39
+ - **Commentary**: Rashi, Ramban, Radak, Rabbeinu Behaye
40
+ - **Philosophy**: Guide for the Perplexed, Sefer HaIkkarim
41
+ - **Hasidic/Kabbalistic**: Likutei Moharan, Tomer Devorah, Kalach Pitchei Chokhmah
42
+ - **Mussar**: Chafetz Chaim, Kav HaYashar, Iggeret HaRamban
43
+ - **Halacha**: Sefer HaChinukh, Mishneh Torah
44
+
45
+ All texts sourced from [Sefaria](https://www.sefaria.org).
46
+
47
+ ## Leaderboard
48
+
49
+ Results are stored persistently in the [Sefaria/Rabbinic-Embedding-Leaderboard](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard) dataset.
50
+
51
+ ## Configuration (Space Secrets)
52
+
53
+ The following environment variables can be set in Space settings:
54
 
55
+ ### Required for Leaderboard Persistence
56
 
57
+ | Secret | Description |
58
+ |--------|-------------|
59
+ | `HF_TOKEN` | HuggingFace token with write access to `Sefaria/Rabbinic-Embedding-Leaderboard`. Without this, evaluations will run but results won't be saved to the leaderboard. |
60
+
61
+ ### Optional for API-based Models
62
 
63
+ | Secret | Description |
64
+ |--------|-------------|
65
+ | `OPENAI_API_KEY` | For OpenAI embedding models |
66
+ | `VOYAGE_API_KEY` | For Voyage AI embedding models |
67
+ | `GEMINI_API_KEY` | For Google Gemini embedding models |
68
 
69
+ Users can also enter API keys directly in the interface (they are not stored).
70
 
71
  ## Local Development
72
 
73
  ```bash
74
+ # Clone and install dependencies
75
+ git clone https://huggingface.co/spaces/Sefaria/Rabbinic-Embedding-Benchmark
76
+ cd Rabbinic-Embedding-Benchmark
77
  pip install -r requirements.txt
78
+
79
+ # Run locally (leaderboard will be read-only without HF_TOKEN)
80
  python app.py
 
81
 
82
+ # Or with write access to leaderboard
83
+ export HF_TOKEN=your_token_here
84
+ python app.py
85
+ ```
86
 
87
+ ## Related
88
 
89
+ - [Benchmark Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Hebrew-English-Pairs)
90
+ - [Leaderboard Dataset](https://huggingface.co/datasets/Sefaria/Rabbinic-Embedding-Leaderboard)
91
+ - [Sefaria](https://www.sefaria.org)
app.py CHANGED
@@ -5,10 +5,8 @@ A Hugging Face Space for evaluating embedding models on cross-lingual
5
  retrieval between Hebrew/Aramaic source texts and English translations.
6
  """
7
 
8
- import json
9
  import os
10
  from datetime import datetime
11
- from pathlib import Path
12
 
13
  import gradio as gr
14
  import pandas as pd
@@ -37,28 +35,31 @@ from evaluation import (
37
  compute_similarity_matrix,
38
  get_rank_distribution,
39
  )
 
 
 
 
40
 
41
- # Paths
42
- BENCHMARK_PATH = "benchmark_data/benchmark.json"
43
- LEADERBOARD_PATH = "benchmark_data/leaderboard.json"
44
 
45
  # Global state
46
  _benchmark_data = None
47
- _leaderboard = []
48
 
49
 
50
  def load_benchmark():
51
- """Load benchmark data, with fallback to sample data."""
52
  global _benchmark_data
53
 
54
  if _benchmark_data is not None:
55
  return _benchmark_data
56
 
57
  try:
58
- _benchmark_data = load_benchmark_dataset(BENCHMARK_PATH)
59
- print(f"Loaded {len(_benchmark_data)} benchmark pairs")
60
- except FileNotFoundError:
61
- print("Benchmark not found, using sample data")
 
62
  # Create minimal sample data for testing
63
  _benchmark_data = [
64
  {
@@ -79,56 +80,34 @@ def load_benchmark():
79
 
80
 
81
  def load_leaderboard():
82
- """Load saved leaderboard results."""
83
- global _leaderboard
84
-
85
- try:
86
- with open(LEADERBOARD_PATH, "r") as f:
87
- _leaderboard = json.load(f)
88
- except FileNotFoundError:
89
- _leaderboard = []
90
-
91
- return _leaderboard
92
-
93
-
94
- def save_leaderboard():
95
- """Save leaderboard to file."""
96
- global _leaderboard
97
-
98
- Path(LEADERBOARD_PATH).parent.mkdir(parents=True, exist_ok=True)
99
- with open(LEADERBOARD_PATH, "w") as f:
100
- json.dump(_leaderboard, f, indent=2)
101
 
102
 
103
  def add_to_leaderboard(results: EvaluationResults):
104
- """Add evaluation results to leaderboard."""
105
- global _leaderboard
106
-
107
  entry = results.to_dict()
108
  entry["timestamp"] = datetime.now().isoformat()
109
 
110
- # Remove existing entry for same model
111
- _leaderboard = [e for e in _leaderboard if e["model_id"] != results.model_id]
112
- _leaderboard.append(entry)
113
 
114
- # Sort by MRR descending
115
- _leaderboard.sort(key=lambda x: x["mrr"], reverse=True)
116
-
117
- save_leaderboard()
118
 
119
 
120
  def format_leaderboard_df():
121
  """Format leaderboard as pandas DataFrame for display."""
122
- load_leaderboard()
123
 
124
- if not _leaderboard:
125
  return pd.DataFrame(columns=[
126
  "#", "Model", "MRR", "R@1", "R@5", "R@10",
127
  "Bitext", "TrueSim", "RandSim", "N"
128
  ])
129
 
130
  rows = []
131
- for i, entry in enumerate(_leaderboard, 1):
132
  rows.append({
133
  "#": i,
134
  "Model": entry.get("model_name", entry["model_id"]),
@@ -261,17 +240,17 @@ def run_evaluation(
261
 
262
  def create_leaderboard_comparison():
263
  """Create comparison chart of all models on leaderboard."""
264
- load_leaderboard()
265
 
266
- if len(_leaderboard) < 2:
267
  return None
268
 
269
- models = [e.get("model_name", e["model_id"]) for e in _leaderboard]
270
- mrr = [e["mrr"] for e in _leaderboard]
271
- r1 = [e["recall_at_1"] for e in _leaderboard]
272
- r5 = [e["recall_at_5"] for e in _leaderboard]
273
- r10 = [e["recall_at_10"] for e in _leaderboard]
274
- bitext = [e["bitext_accuracy"] for e in _leaderboard]
275
 
276
  fig = go.Figure()
277
 
 
5
  retrieval between Hebrew/Aramaic source texts and English translations.
6
  """
7
 
 
8
  import os
9
  from datetime import datetime
 
10
 
11
  import gradio as gr
12
  import pandas as pd
 
35
  compute_similarity_matrix,
36
  get_rank_distribution,
37
  )
38
+ from leaderboard import (
39
+ load_leaderboard as load_leaderboard_from_hub,
40
+ add_result as add_result_to_hub,
41
+ )
42
 
43
+ # HuggingFace Dataset ID for benchmark data
44
+ BENCHMARK_DATASET_ID = "Sefaria/Rabbinic-Hebrew-English-Pairs"
 
45
 
46
  # Global state
47
  _benchmark_data = None
 
48
 
49
 
50
  def load_benchmark():
51
+ """Load benchmark data from HuggingFace Hub, with fallback to sample data."""
52
  global _benchmark_data
53
 
54
  if _benchmark_data is not None:
55
  return _benchmark_data
56
 
57
  try:
58
+ _benchmark_data = load_benchmark_dataset(BENCHMARK_DATASET_ID)
59
+ print(f"Loaded {len(_benchmark_data)} benchmark pairs from {BENCHMARK_DATASET_ID}")
60
+ except Exception as e:
61
+ print(f"Failed to load benchmark: {e}")
62
+ print("Using sample data for testing")
63
  # Create minimal sample data for testing
64
  _benchmark_data = [
65
  {
 
80
 
81
 
82
  def load_leaderboard():
83
+ """Load leaderboard from HuggingFace Hub."""
84
+ return load_leaderboard_from_hub()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
 
87
  def add_to_leaderboard(results: EvaluationResults):
88
+ """Add evaluation results to leaderboard on HuggingFace Hub."""
 
 
89
  entry = results.to_dict()
90
  entry["timestamp"] = datetime.now().isoformat()
91
 
92
+ # Add to Hub (handles deduplication and sorting internally)
93
+ success = add_result_to_hub(entry)
 
94
 
95
+ if not success:
96
+ print("Note: Results saved locally but not persisted to Hub (no HF_TOKEN)")
 
 
97
 
98
 
99
  def format_leaderboard_df():
100
  """Format leaderboard as pandas DataFrame for display."""
101
+ leaderboard = load_leaderboard()
102
 
103
+ if not leaderboard:
104
  return pd.DataFrame(columns=[
105
  "#", "Model", "MRR", "R@1", "R@5", "R@10",
106
  "Bitext", "TrueSim", "RandSim", "N"
107
  ])
108
 
109
  rows = []
110
+ for i, entry in enumerate(leaderboard, 1):
111
  rows.append({
112
  "#": i,
113
  "Model": entry.get("model_name", entry["model_id"]),
 
240
 
241
  def create_leaderboard_comparison():
242
  """Create comparison chart of all models on leaderboard."""
243
+ leaderboard = load_leaderboard()
244
 
245
+ if len(leaderboard) < 2:
246
  return None
247
 
248
+ models = [e.get("model_name", e["model_id"]) for e in leaderboard]
249
+ mrr = [e["mrr"] for e in leaderboard]
250
+ r1 = [e["recall_at_1"] for e in leaderboard]
251
+ r5 = [e["recall_at_5"] for e in leaderboard]
252
+ r10 = [e["recall_at_10"] for e in leaderboard]
253
+ bitext = [e["bitext_accuracy"] for e in leaderboard]
254
 
255
  fig = go.Figure()
256
 
data_loader.py CHANGED
@@ -719,18 +719,41 @@ def build_benchmark_dataset(
719
  return all_pairs
720
 
721
 
722
- def load_benchmark_dataset(path: str = "benchmark_data/benchmark.json") -> list[dict]:
 
 
 
723
  """
724
- Load the pre-cached benchmark dataset.
725
 
726
  Args:
727
- path: Path to the benchmark JSON file
 
728
 
729
  Returns:
730
- List of benchmark pairs
731
  """
732
- with open(path, "r", encoding="utf-8") as f:
733
- return json.load(f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
734
 
735
 
736
  def get_benchmark_stats(pairs: list[dict]) -> dict:
 
719
  return all_pairs
720
 
721
 
722
+ def load_benchmark_dataset(
723
+ source: str = "Sefaria/Rabbinic-Hebrew-English-Pairs",
724
+ use_local: bool = False,
725
+ ) -> list[dict]:
726
  """
727
+ Load the benchmark dataset from HuggingFace Hub or local file.
728
 
729
  Args:
730
+ source: HuggingFace dataset ID or local file path
731
+ use_local: If True, load from local JSON file instead of HuggingFace
732
 
733
  Returns:
734
+ List of benchmark pairs with keys: ref, he, en, category
735
  """
736
+ if use_local or source.endswith(".json"):
737
+ # Load from local JSON file
738
+ with open(source, "r", encoding="utf-8") as f:
739
+ return json.load(f)
740
+
741
+ # Load from HuggingFace Hub
742
+ try:
743
+ from datasets import load_dataset
744
+
745
+ print(f"Loading benchmark from HuggingFace: {source}")
746
+ ds = load_dataset(source, split="train")
747
+ return ds.to_list()
748
+ except Exception as e:
749
+ print(f"Failed to load from HuggingFace: {e}")
750
+ # Fallback to local file if it exists
751
+ local_path = "benchmark_data/benchmark.json"
752
+ if Path(local_path).exists():
753
+ print(f"Falling back to local file: {local_path}")
754
+ with open(local_path, "r", encoding="utf-8") as f:
755
+ return json.load(f)
756
+ raise
757
 
758
 
759
  def get_benchmark_stats(pairs: list[dict]) -> dict:
requirements.txt CHANGED
@@ -3,6 +3,8 @@ gradio>=4.0.0
3
  transformers>=4.36.0
4
  sentence-transformers>=2.2.2
5
  torch>=2.0.0
 
 
6
 
7
  # Data processing
8
  numpy>=1.24.0
 
3
  transformers>=4.36.0
4
  sentence-transformers>=2.2.2
5
  torch>=2.0.0
6
+ datasets>=2.14.0
7
+ huggingface_hub>=0.19.0
8
 
9
  # Data processing
10
  numpy>=1.24.0