Spaces:

Sefaria
/

Rabbinic-Embedding-Bench

Running

App Files Files Community

Lev Israel commited on Jan 12

Commit

a9dad42

1 Parent(s): 018c4c5

Uploade dataset to HF

Browse files

Files changed (2) hide show

dataset/README.md +97 -0
upload_dataset.py +116 -0

dataset/README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+language:
+- he
+- arc
+- en
+license: cc-by-4.0
+task_categories:
+- sentence-similarity
+- text-retrieval
+tags:
+- rabbinic
+- hebrew
+- aramaic
+- talmud
+- cross-lingual
+- bitext
+- sefaria
+size_categories:
+- 1K<n<10K
+---
+# Rabbinic Hebrew/Aramaic - English Parallel Corpus
+A benchmark dataset for evaluating embedding models on Rabbinic Hebrew and Aramaic texts, with parallel English translations sourced from [Sefaria](https://www.sefaria.org).
+## Dataset Description
+This dataset contains 3,708 parallel text pairs spanning diverse Rabbinic literature across multiple centuries and genres. It is designed for evaluating cross-lingual embedding models on their ability to align Hebrew/Aramaic source texts with English translations.
+### Languages
+- **Source**: Rabbinic Hebrew, Jewish Babylonian Aramaic, Jewish Palestinian Aramaic
+- **Target**: English
+### Dataset Structure
+Each example contains:
+- `ref`: Sefaria reference string (e.g., "Berakhot.2a:1")
+- `he`: Hebrew/Aramaic source text
+- `en`: English translation
+- `category`: Text category
+### Categories
+| Category | Count | Description |
+|----------|-------|-------------|
+| Mishnah | 789 | Tannaitic legal compilation (~200 CE) |
+| Tanakh Commentary | 674 | Rashi, Ramban, Radak, Rabbeinu Behaye on Torah |
+| Jerusalem Talmud | 520 | Palestinian Talmud (~400 CE) |
+| Talmud | 480 | Babylonian Talmud (~500 CE) |
+| Midrash Rabbah | 393 | Midrashic compilations |
+| Hasidic/Kabbalistic | 304 | Likutei Moharan, Tomer Devorah, Kalach Pitchei Chokhmah |
+| Philosophy | 240 | Guide for the Perplexed, Sefer HaIkkarim |
+| Halacha | 160 | Sefer HaChinukh, Mishneh Torah |
+| Mussar/Ethics | 108 | Chafetz Chaim, Kav HaYashar, Iggeret HaRamban |
+| Targum | 40 | Aramaic Targum to Song of Songs |
+## Intended Use
+### Primary Use Case
+Evaluating embedding models for cross-lingual retrieval:
+- Given a Hebrew/Aramaic text, can the model find its English translation from a pool of candidates?
+- Models that excel at this task likely capture the semantics of Rabbinic literature well.
+### Evaluation Metrics
+- **Recall@k**: Percentage of queries where correct translation is in top k results
+- **MRR**: Mean Reciprocal Rank
+- **Bitext Accuracy**: True pair vs random pair classification
+## Source
+All texts and translations are from [Sefaria](https://www.sefaria.org), a free library of Jewish texts.
+### Translations
+Translations come from various sources including:
+- William Davidson Talmud (Steinsaltz)
+- Sefaria Community translations
+- Historical translations (e.g., Friedlander's Guide for the Perplexed)
+## Citation
+If you use this dataset, please cite Sefaria:
+```bibtex
+@misc{sefaria,
+  title = {Sefaria: A Living Library of Jewish Texts},
+  url = {https://www.sefaria.org},
+  year = {2024}
+}
+```
+## License
+The dataset is released under CC-BY 4.0, following Sefaria's licensing for their open texts.

upload_dataset.py ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/usr/bin/env python3
+"""
+Upload the Rabbinic benchmark dataset to Hugging Face Hub.
+Usage:
+    python upload_dataset.py --repo-id YOUR_USERNAME/rabbinic-benchmark
+"""
+import argparse
+import json
+import shutil
+import tempfile
+from pathlib import Path
+def main():
+    parser = argparse.ArgumentParser(
+        description="Upload Rabbinic benchmark dataset to Hugging Face Hub"
+    )
+    parser.add_argument(
+        "--repo-id",
+        type=str,
+        required=True,
+        help="HuggingFace repo ID (e.g., 'username/rabbinic-benchmark')",
+    )
+    parser.add_argument(
+        "--benchmark-path",
+        type=str,
+        default="benchmark_data/benchmark.json",
+        help="Path to benchmark JSON file",
+    )
+    parser.add_argument(
+        "--private",
+        action="store_true",
+        help="Make the dataset private",
+    )
+    args = parser.parse_args()
+    # Check that huggingface_hub is installed
+    try:
+        from huggingface_hub import HfApi, upload_folder, login, whoami
+    except ImportError:
+        print("Required packages not installed. Run:")
+        print("  pip install huggingface_hub")
+        return 1
+    # Check current auth status
+    try:
+        user_info = whoami()
+        print(f"Logged in as: {user_info['name']}")
+        if 'orgs' in user_info:
+            orgs = [org['name'] for org in user_info.get('orgs', [])]
+            print(f"Organizations: {orgs}")
+    except Exception as e:
+        print(f"Not logged in or token issue: {e}")
+        print("Running login...")
+        login()
+    # Load benchmark data to verify it
+    print(f"Loading benchmark from {args.benchmark_path}...")
+    with open(args.benchmark_path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    print(f"Loaded {len(data)} pairs")
+    # Create a temp folder with the files to upload
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir = Path(tmpdir)
+        # Copy benchmark data
+        data_dir = tmpdir / "data"
+        data_dir.mkdir()
+        shutil.copy(args.benchmark_path, data_dir / "benchmark.json")
+        print(f"Prepared data/benchmark.json")
+        # Copy README
+        readme_src = Path("dataset/README.md")
+        if readme_src.exists():
+            shutil.copy(readme_src, tmpdir / "README.md")
+            print(f"Prepared README.md")
+        # Create repo if needed
+        api = HfApi()
+        try:
+            api.create_repo(
+                repo_id=args.repo_id,
+                repo_type="dataset",
+                private=args.private,
+                exist_ok=True,
+            )
+            print(f"Repository verified: {args.repo_id}")
+        except Exception as e:
+            print(f"Note: {e}")
+        # Upload the folder (create PR if we don't have direct write access)
+        print(f"\nUploading to HuggingFace Hub: {args.repo_id}...")
+        commit_info = upload_folder(
+            folder_path=str(tmpdir),
+            repo_id=args.repo_id,
+            repo_type="dataset",
+            create_pr=True,  # Create a PR instead of direct commit
+            commit_message="Add Rabbinic Hebrew-English parallel corpus",
+        )
+        if commit_info.pr_url:
+            print(f"\n📝 Pull Request created: {commit_info.pr_url}")
+            print("   Ask an org admin to merge it.")
+    print(f"\n✅ Dataset uploaded successfully!")
+    print(f"   View at: https://huggingface.co/datasets/{args.repo_id}")
+    return 0
+if __name__ == "__main__":
+    exit(main())