Lev Israel commited on
Commit
a9dad42
·
1 Parent(s): 018c4c5

Uploade dataset to HF

Browse files
Files changed (2) hide show
  1. dataset/README.md +97 -0
  2. upload_dataset.py +116 -0
dataset/README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ - arc
5
+ - en
6
+ license: cc-by-4.0
7
+ task_categories:
8
+ - sentence-similarity
9
+ - text-retrieval
10
+ tags:
11
+ - rabbinic
12
+ - hebrew
13
+ - aramaic
14
+ - talmud
15
+ - cross-lingual
16
+ - bitext
17
+ - sefaria
18
+ size_categories:
19
+ - 1K<n<10K
20
+ ---
21
+
22
+ # Rabbinic Hebrew/Aramaic - English Parallel Corpus
23
+
24
+ A benchmark dataset for evaluating embedding models on Rabbinic Hebrew and Aramaic texts, with parallel English translations sourced from [Sefaria](https://www.sefaria.org).
25
+
26
+ ## Dataset Description
27
+
28
+ This dataset contains 3,708 parallel text pairs spanning diverse Rabbinic literature across multiple centuries and genres. It is designed for evaluating cross-lingual embedding models on their ability to align Hebrew/Aramaic source texts with English translations.
29
+
30
+ ### Languages
31
+
32
+ - **Source**: Rabbinic Hebrew, Jewish Babylonian Aramaic, Jewish Palestinian Aramaic
33
+ - **Target**: English
34
+
35
+ ### Dataset Structure
36
+
37
+ Each example contains:
38
+ - `ref`: Sefaria reference string (e.g., "Berakhot.2a:1")
39
+ - `he`: Hebrew/Aramaic source text
40
+ - `en`: English translation
41
+ - `category`: Text category
42
+
43
+ ### Categories
44
+
45
+ | Category | Count | Description |
46
+ |----------|-------|-------------|
47
+ | Mishnah | 789 | Tannaitic legal compilation (~200 CE) |
48
+ | Tanakh Commentary | 674 | Rashi, Ramban, Radak, Rabbeinu Behaye on Torah |
49
+ | Jerusalem Talmud | 520 | Palestinian Talmud (~400 CE) |
50
+ | Talmud | 480 | Babylonian Talmud (~500 CE) |
51
+ | Midrash Rabbah | 393 | Midrashic compilations |
52
+ | Hasidic/Kabbalistic | 304 | Likutei Moharan, Tomer Devorah, Kalach Pitchei Chokhmah |
53
+ | Philosophy | 240 | Guide for the Perplexed, Sefer HaIkkarim |
54
+ | Halacha | 160 | Sefer HaChinukh, Mishneh Torah |
55
+ | Mussar/Ethics | 108 | Chafetz Chaim, Kav HaYashar, Iggeret HaRamban |
56
+ | Targum | 40 | Aramaic Targum to Song of Songs |
57
+
58
+ ## Intended Use
59
+
60
+ ### Primary Use Case
61
+
62
+ Evaluating embedding models for cross-lingual retrieval:
63
+ - Given a Hebrew/Aramaic text, can the model find its English translation from a pool of candidates?
64
+ - Models that excel at this task likely capture the semantics of Rabbinic literature well.
65
+
66
+ ### Evaluation Metrics
67
+
68
+ - **Recall@k**: Percentage of queries where correct translation is in top k results
69
+ - **MRR**: Mean Reciprocal Rank
70
+ - **Bitext Accuracy**: True pair vs random pair classification
71
+
72
+ ## Source
73
+
74
+ All texts and translations are from [Sefaria](https://www.sefaria.org), a free library of Jewish texts.
75
+
76
+ ### Translations
77
+
78
+ Translations come from various sources including:
79
+ - William Davidson Talmud (Steinsaltz)
80
+ - Sefaria Community translations
81
+ - Historical translations (e.g., Friedlander's Guide for the Perplexed)
82
+
83
+ ## Citation
84
+
85
+ If you use this dataset, please cite Sefaria:
86
+
87
+ ```bibtex
88
+ @misc{sefaria,
89
+ title = {Sefaria: A Living Library of Jewish Texts},
90
+ url = {https://www.sefaria.org},
91
+ year = {2024}
92
+ }
93
+ ```
94
+
95
+ ## License
96
+
97
+ The dataset is released under CC-BY 4.0, following Sefaria's licensing for their open texts.
upload_dataset.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Upload the Rabbinic benchmark dataset to Hugging Face Hub.
4
+
5
+ Usage:
6
+ python upload_dataset.py --repo-id YOUR_USERNAME/rabbinic-benchmark
7
+ """
8
+
9
+ import argparse
10
+ import json
11
+ import shutil
12
+ import tempfile
13
+ from pathlib import Path
14
+
15
+
16
+ def main():
17
+ parser = argparse.ArgumentParser(
18
+ description="Upload Rabbinic benchmark dataset to Hugging Face Hub"
19
+ )
20
+ parser.add_argument(
21
+ "--repo-id",
22
+ type=str,
23
+ required=True,
24
+ help="HuggingFace repo ID (e.g., 'username/rabbinic-benchmark')",
25
+ )
26
+ parser.add_argument(
27
+ "--benchmark-path",
28
+ type=str,
29
+ default="benchmark_data/benchmark.json",
30
+ help="Path to benchmark JSON file",
31
+ )
32
+ parser.add_argument(
33
+ "--private",
34
+ action="store_true",
35
+ help="Make the dataset private",
36
+ )
37
+
38
+ args = parser.parse_args()
39
+
40
+ # Check that huggingface_hub is installed
41
+ try:
42
+ from huggingface_hub import HfApi, upload_folder, login, whoami
43
+ except ImportError:
44
+ print("Required packages not installed. Run:")
45
+ print(" pip install huggingface_hub")
46
+ return 1
47
+
48
+ # Check current auth status
49
+ try:
50
+ user_info = whoami()
51
+ print(f"Logged in as: {user_info['name']}")
52
+ if 'orgs' in user_info:
53
+ orgs = [org['name'] for org in user_info.get('orgs', [])]
54
+ print(f"Organizations: {orgs}")
55
+ except Exception as e:
56
+ print(f"Not logged in or token issue: {e}")
57
+ print("Running login...")
58
+ login()
59
+
60
+ # Load benchmark data to verify it
61
+ print(f"Loading benchmark from {args.benchmark_path}...")
62
+ with open(args.benchmark_path, "r", encoding="utf-8") as f:
63
+ data = json.load(f)
64
+ print(f"Loaded {len(data)} pairs")
65
+
66
+ # Create a temp folder with the files to upload
67
+ with tempfile.TemporaryDirectory() as tmpdir:
68
+ tmpdir = Path(tmpdir)
69
+
70
+ # Copy benchmark data
71
+ data_dir = tmpdir / "data"
72
+ data_dir.mkdir()
73
+ shutil.copy(args.benchmark_path, data_dir / "benchmark.json")
74
+ print(f"Prepared data/benchmark.json")
75
+
76
+ # Copy README
77
+ readme_src = Path("dataset/README.md")
78
+ if readme_src.exists():
79
+ shutil.copy(readme_src, tmpdir / "README.md")
80
+ print(f"Prepared README.md")
81
+
82
+ # Create repo if needed
83
+ api = HfApi()
84
+ try:
85
+ api.create_repo(
86
+ repo_id=args.repo_id,
87
+ repo_type="dataset",
88
+ private=args.private,
89
+ exist_ok=True,
90
+ )
91
+ print(f"Repository verified: {args.repo_id}")
92
+ except Exception as e:
93
+ print(f"Note: {e}")
94
+
95
+ # Upload the folder (create PR if we don't have direct write access)
96
+ print(f"\nUploading to HuggingFace Hub: {args.repo_id}...")
97
+ commit_info = upload_folder(
98
+ folder_path=str(tmpdir),
99
+ repo_id=args.repo_id,
100
+ repo_type="dataset",
101
+ create_pr=True, # Create a PR instead of direct commit
102
+ commit_message="Add Rabbinic Hebrew-English parallel corpus",
103
+ )
104
+
105
+ if commit_info.pr_url:
106
+ print(f"\n📝 Pull Request created: {commit_info.pr_url}")
107
+ print(" Ask an org admin to merge it.")
108
+
109
+ print(f"\n✅ Dataset uploaded successfully!")
110
+ print(f" View at: https://huggingface.co/datasets/{args.repo_id}")
111
+
112
+ return 0
113
+
114
+
115
+ if __name__ == "__main__":
116
+ exit(main())