Joshua2 commited on
Commit
a4ba7c3
·
verified ·
1 Parent(s): 37f7b9f

Model B: Steinsaltz 15K sample, 3 epochs, loss 0.2148 (preliminary)

Browse files
Files changed (5) hide show
  1. README.md +57 -0
  2. labels.json +1 -0
  3. model.pt +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +15 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ - arc
5
+ license: mit
6
+ tags:
7
+ - punctuation
8
+ - talmud
9
+ - hebrew
10
+ - aramaic
11
+ - token-classification
12
+ base_model: dicta-il/BEREL_3.0
13
+ pipeline_tag: token-classification
14
+ ---
15
+
16
+ # Talmud Punctuator — Model B (Steinsaltz)
17
+
18
+ Fine-tuned [BEREL 3.0](https://huggingface.co/dicta-il/BEREL_3.0) for predicting punctuation
19
+ in Talmudic Aramaic/Hebrew text.
20
+
21
+ **Model B** reflects the Steinsaltz/William Davidson Edition punctuation style,
22
+ trained on a 15,000-sentence sample across 36 masekhtot of the Babylonian Talmud.
23
+
24
+ This is a preliminary model. For better accuracy, retrain on the full 80K-sentence
25
+ dataset using the included Colab notebook (`train_on_colab.ipynb`).
26
+
27
+ ## Training details
28
+
29
+ - **Base model**: BEREL 3.0 (`dicta-il/BEREL_3.0`)
30
+ - **Head**: Linear classification (768 → 8 labels)
31
+ - **Data**: 15,000 sentences sampled from 36 masekhtot (~80K available)
32
+ - **Epochs**: 3, Batch size: 16, LR: 2e-5
33
+ - **Final loss**: 0.2148
34
+
35
+ ## Labels
36
+
37
+ | Label | Meaning |
38
+ |-------|---------|
39
+ | `O` | No punctuation |
40
+ | `,` | Comma |
41
+ | `.` | Period |
42
+ | `:` | Colon |
43
+ | `;` | Semicolon |
44
+ | `?` | Question mark |
45
+ | `!` | Exclamation mark |
46
+ | `—` | Em-dash |
47
+
48
+ ## Usage
49
+
50
+ Use with the `punctuator.py` script from [mivami](https://github.com/joshuawaxman/mivami).
51
+
52
+ ## Upgrading to full dataset
53
+
54
+ Use the Colab notebook to train on all 80K sentences with a free GPU:
55
+ 1. Set `CONTINUE_FROM = "Joshua2/talmud-punctuator-B"` to start from this checkpoint
56
+ 2. Upload `Steinsaltz_combined.txt`
57
+ 3. Run on a T4 GPU (~3-5 hours)
labels.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"label2idx": {"!": 0, ",": 1, ".": 2, ":": 3, ";": 4, "?": 5, "O": 6}, "idx2label": {"0": "!", "1": ",", "2": ".", "3": ":", "4": ";", "5": "?", "6": "O"}, "num_tags": 7}
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca386256395d2c1c2163e86ad5f57ec2812d509e81e654b916423eb3a2c7a208
3
+ size 737468879
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "do_lower_case": true,
6
+ "is_local": false,
7
+ "mask_token": "[MASK]",
8
+ "model_max_length": 512,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }