Model B: Steinsaltz 15K sample, 3 epochs, loss 0.2148 (preliminary)
Browse files- README.md +57 -0
- labels.json +1 -0
- model.pt +3 -0
- tokenizer.json +0 -0
- tokenizer_config.json +15 -0
README.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- he
|
| 4 |
+
- arc
|
| 5 |
+
license: mit
|
| 6 |
+
tags:
|
| 7 |
+
- punctuation
|
| 8 |
+
- talmud
|
| 9 |
+
- hebrew
|
| 10 |
+
- aramaic
|
| 11 |
+
- token-classification
|
| 12 |
+
base_model: dicta-il/BEREL_3.0
|
| 13 |
+
pipeline_tag: token-classification
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Talmud Punctuator — Model B (Steinsaltz)
|
| 17 |
+
|
| 18 |
+
Fine-tuned [BEREL 3.0](https://huggingface.co/dicta-il/BEREL_3.0) for predicting punctuation
|
| 19 |
+
in Talmudic Aramaic/Hebrew text.
|
| 20 |
+
|
| 21 |
+
**Model B** reflects the Steinsaltz/William Davidson Edition punctuation style,
|
| 22 |
+
trained on a 15,000-sentence sample across 36 masekhtot of the Babylonian Talmud.
|
| 23 |
+
|
| 24 |
+
This is a preliminary model. For better accuracy, retrain on the full 80K-sentence
|
| 25 |
+
dataset using the included Colab notebook (`train_on_colab.ipynb`).
|
| 26 |
+
|
| 27 |
+
## Training details
|
| 28 |
+
|
| 29 |
+
- **Base model**: BEREL 3.0 (`dicta-il/BEREL_3.0`)
|
| 30 |
+
- **Head**: Linear classification (768 → 8 labels)
|
| 31 |
+
- **Data**: 15,000 sentences sampled from 36 masekhtot (~80K available)
|
| 32 |
+
- **Epochs**: 3, Batch size: 16, LR: 2e-5
|
| 33 |
+
- **Final loss**: 0.2148
|
| 34 |
+
|
| 35 |
+
## Labels
|
| 36 |
+
|
| 37 |
+
| Label | Meaning |
|
| 38 |
+
|-------|---------|
|
| 39 |
+
| `O` | No punctuation |
|
| 40 |
+
| `,` | Comma |
|
| 41 |
+
| `.` | Period |
|
| 42 |
+
| `:` | Colon |
|
| 43 |
+
| `;` | Semicolon |
|
| 44 |
+
| `?` | Question mark |
|
| 45 |
+
| `!` | Exclamation mark |
|
| 46 |
+
| `—` | Em-dash |
|
| 47 |
+
|
| 48 |
+
## Usage
|
| 49 |
+
|
| 50 |
+
Use with the `punctuator.py` script from [mivami](https://github.com/joshuawaxman/mivami).
|
| 51 |
+
|
| 52 |
+
## Upgrading to full dataset
|
| 53 |
+
|
| 54 |
+
Use the Colab notebook to train on all 80K sentences with a free GPU:
|
| 55 |
+
1. Set `CONTINUE_FROM = "Joshua2/talmud-punctuator-B"` to start from this checkpoint
|
| 56 |
+
2. Upload `Steinsaltz_combined.txt`
|
| 57 |
+
3. Run on a T4 GPU (~3-5 hours)
|
labels.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"label2idx": {"!": 0, ",": 1, ".": 2, ":": 3, ";": 4, "?": 5, "O": 6}, "idx2label": {"0": "!", "1": ",", "2": ".", "3": ":", "4": ";", "5": "?", "6": "O"}, "num_tags": 7}
|
model.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ca386256395d2c1c2163e86ad5f57ec2812d509e81e654b916423eb3a2c7a208
|
| 3 |
+
size 737468879
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "tokenizers",
|
| 3 |
+
"clean_up_tokenization_spaces": true,
|
| 4 |
+
"cls_token": "[CLS]",
|
| 5 |
+
"do_lower_case": true,
|
| 6 |
+
"is_local": false,
|
| 7 |
+
"mask_token": "[MASK]",
|
| 8 |
+
"model_max_length": 512,
|
| 9 |
+
"pad_token": "[PAD]",
|
| 10 |
+
"sep_token": "[SEP]",
|
| 11 |
+
"strip_accents": null,
|
| 12 |
+
"tokenize_chinese_chars": true,
|
| 13 |
+
"tokenizer_class": "BertTokenizer",
|
| 14 |
+
"unk_token": "[UNK]"
|
| 15 |
+
}
|