Model B: Steinsaltz 15K sample, 3 epochs, loss 0.2148 (preliminary)

Browse files

Files changed (5) hide show

README.md +57 -0
labels.json +1 -0
model.pt +3 -0
tokenizer.json +0 -0
tokenizer_config.json +15 -0

README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+language:
+  - he
+  - arc
+license: mit
+tags:
+  - punctuation
+  - talmud
+  - hebrew
+  - aramaic
+  - token-classification
+base_model: dicta-il/BEREL_3.0
+pipeline_tag: token-classification
+---
+# Talmud Punctuator — Model B (Steinsaltz)
+Fine-tuned [BEREL 3.0](https://huggingface.co/dicta-il/BEREL_3.0) for predicting punctuation
+in Talmudic Aramaic/Hebrew text.
+**Model B** reflects the Steinsaltz/William Davidson Edition punctuation style,
+trained on a 15,000-sentence sample across 36 masekhtot of the Babylonian Talmud.
+This is a preliminary model. For better accuracy, retrain on the full 80K-sentence
+dataset using the included Colab notebook (`train_on_colab.ipynb`).
+## Training details
+- **Base model**: BEREL 3.0 (`dicta-il/BEREL_3.0`)
+- **Head**: Linear classification (768 → 8 labels)
+- **Data**: 15,000 sentences sampled from 36 masekhtot (~80K available)
+- **Epochs**: 3, Batch size: 16, LR: 2e-5
+- **Final loss**: 0.2148
+## Labels
+| Label | Meaning |
+|-------|---------|
+| `O` | No punctuation |
+| `,` | Comma |
+| `.` | Period |
+| `:` | Colon |
+| `;` | Semicolon |
+| `?` | Question mark |
+| `!` | Exclamation mark |
+| `—` | Em-dash |
+## Usage
+Use with the `punctuator.py` script from [mivami](https://github.com/joshuawaxman/mivami).
+## Upgrading to full dataset
+Use the Colab notebook to train on all 80K sentences with a free GPU:
+1. Set `CONTINUE_FROM = "Joshua2/talmud-punctuator-B"` to start from this checkpoint
+2. Upload `Steinsaltz_combined.txt`
+3. Run on a T4 GPU (~3-5 hours)

labels.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"label2idx": {"!": 0, ",": 1, ".": 2, ":": 3, ";": 4, "?": 5, "O": 6}, "idx2label": {"0": "!", "1": ",", "2": ".", "3": ":", "4": ";", "5": "?", "6": "O"}, "num_tags": 7}

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca386256395d2c1c2163e86ad5f57ec2812d509e81e654b916423eb3a2c7a208
+size 737468879

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}