Talmud Punctuator β Model A
Fine-tuned BEREL 3.0 for predicting punctuation in Talmudic Aramaic/Hebrew text.
This is Model A β one punctuation style among several possible annotator conventions. Different annotators produce different gold-standard punctuation, reflecting legitimate stylistic variation in how Talmudic text is punctuated. Each annotator's data yields a distinct model with its own punctuation preferences.
Task
For each word in the Talmud, the model predicts the trailing punctuation mark:
| Label | Meaning |
|---|---|
O |
No punctuation |
, |
Comma β pause within a clause |
. |
Period β end of statement |
: |
Colon β introduces speech or explanation |
; |
Semicolon β separates related clauses |
? |
Question mark |
! |
Exclamation mark / rhetorical challenge |
β |
Em-dash |
Architecture
- Base model: BEREL 3.0 (
dicta-il/BEREL_3.0) β a BERT model pre-trained on historical Hebrew texts - Head: Linear classification layer (768 β 8 labels)
- Training: 5 epochs, AdamW optimizer, learning rate 2e-5, batch size 16
- Parameters: ~184M total
Usage
This model is designed to be used with the punctuator.py script from the
mivami project, which handles
the Al HaTorah markup encoding (nikud, abbreviation tags, daf markers, etc.).
# Download and install
pip install torch transformers numpy scikit-learn
# Apply to an Al HaTorah encoded text file
python punctuator.py predict \
--input YourMasechet.txt \
--model-dir saved_model \
--output YourMasechet_predicted.txt
The script preserves all encoding in the output β only punctuation marks are modified.
Input Format
The model expects Al HaTorah encoded Talmud text with:
- Daf markers:
{ΧΧ£ Χ.} - Abbreviation tags:
<abb>...</abb><openabb>...</openabb> - Note markers:
<EMNM>...</EMNM> - Formatting tags:
<b>,<h2>,<dots> - Full nikud (vowel points)
The preprocessing pipeline:
- Expands abbreviations (keeps
<openabb>content, drops<abb>content) - Strips nikud for model input
- Predicts punctuation per word
- Projects predictions back onto the original encoded text
Limitations
- Trained on a single masechet; performance on other masekhtot may vary
- Punctuation is inherently subjective β this model reflects one annotator's conventions
- The model sometimes removes exclamation marks from Talmudic challenges and drops commas from enumerated lists
- Best results on Babylonian Talmud text in Al HaTorah's encoding format
Model tree for Joshua2/talmud-punctuator-A
Base model
dicta-il/BEREL_3.0