Talmud Punctuator — Model A

Fine-tuned BEREL 3.0 for predicting punctuation in Talmudic Aramaic/Hebrew text.

This is Model A — one punctuation style among several possible annotator conventions. Different annotators produce different gold-standard punctuation, reflecting legitimate stylistic variation in how Talmudic text is punctuated. Each annotator's data yields a distinct model with its own punctuation preferences.

Task

For each word in the Talmud, the model predicts the trailing punctuation mark:

Label	Meaning
`O`	No punctuation
`,`	Comma — pause within a clause
`.`	Period — end of statement
`:`	Colon — introduces speech or explanation
`;`	Semicolon — separates related clauses
`?`	Question mark
`!`	Exclamation mark / rhetorical challenge
`—`	Em-dash

Architecture

Base model: BEREL 3.0 (dicta-il/BEREL_3.0) — a BERT model pre-trained on historical Hebrew texts
Head: Linear classification layer (768 → 8 labels)
Training: 5 epochs, AdamW optimizer, learning rate 2e-5, batch size 16
Parameters: ~184M total

Usage

This model is designed to be used with the punctuator.py script from the mivami project, which handles the Al HaTorah markup encoding (nikud, abbreviation tags, daf markers, etc.).

# Download and install
pip install torch transformers numpy scikit-learn

# Apply to an Al HaTorah encoded text file
python punctuator.py predict \
  --input YourMasechet.txt \
  --model-dir saved_model \
  --output YourMasechet_predicted.txt

The script preserves all encoding in the output — only punctuation marks are modified.

Input Format

The model expects Al HaTorah encoded Talmud text with:

Daf markers: {דף ב.}
Abbreviation tags: <abb>...</abb><openabb>...</openabb>
Note markers: <EMNM>...</EMNM>
Formatting tags: <b>, <h2>, <dots>
Full nikud (vowel points)

The preprocessing pipeline:

Expands abbreviations (keeps <openabb> content, drops <abb> content)
Strips nikud for model input
Predicts punctuation per word
Projects predictions back onto the original encoded text

Limitations

Trained on a single masechet; performance on other masekhtot may vary
Punctuation is inherently subjective — this model reflects one annotator's conventions
The model sometimes removes exclamation marks from Talmudic challenges and drops commas from enumerated lists
Best results on Babylonian Talmud text in Al HaTorah's encoding format

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Joshua2/talmud-punctuator-A

Base model

dicta-il/BEREL_3.0

Finetuned

(9)

this model