Talmud Punctuator β€” Model A

Fine-tuned BEREL 3.0 for predicting punctuation in Talmudic Aramaic/Hebrew text.

This is Model A β€” one punctuation style among several possible annotator conventions. Different annotators produce different gold-standard punctuation, reflecting legitimate stylistic variation in how Talmudic text is punctuated. Each annotator's data yields a distinct model with its own punctuation preferences.

Task

For each word in the Talmud, the model predicts the trailing punctuation mark:

Label Meaning
O No punctuation
, Comma β€” pause within a clause
. Period β€” end of statement
: Colon β€” introduces speech or explanation
; Semicolon β€” separates related clauses
? Question mark
! Exclamation mark / rhetorical challenge
β€” Em-dash

Architecture

  • Base model: BEREL 3.0 (dicta-il/BEREL_3.0) β€” a BERT model pre-trained on historical Hebrew texts
  • Head: Linear classification layer (768 β†’ 8 labels)
  • Training: 5 epochs, AdamW optimizer, learning rate 2e-5, batch size 16
  • Parameters: ~184M total

Usage

This model is designed to be used with the punctuator.py script from the mivami project, which handles the Al HaTorah markup encoding (nikud, abbreviation tags, daf markers, etc.).

# Download and install
pip install torch transformers numpy scikit-learn

# Apply to an Al HaTorah encoded text file
python punctuator.py predict \
  --input YourMasechet.txt \
  --model-dir saved_model \
  --output YourMasechet_predicted.txt

The script preserves all encoding in the output β€” only punctuation marks are modified.

Input Format

The model expects Al HaTorah encoded Talmud text with:

  • Daf markers: {Χ“Χ£ Χ‘.}
  • Abbreviation tags: <abb>...</abb><openabb>...</openabb>
  • Note markers: <EMNM>...</EMNM>
  • Formatting tags: <b>, <h2>, <dots>
  • Full nikud (vowel points)

The preprocessing pipeline:

  1. Expands abbreviations (keeps <openabb> content, drops <abb> content)
  2. Strips nikud for model input
  3. Predicts punctuation per word
  4. Projects predictions back onto the original encoded text

Limitations

  • Trained on a single masechet; performance on other masekhtot may vary
  • Punctuation is inherently subjective β€” this model reflects one annotator's conventions
  • The model sometimes removes exclamation marks from Talmudic challenges and drops commas from enumerated lists
  • Best results on Babylonian Talmud text in Al HaTorah's encoding format
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Joshua2/talmud-punctuator-A

Finetuned
(4)
this model