metadata
language:
- he
- el
license: mit
tags:
- biblical-hebrew
- biblical-greek
- morphology
- parsing
- mt5
- seq2seq
datasets:
- LoveJesus/biblical-tutor-dataset-chirho
pipeline_tag: text2text-generation
model-index:
- name: biblical-parser-chirho
results:
- task:
type: text2text-generation
name: Morphological Parsing
dataset:
type: LoveJesus/biblical-tutor-dataset-chirho
name: Biblical Tutor Dataset (Chirho)
metrics:
- type: exact_match
value: 0.525
name: Exact Match
- type: f1
value: 0.886
name: Average Tag F1
Biblical Morphological Parser (mT5-small)
For God so loved the world that he gave his only begotten Son, that whoever believes in him should not perish but have eternal life. - John 3:16
What This Does
This model parses biblical Hebrew and Greek words into their morphological components: part of speech, stem, lemma, tense, person, gender, number, and English gloss.
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("LoveJesus/biblical-parser-chirho")
model = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/biblical-parser-chirho")
# Parse a Hebrew word
input_text = 'parse [hebrew]: בָּרָא [GEN 1:1] context: בְּרֵאשִׁית אֱלֹהִים'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "class:verb | stem:qal | lemma:ברא | morph:... | person:3 | gender:m | number:s | gloss:he created"
# Parse a Greek word
input_text = 'parse [greek]: λόγος [JHN 1:1] context: ἐν ἀρχῇ ἦν'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Input Format
parse [{language}]: {word} [{verse_ref}] context: {surrounding_words}
{language}:hebreworgreek{word}: The biblical word in original script{verse_ref}: Book chapter:verse reference{surrounding_words}: 2 words before and after for disambiguation
Output Format
Pipe-separated morphological tags:
class:{pos} | stem:{stem} | lemma:{lemma} | morph:{code} | person:{p} | gender:{g} | number:{n} | gloss:{english}
Training Data
- Macula Hebrew (Clear-Bible): ~425K OT words with morphology and glosses
- Macula Greek SBLGNT (Clear-Bible): ~138K NT words with morphology and glosses
- Subsampled to ~200K words (100K per language), stratified by book
Model Details
| Property | Value |
|---|---|
| Base model | google/mt5-small (300M params) |
| Architecture | Encoder-decoder (Seq2Seq) |
| Languages | Biblical Hebrew, Koine Greek |
| Training | 5 epochs, lr=3e-4, batch=32 |
| Hardware | NVIDIA A100/H200 GPU |
Limitations
- Trained on Macula morphological annotations — may not match all scholarly traditions
- Handles individual words, not full syntactic analysis
- Performance may vary on words not well-represented in training data
Evaluation Results
Evaluated on a held-out test set of ~20K word-level parsing examples.
Overall Metrics
| Metric | Score |
|---|---|
| Exact Match (all tags correct) | 0.525 |
| Average Tag F1 (across all tags) | 0.886 |
Per-Tag F1
| Tag | F1 |
|---|---|
| class (POS) | 0.963 |
| number | 0.966 |
| POS | 0.958 |
| lemma | 0.935 |
| person | 0.933 |
| gender | 0.928 |
| type | 0.900 |
| morph | 0.890 |
| state | 0.878 |
| stem | 0.859 |
| gloss | 0.539 |
Per-Language Exact Match
| Language | Exact Match |
|---|---|
| Hebrew | 0.514 |
| Greek | 0.559 |
The
glosstag (English translation) is the hardest to predict exactly, pulling down the overall exact match rate. The model achieves strong F1 on structural/morphological tags (class, number, POS, person, gender all > 0.92).
Built with love for Jesus. Published by LoveJesus. Part of the bible.systems project.