File size: 4,222 Bytes
1600c5a 92af2cc 1600c5a 92af2cc 1600c5a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | ---
language:
- he
- el
license: mit
tags:
- biblical-hebrew
- biblical-greek
- morphology
- parsing
- mt5
- seq2seq
datasets:
- LoveJesus/biblical-tutor-dataset-chirho
pipeline_tag: text2text-generation
model-index:
- name: biblical-parser-chirho
results:
- task:
type: text2text-generation
name: Morphological Parsing
dataset:
type: LoveJesus/biblical-tutor-dataset-chirho
name: Biblical Tutor Dataset (Chirho)
metrics:
- type: exact_match
value: 0.525
name: Exact Match
- type: f1
value: 0.886
name: Average Tag F1
---
# Biblical Morphological Parser (mT5-small)
*For God so loved the world that he gave his only begotten Son, that whoever believes in him should not perish but have eternal life. - John 3:16*
## What This Does
This model parses biblical Hebrew and Greek words into their morphological components: part of speech, stem, lemma, tense, person, gender, number, and English gloss.
## Usage
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("LoveJesus/biblical-parser-chirho")
model = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/biblical-parser-chirho")
# Parse a Hebrew word
input_text = 'parse [hebrew]: בָּרָא [GEN 1:1] context: בְּרֵאשִׁית אֱלֹהִים'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "class:verb | stem:qal | lemma:ברא | morph:... | person:3 | gender:m | number:s | gloss:he created"
# Parse a Greek word
input_text = 'parse [greek]: λόγος [JHN 1:1] context: ἐν ἀρχῇ ἦν'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Input Format
```
parse [{language}]: {word} [{verse_ref}] context: {surrounding_words}
```
- `{language}`: `hebrew` or `greek`
- `{word}`: The biblical word in original script
- `{verse_ref}`: Book chapter:verse reference
- `{surrounding_words}`: 2 words before and after for disambiguation
## Output Format
Pipe-separated morphological tags:
```
class:{pos} | stem:{stem} | lemma:{lemma} | morph:{code} | person:{p} | gender:{g} | number:{n} | gloss:{english}
```
## Training Data
- **Macula Hebrew** (Clear-Bible): ~425K OT words with morphology and glosses
- **Macula Greek SBLGNT** (Clear-Bible): ~138K NT words with morphology and glosses
- Subsampled to ~200K words (100K per language), stratified by book
## Model Details
| Property | Value |
|----------|-------|
| Base model | google/mt5-small (300M params) |
| Architecture | Encoder-decoder (Seq2Seq) |
| Languages | Biblical Hebrew, Koine Greek |
| Training | 5 epochs, lr=3e-4, batch=32 |
| Hardware | NVIDIA A100/H200 GPU |
## Limitations
- Trained on Macula morphological annotations — may not match all scholarly traditions
- Handles individual words, not full syntactic analysis
- Performance may vary on words not well-represented in training data
## Evaluation Results
Evaluated on a held-out test set of ~20K word-level parsing examples.
### Overall Metrics
| Metric | Score |
|--------|-------|
| **Exact Match** (all tags correct) | **0.525** |
| **Average Tag F1** (across all tags) | **0.886** |
### Per-Tag F1
| Tag | F1 |
|-----|-----|
| class (POS) | 0.963 |
| number | 0.966 |
| POS | 0.958 |
| lemma | 0.935 |
| person | 0.933 |
| gender | 0.928 |
| type | 0.900 |
| morph | 0.890 |
| state | 0.878 |
| stem | 0.859 |
| gloss | 0.539 |
### Per-Language Exact Match
| Language | Exact Match |
|----------|-------------|
| Hebrew | 0.514 |
| Greek | 0.559 |
> The `gloss` tag (English translation) is the hardest to predict exactly, pulling down the overall exact match rate. The model achieves strong F1 on structural/morphological tags (class, number, POS, person, gender all > 0.92).
---
Built with love for Jesus. Published by [LoveJesus](https://huggingface.co/LoveJesus).
Part of the [bible.systems](https://bible.systems) project.
|