File size: 4,222 Bytes
1600c5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92af2cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1600c5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92af2cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1600c5a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
  - he
  - el
license: mit
tags:
  - biblical-hebrew
  - biblical-greek
  - morphology
  - parsing
  - mt5
  - seq2seq
datasets:
  - LoveJesus/biblical-tutor-dataset-chirho
pipeline_tag: text2text-generation
model-index:
  - name: biblical-parser-chirho
    results:
      - task:
          type: text2text-generation
          name: Morphological Parsing
        dataset:
          type: LoveJesus/biblical-tutor-dataset-chirho
          name: Biblical Tutor Dataset (Chirho)
        metrics:
          - type: exact_match
            value: 0.525
            name: Exact Match
          - type: f1
            value: 0.886
            name: Average Tag F1
---

# Biblical Morphological Parser (mT5-small)

*For God so loved the world that he gave his only begotten Son, that whoever believes in him should not perish but have eternal life. - John 3:16*

## What This Does

This model parses biblical Hebrew and Greek words into their morphological components: part of speech, stem, lemma, tense, person, gender, number, and English gloss.

## Usage

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LoveJesus/biblical-parser-chirho")
model = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/biblical-parser-chirho")

# Parse a Hebrew word
input_text = 'parse [hebrew]: בָּרָא [GEN 1:1] context: בְּרֵאשִׁית אֱלֹהִים'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "class:verb | stem:qal | lemma:ברא | morph:... | person:3 | gender:m | number:s | gloss:he created"

# Parse a Greek word
input_text = 'parse [greek]: λόγος [JHN 1:1] context: ἐν ἀρχῇ ἦν'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Input Format

```
parse [{language}]: {word} [{verse_ref}] context: {surrounding_words}
```

- `{language}`: `hebrew` or `greek`
- `{word}`: The biblical word in original script
- `{verse_ref}`: Book chapter:verse reference
- `{surrounding_words}`: 2 words before and after for disambiguation

## Output Format

Pipe-separated morphological tags:
```
class:{pos} | stem:{stem} | lemma:{lemma} | morph:{code} | person:{p} | gender:{g} | number:{n} | gloss:{english}
```

## Training Data

- **Macula Hebrew** (Clear-Bible): ~425K OT words with morphology and glosses
- **Macula Greek SBLGNT** (Clear-Bible): ~138K NT words with morphology and glosses
- Subsampled to ~200K words (100K per language), stratified by book

## Model Details

| Property | Value |
|----------|-------|
| Base model | google/mt5-small (300M params) |
| Architecture | Encoder-decoder (Seq2Seq) |
| Languages | Biblical Hebrew, Koine Greek |
| Training | 5 epochs, lr=3e-4, batch=32 |
| Hardware | NVIDIA A100/H200 GPU |

## Limitations

- Trained on Macula morphological annotations — may not match all scholarly traditions
- Handles individual words, not full syntactic analysis
- Performance may vary on words not well-represented in training data

## Evaluation Results

Evaluated on a held-out test set of ~20K word-level parsing examples.

### Overall Metrics

| Metric | Score |
|--------|-------|
| **Exact Match** (all tags correct) | **0.525** |
| **Average Tag F1** (across all tags) | **0.886** |

### Per-Tag F1

| Tag | F1 |
|-----|-----|
| class (POS) | 0.963 |
| number | 0.966 |
| POS | 0.958 |
| lemma | 0.935 |
| person | 0.933 |
| gender | 0.928 |
| type | 0.900 |
| morph | 0.890 |
| state | 0.878 |
| stem | 0.859 |
| gloss | 0.539 |

### Per-Language Exact Match

| Language | Exact Match |
|----------|-------------|
| Hebrew | 0.514 |
| Greek | 0.559 |

> The `gloss` tag (English translation) is the hardest to predict exactly, pulling down the overall exact match rate. The model achieves strong F1 on structural/morphological tags (class, number, POS, person, gender all > 0.92).


---

Built with love for Jesus. Published by [LoveJesus](https://huggingface.co/LoveJesus).
Part of the [bible.systems](https://bible.systems) project.