biblical-parser-chirho / README.md

LoveJesus

Add evaluation metrics to model card (exact match 0.525, avg tag F1 0.886)

92af2cc verified 7 days ago

preview code

raw

history blame contribute delete

4.22 kB

metadata

language:
  - he
  - el
license: mit
tags:
  - biblical-hebrew
  - biblical-greek
  - morphology
  - parsing
  - mt5
  - seq2seq
datasets:
  - LoveJesus/biblical-tutor-dataset-chirho
pipeline_tag: text2text-generation
model-index:
  - name: biblical-parser-chirho
    results:
      - task:
          type: text2text-generation
          name: Morphological Parsing
        dataset:
          type: LoveJesus/biblical-tutor-dataset-chirho
          name: Biblical Tutor Dataset (Chirho)
        metrics:
          - type: exact_match
            value: 0.525
            name: Exact Match
          - type: f1
            value: 0.886
            name: Average Tag F1

Biblical Morphological Parser (mT5-small)

For God so loved the world that he gave his only begotten Son, that whoever believes in him should not perish but have eternal life. - John 3:16

What This Does

This model parses biblical Hebrew and Greek words into their morphological components: part of speech, stem, lemma, tense, person, gender, number, and English gloss.

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LoveJesus/biblical-parser-chirho")
model = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/biblical-parser-chirho")

# Parse a Hebrew word
input_text = 'parse [hebrew]: בָּרָא [GEN 1:1] context: בְּרֵאשִׁית אֱלֹהִים'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "class:verb | stem:qal | lemma:ברא | morph:... | person:3 | gender:m | number:s | gloss:he created"

# Parse a Greek word
input_text = 'parse [greek]: λόγος [JHN 1:1] context: ἐν ἀρχῇ ἦν'
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Input Format

parse [{language}]: {word} [{verse_ref}] context: {surrounding_words}

{language}: hebrew or greek
{word}: The biblical word in original script
{verse_ref}: Book chapter:verse reference
{surrounding_words}: 2 words before and after for disambiguation

Output Format

Pipe-separated morphological tags:

class:{pos} | stem:{stem} | lemma:{lemma} | morph:{code} | person:{p} | gender:{g} | number:{n} | gloss:{english}

Training Data

Macula Hebrew (Clear-Bible): ~425K OT words with morphology and glosses
Macula Greek SBLGNT (Clear-Bible): ~138K NT words with morphology and glosses
Subsampled to ~200K words (100K per language), stratified by book

Model Details

Property	Value
Base model	google/mt5-small (300M params)
Architecture	Encoder-decoder (Seq2Seq)
Languages	Biblical Hebrew, Koine Greek
Training	5 epochs, lr=3e-4, batch=32
Hardware	NVIDIA A100/H200 GPU

Limitations

Trained on Macula morphological annotations — may not match all scholarly traditions
Handles individual words, not full syntactic analysis
Performance may vary on words not well-represented in training data

Evaluation Results

Evaluated on a held-out test set of ~20K word-level parsing examples.

Overall Metrics

Metric	Score
Exact Match (all tags correct)	0.525
Average Tag F1 (across all tags)	0.886

Per-Tag F1

Tag	F1
class (POS)	0.963
number	0.966
POS	0.958
lemma	0.935
person	0.933
gender	0.928
type	0.900
morph	0.890
state	0.878
stem	0.859
gloss	0.539

Per-Language Exact Match

Language	Exact Match
Hebrew	0.514
Greek	0.559

The gloss tag (English translation) is the hardest to predict exactly, pulling down the overall exact match rate. The model achieves strong F1 on structural/morphological tags (class, number, POS, person, gender all > 0.92).

Built with love for Jesus. Published by LoveJesus. Part of the bible.systems project.