--- language: - he - el license: mit tags: - biblical-hebrew - biblical-greek - morphology - parsing - mt5 - seq2seq datasets: - LoveJesus/biblical-tutor-dataset-chirho pipeline_tag: text2text-generation model-index: - name: biblical-parser-chirho results: - task: type: text2text-generation name: Morphological Parsing dataset: type: LoveJesus/biblical-tutor-dataset-chirho name: Biblical Tutor Dataset (Chirho) metrics: - type: exact_match value: 0.525 name: Exact Match - type: f1 value: 0.886 name: Average Tag F1 --- # Biblical Morphological Parser (mT5-small) *For God so loved the world that he gave his only begotten Son, that whoever believes in him should not perish but have eternal life. - John 3:16* ## What This Does This model parses biblical Hebrew and Greek words into their morphological components: part of speech, stem, lemma, tense, person, gender, number, and English gloss. ## Usage ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("LoveJesus/biblical-parser-chirho") model = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/biblical-parser-chirho") # Parse a Hebrew word input_text = 'parse [hebrew]: בָּרָא [GEN 1:1] context: בְּרֵאשִׁית אֱלֹהִים' inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Expected: "class:verb | stem:qal | lemma:ברא | morph:... | person:3 | gender:m | number:s | gloss:he created" # Parse a Greek word input_text = 'parse [greek]: λόγος [JHN 1:1] context: ἐν ἀρχῇ ἦν' inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Input Format ``` parse [{language}]: {word} [{verse_ref}] context: {surrounding_words} ``` - `{language}`: `hebrew` or `greek` - `{word}`: The biblical word in original script - `{verse_ref}`: Book chapter:verse reference - `{surrounding_words}`: 2 words before and after for disambiguation ## Output Format Pipe-separated morphological tags: ``` class:{pos} | stem:{stem} | lemma:{lemma} | morph:{code} | person:{p} | gender:{g} | number:{n} | gloss:{english} ``` ## Training Data - **Macula Hebrew** (Clear-Bible): ~425K OT words with morphology and glosses - **Macula Greek SBLGNT** (Clear-Bible): ~138K NT words with morphology and glosses - Subsampled to ~200K words (100K per language), stratified by book ## Model Details | Property | Value | |----------|-------| | Base model | google/mt5-small (300M params) | | Architecture | Encoder-decoder (Seq2Seq) | | Languages | Biblical Hebrew, Koine Greek | | Training | 5 epochs, lr=3e-4, batch=32 | | Hardware | NVIDIA A100/H200 GPU | ## Limitations - Trained on Macula morphological annotations — may not match all scholarly traditions - Handles individual words, not full syntactic analysis - Performance may vary on words not well-represented in training data ## Evaluation Results Evaluated on a held-out test set of ~20K word-level parsing examples. ### Overall Metrics | Metric | Score | |--------|-------| | **Exact Match** (all tags correct) | **0.525** | | **Average Tag F1** (across all tags) | **0.886** | ### Per-Tag F1 | Tag | F1 | |-----|-----| | class (POS) | 0.963 | | number | 0.966 | | POS | 0.958 | | lemma | 0.935 | | person | 0.933 | | gender | 0.928 | | type | 0.900 | | morph | 0.890 | | state | 0.878 | | stem | 0.859 | | gloss | 0.539 | ### Per-Language Exact Match | Language | Exact Match | |----------|-------------| | Hebrew | 0.514 | | Greek | 0.559 | > The `gloss` tag (English translation) is the hardest to predict exactly, pulling down the overall exact match rate. The model achieves strong F1 on structural/morphological tags (class, number, POS, person, gender all > 0.92). --- Built with love for Jesus. Published by [LoveJesus](https://huggingface.co/LoveJesus). Part of the [bible.systems](https://bible.systems) project.