File size: 13,821 Bytes

---
license: mit
language:
  - en
tags:
  - text2text-generation
  - flan-t5
  - bible
  - simplification
  - readability
  - difficulty-scoring
  - multi-task
  - seq2seq
datasets:
  - LoveJesus/passage-difficulty-simplifier-dataset-chirho
pipeline_tag: text2text-generation
base_model: google/flan-t5-base
model-index:
  - name: passage-difficulty-simplifier-chirho
    results:
    - task:
        type: text2text-generation
        name: Text Generation
      metrics:
      - name: Eval Loss
        type: eval_loss
        value: 2.228
      - name: Difficulty Accuracy
        type: accuracy
        value: 0.9377
      - name: Combined Score
        type: combined_score
        value: 0.3781
---

<!-- For God so loved the world that he gave his only begotten Son,
that whoever believes in him should not perish but have eternal life. - John 3:16 -->

# Passage Difficulty Scorer & Plain-Language Simplifier (Model 8)

A fine-tuned **google/flan-t5-base** (248M parameters) for dual-task Bible passage processing: (1) reading difficulty scoring and (2) archaic-to-modern English simplification. Both tasks are learned jointly through multi-task training on the same model. Upgraded from flan-t5-small (80M) for improved accuracy.

## Model Description

This model takes Bible passages as input and performs one of two tasks, selected by a natural language prefix:

### Task 1: Difficulty Scoring

Analyzes a Bible passage and produces a structured difficulty assessment.

- **Prefix**: `rate difficulty:`
- **Output format**: `reading_level: [1-12] | vocab_complexity: [low/medium/high] | archaic_forms: [count] | difficulty: [easy/medium/hard]`

### Task 2: Simplification

Converts archaic or complex Bible passages into plain modern English.

- **Prefix**: `simplify:`
- **Output**: Plain-language paraphrase of the input verse

## Training Details

| Parameter | Value |
|---|---|
| **Base model** | `google/flan-t5-base` (248M params) |
| **Architecture** | Encoder-Decoder (T5) |
| **Training approach** | Full fine-tuning, multi-task |
| **Trainer** | `Seq2SeqTrainer` with `DataCollatorForSeq2Seq` |
| **Epochs** | 5 |
| **Batch size** | 32 (H200 GPU) |
| **Effective batch size** | 32 (gradient accumulation = 1 on H200) |
| **Learning rate** | 2e-4 |
| **LR scheduler** | Cosine with 10% warmup |
| **Weight decay** | 0.01 |
| **Label smoothing** | 0.1 |
| **Mixed precision** | bf16 (H200) |
| **Max input length** | 256 tokens |
| **Max target length** | 256 tokens |
| **Early stopping** | Patience = 2, monitoring `eval_loss` |
| **Best model selection** | Lowest `eval_loss` |
| **Generation (eval)** | `predict_with_generate=True`, beam search |

### Dataset

Trained on approximately **120K+ examples** combining both tasks, split by Bible book to prevent verse-level leakage (80/10/10 by book):

| Task | Target Count | Description |
|---|---|---|
| Difficulty scoring | ~64K | Verses from 6 translations with algorithmically computed labels |
| Simplification | ~96K | Cross-translation pairs mapping complex to simple English |

#### Translations Used

| Translation | Style | Role |
|---|---|---|
| KJV (King James Version) | Formal, archaic | Complex source |
| ASV (American Standard Version) | Formal, dated | Complex source |
| YLT (Young's Literal Translation) | Ultra-literal | Complex source |
| Darby Bible | Literal, dated | Complex source / Difficulty scoring |
| BBE (Bible in Basic English) | 850-word vocabulary, ~Grade 4 | Simple target |
| OEB (Open English Bible) | Modern, public domain | Simple target |

#### Simplification Pairs

| Complex Source | Simple Target |
|---|---|
| KJV | BBE |
| KJV | OEB |
| ASV | BBE |
| YLT | OEB |

#### Data Source

Bible text sourced from **ScrollMapper Bible Databases** (public domain translations on GitHub).

#### Difficulty Scoring Labels

Labels are computed algorithmically from textual features:

- **Reading level** (1-12): Approximate Flesch-Kincaid grade level analog, adjusted for archaic vocabulary and uncommon word ratio
- **Vocabulary complexity** (low/medium/high): Ratio of words outside a ~3,000-word common English vocabulary
- **Archaic forms** (count): Number of archaic English words detected (thee, thou, hath, doth, -eth/-est verb endings, etc.)
- **Difficulty** (easy/medium/hard): Composite score from reading level, vocabulary complexity, and archaic form count

## Usage

### Quick Start: Simplification

```python
# For God so loved the world that he gave his only begotten Son,
# that whoever believes in him should not perish but have eternal life. - John 3:16

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_chirho = AutoTokenizer.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")
model_chirho = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")

input_text_chirho = "simplify: And the LORD God formed man of the dust of the ground, and breathed into his nostrils the breath of life; and man became a living soul."

inputs_chirho = tokenizer_chirho(input_text_chirho, return_tensors="pt", max_length=256, truncation=True)
outputs_chirho = model_chirho.generate(**inputs_chirho, max_length=256, num_beams=4, early_stopping=True)
result_chirho = tokenizer_chirho.decode(outputs_chirho[0], skip_special_tokens=True)

print(result_chirho)
# Expected: A simplified, modern English version of the verse
```

### Quick Start: Difficulty Scoring

```python
# For God so loved the world that he gave his only begotten Son,
# that whoever believes in him should not perish but have eternal life. - John 3:16

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import re

tokenizer_chirho = AutoTokenizer.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")
model_chirho = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")

input_text_chirho = "rate difficulty: For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life."

inputs_chirho = tokenizer_chirho(input_text_chirho, return_tensors="pt", max_length=256, truncation=True)
outputs_chirho = model_chirho.generate(**inputs_chirho, max_length=256, num_beams=4, early_stopping=True)
raw_output_chirho = tokenizer_chirho.decode(outputs_chirho[0], skip_special_tokens=True)

print(raw_output_chirho)
# Expected: "reading_level: X | vocab_complexity: Y | archaic_forms: Z | difficulty: W"

# Parse structured output
reading_level_chirho = re.search(r"reading_level:\s*(\d+)", raw_output_chirho)
difficulty_chirho = re.search(r"difficulty:\s*(\w+)", raw_output_chirho)
vocab_chirho = re.search(r"vocab_complexity:\s*(\w+)", raw_output_chirho)
archaic_chirho = re.search(r"archaic_forms:\s*(\d+)", raw_output_chirho)

if reading_level_chirho:
    print(f"Reading Level: Grade {reading_level_chirho.group(1)}")
if difficulty_chirho:
    print(f"Difficulty: {difficulty_chirho.group(1)}")
if vocab_chirho:
    print(f"Vocabulary Complexity: {vocab_chirho.group(1)}")
if archaic_chirho:
    print(f"Archaic Forms: {archaic_chirho.group(1)}")
```

### Batch Inference

```python
# For God so loved the world that he gave his only begotten Son,
# that whoever believes in him should not perish but have eternal life. - John 3:16

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_chirho = AutoTokenizer.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")
model_chirho = AutoModelForSeq2SeqLM.from_pretrained("LoveJesus/passage-difficulty-simplifier-chirho")
model_chirho.eval()

verses_chirho = [
    "simplify: Verily, verily, I say unto thee, Except a man be born again, he cannot see the kingdom of God.",
    "simplify: Wherefore, as by one man sin entered into the world, and death by sin; and so death passed upon all men, for that all have sinned:",
    "rate difficulty: In the beginning God created the heaven and the earth.",
    "rate difficulty: Jesus wept.",
]

inputs_chirho = tokenizer_chirho(verses_chirho, return_tensors="pt", max_length=256, truncation=True, padding=True)

with torch.no_grad():
    outputs_chirho = model_chirho.generate(**inputs_chirho, max_length=256, num_beams=4, early_stopping=True)

results_chirho = tokenizer_chirho.batch_decode(outputs_chirho, skip_special_tokens=True)

for verse_chirho, result_chirho in zip(verses_chirho, results_chirho):
    print(f"Input:  {verse_chirho}")
    print(f"Output: {result_chirho}\n")
```

## Evaluation

### Metrics

| Task | Metric | Description |
|---|---|---|
| Difficulty Scoring | `difficulty_accuracy_chirho` | Exact match on easy/medium/hard label |
| Difficulty Scoring | Reading level MAE | Mean absolute error on grade level (1-12) |
| Difficulty Scoring | Vocab complexity accuracy | Exact match on low/medium/high |
| Simplification | BLEU | Corpus-level BLEU score (sacrebleu) |
| Simplification | BERTScore F1 | Semantic similarity to reference simplifications |
| Simplification | Exact match | Proportion of predictions matching reference exactly |
| Combined | `combined_score_chirho` | 0.4 * difficulty_accuracy + 0.6 * simplification_exact_match |

### Results (v2 - flan-t5-base upgrade)

| Metric | Score |
|---|---|
| **Eval loss** | **2.228** (best at epoch 3) |
| **Difficulty accuracy** | **93.8%** |
| **Simplification exact match** | 0.50% |
| **Combined score** | **0.378** |
| Train loss | 1.964 |
| Hardware | NVIDIA H200 (143GB), ~64 min |

### Training Trajectory

| Epoch | Eval Loss | Difficulty Acc | Combined Score |
|-------|-----------|----------------|----------------|
| 1 | 2.282 | 87.1% | 0.351 |
| 2 | 2.244 | 91.9% | 0.370 |
| **3** | **2.228** | 93.8% | 0.378 |
| 4 | 2.236 | 94.7% | 0.382 |
| 5 | 2.241 | 94.8% | 0.382 |

Best model selected by lowest eval_loss (epoch 3). Difficulty accuracy continued improving through epoch 5 but loss began increasing at epoch 4, indicating mild overfitting on the simplification task.

## Try It Live

**[Interactive Demo on HuggingFace Spaces](https://huggingface.co/spaces/LoveJesus/passage-difficulty-simplifier-chirho)**

The Gradio-powered demo provides two tabs:
- **Simplify**: Enter any Bible verse and receive a plain-language version
- **Difficulty**: Enter a verse and get reading level, vocabulary complexity, archaic form count, and overall difficulty

## Limitations

- Trained exclusively on Bible text; does not generalize to other literary or domain-specific texts
- Simplification quality varies by verse length and complexity; very long passages may be truncated
- Difficulty scoring labels are algorithmically generated (not human-annotated), which introduces systematic biases
- Base model (248M params) balances accuracy with accessibility
- Simplification targets (BBE, OEB) have their own translation biases; outputs reflect those stylistic choices
- Archaic form detection relies on a fixed word list and may miss uncommon archaic constructions
- The model does not preserve verse references or theological nuance; it is a readability tool, not a study Bible

## Intended Use

- Bible study tools that need plain-language paraphrasing of archaic translations
- Reading level assessment for curriculum planning or children's ministry materials
- Accessibility applications that present Bible text at appropriate reading levels
- Research into text simplification for historical English

## Out-of-Scope Use

- Replacing authoritative Bible translations for doctrinal study
- General-purpose text simplification outside of biblical literature
- Machine translation between languages (this model operates only in English)

## Model Architecture

```
google/flan-t5-base (Encoder-Decoder)
  Encoder: 12 layers, 12 heads, d_model=768
  Decoder: 12 layers, 12 heads, d_model=768
  Total parameters: ~248M (all trainable, full fine-tuning)
  Vocabulary: SentencePiece, 32,128 tokens
```

## Repository Structure

```
passage-difficulty-simplifier-chirho/
  src-chirho/
    train-chirho/train-simplifier-chirho.py    # Training script
    eval-chirho/evaluate-chirho.py             # Evaluation script
    data-chirho/build-simplifier-dataset-chirho.ts  # Dataset builder (Bun/TS)
    data-chirho/download-translations-chirho.ts     # Translation downloader
    upload-hf-chirho.py                        # HuggingFace upload script
  space-chirho/
    app.py                                     # Gradio demo application
  data-chirho/
    raw-chirho/                                # Raw Bible CSVs
    processed-chirho/                          # JSONL train/val/test splits
  models-chirho/
    simplifier-chirho/best-chirho/             # Best checkpoint
  cards-chirho/
    simplifier-card-chirho.md                  # This model card
  config-chirho.yaml                           # Training configuration
  spec-chirho/
    progress-chirho.sqlite                     # Agent progress log
```

## Training Reproducibility

```bash
# 1. Download Bible translations
cd passage-difficulty-simplifier-chirho
bun run src-chirho/data-chirho/download-translations-chirho.ts

# 2. Build dual-task dataset
bun run src-chirho/data-chirho/build-simplifier-dataset-chirho.ts

# 3. Train model
python src-chirho/train-chirho/train-simplifier-chirho.py

# 4. Evaluate
python src-chirho/eval-chirho/evaluate-chirho.py

# 5. Upload to HuggingFace
python src-chirho/upload-hf-chirho.py
```

## License

MIT

## Citation

```bibtex
@misc{lovejesus2026passagedifficultysimplifier,
  title={Passage Difficulty Scorer & Plain-Language Simplifier: Multi-Task Flan-T5 for Bible Readability},
  author={loveJesus},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/LoveJesus/passage-difficulty-simplifier-chirho}
}
```

---

Built with love for Jesus. Published by [loveJesus](https://huggingface.co/LoveJesus).