You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Koine-Greek-BERT v0.1

A BERT-base model for biblical Koine Greek, produced by continued masked-language-model pre-training of Greek-BERT on a curated corpus of New Testament, Septuagint, Apocrypha, Apostolic Fathers, and other early Koine sources.

Quick start

from transformers import AutoTokenizer, AutoModelForMaskedLM

REPO = "ABeZet/Koine-Greek-BERT"
tok = AutoTokenizer.from_pretrained(REPO)
mdl = AutoModelForMaskedLM.from_pretrained(REPO)

# IMPORTANT: pre-normalize input (see "Input requirements" below)
import unicodedata
def normalize(s):
    s = unicodedata.normalize("NFD", s)
    s = "".join(c for c in s if unicodedata.category(c) != "Mn")  # strip combining marks
    return s.lower()

text = normalize("Ἐν ἀρχῇ ἦν ὁ [MASK].")
out = mdl(**tok(text, return_tensors="pt"))

Input requirements

The model was trained on monotonic, accent-stripped, lowercase Greek text. Inputs MUST be pre-normalized to this canonical form before tokenization, because:

  1. The vocabulary inherited from Greek-BERT contains only unaccented Greek characters (a-ω, no diacritics).
  2. In transformers >= 5.0, the Greek-BERT tokenizer no longer applies do_lower_case/strip_accents automatically. Without pre-normalization, accented Greek words encode to [UNK] and the model produces meaningless output.

The normalization pipeline (matches the training preprocessing exactly):

import unicodedata

def normalize_line(text: str) -> str:
    # Normalize apostrophe variants to ASCII (preserves Greek elision forms)
    apostrophe_variants = "ʼ’ʹ᾽᾿ʾʿ"
    text = "".join("'" if ch in apostrophe_variants else ch for ch in text)
    # NFD + strip combining marks + lowercase
    text = unicodedata.normalize("NFD", text)
    text = "".join(ch for ch in text if unicodedata.category(ch) != "Mn")
    return text.lower()

Training corpus

~2.85M tokens of biblical and early Christian Koine Greek:

Source License Notes
MorphGNT (SBLGNT text) CC BY-SA 3.0 New Testament
Byzantine Majority Text CC BY 4.0 NT, second textual witness
Swete LXX CC BY-SA 4.0 Greek Septuagint
First1KGreek (selected) CC BY-SA 4.0 Apostolic Fathers, NT Apocrypha, Pseudepigrapha
Perseus (selected) CC BY-SA 4.0 Philo, Origen, Clement, Josephus

All sources publicly available; per-work license metadata in data/processed/manifest.json of the source repository.

Training details

  • Base: nlpaueb/bert-base-greek-uncased-v1 (vocab 35,000, hidden 768, 12 layers)
  • Strategy: continued MLM pre-training, no vocab extension, no architectural changes
  • Hyperparameters:
    • learning rate: 5e-5 (½ of Greek-BERT's original 1e-4)
    • epochs: 5
    • effective batch size: 64 (32 × 2 grad accumulation)
    • max sequence length: 128
    • warmup ratio: 0.06
    • weight decay: 0.01
    • mlm probability: 0.15
    • optimizer: AdamW (Trainer default), fp16 mixed precision
    • hardware: Google Colab T4 GPU
    • total steps: 5,920 (~5 epochs over ~76k training lines)
    • seed: 42

Evaluation

Held-out 5% of corpus (3,988 lines). Both models tokenized with Greek-BERT's tokenizer.

Environment Greek-BERT (baseline) Koine-BERT v0.1 Δ
Colab T4, fp16, transformers 5.0 174.52 7.79 ~22× reduction
Local CPU, fp32, transformers 4.57 179.70 10.49 ~17× reduction

17–22× perplexity reduction (cross-environment). Cloze test (six iconic biblical sentences) shows direct top-1 hits on John 1:1 (λογος) and Didache (πλησιον), and substantial syntactic improvements on Rom 3:23 and 1 Cor 13:13.

Known limitations

  • Polytonic input is not supported. Inputs must be pre-normalized as described above. We did not extend the vocabulary to include accented forms in v0.1 — that's planned for v1.0.
  • Vocab unchanged from Greek-BERT. Some biblical-specific terms (rare proper nouns, technical theological vocabulary) still subword-split heavily.
  • No downstream task evaluation in v0.1. POS / lemma / NER probe evaluation using MorphGNT labels is planned for v1.0.
  • Corpus size is on the lower end of effective DAPT range. ~2.85M tokens is sufficient for adapting a 110M-parameter BERT to a closely-related domain (Modern Greek → biblical Koine), but rare biblical terms may not be fully captured.
  • No catastrophic-forgetting evaluation on Modern Greek. Use this model for biblical/Koine text; for Modern Greek, use the original Greek-BERT.

Intended use

  • Fill-mask predictions on monotonic biblical Koine text.
  • Feature extraction (sentence/token embeddings) for downstream Koine NLP tasks, e.g. semantic similarity, clustering, or as a frozen encoder for classifiers/probes.
  • Research applications on the New Testament, Septuagint, Apostolic Fathers, and other early Christian Greek texts.

Out of scope

  • Modern Greek tasks — use nlpaueb/bert-base-greek-uncased-v1.
  • Polytonic Greek without pre-normalization (will produce [UNK]-only output in transformers 5.0+).
  • Generative tasks — this is an encoder-only MLM, not a generative model.
  • Authoritative theological or textual-critical claims; the model reflects patterns in its training corpus and should not be treated as a substitute for scholarly judgment.

Roadmap (v1.0)

  • Polytonic vocab extension (so the model handles accented input natively).
  • Larger Tier-3 corpus (more Philo, Josephus, more Fathers).
  • Downstream evaluation suite: POS/lemma probe via MorphGNT, NER, semantic similarity benchmarks.

Citation

If you use this model in research, please cite both Greek-BERT and this adaptation:

@inproceedings{koutsikakis-etal-2020-greek,
  title     = "{GREEK-BERT}: The Greeks visiting Sesame Street",
  author    = "Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion",
  booktitle = "11th Hellenic Conference on Artificial Intelligence",
  year      = "2020",
}

@misc{koine-bert-v0.1,
  title  = {Koine-Greek-BERT v0.1: Continued MLM pre-training of Greek-BERT on biblical Koine},
  author = {Ziemińska, Agnieszka B.},
  year   = {2026},
  note   = {Domain-adapted from \texttt{nlpaueb/bert-base-greek-uncased-v1}.},
}

License

CC BY-SA 4.0, matching the most restrictive corpus source license. The Greek-BERT base model is released by AUEB; please also attribute it.

Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ABeZet/Koine-Greek-BERT

Finetuned
(13)
this model