Koine-Greek-BERT v0.1
A BERT-base model for biblical Koine Greek, produced by continued masked-language-model pre-training of Greek-BERT on a curated corpus of New Testament, Septuagint, Apocrypha, Apostolic Fathers, and other early Koine sources.
Quick start
from transformers import AutoTokenizer, AutoModelForMaskedLM
REPO = "ABeZet/Koine-Greek-BERT"
tok = AutoTokenizer.from_pretrained(REPO)
mdl = AutoModelForMaskedLM.from_pretrained(REPO)
# IMPORTANT: pre-normalize input (see "Input requirements" below)
import unicodedata
def normalize(s):
s = unicodedata.normalize("NFD", s)
s = "".join(c for c in s if unicodedata.category(c) != "Mn") # strip combining marks
return s.lower()
text = normalize("Ἐν ἀρχῇ ἦν ὁ [MASK].")
out = mdl(**tok(text, return_tensors="pt"))
Input requirements
The model was trained on monotonic, accent-stripped, lowercase Greek text. Inputs MUST be pre-normalized to this canonical form before tokenization, because:
- The vocabulary inherited from Greek-BERT contains only unaccented Greek characters (a-ω, no diacritics).
- In
transformers >= 5.0, the Greek-BERT tokenizer no longer appliesdo_lower_case/strip_accentsautomatically. Without pre-normalization, accented Greek words encode to[UNK]and the model produces meaningless output.
The normalization pipeline (matches the training preprocessing exactly):
import unicodedata
def normalize_line(text: str) -> str:
# Normalize apostrophe variants to ASCII (preserves Greek elision forms)
apostrophe_variants = "ʼ’ʹ᾽᾿ʾʿ"
text = "".join("'" if ch in apostrophe_variants else ch for ch in text)
# NFD + strip combining marks + lowercase
text = unicodedata.normalize("NFD", text)
text = "".join(ch for ch in text if unicodedata.category(ch) != "Mn")
return text.lower()
Training corpus
~2.85M tokens of biblical and early Christian Koine Greek:
| Source | License | Notes |
|---|---|---|
| MorphGNT (SBLGNT text) | CC BY-SA 3.0 | New Testament |
| Byzantine Majority Text | CC BY 4.0 | NT, second textual witness |
| Swete LXX | CC BY-SA 4.0 | Greek Septuagint |
| First1KGreek (selected) | CC BY-SA 4.0 | Apostolic Fathers, NT Apocrypha, Pseudepigrapha |
| Perseus (selected) | CC BY-SA 4.0 | Philo, Origen, Clement, Josephus |
All sources publicly available; per-work license metadata in
data/processed/manifest.json of the source repository.
Training details
- Base:
nlpaueb/bert-base-greek-uncased-v1(vocab 35,000, hidden 768, 12 layers) - Strategy: continued MLM pre-training, no vocab extension, no architectural changes
- Hyperparameters:
- learning rate: 5e-5 (½ of Greek-BERT's original 1e-4)
- epochs: 5
- effective batch size: 64 (32 × 2 grad accumulation)
- max sequence length: 128
- warmup ratio: 0.06
- weight decay: 0.01
- mlm probability: 0.15
- optimizer: AdamW (Trainer default), fp16 mixed precision
- hardware: Google Colab T4 GPU
- total steps: 5,920 (~5 epochs over ~76k training lines)
- seed: 42
Evaluation
Held-out 5% of corpus (3,988 lines). Both models tokenized with Greek-BERT's tokenizer.
| Environment | Greek-BERT (baseline) | Koine-BERT v0.1 | Δ |
|---|---|---|---|
| Colab T4, fp16, transformers 5.0 | 174.52 | 7.79 | ~22× reduction |
| Local CPU, fp32, transformers 4.57 | 179.70 | 10.49 | ~17× reduction |
17–22× perplexity reduction (cross-environment). Cloze test (six iconic biblical sentences) shows direct
top-1 hits on John 1:1 (λογος) and Didache (πλησιον), and substantial
syntactic improvements on Rom 3:23 and 1 Cor 13:13.
Known limitations
- Polytonic input is not supported. Inputs must be pre-normalized as described above. We did not extend the vocabulary to include accented forms in v0.1 — that's planned for v1.0.
- Vocab unchanged from Greek-BERT. Some biblical-specific terms (rare proper nouns, technical theological vocabulary) still subword-split heavily.
- No downstream task evaluation in v0.1. POS / lemma / NER probe evaluation using MorphGNT labels is planned for v1.0.
- Corpus size is on the lower end of effective DAPT range. ~2.85M tokens is sufficient for adapting a 110M-parameter BERT to a closely-related domain (Modern Greek → biblical Koine), but rare biblical terms may not be fully captured.
- No catastrophic-forgetting evaluation on Modern Greek. Use this model for biblical/Koine text; for Modern Greek, use the original Greek-BERT.
Intended use
- Fill-mask predictions on monotonic biblical Koine text.
- Feature extraction (sentence/token embeddings) for downstream Koine NLP tasks, e.g. semantic similarity, clustering, or as a frozen encoder for classifiers/probes.
- Research applications on the New Testament, Septuagint, Apostolic Fathers, and other early Christian Greek texts.
Out of scope
- Modern Greek tasks — use
nlpaueb/bert-base-greek-uncased-v1. - Polytonic Greek without pre-normalization (will produce
[UNK]-only output in transformers 5.0+). - Generative tasks — this is an encoder-only MLM, not a generative model.
- Authoritative theological or textual-critical claims; the model reflects patterns in its training corpus and should not be treated as a substitute for scholarly judgment.
Roadmap (v1.0)
- Polytonic vocab extension (so the model handles accented input natively).
- Larger Tier-3 corpus (more Philo, Josephus, more Fathers).
- Downstream evaluation suite: POS/lemma probe via MorphGNT, NER, semantic similarity benchmarks.
Citation
If you use this model in research, please cite both Greek-BERT and this adaptation:
@inproceedings{koutsikakis-etal-2020-greek,
title = "{GREEK-BERT}: The Greeks visiting Sesame Street",
author = "Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion",
booktitle = "11th Hellenic Conference on Artificial Intelligence",
year = "2020",
}
@misc{koine-bert-v0.1,
title = {Koine-Greek-BERT v0.1: Continued MLM pre-training of Greek-BERT on biblical Koine},
author = {Ziemińska, Agnieszka B.},
year = {2026},
note = {Domain-adapted from \texttt{nlpaueb/bert-base-greek-uncased-v1}.},
}
License
CC BY-SA 4.0, matching the most restrictive corpus source license. The Greek-BERT base model is released by AUEB; please also attribute it.
- Downloads last month
- 22
Model tree for ABeZet/Koine-Greek-BERT
Base model
nlpaueb/bert-base-greek-uncased-v1