Koine-Greek-BERT (v1.0)
Domain-adapted BERT model for Polytonic Koine Greek, fine-tuned from pranaydeeps/Ancient-Greek-BERT. This model has been explicitly adapted to correctly tokenize and embed biblical and extra-biblical Koine Greek texts (e.g., the Septuagint, the New Testament, Apostolic Fathers, and Hellenistic historians) with full polytonic accentuation.
Model Description
- Base Model:
pranaydeeps/Ancient-Greek-BERT - Architecture: BERT (124M parameters)
- Vocabulary: Extended from 35,000 (Monotonic) to 50,000 (Polytonic).
- Language: Koine Greek (
grc) - Primary Objective: Masked Language Modeling (MLM)
Version History
- v1.0 (Current): Base model switched to
pranaydeeps/Ancient-Greek-BERT. Full 50K polytonic vocabulary extension with smart-initialized embeddings. Two-phase domain-adaptive pre-training (Phase 1: embedding warm-up; Phase 2: full model tuning). - v0.1 (Deprecated): Initial proof-of-concept based on
nlpaueb/bert-base-greek-uncased-v1(Modern Greek BERT). 35K monotonic vocabulary.
Performance & Evaluation
The model was evaluated on a held-out Koine validation set (val_v1.txt, 8,881 sequences) using single-GPU FP32 math.
| Metric | Phase 1 (Warmup) | Phase 2 (Final v1.0) |
|---|---|---|
| Eval Loss | 2.5946 | 2.3423 |
| Perplexity | 13.39 | 10.41 |
Zero-Shot Cloze Tests
The model demonstrates a strong grasp of biblical vocabulary and polytonic forms:
- John 1:1 (
ἐν ἀρχῇ ἦν ὁ [MASK]): Predicts λόγος (Rank 1) - Mt 22:39 (
ἀγαπήσεις τὸν [MASK] σου ὡς σεαυτόν): Predicts πλησίον (Rank 1)
(Note: The model occasionally favors alternative correct accentuation variants, e.g., οὐρανὸν vs οὐρανόν depending on underlying textual traditions).
Usage
You can use this model directly for Masked Language Modeling or fine-tune it for downstream tasks like authorship attribution (scribal fingerprinting) or text classification.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ABeZet/Koine-Greek-BERT")
model = AutoModelForMaskedLM.from_pretrained("ABeZet/Koine-Greek-BERT")
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("ἐν ἀρχῇ ἦν ὁ [MASK]"))
Training Methodology (DAPT)
The model was trained using a custom two-phase Domain-Adaptive Pre-Training (DAPT) pipeline:
- Vocabulary Extension: 15,000 new polytonic tokens were extracted from a Koine corpus. Their embeddings were smart-initialized by averaging the embeddings of their sub-word components from the base model.
- Phase 1 (Warm-up): The base model weights were frozen, and only the new token embeddings were trained for 4 epochs (learning rate 5e-4) to align them with the existing latent space.
- Phase 2 (Full Tuning): The entire model was unfrozen and trained for 4 epochs (learning rate 5e-5, FP32 to prevent gradient overflow) over the full Koine corpus.
Citation
If you use this model in research, please cite both the original Greek-BERT authors and this Koine-specific adaptation:
@inproceedings{koutsikakis-etal-2020-greek,
title = "{GREEK-BERT}: The Greeks visiting Sesame Street",
author = "Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion",
booktitle = "11th Hellenic Conference on Artificial Intelligence",
year = "2020",
}
@misc{koine-bert-v1.0,
title = {Koine-Greek-BERT v1.0: Polytonic domain-adaptive pre-training of Greek-BERT on biblical Koine},
author = {Ziemińska, Agnieszka B.},
year = {2026},
note = {Domain-adapted from \texttt{nlpaueb/bert-base-greek-uncased-v1}.},
}
- Downloads last month
- 30