You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Koine-Greek-BERT (v1.0)

Domain-adapted BERT model for Polytonic Koine Greek, fine-tuned from pranaydeeps/Ancient-Greek-BERT. This model has been explicitly adapted to correctly tokenize and embed biblical and extra-biblical Koine Greek texts (e.g., the Septuagint, the New Testament, Apostolic Fathers, and Hellenistic historians) with full polytonic accentuation.

Model Description

  • Base Model: pranaydeeps/Ancient-Greek-BERT
  • Architecture: BERT (124M parameters)
  • Vocabulary: Extended from 35,000 (Monotonic) to 50,000 (Polytonic).
  • Language: Koine Greek (grc)
  • Primary Objective: Masked Language Modeling (MLM)

Version History

  • v1.0 (Current): Base model switched to pranaydeeps/Ancient-Greek-BERT. Full 50K polytonic vocabulary extension with smart-initialized embeddings. Two-phase domain-adaptive pre-training (Phase 1: embedding warm-up; Phase 2: full model tuning).
  • v0.1 (Deprecated): Initial proof-of-concept based on nlpaueb/bert-base-greek-uncased-v1 (Modern Greek BERT). 35K monotonic vocabulary.

Performance & Evaluation

The model was evaluated on a held-out Koine validation set (val_v1.txt, 8,881 sequences) using single-GPU FP32 math.

Metric Phase 1 (Warmup) Phase 2 (Final v1.0)
Eval Loss 2.5946 2.3423
Perplexity 13.39 10.41

Zero-Shot Cloze Tests

The model demonstrates a strong grasp of biblical vocabulary and polytonic forms:

  • John 1:1 (ἐν ἀρχῇ ἦν ὁ [MASK]): Predicts λόγος (Rank 1)
  • Mt 22:39 (ἀγαπήσεις τὸν [MASK] σου ὡς σεαυτόν): Predicts πλησίον (Rank 1)

(Note: The model occasionally favors alternative correct accentuation variants, e.g., οὐρανὸν vs οὐρανόν depending on underlying textual traditions).

Usage

You can use this model directly for Masked Language Modeling or fine-tune it for downstream tasks like authorship attribution (scribal fingerprinting) or text classification.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ABeZet/Koine-Greek-BERT")
model = AutoModelForMaskedLM.from_pretrained("ABeZet/Koine-Greek-BERT")

from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("ἐν ἀρχῇ ἦν ὁ [MASK]"))

Training Methodology (DAPT)

The model was trained using a custom two-phase Domain-Adaptive Pre-Training (DAPT) pipeline:

  1. Vocabulary Extension: 15,000 new polytonic tokens were extracted from a Koine corpus. Their embeddings were smart-initialized by averaging the embeddings of their sub-word components from the base model.
  2. Phase 1 (Warm-up): The base model weights were frozen, and only the new token embeddings were trained for 4 epochs (learning rate 5e-4) to align them with the existing latent space.
  3. Phase 2 (Full Tuning): The entire model was unfrozen and trained for 4 epochs (learning rate 5e-5, FP32 to prevent gradient overflow) over the full Koine corpus.

Citation

If you use this model in research, please cite both the original Greek-BERT authors and this Koine-specific adaptation:

@inproceedings{koutsikakis-etal-2020-greek,
  title     = "{GREEK-BERT}: The Greeks visiting Sesame Street",
  author    = "Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion",
  booktitle = "11th Hellenic Conference on Artificial Intelligence",
  year      = "2020",
}

@misc{koine-bert-v1.0,
  title  = {Koine-Greek-BERT v1.0: Polytonic domain-adaptive pre-training of Greek-BERT on biblical Koine},
  author = {Ziemińska, Agnieszka B.},
  year   = {2026},
  note   = {Domain-adapted from \texttt{nlpaueb/bert-base-greek-uncased-v1}.},
}
Downloads last month
30
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support