Script Reproduction — Hieroglyphic-to-German Translation

This model is part of the paper "Data Contamination in Neural Machine Translation of Ancient Egyptian Hieroglyphics" (NLP4DH 2026).

Model Description

M2M-100 (418M) retrained using the original train.py script from the hiero-transformer repository with default hyperparameters (epochs=20, batch_size=16, lr=3e-5). This represents the closest replication of the original training procedure.

Task: Ancient Egyptian hieroglyphics (Gardiner notation) → German translation

Performance

Subset	BLEU
All (n=50)	42.2
Contaminated (n=16)	77.5
Clean (n=34)	33.8

Important: The "All" and "Contaminated" BLEU scores are inflated due to target-side data contamination (32% of test targets appear in training). The Clean score represents genuine translation quality on decontaminated samples.

Usage

from transformers import M2MForConditionalGeneration, M2MTokenizer

model = M2MForConditionalGeneration.from_pretrained("bumblelbee/hiero-m2m100-script-reproduction")
tokenizer = M2MTokenizer.from_pretrained("bumblelbee/hiero-m2m100-script-reproduction")

# Gardiner notation input (hieroglyphic transliteration)
source = "D36 N35 G17 D21 X1 O34"

tokenizer.src_lang = "ea"
inputs = tokenizer(source, return_tensors="pt")
generated = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("de"))
output = tokenizer.decode(generated[0], skip_special_tokens=True)
print(output)

Training Data

Fine-tuned on 18,669 ea→de pairs from the Thesaurus Linguae Aegyptiae (TLA), maintained by the Berlin-Brandenburg Academy of Sciences and Humanities.

Citation

@inproceedings{toutou-etal-2026-data,
  title     = {Data Contamination in Neural Hieroglyphic Translation:
               A Reproducibility Study},
  author    = {Toutou, Ammar and Harb, Abdelrahman and Basta, Christine},
  booktitle = {Proceedings of the 6th International Conference on
               Natural Language Processing for Digital Humanities
               (NLP4DH 2026)},
  year      = {2026},
  address   = {San Diego, USA},
  publisher = {Association for Computational Linguistics},
}

Paper Repository

See the full paper, scripts, and results: GitHub repository

Downloads last month: 2

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for bumblelbee/hiero-m2m100-script-reproduction

Base model

facebook/m2m100_418M

Finetuned

(175)

this model