M2M-100 Conservative โ Hieroglyphic-to-German Translation
This model is part of the paper "Data Contamination in Neural Machine Translation of Ancient Egyptian Hieroglyphics" (NLP4DH 2026).
Model Description
M2M-100 (418M) fine-tuned with conservative learning rate (lr=1e-5, AdamW, cosine warmup, effective batch 288). Best checkpoint at step 11,000.
Task: Ancient Egyptian hieroglyphics (Gardiner notation) โ German translation
Performance
| Subset | BLEU |
|---|---|
| All (n=50) | 47.3 |
| Contaminated (n=16) | 77.9 |
| Clean (n=34) | 39.2 |
Important: The "All" and "Contaminated" BLEU scores are inflated due to target-side data contamination (32% of test targets appear in training). The Clean score represents genuine translation quality on decontaminated samples.
Usage
from transformers import M2MForConditionalGeneration, M2MTokenizer
model = M2MForConditionalGeneration.from_pretrained("bumblelbee/hiero-m2m100-conservative")
tokenizer = M2MTokenizer.from_pretrained("bumblelbee/hiero-m2m100-conservative")
# Gardiner notation input (hieroglyphic transliteration)
source = "D36 N35 G17 D21 X1 O34"
tokenizer.src_lang = "ea"
inputs = tokenizer(source, return_tensors="pt")
generated = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("de"))
output = tokenizer.decode(generated[0], skip_special_tokens=True)
print(output)
Training Data
Fine-tuned on 18,669 eaโde pairs from the Thesaurus Linguae Aegyptiae (TLA), maintained by the Berlin-Brandenburg Academy of Sciences and Humanities.
Citation
@inproceedings{contamination2026nlp4dh,
title={Data Contamination in Neural Machine Translation of Ancient Egyptian Hieroglyphics},
booktitle={Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH 2026)},
year={2026}
}
Paper Repository
See the full paper, scripts, and results: GitHub repository
- Downloads last month
- 9
Model tree for bumblelbee/hiero-m2m100-conservative
Base model
facebook/m2m100_418M