cuneiformBase-400m

Introducing cuneiformBase-400m, a multilingual model capable of handling translation, transliteration, and script conversion tasks across multiple ancient languages: Akkadian, Sumerian, Hittite, Linear B, and Elamite.

1. Model Description

This is an instruct model based on Google's umt5-base (768 hidden dimensions, 12 encoder layers, 12 decoder layers). Unlike the original UMT5 architecture which uses untied input/output embeddings, this model uses tied embeddings (~396M parameters). It supports translation to and from English (and German for Hittite), transliteration between cuneiform signs and Latin characters, and script conversion across five ancient writing systems.

Three styles of transliteration are supported where applicable:

  • Plain transliteration -- standard scholarly transliteration following CDLI notation style
  • Complex transliteration -- includes special symbols, subscript numbers, and determinatives
  • Simple transliteration -- stripped of all special symbols and diacritics, syllables merged to form words

Akkadian Instructions

Translation:

Prompt Input Output
Translate Akkadian cuneiform to English: cuneiform signs English
Translate Akkadian transliteration to English: transliteration English
Translate complex Akkadian transliteration to English: complex transliteration English
Translate simple Akkadian transliteration to English: simple transliteration English
Translate English to Akkadian cuneiform: English cuneiform signs
Translate English to Akkadian transliteration: English transliteration
Translate English to complex Akkadian transliteration: English complex transliteration
Translate English to simple Akkadian transliteration: English simple transliteration

Transliteration:

Prompt Input Output
Transliterate Akkadian cuneiform to Latin characters: cuneiform signs transliteration
Transliterate Akkadian cuneiform to complex Latin characters: cuneiform signs complex transliteration
Transliterate Akkadian cuneiform to simple Latin characters: cuneiform signs simple transliteration

Script Conversion:

Prompt Input Output
Convert transliterated Latin characters to Akkadian cuneiform: transliteration cuneiform signs
Convert complex transliterated Latin characters to Akkadian cuneiform: complex transliteration cuneiform signs
Convert simple transliterated Latin characters to Akkadian cuneiform: simple transliteration cuneiform signs

Sumerian Instructions

Translation:

Prompt Input Output
Translate Sumerian cuneiform to English: cuneiform signs English
Translate Sumerian transliteration to English: transliteration English
Translate complex Sumerian transliteration to English: complex transliteration English
Translate simple Sumerian transliteration to English: simple transliteration English
Translate English to Sumerian cuneiform: English cuneiform signs
Translate English to Sumerian transliteration: English transliteration

Transliteration:

Prompt Input Output
Transliterate Sumerian cuneiform to Latin characters: cuneiform signs transliteration
Transliterate Sumerian cuneiform to complex Latin characters: cuneiform signs complex transliteration

Script Conversion:

Prompt Input Output
Convert transliterated Latin characters to Sumerian cuneiform: transliteration cuneiform signs

Hittite Instructions

Translation:

Prompt Input Output
Translate Hittite transliteration to English: transliteration English
Translate complex Hittite transliteration to English: complex transliteration English
Translate simple Hittite transliteration to English: simple transliteration English
Translate Hittite transliteration to German: transliteration German
Translate complex Hittite transliteration to German: complex transliteration German
Translate simple Hittite transliteration to German: simple transliteration German
Translate English to Hittite transliteration: English transliteration
Translate German to Hittite transliteration: German transliteration

Linear B Instructions

Translation:

Prompt Input Output
Translate Linear B cuneiform to English: Linear B signs English
Translate Linear B transliteration to English: transliteration English
Translate complex Linear B transliteration to English: complex transliteration English
Translate simple Linear B transliteration to English: simple transliteration English
Translate English to Linear B cuneiform: English Linear B signs
Translate English to Linear B transliteration: English transliteration

Transliteration:

Prompt Input Output
Transliterate Linear B cuneiform to Latin characters: Linear B signs transliteration

Script Conversion:

Prompt Input Output
Convert transliterated Latin characters to Linear B cuneiform: transliteration Linear B signs

Elamite

Elamite was included in training on a limited corpus. Due to insufficient validation data, no evaluation metrics are reported for Elamite at this time. Use with caution and expect lower accuracy than the other supported languages.


Base Model

This is a finetuned version of Google's umt5-base, but with tied embeddings.

2. Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "Thalesian/cuneiformBase-400m"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# Example: Translate Akkadian cuneiform to English
prompt = "Translate Akkadian cuneiform to English: "
input_text = "π’…† 𒁹 π’€­ π’‰Ί 𒉽 π’€€ 𒁹 π’„Ώ π’Œ‹ π’Š 𒀴 𒃻 π’€€ π’Œ‹ π’Œ‹ π’€€ π’Œ‹ π’Œ‹"

inputs = tokenizer(prompt + input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=64)

prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Prediction:", prediction)
> "witness  Nabu-naαΉ£ir  son  Na  di-Issar  servant  of  son  king"

3. Training and Evaluation Data

Data was used from the Akkademia project, previously published in PNAS Nexus. Additional data for pre-training and training came from CDLI data for Akkadian and Sumerian, the OARE dataset for Akkadian, Hittite data from the HPM corpus, Linear B data from published syllabary resources, and a limited Elamite corpus. More information on the training data, as well as the test and validation splits, can be found on both the GitHub and published methodology.

Training Procedure

The model was trained in multiple stages with different datasets and collators across all supported languages.

Framework Versions

  • Transformers 5.0.0.dev0
  • PyTorch 2.6.0+cu126
  • Tokenizers 0.21.1

4. Evaluation Metrics

4.1 Akkadian

4.1.1 Akkadian Metrics by Line

From Language From Script To Language To Script BLEU CHRF METEOR
Akkadian Transliteration Akkadian Cuneiform 95.78 95.33 -
Akkadian Cuneiform English Latin 66.86 78.26 0.78
Akkadian Transliteration English Latin 69.78 80.56 0.80
Akkadian Complex Transliteration English Latin 69.78 80.58 0.80
Akkadian Simple Transliteration English Latin 67.65 78.70 0.78
English Latin Akkadian Cuneiform 45.61 45.48 -
English Latin Akkadian Transliteration 43.54 65.42 -
Akkadian Cuneiform Akkadian Transliteration 83.63 92.22 -

4.1.2 Akkadian Metrics by Document

From Language From Script To Language To Script BLEU CHRF METEOR
Akkadian Transliteration Akkadian Cuneiform 39.47 50.44 -
Akkadian Cuneiform English Latin 34.14 51.54 0.49
Akkadian Transliteration English Latin 35.94 53.19 0.50
Akkadian Complex Transliteration English Latin 36.17 53.46 0.51
Akkadian Simple Transliteration English Latin 33.95 51.61 0.49
English Latin Akkadian Cuneiform 24.14 29.94 -
English Latin Akkadian Transliteration 18.84 35.57 -
Akkadian Cuneiform Akkadian Transliteration 26.13 41.81 -

4.1.3 Akkadian Metrics by Line (CDLI Test Set)

From Language From Script To Language To Script BLEU CHRF METEOR
Akkadian Transliteration Akkadian Cuneiform 29.79 46.17 -
Akkadian Complex Transliteration Akkadian Cuneiform 32.05 47.47 -
Akkadian Simple Transliteration Akkadian Cuneiform 16.97 24.64 -
Akkadian Cuneiform English Latin 20.74 46.63 0.41
Akkadian Transliteration English Latin 24.10 51.15 0.45
Akkadian Complex Transliteration English Latin 23.99 51.30 0.45
Akkadian Simple Transliteration English Latin 20.74 46.85 0.40
English Latin Akkadian Transliteration 29.45 60.35 -
English Latin Akkadian Complex Transliteration 18.06 47.32 -
English Latin Akkadian Simple Transliteration 2.02 25.48 -
Akkadian Cuneiform Akkadian Transliteration 36.74 74.18 -
Akkadian Cuneiform Akkadian Complex Transliteration 29.49 68.63 -
Akkadian Cuneiform Akkadian Simple Transliteration 1.80 30.50 -

4.1.4 Akkadian Metrics by Document (CDLI Test Set)

From Language From Script To Language To Script BLEU CHRF METEOR
Akkadian Transliteration Akkadian Cuneiform 68.01 75.51 -
Akkadian Complex Transliteration Akkadian Cuneiform 67.40 74.77 -
Akkadian Simple Transliteration Akkadian Cuneiform 38.09 40.73 -
Akkadian Cuneiform English Latin 23.39 47.78 0.44
Akkadian Transliteration English Latin 26.23 50.91 0.47
Akkadian Complex Transliteration English Latin 25.41 50.86 0.47
Akkadian Simple Transliteration English Latin 22.48 47.68 0.44
English Latin Akkadian Transliteration 28.88 51.57 -
English Latin Akkadian Complex Transliteration 15.19 38.29 -
English Latin Akkadian Simple Transliteration 1.75 23.31 -
Akkadian Cuneiform Akkadian Transliteration 34.48 57.21 -
Akkadian Cuneiform Akkadian Complex Transliteration 28.99 53.30 -
Akkadian Cuneiform Akkadian Simple Transliteration 1.42 24.87 -

4.1.5 Akkadian Metrics by Document (OARE Test Set)

From Language From Script To Language To Script BLEU CHRF METEOR
Akkadian Transliteration Akkadian Cuneiform 15.57 21.40 -
Akkadian Complex Transliteration Akkadian Cuneiform 18.39 28.34 -
Akkadian Simple Transliteration Akkadian Cuneiform 10.35 15.89 -
Akkadian Cuneiform English Latin 1.72 17.53 0.13
Akkadian Transliteration English Latin 0.78 17.67 0.10
Akkadian Complex Transliteration English Latin 0.86 17.52 0.11
Akkadian Simple Transliteration English Latin 1.53 18.32 0.14
English Latin Akkadian Transliteration 0.77 16.17 -
English Latin Akkadian Complex Transliteration 0.53 12.87 -
English Latin Akkadian Simple Transliteration 0.33 10.94 -
Akkadian Cuneiform Akkadian Transliteration 0.96 19.75 -
Akkadian Cuneiform Akkadian Complex Transliteration 1.39 19.28 -
Akkadian Cuneiform Akkadian Simple Transliteration 0.38 11.88 -

4.2 Sumerian

4.2.1 Sumerian Metrics by Line

From Language From Script To Language To Script BLEU CHRF METEOR
Sumerian Transliteration Sumerian Cuneiform 98.85 98.87 -
Sumerian Cuneiform English Latin 19.40 40.43 0.38
Sumerian Transliteration English Latin 23.81 46.00 0.45
Sumerian Complex Transliteration English Latin 23.96 45.88 0.45
Sumerian Simple Transliteration English Latin 21.53 43.43 0.41
English Latin Sumerian Cuneiform 52.28 55.05 -
English Latin Sumerian Transliteration 42.02 62.72 -
Sumerian Cuneiform Sumerian Transliteration 39.08 64.61 -
Sumerian Cuneiform Sumerian Complex Transliteration 37.66 63.61 -

4.2.2 Sumerian Metrics by Document

From Language From Script To Language To Script BLEU CHRF METEOR
Sumerian Transliteration Sumerian Cuneiform 78.74 83.74 -
Sumerian Cuneiform English Latin 24.99 45.43 0.43
Sumerian Transliteration English Latin 30.34 50.82 0.51
Sumerian Complex Transliteration English Latin 30.43 50.58 0.50
Sumerian Simple Transliteration English Latin 26.15 47.91 0.45
English Latin Sumerian Cuneiform 52.58 55.17 -
English Latin Sumerian Transliteration 48.35 62.48 -
Sumerian Cuneiform Sumerian Transliteration 39.88 58.59 -
Sumerian Cuneiform Sumerian Complex Transliteration 37.78 57.10 -

4.3 Hittite

4.3.1 Hittite Metrics by Line

From Language From Script To Language To Script BLEU CHRF METEOR
Hittite Transliteration English Latin 95.62 97.41 0.97
Hittite Complex Transliteration English Latin 95.08 97.05 0.97
Hittite Simple Transliteration English Latin 93.45 96.19 0.96
Hittite Transliteration German Latin 86.88 94.25 0.93
Hittite Complex Transliteration German Latin 86.64 94.07 0.92
Hittite Simple Transliteration German Latin 79.82 91.13 0.89
English Latin Hittite Transliteration 55.89 84.47 -
German Latin Hittite Transliteration 49.18 83.33 -

4.3.2 Hittite Metrics by Document

From Language From Script To Language To Script BLEU CHRF METEOR
Hittite Transliteration English Latin 65.23 72.79 0.70
Hittite Complex Transliteration English Latin 65.39 73.06 0.71
Hittite Simple Transliteration English Latin 63.61 71.94 0.69
Hittite Transliteration German Latin 56.01 68.22 0.65
Hittite Complex Transliteration German Latin 56.47 68.87 0.66
Hittite Simple Transliteration German Latin 49.32 64.91 0.62
English Latin Hittite Transliteration 28.38 47.49 -
German Latin Hittite Transliteration 24.24 45.63 -

The Hittite validation scripts were based on CTH numbers - however the English bleu score for lines (95.62) is implausibly high - we believe there was data leakage for a manually generated training set. This may impact the German as well, but German scores are consistent with past models deployed before the additional English set.


4.4 Linear B

Note: Line-level and document-level metrics are identical for Linear B, as the validation set consists of single-line documents.

4.4.1 Linear B Metrics

From Language From Script To Language To Script BLEU CHRF METEOR
Linear B Transliteration Linear B Syllabary 86.24 88.29 -
Linear B Syllabary English Latin 50.41 62.82 0.67
Linear B Transliteration English Latin 56.51 66.23 0.70
Linear B Complex Transliteration English Latin 68.33 73.42 0.78
Linear B Simple Transliteration English Latin 28.24 44.50 0.50
English Latin Linear B Syllabary 50.76 52.18 -
English Latin Linear B Transliteration 52.21 64.18 -
Linear B Syllabary Linear B Transliteration 50.98 73.45 -

4.5 Elamite

Elamite was included during training on a limited corpus. Due to insufficient validation data, no evaluation metrics are available. Results should be treated as experimental.

5. Intended Uses

  • Translation of short cuneiform lines across Akkadian, Sumerian, Hittite, and Linear B
  • Transliteration pipelines converting between cuneiform signs and Latin-script representations
  • Reverse translation from English/German back to ancient language transliterations or cuneiform
  • Comparative studies across multiple ancient writing systems
  • Educational and research applications in digital Assyriology, Sumerology, Hittitology, and Aegean scripts

6. Limitations

  • Context window is limited to 512 tokens; longer texts should be split into individual lines.
  • Sumerian translation quality is notably lower than other languages due to the complexity and limited parallel data for Sumerian.
  • Elamite support is experimental with minimal training data.
  • OARE out-of-domain Akkadian data shows significantly degraded performance, indicating domain sensitivity.
  • The model was trained on scholarly transliterations and may not generalize well to non-standard input formats.
  • Linear B prompts use the term "cuneiform" for the syllabary script for consistency with the prompt format; Linear B is a syllabic script, not cuneiform.

7. How to Cite

@misc{drake2025cuneiformBase400m,
  title        = {{cuneiformBase-400m}: A Multilingual T5 Model for Ancient Script Translation and Transliteration},
  author       = {Drake, B. Lee},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/Thalesian/cuneiformBase-400m}}
}
Downloads last month
43
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Thalesian/cuneiformBase-400m

Base model

google/umt5-base
Finetuned
(49)
this model