--- language: la library_name: transformers license: cc-by-sa-4.0 base_model: google/byt5-large pipeline_tag: text2text-generation tags: - latin - medieval-latin - normalization - legal-history - digital-humanities - ocr-postprocessing widget: - text: "viiii vt in sabbato sancto ieiunium ante noctis initium non soluatur" example_title: "Medieval Legal Latin" --- # Medieval Latin Normalizer (ByT5-Large) This model is a **ByT5-Large** transformer fine-tuned to normalize medieval Latin text. It transforms diplomatic transcriptions or noisy HTR/OCR output into a standardized normalized orthography, facilitating better downstream processing such as POS tagging, lemmatization, and linguistic analysis. The model was developed as part of the following research projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Academy of Sciences and Literature | Mainz). ## Model Logic Medieval Latin normalization involves handling inconsistent orthography (e.g., `u/v`, `i/j`, or `ae/e` variations) and resolving phonetic spellings common in legal and ecclesiastical manuscripts. By using **ByT5-Large**, the model operates directly on **UTF-8 bytes**. This is a significant advantage for Medieval Latin, as it allows the model to process non-standard characters without the information loss typical of subword tokenizers (like BERT or standard T5). - **Input:** Raw/Diplomatic medieval Latin text. - **Output:** Standardized/Normalized Latin text. ## Technical Specifications - **Architecture:** [ByT5-Large](https://huggingface.co/google/byt5-large) (~1.2B parameters). - **Hardware:** Trained on NVIDIA Blackwell GPUs using `bf16` precision and `adamw_torch_fused` optimization. - **Training Parameters:** - **Learning Rate:** 2e-4 - **Epochs:** 20 - **Label Smoothing:** 0.1 (to improve robustness against transcription noise). - **Batch Size:** 48. ## Performance (Test Set) The model was evaluated on a held-out test set (85 samples) from medieval legal corpora: | Metric | Value | | :--- | :--- | | **Character Error Rate (CER)** | **1.62%** | | **Word-Level F1-Score** | **94.12%** | | **Evaluation Loss** | 0.143 | ## Usage You can utilize this model through the Hugging Face `pipeline` API: ```python from transformers import pipeline # Initialize the normalizer normalizer = pipeline("text2text-generation", model="mschonhardt/latin-normalizer") # Example input raw_text = "viiii vt in sabbato sancto ieiunium ante noctis initium non soluatur" result = normalizer(raw_text, max_length=128) print(f"Normalized: {result[0]['generated_text']}") ``` ## Citation If you use this model in your research, please cite: ```bibtex @software{schonhardt_michael_2026_normalization, author = "Schonhardt, Michael", title = "Medieval Latin Normalizer", year = "2026", publisher = "Zenodo", doi = "10.5281/zenodo.18416639", url = "https://doi.org/10.5281/zenodo.18416639" } @article{xue-etal-2022-byt5, title = "{B}y{T}5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models", author = "Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin", editor = "Roark, Brian and Nenkova, Ani", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.17/", doi = "10.1162/tacl_a_00461", pages = "291--306"} ```