--- language: la library_name: transformers license: cc-by-sa-4.0 base_model: google/byt5-base pipeline_tag: text2text-generation tags: - latin - medieval-latin - legal-history - digital-humanities - ocr-postprocessing - expansion - pagexml - htr widget: - text: "Vt ep̅i conꝓuinciales peregrina iu¬" --- # Medieval Latin Abbreviation Expander (abbreviationes-v2) This model is a specialized Seq2Seq transformer designed to expand medieval scribal abbreviations (brevigraphs and suspensions) into their full forms. It was specifically trained to handle the complexities of Latin manuscripts based on a fixed set of special characters used in [Burchards Dekret Digital](www.burchards-dekret-digital.de) . The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Akademie der Wissenschaften und der Literatur | Mainz). ## Model Logic Unlike traditional token-based models, this model utilizes **ByT5**, which operates on raw UTF-8 bytes. This is critical for Medieval Latin, as it allows the model to see and correctly process non-standard Unicode characters such as macrons, brevigraphs, tironian notes or special ligatures, that are often discarded by standard subword tokenizers. - **Input:** Abbreviated text lines extracted from PageXML (e.g., `ep̅i`, `conꝓuinciales`). - **Output:** Fully expanded Unicode text (e.g., `episcopi`, `conprouinciales`). ## Training & Technical Details - **Architecture:** [ByT5-Base](https://huggingface.co/google/byt5-base) (encoder-decoder). - **Data Source:** ~32,800 paired lines (Abbr/Expan) from the *Decretum Burchardi*. - **Hardware:** Optimized for NVIDIA Blackwell (TF32/BF16 training). - **Training Regime:** 15 epochs with a Cosine learning rate scheduler (LR 2e-4). ### Performance (Test Set) | Metric | Value | | :--- | :--- | | **Character Error Rate (CER)** | **0.45%** | | **Word-Level F1-Score** | **98.75%** | | **Eval Loss** | 0.00064 | ## Usage You can use this model via the Hugging Face `pipeline` interface for quick inference: ```python from transformers import pipeline # Load the expander expander = pipeline("text2text-generation", model="mschonhardt/abbreviationes-v2") # Example: "Vt ep̅i conꝓuinciales peregrina iu¬" abbreviated text = "Vt ep̅i conꝓuinciales peregrina iu¬" result = expander(text, max_length=512) print(f"Source: {text}") print(f"Expanded: {result[0]['generated_text']}") ``` ## Citation If you use this model in your research, please cite the project and the underlying architecture: ```bibtex @software{schonhardt_michael_2026_expansion, author = "Schonhardt, Michael", title = "Medieval Latin Abbreviation Expander (abbreviationes-v2)", year = "2026", publisher = "Zenodo", doi = "10.5281/zenodo.18411989", url = "[https://doi.org/10.5281/zenodo.18411989](https://doi.org/10.5281/zenodo.18411989)" } @article{xue-etal-2022-byt5, title = "{B}y{T}5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models", author = "Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin", editor = "Roark, Brian and Nenkova, Ani", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.17/", doi = "10.1162/tacl_a_00461", pages = "291--306"} ```