Latin Contextual Line-Break Detector

This model is a specialized Sequence Tagger for Latin designed to facilitate OCR post-processing and editorial workflows. It predicts whether a line-break marker (<lb>) should be treated as a word separator (NB) or if it splits a single word that should be reconstructed (WB).

The model was trained as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).

Model Logic

The tagger evaluates the context around a <lb> token using bidirectional embeddings:

NB (KEEP): Predicted if the line break occurs between two distinct words (No Break).
WB (JOIN): Predicted if the line break splits a word (Word Break).

Technical Details

Architecture: Bi-LSTM + CRF Sequence Tagger.
Base Embeddings: Stacked Latin Legal Forward and Backward contextual string embeddings.
Data Source: ~1.47M sentences from medieval and early modern legal corpora, including the Decretum Burchardi and documents from the School of Salamanca.
Accuracy: 97.82% (Micro F1-score).
Robustness: Handles both normalized and diplomatic transcriptions, but was trained on expanded data primarily.

Usage

Note: To ensure the model recognizes the line-break marker, you must ensure <lb> is treated as a separate token (surrounded by whitespace).

from flair.models import SequenceTagger
from flair.data import Sentence

# Load the model
tagger = SequenceTagger.load('mschonhardt/flair-latin-linebreak-detector')

def process_latin_text(text_with_lb):
    # Split manually to ensure <lb> remains a distinct token
    tokens = text_with_lb.split()
    sentence = Sentence(tokens) 
    
    # Predict
    tagger.predict(sentence)
    
    for token in sentence:
        if token.text == "<lb>":
            label = token.get_label("lb")
            # NB = KEEP, WB = JOIN
            action = "KEEP" if label.value == "NB" else "JOIN"
            print(f"Token: {token.text} -> Suggestion: {action} ({label.score:.2%})")

# Example usage
process_latin_text("Et videtur, quod sic, quia res em <lb> pta de pecunia pupilli efficitur")

Citation

If you use this model, please cite the specific model DOI and the Flair framework:

@software{schonhardt_michael_2026_llbd,
  author = "Schonhardt, Michael",
  title = "Latin Contextual Line-Break Detector",
  year = "2026",
  publisher = "Zenodo",
  doi = "10.5281/zenodo.18390269",
  url = "https://doi.org/10.5281/zenodo.18390269"
}

@inproceedings{akbik-etal-2018-contextual,
    title = "Contextual String Embeddings for Sequence Labeling",
    author = "Akbik, Alan  and
      Blythe, Duncan  and
      Vollgraf, Roland",
    editor = "Bender, Emily M.  and
      Derczynski, Leon  and
      Isabelle, Pierre",
    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
    month = aug,
    year = "2018",
    address = "Santa Fe, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/C18-1139/",
    pages = "1638--1649",
}

Downloads last month: 31

Collection including mschonhardt/latin-contextual-lb-detector

Digital Editing Toolkit

Collection

Models and datasets for implementing machine learning methods into digital editing workflows. • 12 items • Updated Jan 29 • 1