Latin Contextual Line-Break Detector

This model is a specialized Sequence Tagger for Latin designed to facilitate OCR post-processing and editorial workflows. It predicts whether a line-break marker (<lb>) should be treated as a word separator (NB) or if it splits a single word that should be reconstructed (WB).

The model was trained as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).

Model Logic

The tagger evaluates the context around a <lb> token using bidirectional embeddings:

  • NB (KEEP): Predicted if the line break occurs between two distinct words (No Break).
  • WB (JOIN): Predicted if the line break splits a word (Word Break).

Technical Details

  • Architecture: Bi-LSTM + CRF Sequence Tagger.
  • Base Embeddings: Stacked Latin Legal Forward and Backward contextual string embeddings.
  • Data Source: ~1.47M sentences from medieval and early modern legal corpora, including the Decretum Burchardi and documents from the School of Salamanca.
  • Accuracy: 97.82% (Micro F1-score).
  • Robustness: Handles both normalized and diplomatic transcriptions, but was trained on expanded data primarily.

Usage

Note: To ensure the model recognizes the line-break marker, you must ensure <lb> is treated as a separate token (surrounded by whitespace).

from flair.models import SequenceTagger
from flair.data import Sentence

# Load the model
tagger = SequenceTagger.load('mschonhardt/flair-latin-linebreak-detector')

def process_latin_text(text_with_lb):
    # Split manually to ensure <lb> remains a distinct token
    tokens = text_with_lb.split()
    sentence = Sentence(tokens) 
    
    # Predict
    tagger.predict(sentence)
    
    for token in sentence:
        if token.text == "<lb>":
            label = token.get_label("lb")
            # NB = KEEP, WB = JOIN
            action = "KEEP" if label.value == "NB" else "JOIN"
            print(f"Token: {token.text} -> Suggestion: {action} ({label.score:.2%})")

# Example usage
process_latin_text("Et videtur, quod sic, quia res em <lb> pta de pecunia pupilli efficitur")

Citation

If you use this model, please cite the specific model DOI and the Flair framework:

@software{schonhardt_michael_2026_llbd, author = "Schonhardt, Michael", title = "Latin Contextual Line-Break Detector", year = "2026", publisher = "Zenodo", doi = "10.5281/zenodo.18390269", url = "https://doi.org/10.5281/zenodo.18390269" }

@inproceedings{akbik-etal-2018-contextual, title = "Contextual String Embeddings for Sequence Labeling", author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland", editor = "Bender, Emily M. and Derczynski, Leon and Isabelle, Pierre", booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", month = aug, year = "2018", address = "Santa Fe, New Mexico, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/C18-1139/", pages = "1638--1649", }

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mschonhardt/latin-contextual-lb-detector