Latin Contextual Line-Break Detector
This model is a specialized Sequence Tagger for Latin designed to facilitate OCR post-processing and editorial workflows. It predicts whether a line-break marker (<lb>) should be treated as a word separator (NB) or if it splits a single word that should be reconstructed (WB).
The model was trained as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
Model Logic
The tagger evaluates the context around a <lb> token using bidirectional embeddings:
- NB (KEEP): Predicted if the line break occurs between two distinct words (No Break).
- WB (JOIN): Predicted if the line break splits a word (Word Break).
Technical Details
- Architecture: Bi-LSTM + CRF Sequence Tagger.
- Base Embeddings: Stacked Latin Legal Forward and Backward contextual string embeddings.
- Data Source: ~1.47M sentences from medieval and early modern legal corpora, including the Decretum Burchardi and documents from the School of Salamanca.
- Accuracy: 97.82% (Micro F1-score).
- Robustness: Handles both normalized and diplomatic transcriptions, but was trained on expanded data primarily.
Usage
Note: To ensure the model recognizes the line-break marker, you must ensure <lb> is treated as a separate token (surrounded by whitespace).
from flair.models import SequenceTagger
from flair.data import Sentence
# Load the model
tagger = SequenceTagger.load('mschonhardt/flair-latin-linebreak-detector')
def process_latin_text(text_with_lb):
# Split manually to ensure <lb> remains a distinct token
tokens = text_with_lb.split()
sentence = Sentence(tokens)
# Predict
tagger.predict(sentence)
for token in sentence:
if token.text == "<lb>":
label = token.get_label("lb")
# NB = KEEP, WB = JOIN
action = "KEEP" if label.value == "NB" else "JOIN"
print(f"Token: {token.text} -> Suggestion: {action} ({label.score:.2%})")
# Example usage
process_latin_text("Et videtur, quod sic, quia res em <lb> pta de pecunia pupilli efficitur")
Citation
If you use this model, please cite the specific model DOI and the Flair framework:
@software{schonhardt_michael_2026_llbd, author = "Schonhardt, Michael", title = "Latin Contextual Line-Break Detector", year = "2026", publisher = "Zenodo", doi = "10.5281/zenodo.18390269", url = "https://doi.org/10.5281/zenodo.18390269" }
@inproceedings{akbik-etal-2018-contextual, title = "Contextual String Embeddings for Sequence Labeling", author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland", editor = "Bender, Emily M. and Derczynski, Leon and Isabelle, Pierre", booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", month = aug, year = "2018", address = "Santa Fe, New Mexico, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/C18-1139/", pages = "1638--1649", }
- Downloads last month
- 15