--- license: cc-by-nc-sa-4.0 language: - la - fr - en - pt - ca - es - it pipeline_tag: token-classification library_name: transformers tags: - medieval-texts - phrase-segmentation - multilingual --- # Aquilign Multilingual Segmenter **Aquilign Multilingual Segmenter** is a token-classification model for phrase-level segmentation of medieval and historical texts. The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the [Aquilign](https://github.com/ProMeText/Aquilign) alignment workflow. ## Model Description The segmenter is based on a trainable `BertForTokenClassification` model from Hugging Face’s `transformers` library. It was fine-tuned on historical prose from the [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset) to identify phrase-level segmentation boundaries. ## Supported Languages - Latin - French - Castilian - Portuguese - Catalan - English - Italian ## Intended Use This model is intended for: - phrase-level segmentation of **medieval texts** - preprocessing parallel corpora before alignment - multilingual medieval text alignment workflows - digital philology and computational humanities research It is especially designed to be used with [Aquilign](https://github.com/ProMeText/Aquilign). ## Related Resources - [Aquilign alignment tool](https://github.com/ProMeText/Aquilign) - [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset) - [ProMeTEXT GitHub organization](https://github.com/ProMeText) ## Citation If you use this model, please cite the related dataset and publication. ### Dataset ```bibtex @dataset{ing2025multilingual, author = {Ing, L. and Gille Levenson, M. and Macedo, C.}, title = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)}, year = {2025}, publisher = {Zenodo}, version = {1.0}, doi = {10.5281/zenodo.16992629}, url = {https://doi.org/10.5281/zenodo.16992629}, license = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International} } ``` ### Related Publication ```bibtex @inproceedings{ing-etal-2026-phrase, title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts}, author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina}, booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)}, month = {May}, year = {2026}, pages = {936--946}, address = {Palma, Mallorca, Spain}, publisher = {European Language Resources Association (ELRA)}, doi = {10.63317/32huzuuokpfr} } ```