ProMeText
/

aquilign-multilingual-segmenter

Token Classification

phrase-segmentation

Model card Files Files and versions

aquilign-multilingual-segmenter / README.md

carolisteia's picture

Update README.md

ac4d4cb verified 26 days ago

|

History Blame Contribute Delete

2.72 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- la
	- fr
	- en
	- pt
	- ca
	- es
	- it
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- medieval-texts
	- phrase-segmentation
	- multilingual
	---


	# Aquilign Multilingual Segmenter

	Aquilign Multilingual Segmenter is a token-classification model for phrase-level segmentation of medieval and historical texts.

	The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the [Aquilign](https://github.com/ProMeText/Aquilign) alignment workflow.

	## Model Description

	The segmenter is based on a trainable `BertForTokenClassification` model from Hugging Face’s `transformers` library.

	It was fine-tuned on historical prose from the [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset) to identify phrase-level segmentation boundaries.

	## Supported Languages

	- Latin
	- French
	- Castilian
	- Portuguese
	- Catalan
	- English
	- Italian

	## Intended Use

	This model is intended for:

	- phrase-level segmentation of medieval texts
	- preprocessing parallel corpora before alignment
	- multilingual medieval text alignment workflows
	- digital philology and computational humanities research

	It is especially designed to be used with [Aquilign](https://github.com/ProMeText/Aquilign).


	## Related Resources

	- [Aquilign alignment tool](https://github.com/ProMeText/Aquilign)
	- [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset)
	- [ProMeTEXT GitHub organization](https://github.com/ProMeText)

	## Citation

	If you use this model, please cite the related dataset and publication.

	### Dataset

	```bibtex
	@dataset{ing2025multilingual,
	author = {Ing, L. and Gille Levenson, M. and Macedo, C.},
	title = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
	year = {2025},
	publisher = {Zenodo},
	version = {1.0},
	doi = {10.5281/zenodo.16992629},
	url = {https://doi.org/10.5281/zenodo.16992629},
	license = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
	}
	```

	### Related Publication
	```bibtex
	@inproceedings{ing-etal-2026-phrase,
	title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
	author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
	booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
	month = {May},
	year = {2026},
	pages = {936--946},
	address = {Palma, Mallorca, Spain},
	publisher = {European Language Resources Association (ELRA)},
	doi = {10.63317/32huzuuokpfr}
	}
	```