carolisteia's picture
Update README.md
ac4d4cb verified
|
Raw
History Blame Contribute Delete
2.72 kB
---
license: cc-by-nc-sa-4.0
language:
- la
- fr
- en
- pt
- ca
- es
- it
pipeline_tag: token-classification
library_name: transformers
tags:
- medieval-texts
- phrase-segmentation
- multilingual
---
# Aquilign Multilingual Segmenter
**Aquilign Multilingual Segmenter** is a token-classification model for phrase-level segmentation of medieval and historical texts.
The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the [Aquilign](https://github.com/ProMeText/Aquilign) alignment workflow.
## Model Description
The segmenter is based on a trainable `BertForTokenClassification` model from Hugging Face’s `transformers` library.
It was fine-tuned on historical prose from the [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset) to identify phrase-level segmentation boundaries.
## Supported Languages
- Latin
- French
- Castilian
- Portuguese
- Catalan
- English
- Italian
## Intended Use
This model is intended for:
- phrase-level segmentation of **medieval texts**
- preprocessing parallel corpora before alignment
- multilingual medieval text alignment workflows
- digital philology and computational humanities research
It is especially designed to be used with [Aquilign](https://github.com/ProMeText/Aquilign).
## Related Resources
- [Aquilign alignment tool](https://github.com/ProMeText/Aquilign)
- [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset)
- [ProMeTEXT GitHub organization](https://github.com/ProMeText)
## Citation
If you use this model, please cite the related dataset and publication.
### Dataset
```bibtex
@dataset{ing2025multilingual,
author = {Ing, L. and Gille Levenson, M. and Macedo, C.},
title = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
year = {2025},
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.16992629},
url = {https://doi.org/10.5281/zenodo.16992629},
license = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}
```
### Related Publication
```bibtex
@inproceedings{ing-etal-2026-phrase,
title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
month = {May},
year = {2026},
pages = {936--946},
address = {Palma, Mallorca, Spain},
publisher = {European Language Resources Association (ELRA)},
doi = {10.63317/32huzuuokpfr}
}
```