File size: 2,719 Bytes
e0ac941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: cc-by-nc-sa-4.0
language:
- la
- fr
- en
- pt
- ca
- es
- it
pipeline_tag: token-classification
library_name: transformers
tags:
- medieval-texts
- phrase-segmentation
- multilingual
---


# Aquilign Multilingual Segmenter

**Aquilign Multilingual Segmenter** is a token-classification model for phrase-level segmentation of medieval and historical texts.

The model is designed to detect custom segmentation delimiters in multilingual historical corpora and is used as part of the [Aquilign](https://github.com/ProMeText/Aquilign) alignment workflow.

## Model Description

The segmenter is based on a trainable `BertForTokenClassification` model from Hugging Face’s `transformers` library.

It was fine-tuned on historical prose from the [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset) to identify phrase-level segmentation boundaries.

## Supported Languages

- Latin
- French
- Castilian
- Portuguese
- Catalan
- English
- Italian

## Intended Use

This model is intended for:

- phrase-level segmentation of **medieval texts**
- preprocessing parallel corpora before alignment
- multilingual medieval text alignment workflows
- digital philology and computational humanities research

It is especially designed to be used with [Aquilign](https://github.com/ProMeText/Aquilign).


## Related Resources

- [Aquilign alignment tool](https://github.com/ProMeText/Aquilign)
- [Multilingual Segmentation Dataset](https://github.com/ProMeText/multilingual-segmentation-dataset)
- [ProMeTEXT GitHub organization](https://github.com/ProMeText)

## Citation

If you use this model, please cite the related dataset and publication.

### Dataset

```bibtex
@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}
```

### Related Publication
```bibtex
@inproceedings{ing-etal-2026-phrase,
  title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
  author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {936--946},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  doi = {10.63317/32huzuuokpfr}
}
```