| **Estienne** is a text-segmentation model trained on Deberta. | |
| In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition. | |
| Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex). | |
| Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages. | |
| As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (¶) | |
| Estienne supports the following segmentations: | |
| * **Text** | |
| * **Separator** - actually a segmentation separator. They are generally based on newline (actually ¶) with some variations due to text segmentation understanding. | |
| * **Title** | |
| * **Table** | |
| * **Dialog** - any kind of speaker attributed intervention. | |
| * **Bibliography** - statement of a specific bibliographic reference, either in a bibliography section or a footnote. | |
| * **Contact** - personal information, can be especially useful in the context of PII removal. | |
| * **Paratext** - any non-meaningful text included in standard documents like header, page numbering, section recall, etc. | |
| * **Author** - author names and signatures. | |
| * **Date** - statement of date and time, common in letters and newspaper articles. | |
| * **Keyword** - list of keywords, especially common in scientific publications. | |
| The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today. |