Segmentext / README.md

Update README.md

41ed6c4 verified over 1 year ago

1.82 kB

	Estienne is a text-segmentation model trained on Deberta.

	In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.

	Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).

	Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.

	As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (¶)

	Estienne supports the following segmentations:
	* Text
	* Separator - actually a segmentation separator. They are generally based on newline (actually ¶) with some variations due to text segmentation understanding.
	* Title
	* Table
	* Dialog - any kind of speaker attributed intervention.
	* Bibliography - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
	* Contact - personal information, can be especially useful in the context of PII removal.
	* Paratext - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
	* Author - author names and signatures.
	* Date - statement of date and time, common in letters and newspaper articles.
	* Keyword - list of keywords, especially common in scientific publications.

	The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.

	Estienne is a text-segmentation model trained on Deberta.

	In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.

	Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).

	Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.

	As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (¶)

	Estienne supports the following segmentations:
	* Text
	* Separator - actually a segmentation separator. They are generally based on newline (actually ¶) with some variations due to text segmentation understanding.
	* Title
	* Table
	* Dialog - any kind of speaker attributed intervention.
	* Bibliography - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
	* Contact - personal information, can be especially useful in the context of PII removal.
	* Paratext - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
	* Author - author names and signatures.
	* Date - statement of date and time, common in letters and newspaper articles.
	* Keyword - list of keywords, especially common in scientific publications.

	The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.