File size: 1,153 Bytes
287f46d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
---
license: mit
language:
- la
- el
- fr
- en
- de
- it
base_model:
- FacebookAI/xlm-roberta-base
---

# Model Description

<!-- Provide a quick summary of what the model is/does. -->

This model checkpoint was created by further pre-training XLM-RoBERTa-base on a 1.4B tokens corpus of classical texts mainly written in Ancient Greek, Latin, French, German, English and Italian.
The corpus notably contains data from [Brill-KIEM](https://github.com/kiem-group/pdfParser), various ancient sources from the Internet Archive, the [Corpus Thomisticum](https://www.corpusthomisticum.org/), [Open Greek and Latin](https://www.opengreekandlatin.org/), [JSTOR](https://about.jstor.org/whats-in-jstor/text-mining-support/), [Persée](https://www.persee.fr/), Propylaeum, [Remacle](https://remacle.org/) or Wikipedia.
The model can be used as a checkpoint for further pre-training or as a base model for fine-tuning. 
The model was evaluated on classics-related named-entity recognition and part-of-speech tagging and surpassed XLM-RoBERTa-Base on all task. 
It also performed significantly better than similar models retrained from scratch on the same corpus.