| | --- |
| | language: "cs" |
| | tags: |
| | - Czech |
| | - KKY |
| | - FAV |
| | - RoBERTa |
| | license: "cc-by-nc-sa-4.0" |
| | --- |
| | |
| | # FERNET-C5-RoBERTa |
| | FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5). |
| | It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5). |
| | See our paper for details. |
| |
|
| | ## How to use |
| |
|
| | You can use this model directly with a pipeline for masked language modeling: |
| |
|
| | ```python |
| | >>> from transformers import pipeline |
| | >>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa') |
| | >>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.") |
| | |
| | [{'score': 0.13343162834644318, |
| | 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.', |
| | 'token': 33582, |
| | 'token_str': ' textem'}, |
| | {'score': 0.12583224475383759, |
| | 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s ' |
| | 'počítačem.', |
| | 'token': 32837, |
| | 'token_str': ' počítačem'}, |
| | {'score': 0.0796666219830513, |
| | 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.', |
| | 'token': 15876, |
| | 'token_str': ' obrázky'}, |
| | {'score': 0.06347835063934326, |
| | 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.', |
| | 'token': 5426, |
| | 'token_str': ' lidmi'}, |
| | {'score': 0.050984010100364685, |
| | 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.', |
| | 'token': 5468, |
| | 'token_str': ' dětmi'}] |
| | ``` |
| |
|
| | Here is how to use this model to get the features of a given text in PyTorch: |
| |
|
| | ```python |
| | from transformers import RobertaTokenizer, RobertaModel |
| | tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa') |
| | model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa', add_pooling_layer=False) |
| | text = "Libovolný text." |
| | encoded_input = tokenizer(text, return_tensors='pt') |
| | output = model(**encoded_input) |
| | ``` |
| |
|
| | ## Training data |
| |
|
| | The model was pretrained on the mix of three text sources: |
| | - Czech web pages extracted from the Common Crawl project (93GB), |
| | - self-crawled Czech news dataset (20GB), |
| | - Czech part Wikipedia (1GB). |
| |
|
| | The model was pretrained for 500k steps (over 15 epochs over the full dataset) with a peak learning rate of 4e-4. |
| |
|
| | ## Paper |
| | https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3 |
| | |
| | The preprint of our paper is available at https://arxiv.org/abs/2107.10042. |
| | |
| | ## Citation |
| | If you find this model useful, please cite our related paper: |
| | ``` |
| | @inproceedings{FERNETC5, |
| | title = {Comparison of Czech Transformers on Text Classification Tasks}, |
| | author = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan}, |
| | year = 2021, |
| | booktitle = {Statistical Language and Speech Processing}, |
| | publisher = {Springer International Publishing}, |
| | address = {Cham}, |
| | pages = {27--37}, |
| | doi = {10.1007/978-3-030-89579-2_3}, |
| | isbn = {978-3-030-89579-2}, |
| | editor = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena} |
| | } |
| | ``` |