fav-kky
/

FERNET-C5-RoBERTa

Model card Files Files and versions

FERNET-C5-RoBERTa / README.md

jlehecka's picture

Update README.md

974856e verified about 1 year ago

|

history blame contribute delete

3.08 kB

	---
	language: "cs"
	tags:
	- Czech
	- KKY
	- FAV
	- RoBERTa
	license: "cc-by-nc-sa-4.0"
	---

	# FERNET-C5-RoBERTa
	FERNET-C5-RoBERTa (FERNET stands for Flexible Embedding Representation NETwork) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5).
	It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5).
	See our paper for details.

	## How to use

	You can use this model directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa')
	>>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.")

	[{'score': 0.13343162834644318,
	'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.',
	'token': 33582,
	'token_str': ' textem'},
	{'score': 0.12583224475383759,
	'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s '
	'počítačem.',
	'token': 32837,
	'token_str': ' počítačem'},
	{'score': 0.0796666219830513,
	'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.',
	'token': 15876,
	'token_str': ' obrázky'},
	{'score': 0.06347835063934326,
	'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.',
	'token': 5426,
	'token_str': ' lidmi'},
	{'score': 0.050984010100364685,
	'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.',
	'token': 5468,
	'token_str': ' dětmi'}]
	```

	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import RobertaTokenizer, RobertaModel
	tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
	model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa', add_pooling_layer=False)
	text = "Libovolný text."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Training data

	The model was pretrained on the mix of three text sources:
	- Czech web pages extracted from the Common Crawl project (93GB),
	- self-crawled Czech news dataset (20GB),
	- Czech part Wikipedia (1GB).

	The model was pretrained for 500k steps (over 15 epochs over the full dataset) with a peak learning rate of 4e-4.

	## Paper
	https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3

	The preprint of our paper is available at https://arxiv.org/abs/2107.10042.

	## Citation
	If you find this model useful, please cite our related paper:
	```
	@inproceedings{FERNETC5,
	title = {Comparison of Czech Transformers on Text Classification Tasks},
	author = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan},
	year = 2021,
	booktitle = {Statistical Language and Speech Processing},
	publisher = {Springer International Publishing},
	address = {Cham},
	pages = {27--37},
	doi = {10.1007/978-3-030-89579-2_3},
	isbn = {978-3-030-89579-2},
	editor = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena}
	}
	```