allegro
/

herbert-base-cased

Feature Extraction

Model card Files Files and versions

herbert-base-cased / README.md

piotr-rybak's picture

Update README.md

108138c over 4 years ago

|

history blame contribute delete

3.29 kB

	---
	language: pl
	tags:
	- herbert
	license: cc-by-4.0
	---

	# HerBERT
	[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert) is a BERT-based Language Model trained on Polish corpora
	using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).

	Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.

	## Corpus
	HerBERT was trained on six different corpora available for Polish language:

	\| Corpus \| Tokens \| Documents \|
	\| :------ \| ------: \| ------: \|
	\| [CCNet Middle](https://github.com/facebookresearch/cc_net) \| 3243M \| 7.9M \|
	\| [CCNet Head](https://github.com/facebookresearch/cc_net) \| 2641M \| 7.0M \|
	\| [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=1)\| 1357M \| 3.9M \|
	\| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) \| 1056M \| 1.1M
	\| [Wikipedia](https://dumps.wikimedia.org/) \| 260M \| 1.4M \|
	\| [Wolne Lektury](https://wolnelektury.pl/) \| 41M \| 5.5k \|

	## Tokenizer
	The training dataset was tokenized into subwords using a character level byte-pair encoding (``CharBPETokenizer``) with
	a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.

	We kindly encourage you to use the ``Fast`` version of the tokenizer, namely ``HerbertTokenizerFast``.

	## Usage
	Example code:
	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
	model = AutoModel.from_pretrained("allegro/herbert-base-cased")

	output = model(
	**tokenizer.batch_encode_plus(
	[
	(
	"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
	"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
	)
	],
	padding='longest',
	add_special_tokens=True,
	return_tensors='pt'
	)
	)
	```

	## License
	CC BY 4.0

	## Citation
	If you use this model, please cite the following paper:
	```
	@inproceedings{mroczkowski-etal-2021-herbert,
	title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
	author = "Mroczkowski, Robert and
	Rybak, Piotr and
	Wr{\\'o}blewska, Alina and
	Gawlik, Ireneusz",
	booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
	month = apr,
	year = "2021",
	address = "Kiyv, Ukraine",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
	pages = "1--10",
	}
	```

	## Authors
	The model was trained by [Machine Learning Research Team at Allegro](https://ml.allegro.tech/) and [Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences](http://zil.ipipan.waw.pl/).

	You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>