| | --- |
| | language: pl |
| | tags: |
| | - herbert |
| | license: cc-by-4.0 |
| | --- |
| | |
| | # HerBERT |
| | **[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish corpora |
| | using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/). |
| |
|
| | Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9. |
| |
|
| | ## Corpus |
| | HerBERT was trained on six different corpora available for Polish language: |
| |
|
| | | Corpus | Tokens | Documents | |
| | | :------ | ------: | ------: | |
| | | [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M | 7.9M | |
| | | [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M | 7.0M | |
| | | [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=1)| 1357M | 3.9M | |
| | | [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M | 1.1M |
| | | [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.4M | |
| | | [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k | |
| |
|
| | ## Tokenizer |
| | The training dataset was tokenized into subwords using a character level byte-pair encoding (``CharBPETokenizer``) with |
| | a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library. |
| |
|
| | We kindly encourage you to use the ``Fast`` version of the tokenizer, namely ``HerbertTokenizerFast``. |
| |
|
| | ## Usage |
| | Example code: |
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased") |
| | model = AutoModel.from_pretrained("allegro/herbert-base-cased") |
| | |
| | output = model( |
| | **tokenizer.batch_encode_plus( |
| | [ |
| | ( |
| | "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.", |
| | "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy." |
| | ) |
| | ], |
| | padding='longest', |
| | add_special_tokens=True, |
| | return_tensors='pt' |
| | ) |
| | ) |
| | ``` |
| |
|
| | ## License |
| | CC BY 4.0 |
| |
|
| | ## Citation |
| | If you use this model, please cite the following paper: |
| | ``` |
| | @inproceedings{mroczkowski-etal-2021-herbert, |
| | title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish", |
| | author = "Mroczkowski, Robert and |
| | Rybak, Piotr and |
| | Wr{\\'o}blewska, Alina and |
| | Gawlik, Ireneusz", |
| | booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", |
| | month = apr, |
| | year = "2021", |
| | address = "Kiyv, Ukraine", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1", |
| | pages = "1--10", |
| | } |
| | ``` |
| |
|
| | ## Authors |
| | The model was trained by [**Machine Learning Research Team at Allegro**](https://ml.allegro.tech/) and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/). |
| |
|
| | You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a> |