| datasets: | |
| - oscar | |
| language: | |
| - he | |
| - ar | |
| # HeArBERT | |
| A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus. | |
| In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing file](./preprocessing.py) and can be used as follows: | |
| ```python | |
| from transformers import AutoTokenizer | |
| from preprocessing import transliterate_arabic_to_hebrew | |
| tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT") | |
| text_ar = "مرحبا" | |
| text_he = transliterate_arabic_to_hebrew(text_ar) | |
| tokenizer(text_he) | |
| ``` | |
| # Citation | |
| If you find our work useful in your research, please consider citing: | |
| ``` | |
| @article{rom2024training, | |
| title={Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space}, | |
| author={Rom, Aviad and Bar, Kfir}, | |
| journal={arXiv preprint arXiv:2402.16065}, | |
| year={2024} | |
| } | |
| ``` |