aviadrom
/

HeArBERT

+---
+datasets:
+- oscar
+language:
+- he
+- ar
+---
+# HeArBERT
+A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.
+In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing](./preprocessing.py) file and can be used as follows:
+```python
+from transformers import AutoTokenizer
+from preprocessing import transliterate_arabic_to_hebrew
+tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT")
+text_ar = "مرحبا"
+text_he = transliterate_arabic_to_hebrew(text_ar)
+tokenizer(text_he)
+```