Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- he
|
| 4 |
+
tags:
|
| 5 |
+
- language model
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
datasets:
|
| 8 |
+
- oscar
|
| 9 |
+
- wikipedia
|
| 10 |
+
- twitter
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# AlephBERT
|
| 14 |
+
|
| 15 |
+
## Hebrew Language Model
|
| 16 |
+
|
| 17 |
+
State-of-the-art language model for Hebrew.
|
| 18 |
+
Based on Google's BERT architecture [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805).
|
| 19 |
+
|
| 20 |
+
#### How to use
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
from transformers import BertModel, BertTokenizerFast
|
| 24 |
+
|
| 25 |
+
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
|
| 26 |
+
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
|
| 27 |
+
|
| 28 |
+
# if not finetuning - disable dropout
|
| 29 |
+
alephbert.eval()
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## Training data
|
| 33 |
+
1. OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/) Hebrew section (10 GB text, 20 million sentences).
|
| 34 |
+
2. Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/) (650 MB text, 3 million sentences).
|
| 35 |
+
3. Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).
|
| 36 |
+
|
| 37 |
+
## Training procedure
|
| 38 |
+
|
| 39 |
+
Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.
|
| 40 |
+
|
| 41 |
+
Since the larger part of our training data is based on tweets we decided to start by optimizing using Masked Language Model loss only.
|
| 42 |
+
|
| 43 |
+
To optimize training time we split the data into 4 sections based on max number of tokens:
|
| 44 |
+
|
| 45 |
+
1. num tokens < 32 (70M sentences)
|
| 46 |
+
2. 32 <= num tokens < 64 (12M sentences)
|
| 47 |
+
3. 64 <= num tokens < 128 (10M sentences)
|
| 48 |
+
4. 128 <= num tokens < 512 (1.5M sentences)
|
| 49 |
+
|
| 50 |
+
Each section was first trained for 5 epochs with an initial learning rate set to 1e-4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e-5, for a total of 10 epochs.
|
| 51 |
+
|
| 52 |
+
Total training time was 8 days.
|
| 53 |
+
|