Commit
·
590a87e
1
Parent(s):
c655777
Update README.md
Browse files
README.md
CHANGED
|
@@ -46,9 +46,9 @@ All models are available in the `HuggingFace` model page under the [aubmindlab](
|
|
| 46 |
|
| 47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
| 48 |
|
| 49 |
-
The new vocabulary was
|
| 50 |
|
| 51 |
-
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing
|
| 52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
| 53 |
|
| 54 |
## Bigger Dataset and More Compute
|
|
@@ -86,7 +86,7 @@ It is recommended to apply our preprocessing function before training/testing on
|
|
| 86 |
```python
|
| 87 |
from arabert.preprocess import ArabertPreprocessor
|
| 88 |
|
| 89 |
-
model_name="bert-
|
| 90 |
arabert_prep = ArabertPreprocessor(model_name=model_name)
|
| 91 |
|
| 92 |
text = "ولن نبالغ إذا قلنا: إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|
|
|
|
| 46 |
|
| 47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
| 48 |
|
| 49 |
+
The new vocabulary was learned using the `BertWordpieceTokenizer` from the `tokenizers` library, and should now support the Fast tokenizer implementation from the `transformers` library.
|
| 50 |
|
| 51 |
+
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing function
|
| 52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
| 53 |
|
| 54 |
## Bigger Dataset and More Compute
|
|
|
|
| 86 |
```python
|
| 87 |
from arabert.preprocess import ArabertPreprocessor
|
| 88 |
|
| 89 |
+
model_name="aubmindlab/bert-large-arabertv02"
|
| 90 |
arabert_prep = ArabertPreprocessor(model_name=model_name)
|
| 91 |
|
| 92 |
text = "ولن نبالغ إذا قلنا: إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|