Commit ·
85bafc6
1
Parent(s): 73ec04c
Update README.md
Browse files
README.md
CHANGED
|
@@ -14,12 +14,12 @@ widget:
|
|
| 14 |
|
| 15 |
<img src="https://raw.githubusercontent.com/aub-mind/arabert/master/arabert_logo.png" width="100" align="left"/>
|
| 16 |
|
| 17 |
-
**AraBERT** is an Arabic pretrained
|
| 18 |
|
| 19 |
-
There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were
|
| 20 |
|
| 21 |
|
| 22 |
-
We
|
| 23 |
|
| 24 |
# AraBERTv2
|
| 25 |
|
|
@@ -46,9 +46,9 @@ All models are available in the `HuggingFace` model page under the [aubmindlab](
|
|
| 46 |
|
| 47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
| 48 |
|
| 49 |
-
The new vocabulary was
|
| 50 |
|
| 51 |
-
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing
|
| 52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
| 53 |
|
| 54 |
## Bigger Dataset and More Compute
|
|
@@ -125,7 +125,7 @@ Google Scholar has our Bibtex wrong (missing name), use this instead
|
|
| 125 |
}
|
| 126 |
```
|
| 127 |
# Acknowledgments
|
| 128 |
-
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the
|
| 129 |
|
| 130 |
# Contacts
|
| 131 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|
|
|
|
| 14 |
|
| 15 |
<img src="https://raw.githubusercontent.com/aub-mind/arabert/master/arabert_logo.png" width="100" align="left"/>
|
| 16 |
|
| 17 |
+
**AraBERT** is an Arabic pretrained language model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT Paper](https://arxiv.org/abs/2003.00104) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
|
| 18 |
|
| 19 |
+
There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
|
| 20 |
|
| 21 |
|
| 22 |
+
We evaluate AraBERT models on different downstream tasks and compare them to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
|
| 23 |
|
| 24 |
# AraBERTv2
|
| 25 |
|
|
|
|
| 46 |
|
| 47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
| 48 |
|
| 49 |
+
The new vocabulary was learned using the `BertWordpieceTokenizer` from the `tokenizers` library, and should now support the Fast tokenizer implementation from the `transformers` library.
|
| 50 |
|
| 51 |
+
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing function
|
| 52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
| 53 |
|
| 54 |
## Bigger Dataset and More Compute
|
|
|
|
| 125 |
}
|
| 126 |
```
|
| 127 |
# Acknowledgments
|
| 128 |
+
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continuous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
|
| 129 |
|
| 130 |
# Contacts
|
| 131 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|