Fill-Mask
Transformers
PyTorch
TensorFlow
JAX
Arabic
bert
Arabic BERT
MSA
Twitter
Masked Langauge Model
Instructions to use UBC-NLP/MARBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UBC-NLP/MARBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="UBC-NLP/MARBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERT") model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/MARBERT") - Inference
- Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -8,8 +8,11 @@ tags:
|
|
| 8 |
- Masked Langauge Model
|
| 9 |
widget:
|
| 10 |
- text: "اللغة العربية هي لغة [MASK]."
|
|
|
|
| 11 |
---
|
|
|
|
| 12 |
<img src="https://raw.githubusercontent.com/UBC-NLP/marbert/main/ARBERT_MARBERT.jpg" alt="drawing" width="200" height="200" align="right"/>
|
|
|
|
| 13 |
**MARBERT** is one of three models described in our **ACL 2021 paper** **["ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic"](https://aclanthology.org/2021.acl-long.551.pdf)**. MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up **128GB of text** (**15.6B tokens**). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our [repo](https://github.com/UBC-NLP/LMBERT) for modifying BERT code to remove NSP. For more information about MARBERT, please visit our own GitHub [repo](https://github.com/UBC-NLP/marbert).
|
| 14 |
|
| 15 |
|
|
|
|
| 8 |
- Masked Langauge Model
|
| 9 |
widget:
|
| 10 |
- text: "اللغة العربية هي لغة [MASK]."
|
| 11 |
+
|
| 12 |
---
|
| 13 |
+
|
| 14 |
<img src="https://raw.githubusercontent.com/UBC-NLP/marbert/main/ARBERT_MARBERT.jpg" alt="drawing" width="200" height="200" align="right"/>
|
| 15 |
+
|
| 16 |
**MARBERT** is one of three models described in our **ACL 2021 paper** **["ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic"](https://aclanthology.org/2021.acl-long.551.pdf)**. MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up **128GB of text** (**15.6B tokens**). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our [repo](https://github.com/UBC-NLP/LMBERT) for modifying BERT code to remove NSP. For more information about MARBERT, please visit our own GitHub [repo](https://github.com/UBC-NLP/marbert).
|
| 17 |
|
| 18 |
|