Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,45 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- yo
|
| 6 |
+
- ha
|
| 7 |
+
- ig
|
| 8 |
+
- pcm
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# naija-bert-large
|
| 13 |
+
|
| 14 |
+
NaijaBERT was created by pre-training a [BERT model with token dropping](https://aclanthology.org/2022.acl-long.262/) on five Nigerian languages (English, Hausa, Igbo, Naija, and Yoruba) texts for about 100K steps.
|
| 15 |
+
It was trained using BERT-base architecture with [Tensorflow Model Garden](https://github.com/tensorflow/models/tree/master/official/projects)
|
| 16 |
+
|
| 17 |
+
### Pre-training corpus
|
| 18 |
+
A mix of WURA, Wikipedia and MT560 data
|
| 19 |
+
|
| 20 |
+
#### How to use
|
| 21 |
+
You can use this model with Transformers *pipeline* for masked token prediction.
|
| 22 |
+
```python
|
| 23 |
+
>>> from transformers import pipeline
|
| 24 |
+
>>> unmasker = pipeline('fill-mask', model='Davlan/naija-bert-large')
|
| 25 |
+
>>> unmasker("Ọjọ kẹsan-an, [MASK] Kẹjọ ni wọn ri oku Baba")
|
| 26 |
+
```
|
| 27 |
+
```
|
| 28 |
+
[{'score': 0.9981744289398193, 'token': 3785, 'token_str': 'osu', 'sequence': 'ojo kesan - an, osu kejo ni won ri oku baba'}, {'score': 0.0015279919607564807, 'token': 3355, 'token_str': 'ojo', 'sequence': 'ojo kesan - an, ojo kejo ni won ri oku baba'}, {'score': 0.0001734074903652072, 'token': 11780, 'token_str': 'osun', 'sequence': 'ojo kesan - an, osun kejo ni won ri oku baba'}, {'score': 9.066923666978255e-05, 'token': 21579, 'token_str': 'oṣu', 'sequence': 'ojo kesan - an, oṣu kejo ni won ri oku baba'}, {'score': 1.816015355871059e-05, 'token': 3387, 'token_str': 'odun', 'sequence': 'ojo kesan - an, odun kejo ni won ri oku baba'}]
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### Acknowledgment
|
| 32 |
+
We thank [@stefan-it](https://github.com/stefan-it) for providing the pre-processing and pre-training scripts. Finally, we would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
### BibTeX entry and citation info.
|
| 36 |
+
```
|
| 37 |
+
@misc{david_adelani_2025,
|
| 38 |
+
author = { David Adelani },
|
| 39 |
+
title = { naija-bert-large (Revision 1f7243d) },
|
| 40 |
+
year = 2025,
|
| 41 |
+
url = { https://huggingface.co/Davlan/naija-bert-large },
|
| 42 |
+
doi = { 10.57967/hf/5863 },
|
| 43 |
+
publisher = { Hugging Face }
|
| 44 |
+
}
|
| 45 |
+
```
|