File size: 2,039 Bytes
f15810c
 
7340d32
 
f15810c
7340d32
 
857b63f
7340d32
4837a7e
 
7340d32
 
 
 
16f94f0
 
 
 
857b63f
3c0b578
ebc14a1
 
73e0d75
16f94f0
 
7340d32
099c0c4
7340d32
 
fe7cf16
c7d5504
fe7cf16
 
 
 
 
 
 
c7d5504
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: apache-2.0
language:
- yo
---


# oyo-bert-base

OYO-BERT (or Oyo-dialect of Yoruba BERT) was created by pre-training a [BERT model with token dropping](https://aclanthology.org/2022.acl-long.262/) on Yoruba language texts for about 100K steps. 
It was trained using BERT-base architecture with [Tensorflow Model Garden](https://github.com/tensorflow/models/tree/master/official/projects)

### Pre-training corpus
A mix of WURA, Wikipedia and MT560 Yoruba data

#### How to use
You can use this model with Transformers *pipeline* for masked token prediction.
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Davlan/oyo-bert-base')
>>> unmasker("Ọjọ kẹsan-an, [MASK] Kẹjọ ni wọn ri oku Baba")
```
```
[{'score': 0.9981744289398193, 'token': 3785, 'token_str': 'osu', 'sequence': 'ojo kesan - an, osu kejo ni won ri oku baba'}, {'score': 0.0015279919607564807, 'token': 3355, 'token_str': 'ojo', 'sequence': 'ojo kesan - an, ojo kejo ni won ri oku baba'}, {'score': 0.0001734074903652072, 'token': 11780, 'token_str': 'osun', 'sequence': 'ojo kesan - an, osun kejo ni won ri oku baba'}, {'score': 9.066923666978255e-05, 'token': 21579, 'token_str': 'oṣu', 'sequence': 'ojo kesan - an, oṣu kejo ni won ri oku baba'}, {'score': 1.816015355871059e-05, 'token': 3387, 'token_str': 'odun', 'sequence': 'ojo kesan - an, odun kejo ni won ri oku baba'}]
```

### Acknowledgment
We thank [@stefan-it](https://github.com/stefan-it) for providing the pre-processing and pre-training scripts. Finally, we would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.


### BibTeX entry and citation info.
```
@misc{david_adelani_2023,
	author       = { David Adelani },
	title        = { oyo-bert-base (Revision f9d07fb) },
	year         = 2023,
	url          = { https://huggingface.co/Davlan/oyo-bert-base },
	doi          = { 10.57967/hf/5857 },
	publisher    = { Hugging Face }
}
```