|
|
--- |
|
|
language: eo |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# EsperBERTo: A RoBERTa-like model for Esperanto |
|
|
|
|
|
This is a RoBERTa-like model trained from scratch on the Esperanto language. |
|
|
|
|
|
## Model description |
|
|
|
|
|
The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus. |
|
|
|
|
|
- **Model:** RoBERTa-like |
|
|
- **Layers:** 6 |
|
|
- **Hidden size:** 768 |
|
|
- **Heads:** 12 |
|
|
- **Parameters:** 84M |
|
|
- **Tokenizer:** Byte-level BPE |
|
|
- **Vocabulary size:** 52,000 |
|
|
|
|
|
## Training data |
|
|
|
|
|
The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU. |
|
|
|
|
|
### Hyperparameters |
|
|
- `output_dir`: "./EsperBERTo" |
|
|
- `overwrite_output_dir`: `True` |
|
|
- `num_train_epochs`: 1 |
|
|
- `per_gpu_train_batch_size`: 64 |
|
|
- `save_steps`: 10_000 |
|
|
- `save_total_limit`: 2 |
|
|
- `prediction_loss_only`: `True` |
|
|
|
|
|
The final training loss was `6.1178`. |
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline. |
|
|
|
|
|
Example 1: |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
fill_mask = pipeline( |
|
|
"fill-mask", |
|
|
model="./EsperBERTo", |
|
|
tokenizer="./EsperBERTo" |
|
|
) |
|
|
|
|
|
fill_mask("La suno <mask>.") |
|
|
``` |
|
|
Output: |
|
|
``` |
|
|
[{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'}, |
|
|
{'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'}, |
|
|
{'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'}, |
|
|
{'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'}, |
|
|
{'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}] |
|
|
``` |
|
|
|
|
|
Example 2: |
|
|
```python |
|
|
fill_mask("Jen la komenco de bela <mask>.") |
|
|
``` |
|
|
Output: |
|
|
``` |
|
|
[{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'}, |
|
|
{'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'}, |
|
|
{'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'}, |
|
|
{'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'}, |
|
|
{'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}] |
|
|
``` |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as: |
|
|
- Text Classification |
|
|
- Token Classification (Part-of-Speech Tagging, Named Entity Recognition) |
|
|
- Question Answering |
|
|
|
|
|
Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended. |