File size: 3,241 Bytes
c0e955f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language: eo
license: mit
---

# EsperBERTo: A RoBERTa-like model for Esperanto

This is a RoBERTa-like model trained from scratch on the Esperanto language. 

## Model description

The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus.

- **Model:** RoBERTa-like
- **Layers:** 6
- **Hidden size:** 768
- **Heads:** 12
- **Parameters:** 84M
- **Tokenizer:** Byte-level BPE
- **Vocabulary size:** 52,000

## Training data

The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size.

## Training procedure

The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU.

### Hyperparameters
- `output_dir`: "./EsperBERTo"
- `overwrite_output_dir`: `True`
- `num_train_epochs`: 1
- `per_gpu_train_batch_size`: 64
- `save_steps`: 10_000
- `save_total_limit`: 2
- `prediction_loss_only`: `True`

The final training loss was `6.1178`.

## Evaluation results

The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline.

Example 1:
```python
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

fill_mask("La suno <mask>.")
```
Output:
```
[{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'},
 {'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'},
 {'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'},
 {'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'},
 {'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}]
```

Example 2:
```python
fill_mask("Jen la komenco de bela <mask>.")
```
Output:
```
[{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'},
 {'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'},
 {'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'},
 {'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'},
 {'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}]
```

## Intended uses & limitations

This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as:
- Text Classification
- Token Classification (Part-of-Speech Tagging, Named Entity Recognition)
- Question Answering

Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended.