nb-roberta-tpu / README.md
pere's picture
Update README.md
6f7802b
---
license: apache-2.0
---
# NB-ROBERTA Training Code
This is the current training code for the planned nb-roberta models.
We are currently planning to run the following experiments:
<table>
<tr>
<td><strong>Name</strong>
</td>
<td><strong>nb-roberta-base-old (C)</strong>
</td>
</tr>
<tr>
<td>Corpus
</td>
<td>NbAiLab/nb_bert
</td>
</tr>
<tr>
<td>Pod size
</td>
<td>v4-64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>62*4*8 = 1984 = 2k
</td>
</tr>
<tr>
<td>Learning rate
</td>
<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
</td>
</tr>
<tr>
<td>Number of steps
</td>
<td>250k
</td>
</tr>
</table>
<table>
<tr>
<td><strong>Name</strong>
</td>
<td><strong>nb-roberta-base-ext (B)</strong>
</td>
</tr>
<tr>
<td>Corpus
</td>
<td>NbAiLab/nbailab_extended
</td>
</tr>
<tr>
<td>Pod size
</td>
<td>v4-64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>62*4*8 = 1984 = 2k
</td>
</tr>
<tr>
<td>Learning rate
</td>
<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
</td>
</tr>
<tr>
<td>Number of steps
</td>
<td>250k
</td>
</tr>
</table>
<table>
<tr>
<td><strong>Name</strong>
</td>
<td><strong>nb-roberta-large-ext</strong>
</td>
</tr>
<tr>
<td>Corpus
</td>
<td>NbAiLab/nbailab_extended
</td>
</tr>
<tr>
<td>Pod size
</td>
<td>v4-64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>32*4*8 = 2024 = 1k
</td>
</tr>
<tr>
<td>Learning rate
</td>
<td>2-e4 (RoBERTa article is using 4e-4 and bs=8k)
</td>
</tr>
<tr>
<td>Number of steps
</td>
<td>500k
</td>
</tr>
</table>
<table>
<tr>
<td><strong>Name</strong>
</td>
<td><strong>nb-roberta-base-scandi</strong>
</td>
</tr>
<tr>
<td>Corpus
</td>
<td>NbAiLab/scandinavian
</td>
</tr>
<tr>
<td>Pod size
</td>
<td>v4-64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>62*4*8 = 1984 = 2k
</td>
</tr>
<tr>
<td>Learning rate
</td>
<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
</td>
</tr>
<tr>
<td>Number of steps
</td>
<td>250k
</td>
</tr>
</table>
<table>
<tr>
<td><strong>Name</strong>
</td>
<td><strong>nb-roberta-large-scandi</strong>
</td>
</tr>
<tr>
<td>Corpus
</td>
<td>NbAiLab/scandinavian
</td>
</tr>
<tr>
<td>Pod size
</td>
<td>v4-64
</td>
</tr>
<tr>
<td>Batch size
</td>
<td>32*4*8 = 1024 = 1k
</td>
</tr>
<tr>
<td>Learning rate
</td>
<td>2-e4 (RoBERTa article is using 4e-4 and bs=8k)
</td>
</tr>
<tr>
<td>Number of steps
</td>
<td>500k
</td>
</tr>
</table>
## Calculations
Some basic that we used when estimating the number of training steps:
* The Scandinavic Corpus is 85GB
* The Scandinavic Corpus contains 13B words
* With a conversion factor of 2.3, this is estimated to around 30B tokens
* 30B tokens / (512 seq length * 3000 batch size) = 20.000 steps