NbAiLab
/

nb-roberta-tpu

Model card Files Files and versions

nb-roberta-tpu / README.md

pere's picture

Update README.md

6f7802b about 3 years ago

|

history blame contribute delete

3.09 kB

	---
	license: apache-2.0
	---
	# NB-ROBERTA Training Code
	This is the current training code for the planned nb-roberta models.

	We are currently planning to run the following experiments:


	<table>
	<tr>
	<td><strong>Name</strong>
	</td>
	<td><strong>nb-roberta-base-old (C)</strong>
	</td>
	</tr>
	<tr>
	<td>Corpus
	</td>
	<td>NbAiLab/nb_bert
	</td>
	</tr>
	<tr>
	<td>Pod size
	</td>
	<td>v4-64
	</td>
	</tr>
	<tr>
	<td>Batch size
	</td>
	<td>6248 = 1984 = 2k
	</td>
	</tr>
	<tr>
	<td>Learning rate
	</td>
	<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
	</td>
	</tr>
	<tr>
	<td>Number of steps
	</td>
	<td>250k
	</td>
	</tr>
	</table>



	<table>
	<tr>
	<td><strong>Name</strong>
	</td>
	<td><strong>nb-roberta-base-ext (B)</strong>
	</td>
	</tr>
	<tr>
	<td>Corpus
	</td>
	<td>NbAiLab/nbailab_extended
	</td>
	</tr>
	<tr>
	<td>Pod size
	</td>
	<td>v4-64
	</td>
	</tr>
	<tr>
	<td>Batch size
	</td>
	<td>6248 = 1984 = 2k
	</td>
	</tr>
	<tr>
	<td>Learning rate
	</td>
	<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
	</td>
	</tr>
	<tr>
	<td>Number of steps
	</td>
	<td>250k
	</td>
	</tr>
	</table>



	<table>
	<tr>
	<td><strong>Name</strong>
	</td>
	<td><strong>nb-roberta-large-ext</strong>
	</td>
	</tr>
	<tr>
	<td>Corpus
	</td>
	<td>NbAiLab/nbailab_extended
	</td>
	</tr>
	<tr>
	<td>Pod size
	</td>
	<td>v4-64
	</td>
	</tr>
	<tr>
	<td>Batch size
	</td>
	<td>3248 = 2024 = 1k
	</td>
	</tr>
	<tr>
	<td>Learning rate
	</td>
	<td>2-e4 (RoBERTa article is using 4e-4 and bs=8k)
	</td>
	</tr>
	<tr>
	<td>Number of steps
	</td>
	<td>500k
	</td>
	</tr>
	</table>



	<table>
	<tr>
	<td><strong>Name</strong>
	</td>
	<td><strong>nb-roberta-base-scandi</strong>
	</td>
	</tr>
	<tr>
	<td>Corpus
	</td>
	<td>NbAiLab/scandinavian
	</td>
	</tr>
	<tr>
	<td>Pod size
	</td>
	<td>v4-64
	</td>
	</tr>
	<tr>
	<td>Batch size
	</td>
	<td>6248 = 1984 = 2k
	</td>
	</tr>
	<tr>
	<td>Learning rate
	</td>
	<td>3e-4 (RoBERTa article is using 6e-4 and bs=8k)
	</td>
	</tr>
	<tr>
	<td>Number of steps
	</td>
	<td>250k
	</td>
	</tr>
	</table>



	<table>
	<tr>
	<td><strong>Name</strong>
	</td>
	<td><strong>nb-roberta-large-scandi</strong>
	</td>
	</tr>
	<tr>
	<td>Corpus
	</td>
	<td>NbAiLab/scandinavian
	</td>
	</tr>
	<tr>
	<td>Pod size
	</td>
	<td>v4-64
	</td>
	</tr>
	<tr>
	<td>Batch size
	</td>
	<td>3248 = 1024 = 1k
	</td>
	</tr>
	<tr>
	<td>Learning rate
	</td>
	<td>2-e4 (RoBERTa article is using 4e-4 and bs=8k)
	</td>
	</tr>
	<tr>
	<td>Number of steps
	</td>
	<td>500k
	</td>
	</tr>
	</table>


	## Calculations
	Some basic that we used when estimating the number of training steps:
	* The Scandinavic Corpus is 85GB
	* The Scandinavic Corpus contains 13B words
	* With a conversion factor of 2.3, this is estimated to around 30B tokens
	* 30B tokens / (512 seq length * 3000 batch size) = 20.000 steps