XLM-RoBERTa-malach

XLM-RoBERTa-large with continued pretraining on speech transcripts of the Visual History Archive.
Part 1 of the used training data is ASR'd using domain-specific Wav2Vec 2.0 and general-domain Zipformer models deployed at UWebASR.
Part 2 is machine translated from Part 1 using MADLAD-400-3B-MT.

Training Data

ASR data: cs, de, en, hu, nl, pl
MT data: cs, da, de, en, hu, nl, pl

Total tokens: 4.9B
Training tokens: 4.4B
Test tokens: 490M

The same documents are used in all 7 languages, but their proportions in terms of the number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.

The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.

Training Details

Parameters are mostly replicated from [1] Appendix B:
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-5 with linear schedule and linear warmup for 6% of the first training steps. Trained with dynamic masking on 4 L40s with per-device batch size 8, using 64 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (34k steps) on an MLM objective.

Main differences from XLM-RoBERTa-large:
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset. Smaller learning rate, since greater ones lead to overfitting. This somewhat aligns with [2] and [3], who continue the pretraining on small data.

The training takes around 24 hours but can be significantly reduced with more GPUs.

Evaluation

Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from the EHRI-NER dataset based on EHRI Online Editions in 9 languages, not including Danish [4]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages (French, Slovak, Yiddish).

Perplexity (VHA): 2.5257 -> 1.9064
Perplexity (EHRI-6): 3.1897 -> 2.9683
Perplexity (EHRI-9): 3.1806 -> 3.0340

Improvements from the XLM-RoBERTa-large checkpoint. The 490M test set is split from the dataset used to train this model and has a greater proportion of machine translations than the 42M test set.

Perplexity per language in the EHRI data, number of tokens given in parentheses:

Model	cs (195k)	de (356k)	en (81k)	fr (3.5k)	hu (45k)	nl (2.5k)	pl (34k)	sk (6k)	yi (151k)
XLM-RoBERTa-large	3.1553	3.4038	3.0588	2.0579	2.8928	2.9133	2.5284	2.6245	4.0217
XLM-RoBERTa-malach	2.8023	3.1704	2.9022	2.0254	2.8285	2.8797	2.4003	2.5914	4.0910