Fill-Mask
Transformers
Safetensors
xlm-roberta
holocaust
speech
historical

XLM-RoBERTa-malach

XLM-RoBERTa-large with continued pretraining on speech transcripts of the Visual History Archive.
Part 1 of the used training data is ASR'd using domain-specific Wav2Vec 2.0 and general-domain Zipformer models deployed at UWebASR.
Part 2 is machine translated from Part 1 using MADLAD-400-3B-MT.

Training Data

ASR data: cs, de, en, hu, nl, pl
MT data: cs, da, de, en, hu, nl, pl

Total tokens: 4.9B
Training tokens: 4.4B
Test tokens: 490M

The same documents are used in all 7 languages, but their proportions in terms of the number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.

The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.

Training Details

Parameters are mostly replicated from [1] Appendix B:
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-5 with linear schedule and linear warmup for 6% of the first training steps. Trained with dynamic masking on 4 L40s with per-device batch size 8, using 64 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (34k steps) on an MLM objective.

Main differences from XLM-RoBERTa-large:
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset. Smaller learning rate, since greater ones lead to overfitting. This somewhat aligns with [2] and [3], who continue the pretraining on small data.

The training takes around 24 hours but can be significantly reduced with more GPUs.

Evaluation

Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from the EHRI-NER dataset based on EHRI Online Editions in 9 languages, not including Danish [4]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages (French, Slovak, Yiddish).

Perplexity (VHA): 2.5257 -> 1.9064
Perplexity (EHRI-6): 3.1897 -> 2.9683
Perplexity (EHRI-9): 3.1806 -> 3.0340

Improvements from the XLM-RoBERTa-large checkpoint. The 490M test set is split from the dataset used to train this model and has a greater proportion of machine translations than the 42M test set.

Perplexity per language in the EHRI data, number of tokens given in parentheses:

Model cs (195k) de (356k) en (81k) fr (3.5k) hu (45k) nl (2.5k) pl (34k) sk (6k) yi (151k)
XLM-RoBERTa-large 3.1553 3.4038 3.0588 2.0579 2.8928 2.9133 2.5284 2.6245 4.0217
XLM-RoBERTa-malach 2.8023 3.1704 2.9022 2.0254 2.8285 2.8797 2.4003 2.5914 4.0910

References

[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
[2] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
[3] The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
[4] Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools

Downloads last month
76
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ufal/xlm-roberta-malach

Finetuned
(901)
this model

Papers for ufal/xlm-roberta-malach