Does a RoBERTa trained on one English variety generalise to the others?
Ryan fine-tuned roberta-base separately on each English variety in the BESSTIE dataset
(en-AU, en-IN, en-UK) and evaluated every checkpoint across all three
test sets, producing full 3×3 cross-variety evaluation matrices.
mixed) for Sentiment, to compare against the cross-variety models.Macro-F1 / Macro-P / Macro-R reported as mean ± std over 3 seeds. Rows are the variety the model was trained on; the test variety changes within each row.
Sentiment generalises surprisingly well across varieties — even the worst cross-variety pair
(en-AU → en-IN) still hits Macro-F1 ≈ 0.82. The mixed model matches or
slightly beats the per-variety champions on every test set, peaking at 0.9523 on en-UK.
Sarcasm is a different story: scores collapse the moment a model is tested on a variety it wasn't trained on.
en-AU → en-IN drops from 0.76 to 0.49, and the highest cross-variety score (en-UK
→ en-AU) is only 0.61. Sarcasm cues clearly travel less well between dialects than sentiment cues.