{% extends "base.html" %} {% block title %}Task 2 — Cross-Variety Fine-Tuning{% endblock %} {% block content %}
02

Cross-Variety Fine-Tuning

Does a RoBERTa trained on one English variety generalise to the others?

What Ryan Did

Ryan fine-tuned roberta-base separately on each English variety in the BESSTIE dataset (en-AU, en-IN, en-UK) and evaluated every checkpoint across all three test sets, producing full 3×3 cross-variety evaluation matrices.

  • Core task: Sentiment Analysis — three per-variety RoBERTa models evaluated cross-variety.
  • Bonus 1: the same 3×3 cross-variety experiment repeated for Sarcasm Detection.
  • Bonus 2: an additional RoBERTa fine-tuned on the three varieties combined (mixed) for Sentiment, to compare against the cross-variety models.
  • Every score is averaged over 3 random seeds [42, 7, 123] for statistical reliability.
  • Training recipe: max-len 128, batch 16, up to 20 epochs with early stopping on Val Macro-F1, AdamW LR 2e-5, warmup ratio 0.1.
Type once and every per-variety RoBERTa for the selected task will reply independently.
Pick a task, send a message, and compare what each variety-trained RoBERTa predicts.

Cross-Variety Evaluation

Macro-F1 / Macro-P / Macro-R reported as mean ± std over 3 seeds. Rows are the variety the model was trained on; the test variety changes within each row.

Sentiment Analysis

{% set rows = eval_tables.sentiment %} {% include "partials/cross_variety_table.html" %}

Sarcasm Detection

{% set rows = eval_tables.sarcasm %} {% include "partials/cross_variety_table.html" %}

Visualisations

Sentiment Analysis

Cross-variety sentiment macro-F1 heatmap (mean over 3 seeds)
Cross-variety macro-F1 heatmap (mean ± std over 3 seeds).
Sentiment confusion matrices, seed 7
Confusion matrices — seed 7.
Sentiment confusion matrices, seed 42
Confusion matrices — seed 42.
Sentiment confusion matrices, seed 123
Confusion matrices — seed 123.

Sarcasm Detection

Cross-variety sarcasm macro-F1 heatmap (mean over 3 seeds)
Cross-variety macro-F1 heatmap (mean ± std over 3 seeds).
Sarcasm confusion matrices, seed 7
Confusion matrices — seed 7.
Sarcasm confusion matrices, seed 42
Confusion matrices — seed 42.
Sarcasm confusion matrices, seed 123
Confusion matrices — seed 123.

Takeaway

Sentiment generalises surprisingly well across varieties — even the worst cross-variety pair (en-AUen-IN) still hits Macro-F1 ≈ 0.82. The mixed model matches or slightly beats the per-variety champions on every test set, peaking at 0.9523 on en-UK.

Sarcasm is a different story: scores collapse the moment a model is tested on a variety it wasn't trained on. en-AUen-IN drops from 0.76 to 0.49, and the highest cross-variety score (en-UKen-AU) is only 0.61. Sarcasm cues clearly travel less well between dialects than sentiment cues.

{% endblock %} {% block scripts %} {% endblock %}