File size: 7,229 Bytes
6124cbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
{% extends "base.html" %}
{% block title %}Task 2 β€” Cross-Variety Fine-Tuning{% endblock %}

{% block content %}
<section class="task-page task1-page task2-page">
  <div class="task-header">
    <span class="task-number">02</span>
    <h1>Cross-Variety Fine-Tuning</h1>
    <p class="task-subtitle">Does a RoBERTa trained on one English variety generalise to the others?</p>
  </div>

  <div class="task-body">
    <section class="task-card">
      <h2><i class="fa-solid fa-book-open"></i> What Ryan Did</h2>
      <p>
        Ryan fine-tuned <code>roberta-base</code> separately on each English variety in the BESSTIE dataset
        (<code>en-AU</code>, <code>en-IN</code>, <code>en-UK</code>) and evaluated every checkpoint across all three
        test sets, producing full 3Γ—3 cross-variety evaluation matrices.
      </p>
      <ul class="task-list">
        <li><strong>Core task:</strong> Sentiment Analysis β€” three per-variety RoBERTa models evaluated cross-variety.</li>
        <li><strong>Bonus 1:</strong> the same 3Γ—3 cross-variety experiment repeated for Sarcasm Detection.</li>
        <li><strong>Bonus 2:</strong> an additional RoBERTa fine-tuned on the three varieties combined (<code>mixed</code>) for Sentiment, to compare against the cross-variety models.</li>
        <li>Every score is <strong>averaged over 3 random seeds</strong> [42, 7, 123] for statistical reliability.</li>
        <li>Training recipe: max-len 128, batch 16, up to 20 epochs with early stopping on Val Macro-F1, AdamW LR 2e-5, warmup ratio 0.1.</li>
      </ul>
    </section>

    <section class="chat-section task1-chat-section task2-chat-section">
      <div class="chat-container">
        <div class="controls-row">
          <div class="control-group">
            <label for="task2TaskSelect"><i class="fa-solid fa-list-check"></i> Task</label>
            <select id="task2TaskSelect">
              <option value="sentiment" selected>Sentiment Analysis</option>
              <option value="sarcasm">Sarcasm Detection</option>
            </select>
          </div>
        </div>

        <div class="model-desc" id="task2ModelDesc">
          Type once and every per-variety RoBERTa for the selected task will reply independently.
        </div>

        <div class="chat-log" id="task2ChatLog">
          <div class="msg bot">
            <div class="msg-avatar"><i class="fa-solid fa-robot"></i></div>
            <div class="msg-bubble">Pick a task, send a message, and compare what each variety-trained RoBERTa predicts.</div>
          </div>
        </div>

        <div class="chat-input-bar">
          <textarea id="task2UserInput" placeholder="Type your text here..." rows="1"></textarea>
          <button id="task2SendBtn" title="Run all Task 2 models"><i class="fa-solid fa-paper-plane"></i></button>
        </div>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-table"></i> Cross-Variety Evaluation</h2>
      <p style="margin-bottom: 12px;">
        Macro-F1 / Macro-P / Macro-R reported as <em>mean Β± std</em> over 3 seeds. Rows are the variety the model was
        trained on; the test variety changes within each row.
      </p>
      <div class="eval-grid">
        <div>
          <h3>Sentiment Analysis</h3>
          {% set rows = eval_tables.sentiment %}
          {% include "partials/cross_variety_table.html" %}
        </div>
        <div>
          <h3>Sarcasm Detection</h3>
          {% set rows = eval_tables.sarcasm %}
          {% include "partials/cross_variety_table.html" %}
        </div>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-chart-column"></i> Visualisations</h2>

      <h3 class="figures-subhead">Sentiment Analysis</h3>
      <div class="task-figures">
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sentiment/heatmap_Sentiment_avg.png') }}" alt="Cross-variety sentiment macro-F1 heatmap (mean over 3 seeds)" />
          <figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed7.png') }}" alt="Sentiment confusion matrices, seed 7" />
          <figcaption>Confusion matrices β€” seed 7.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed42.png') }}" alt="Sentiment confusion matrices, seed 42" />
          <figcaption>Confusion matrices β€” seed 42.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed123.png') }}" alt="Sentiment confusion matrices, seed 123" />
          <figcaption>Confusion matrices β€” seed 123.</figcaption>
        </figure>
      </div>

      <h3 class="figures-subhead">Sarcasm Detection</h3>
      <div class="task-figures">
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sarcasm/heatmap_Sarcasm_avg.png') }}" alt="Cross-variety sarcasm macro-F1 heatmap (mean over 3 seeds)" />
          <figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed7.png') }}" alt="Sarcasm confusion matrices, seed 7" />
          <figcaption>Confusion matrices β€” seed 7.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed42.png') }}" alt="Sarcasm confusion matrices, seed 42" />
          <figcaption>Confusion matrices β€” seed 42.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed123.png') }}" alt="Sarcasm confusion matrices, seed 123" />
          <figcaption>Confusion matrices β€” seed 123.</figcaption>
        </figure>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-lightbulb"></i> Takeaway</h2>
      <p>
        Sentiment generalises surprisingly well across varieties β€” even the worst cross-variety pair
        (<code>en-AU</code> β†’ <code>en-IN</code>) still hits Macro-F1 β‰ˆ 0.82. The <code>mixed</code> model matches or
        slightly beats the per-variety champions on every test set, peaking at <strong>0.9523</strong> on <code>en-UK</code>.
      </p>
      <p>
        Sarcasm is a different story: scores collapse the moment a model is tested on a variety it wasn't trained on.
        <code>en-AU</code> β†’ <code>en-IN</code> drops from 0.76 to 0.49, and the highest cross-variety score (<code>en-UK</code>
        β†’ <code>en-AU</code>) is only 0.61. Sarcasm cues clearly travel less well between dialects than sentiment cues.
      </p>
    </section>
  </div>
</section>
{% endblock %}

{% block scripts %}
<script>
const TASK2_MODELS = {{ models | tojson }};
</script>
<script src="{{ url_for('static', filename='js/task2_chat.js') }}"></script>
{% endblock %}