Spaces:
Sleeping
Sleeping
File size: 7,229 Bytes
6124cbc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | {% extends "base.html" %}
{% block title %}Task 2 β Cross-Variety Fine-Tuning{% endblock %}
{% block content %}
<section class="task-page task1-page task2-page">
<div class="task-header">
<span class="task-number">02</span>
<h1>Cross-Variety Fine-Tuning</h1>
<p class="task-subtitle">Does a RoBERTa trained on one English variety generalise to the others?</p>
</div>
<div class="task-body">
<section class="task-card">
<h2><i class="fa-solid fa-book-open"></i> What Ryan Did</h2>
<p>
Ryan fine-tuned <code>roberta-base</code> separately on each English variety in the BESSTIE dataset
(<code>en-AU</code>, <code>en-IN</code>, <code>en-UK</code>) and evaluated every checkpoint across all three
test sets, producing full 3Γ3 cross-variety evaluation matrices.
</p>
<ul class="task-list">
<li><strong>Core task:</strong> Sentiment Analysis β three per-variety RoBERTa models evaluated cross-variety.</li>
<li><strong>Bonus 1:</strong> the same 3Γ3 cross-variety experiment repeated for Sarcasm Detection.</li>
<li><strong>Bonus 2:</strong> an additional RoBERTa fine-tuned on the three varieties combined (<code>mixed</code>) for Sentiment, to compare against the cross-variety models.</li>
<li>Every score is <strong>averaged over 3 random seeds</strong> [42, 7, 123] for statistical reliability.</li>
<li>Training recipe: max-len 128, batch 16, up to 20 epochs with early stopping on Val Macro-F1, AdamW LR 2e-5, warmup ratio 0.1.</li>
</ul>
</section>
<section class="chat-section task1-chat-section task2-chat-section">
<div class="chat-container">
<div class="controls-row">
<div class="control-group">
<label for="task2TaskSelect"><i class="fa-solid fa-list-check"></i> Task</label>
<select id="task2TaskSelect">
<option value="sentiment" selected>Sentiment Analysis</option>
<option value="sarcasm">Sarcasm Detection</option>
</select>
</div>
</div>
<div class="model-desc" id="task2ModelDesc">
Type once and every per-variety RoBERTa for the selected task will reply independently.
</div>
<div class="chat-log" id="task2ChatLog">
<div class="msg bot">
<div class="msg-avatar"><i class="fa-solid fa-robot"></i></div>
<div class="msg-bubble">Pick a task, send a message, and compare what each variety-trained RoBERTa predicts.</div>
</div>
</div>
<div class="chat-input-bar">
<textarea id="task2UserInput" placeholder="Type your text here..." rows="1"></textarea>
<button id="task2SendBtn" title="Run all Task 2 models"><i class="fa-solid fa-paper-plane"></i></button>
</div>
</div>
</section>
<section class="task-card">
<h2><i class="fa-solid fa-table"></i> Cross-Variety Evaluation</h2>
<p style="margin-bottom: 12px;">
Macro-F1 / Macro-P / Macro-R reported as <em>mean Β± std</em> over 3 seeds. Rows are the variety the model was
trained on; the test variety changes within each row.
</p>
<div class="eval-grid">
<div>
<h3>Sentiment Analysis</h3>
{% set rows = eval_tables.sentiment %}
{% include "partials/cross_variety_table.html" %}
</div>
<div>
<h3>Sarcasm Detection</h3>
{% set rows = eval_tables.sarcasm %}
{% include "partials/cross_variety_table.html" %}
</div>
</div>
</section>
<section class="task-card">
<h2><i class="fa-solid fa-chart-column"></i> Visualisations</h2>
<h3 class="figures-subhead">Sentiment Analysis</h3>
<div class="task-figures">
<figure>
<img src="{{ url_for('static', filename='images/task2/Sentiment/heatmap_Sentiment_avg.png') }}" alt="Cross-variety sentiment macro-F1 heatmap (mean over 3 seeds)" />
<figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed7.png') }}" alt="Sentiment confusion matrices, seed 7" />
<figcaption>Confusion matrices β seed 7.</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed42.png') }}" alt="Sentiment confusion matrices, seed 42" />
<figcaption>Confusion matrices β seed 42.</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed123.png') }}" alt="Sentiment confusion matrices, seed 123" />
<figcaption>Confusion matrices β seed 123.</figcaption>
</figure>
</div>
<h3 class="figures-subhead">Sarcasm Detection</h3>
<div class="task-figures">
<figure>
<img src="{{ url_for('static', filename='images/task2/Sarcasm/heatmap_Sarcasm_avg.png') }}" alt="Cross-variety sarcasm macro-F1 heatmap (mean over 3 seeds)" />
<figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed7.png') }}" alt="Sarcasm confusion matrices, seed 7" />
<figcaption>Confusion matrices β seed 7.</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed42.png') }}" alt="Sarcasm confusion matrices, seed 42" />
<figcaption>Confusion matrices β seed 42.</figcaption>
</figure>
<figure>
<img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed123.png') }}" alt="Sarcasm confusion matrices, seed 123" />
<figcaption>Confusion matrices β seed 123.</figcaption>
</figure>
</div>
</section>
<section class="task-card">
<h2><i class="fa-solid fa-lightbulb"></i> Takeaway</h2>
<p>
Sentiment generalises surprisingly well across varieties β even the worst cross-variety pair
(<code>en-AU</code> β <code>en-IN</code>) still hits Macro-F1 β 0.82. The <code>mixed</code> model matches or
slightly beats the per-variety champions on every test set, peaking at <strong>0.9523</strong> on <code>en-UK</code>.
</p>
<p>
Sarcasm is a different story: scores collapse the moment a model is tested on a variety it wasn't trained on.
<code>en-AU</code> β <code>en-IN</code> drops from 0.76 to 0.49, and the highest cross-variety score (<code>en-UK</code>
β <code>en-AU</code>) is only 0.61. Sarcasm cues clearly travel less well between dialects than sentiment cues.
</p>
</section>
</div>
</section>
{% endblock %}
{% block scripts %}
<script>
</script>
<script src="{{ url_for('static', filename='js/task2_chat.js') }}"></script>
{% endblock %}
|