Spaces:
Sleeping
Sleeping
| {% extends "base.html" %} | |
| {% block title %}Task 2 β Cross-Variety Fine-Tuning{% endblock %} | |
| {% block content %} | |
| <section class="task-page task1-page task2-page"> | |
| <div class="task-header"> | |
| <span class="task-number">02</span> | |
| <h1>Cross-Variety Fine-Tuning</h1> | |
| <p class="task-subtitle">Does a RoBERTa trained on one English variety generalise to the others?</p> | |
| </div> | |
| <div class="task-body"> | |
| <section class="task-card"> | |
| <h2><i class="fa-solid fa-book-open"></i> What Ryan Did</h2> | |
| <p> | |
| Ryan fine-tuned <code>roberta-base</code> separately on each English variety in the BESSTIE dataset | |
| (<code>en-AU</code>, <code>en-IN</code>, <code>en-UK</code>) and evaluated every checkpoint across all three | |
| test sets, producing full 3Γ3 cross-variety evaluation matrices. | |
| </p> | |
| <ul class="task-list"> | |
| <li><strong>Core task:</strong> Sentiment Analysis β three per-variety RoBERTa models evaluated cross-variety.</li> | |
| <li><strong>Bonus 1:</strong> the same 3Γ3 cross-variety experiment repeated for Sarcasm Detection.</li> | |
| <li><strong>Bonus 2:</strong> an additional RoBERTa fine-tuned on the three varieties combined (<code>mixed</code>) for Sentiment, to compare against the cross-variety models.</li> | |
| <li>Every score is <strong>averaged over 3 random seeds</strong> [42, 7, 123] for statistical reliability.</li> | |
| <li>Training recipe: max-len 128, batch 16, up to 20 epochs with early stopping on Val Macro-F1, AdamW LR 2e-5, warmup ratio 0.1.</li> | |
| </ul> | |
| </section> | |
| <section class="chat-section task1-chat-section task2-chat-section"> | |
| <div class="chat-container"> | |
| <div class="controls-row"> | |
| <div class="control-group"> | |
| <label for="task2TaskSelect"><i class="fa-solid fa-list-check"></i> Task</label> | |
| <select id="task2TaskSelect"> | |
| <option value="sentiment" selected>Sentiment Analysis</option> | |
| <option value="sarcasm">Sarcasm Detection</option> | |
| </select> | |
| </div> | |
| </div> | |
| <div class="model-desc" id="task2ModelDesc"> | |
| Type once and every per-variety RoBERTa for the selected task will reply independently. | |
| </div> | |
| <div class="chat-log" id="task2ChatLog"> | |
| <div class="msg bot"> | |
| <div class="msg-avatar"><i class="fa-solid fa-robot"></i></div> | |
| <div class="msg-bubble">Pick a task, send a message, and compare what each variety-trained RoBERTa predicts.</div> | |
| </div> | |
| </div> | |
| <div class="chat-input-bar"> | |
| <textarea id="task2UserInput" placeholder="Type your text here..." rows="1"></textarea> | |
| <button id="task2SendBtn" title="Run all Task 2 models"><i class="fa-solid fa-paper-plane"></i></button> | |
| </div> | |
| </div> | |
| </section> | |
| <section class="task-card"> | |
| <h2><i class="fa-solid fa-table"></i> Cross-Variety Evaluation</h2> | |
| <p style="margin-bottom: 12px;"> | |
| Macro-F1 / Macro-P / Macro-R reported as <em>mean Β± std</em> over 3 seeds. Rows are the variety the model was | |
| trained on; the test variety changes within each row. | |
| </p> | |
| <div class="eval-grid"> | |
| <div> | |
| <h3>Sentiment Analysis</h3> | |
| {% set rows = eval_tables.sentiment %} | |
| {% include "partials/cross_variety_table.html" %} | |
| </div> | |
| <div> | |
| <h3>Sarcasm Detection</h3> | |
| {% set rows = eval_tables.sarcasm %} | |
| {% include "partials/cross_variety_table.html" %} | |
| </div> | |
| </div> | |
| </section> | |
| <section class="task-card"> | |
| <h2><i class="fa-solid fa-chart-column"></i> Visualisations</h2> | |
| <h3 class="figures-subhead">Sentiment Analysis</h3> | |
| <div class="task-figures"> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sentiment/heatmap_Sentiment_avg.png') }}" alt="Cross-variety sentiment macro-F1 heatmap (mean over 3 seeds)" /> | |
| <figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed7.png') }}" alt="Sentiment confusion matrices, seed 7" /> | |
| <figcaption>Confusion matrices β seed 7.</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed42.png') }}" alt="Sentiment confusion matrices, seed 42" /> | |
| <figcaption>Confusion matrices β seed 42.</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sentiment/confusion_matrices_Sentiment_seed123.png') }}" alt="Sentiment confusion matrices, seed 123" /> | |
| <figcaption>Confusion matrices β seed 123.</figcaption> | |
| </figure> | |
| </div> | |
| <h3 class="figures-subhead">Sarcasm Detection</h3> | |
| <div class="task-figures"> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sarcasm/heatmap_Sarcasm_avg.png') }}" alt="Cross-variety sarcasm macro-F1 heatmap (mean over 3 seeds)" /> | |
| <figcaption>Cross-variety macro-F1 heatmap (mean Β± std over 3 seeds).</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed7.png') }}" alt="Sarcasm confusion matrices, seed 7" /> | |
| <figcaption>Confusion matrices β seed 7.</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed42.png') }}" alt="Sarcasm confusion matrices, seed 42" /> | |
| <figcaption>Confusion matrices β seed 42.</figcaption> | |
| </figure> | |
| <figure> | |
| <img src="{{ url_for('static', filename='images/task2/Sarcasm/confusion_matrices_Sarcasm_seed123.png') }}" alt="Sarcasm confusion matrices, seed 123" /> | |
| <figcaption>Confusion matrices β seed 123.</figcaption> | |
| </figure> | |
| </div> | |
| </section> | |
| <section class="task-card"> | |
| <h2><i class="fa-solid fa-lightbulb"></i> Takeaway</h2> | |
| <p> | |
| Sentiment generalises surprisingly well across varieties β even the worst cross-variety pair | |
| (<code>en-AU</code> β <code>en-IN</code>) still hits Macro-F1 β 0.82. The <code>mixed</code> model matches or | |
| slightly beats the per-variety champions on every test set, peaking at <strong>0.9523</strong> on <code>en-UK</code>. | |
| </p> | |
| <p> | |
| Sarcasm is a different story: scores collapse the moment a model is tested on a variety it wasn't trained on. | |
| <code>en-AU</code> β <code>en-IN</code> drops from 0.76 to 0.49, and the highest cross-variety score (<code>en-UK</code> | |
| β <code>en-AU</code>) is only 0.61. Sarcasm cues clearly travel less well between dialects than sentiment cues. | |
| </p> | |
| </section> | |
| </div> | |
| </section> | |
| {% endblock %} | |
| {% block scripts %} | |
| <script> | |
| </script> | |
| <script src="{{ url_for('static', filename='js/task2_chat.js') }}"></script> | |
| {% endblock %} | |