File size: 6,160 Bytes
6124cbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
{% extends "base.html" %}
{% block title %}Task 3 — Dialect-Aware LoRA Adapters{% endblock %}

{% block content %}
<section class="task-page task1-page task2-page task3-page">
  <div class="task-header">
    <span class="task-number">03</span>
    <h1>Dialect-Aware LoRA Sarcasm Detection</h1>
    <p class="task-subtitle">Can a single 1.1B-parameter LLM be cheaply specialised per English variety with LoRA adapters?</p>
  </div>

  <div class="task-body">
    <section class="task-card">
      <h2><i class="fa-solid fa-book-open"></i> What Omkar Did</h2>
      <p>
        Omkar fine-tuned <code>TinyLlama-1.1B-Chat-v1.0</code> with a separate LoRA adapter for each English variety in the
        BESSTIE sarcasm dataset (<code>en-AU</code>, <code>en-IN</code>, <code>en-UK</code>) and evaluated every adapter
        across all three test sets, producing a full 3×3 cross-variety evaluation matrix.
      </p>
      <ul class="task-list">
        <li><strong>Base model:</strong> TinyLlama-1.1B-Chat-v1.0 (frozen) with PEFT LoRA adapters &mdash; <code>r=16</code>, <code>alpha=32</code>, <code>dropout=0.05</code>, all-linear target modules.</li>
        <li><strong>Task framing:</strong> the assistant answers <code>"yes"</code> or <code>"no"</code> to a single sarcasm question, with completion-only loss applied to that one label token.</li>
        <li><strong>Class balance:</strong> a <code>WeightedRandomSampler</code> rebalances training exposure without duplicating rows; evaluation uses a validation-tuned threshold on <code>logit_yes − logit_no</code>.</li>
        <li><strong>Training recipe:</strong> max-len 512, batch 8 × 2 grad-accum, 5 epochs, AdamW LR 2e-4 with cosine schedule, warmup ratio 0.05, weight decay 0.01, bf16, best-checkpoint by validation Macro-F1.</li>
        <li>Every score is <strong>averaged over 3 random seeds</strong> [42, 123, 2024]. The best-performing adapter per variety is published on the Hugging Face Hub.</li>
      </ul>
    </section>

    <section class="chat-section task1-chat-section task2-chat-section task3-chat-section">
      <div class="chat-container">
        <div class="model-desc" id="task3ModelDesc">
          Type once and every dialect-tuned LoRA adapter will reply independently.
        </div>

        <div class="chat-log" id="task3ChatLog">
          <div class="msg bot">
            <div class="msg-avatar"><i class="fa-solid fa-robot"></i></div>
            <div class="msg-bubble">Send a message and compare what each variety-tuned LoRA adapter predicts. (First call per dialect lazy-loads the adapter, so expect a short delay.)</div>
          </div>
        </div>

        <div class="chat-input-bar">
          <textarea id="task3UserInput" placeholder="Type your text here..." rows="1"></textarea>
          <button id="task3SendBtn" title="Run all Task 3 models"><i class="fa-solid fa-paper-plane"></i></button>
        </div>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-table"></i> Cross-Variety Evaluation</h2>
      <p style="margin-bottom: 12px;">
        Macro-F1 / Macro-P / Macro-R reported as <em>mean ± std</em> over 3 seeds. Rows are the variety the adapter was
        trained on; the test variety changes within each row.
      </p>
      <div class="eval-grid">
        <div>
          <h3>Sarcasm Detection</h3>
          {% set rows = eval_tables.sarcasm %}
          {% include "partials/cross_variety_table.html" %}
        </div>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-chart-column"></i> Visualisations</h2>
      <div class="task-figures">
        <figure>
          <img src="{{ url_for('static', filename='images/task3/CrossVarietyMeanF1Matrix.png') }}" alt="Cross-variety mean Macro-F1 heatmap" />
          <figcaption>Cross-variety Macro-F1 heatmap (mean over 3 seeds).</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task3/EvaluationF1overseeds.png') }}" alt="Macro-F1 by train/test variety with seed std-dev" />
          <figcaption>Macro-F1 by train/test variety with seed standard deviation.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task3/seedAvgF1LineGraph.png') }}" alt="Seed-averaged Macro-F1 with seed standard deviation" />
          <figcaption>Seed-averaged Macro-F1 across test varieties.</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task3/confusionAcrossSeeds.png') }}" alt="Mean confusion matrices across seeds" />
          <figcaption>Mean confusion matrices across seeds (3×3 grid).</figcaption>
        </figure>
        <figure>
          <img src="{{ url_for('static', filename='images/task3/variety-valMacroF1.png') }}" alt="Validation Macro-F1 vs training step" />
          <figcaption>Validation Macro-F1 vs. training step.</figcaption>
        </figure>
      </div>
    </section>

    <section class="task-card">
      <h2><i class="fa-solid fa-lightbulb"></i> Takeaway</h2>
      <p>
        On its own dialect, the <code>en-UK</code> adapter leads with Macro-F1 <strong>0.7724 ± 0.0088</strong>, closely
        followed by <code>en-AU</code> at <strong>0.7603 ± 0.0291</strong>. The <code>en-IN</code> adapter lags at
        <strong>0.5964 ± 0.0817</strong> with a much wider seed spread &mdash; sarcasm in Indian English is the hardest of
        the three for this 1.1B-parameter base.
      </p>
      <p>
        Cross-variety transfer collapses just like in Ryan's RoBERTa sarcasm matrix: every off-diagonal cell drops well
        below the same-dialect score, with the worst pair (<code>en-AU</code><code>en-IN</code>) falling to
        <strong>0.50</strong>. Sarcasm cues clearly remain dialect-specific even when the underlying model is a much
        larger LLM.
      </p>
    </section>
  </div>
</section>
{% endblock %}

{% block scripts %}
<script>
const TASK3_MODELS = {{ models | tojson }};
</script>
<script src="{{ url_for('static', filename='js/task3_chat.js') }}"></script>
{% endblock %}