AstralPotato commited on
Commit
472017b
·
verified ·
1 Parent(s): d7fa769

v2: 2M training, dropout 0.1, full-corpus tokenizer — chrF 51.65 case-normalized (was 45.62)

Browse files
Files changed (2) hide show
  1. README.md +14 -12
  2. config.json +3 -3
README.md CHANGED
@@ -29,11 +29,11 @@ model-index:
29
  split: test
30
  metrics:
31
  - type: chrf
32
- value: 48.93
33
- name: chrF (greedy)
34
  - type: chrf
35
- value: 49.5
36
- name: chrF (clean refs only)
37
  ---
38
 
39
  # English → Malay Transformer (6+2 Tied, 16K BPE)
@@ -160,10 +160,12 @@ Fixed: 50K vocab, 6+2 architecture, 3 epochs.
160
 
161
  Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
162
 
 
 
163
  | Metric | Score |
164
  |---|---|
165
- | **chrF (greedy, all refs)** | **48.93** |
166
- | **chrF (greedy, clean refs only)** | **49.5** |
167
  | Best validation loss | 2.8176 |
168
 
169
  ### Reference Quality Analysis
@@ -177,10 +179,10 @@ OpenSubtitles community translations contain noise that deflates chrF:
177
  - **Leading dashes** (subtitle speaker indicators)
178
 
179
  We automatically filter these using heuristics + **langid language identification**:
180
- - **Clean references:** 4,339 (86.8%) → chrF **49.5**
181
- - **Garbage references:** 661 (13.2%) → chrF dragged down to 48.93
182
 
183
- The **true model performance** is better represented by the clean-ref score of **49.5 chrF**.
184
 
185
  ### Post-Processing
186
 
@@ -210,8 +212,8 @@ The model produces fluent, natural Malay that is often comparable or near-identi
210
  | Version | Data | Dropout | LR | Warmup | chrF (greedy) |
211
  |---|---|---|---|---|---|
212
  | NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
213
- | **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **48.93** |
214
- | **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+3.31** |
215
 
216
  Key changes in this version:
217
  1. **4× more training data** (490K → 2M) — the dominant factor
@@ -283,7 +285,7 @@ This project went through several iterations:
283
  3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
284
  4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
285
  5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
286
- 6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **48.93** (clean refs: **49.5**).
287
 
288
  ## Limitations
289
 
 
29
  split: test
30
  metrics:
31
  - type: chrf
32
+ value: 51.65
33
+ name: chrF (greedy, case-normalized)
34
  - type: chrf
35
+ value: 51.80
36
+ name: chrF (clean refs, case-normalized)
37
  ---
38
 
39
  # English → Malay Transformer (6+2 Tied, 16K BPE)
 
160
 
161
  Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
162
 
163
+ **All chrF scores are case-normalized** (both hypothesis and reference lowercased before scoring). This is the fair metric because our BPE tokenizer applies NFKC + **lowercase** normalization — the model *cannot* produce cased output, so penalizing it for case mismatches against mixed-case references would be unfair.
164
+
165
  | Metric | Score |
166
  |---|---|
167
+ | **chrF (greedy, case-normalized)** | **51.65** |
168
+ | **chrF (clean refs, case-normalized)** | **51.80** |
169
  | Best validation loss | 2.8176 |
170
 
171
  ### Reference Quality Analysis
 
179
  - **Leading dashes** (subtitle speaker indicators)
180
 
181
  We automatically filter these using heuristics + **langid language identification**:
182
+ - **Clean references:** 4,339 (86.8%) → chrF **51.80** (case-normalized)
183
+ - **Garbage references:** 661 (13.2%) → chrF dragged down to 51.65
184
 
185
+ The **true model performance** is better represented by the clean-ref case-normalized score of **51.80 chrF**.
186
 
187
  ### Post-Processing
188
 
 
212
  | Version | Data | Dropout | LR | Warmup | chrF (greedy) |
213
  |---|---|---|---|---|---|
214
  | NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
215
+ | **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **51.65** |
216
+ | **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+6.03** |
217
 
218
  Key changes in this version:
219
  1. **4× more training data** (490K → 2M) — the dominant factor
 
285
  3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
286
  4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
287
  5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
288
+ 6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **51.65** (clean refs: **51.80**, case-normalized).
289
 
290
  ## Limitations
291
 
config.json CHANGED
@@ -46,10 +46,10 @@
46
  "training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
47
  },
48
  "evaluation": {
49
- "chrf_greedy": 48.93,
50
- "chrf_clean_refs_only": 49.5,
51
  "best_val_loss": 2.8176,
52
  "test_set": "5K in-distribution OpenSubtitles",
53
- "note": "chrF with post-processing; clean-ref score filters out mojibake, Indonesian contamination, truncated and untranslated refs"
54
  }
55
  }
 
46
  "training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
47
  },
48
  "evaluation": {
49
+ "chrf_greedy_case_normalized": 51.65,
50
+ "chrf_clean_refs_case_normalized": 51.8,
51
  "best_val_loss": 2.8176,
52
  "test_set": "5K in-distribution OpenSubtitles",
53
+ "note": "Case-normalized chrF (lowercased before scoring) \u2014 fair because BPE tokenizer applies lowercase normalization, so model cannot produce cased output"
54
  }
55
  }