AstralPotato
/

en-ms-transformer

@@ -29,11 +29,11 @@ model-index:
           split: test
         metrics:
           - type: chrf
-            value: 48.93
-            name: chrF (greedy)
           - type: chrf
-            value: 49.5
-            name: chrF (clean refs only)
 ---
 # English → Malay Transformer (6+2 Tied, 16K BPE)
@@ -160,10 +160,12 @@ Fixed: 50K vocab, 6+2 architecture, 3 epochs.
 Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
 | Metric | Score |
 |---|---|
-| **chrF (greedy, all refs)** | **48.93** |
-| **chrF (greedy, clean refs only)** | **49.5** |
 | Best validation loss | 2.8176 |
 ### Reference Quality Analysis
@@ -177,10 +179,10 @@ OpenSubtitles community translations contain noise that deflates chrF:
 - **Leading dashes** (subtitle speaker indicators)
 We automatically filter these using heuristics + **langid language identification**:
-- **Clean references:** 4,339 (86.8%) → chrF **49.5**
-- **Garbage references:** 661 (13.2%) → chrF dragged down to 48.93
-The **true model performance** is better represented by the clean-ref score of **49.5 chrF**.
 ### Post-Processing
@@ -210,8 +212,8 @@ The model produces fluent, natural Malay that is often comparable or near-identi
 | Version | Data | Dropout | LR | Warmup | chrF (greedy) |
 |---|---|---|---|---|---|
 | NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
-| **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **48.93** |
-| **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+3.31** |
 Key changes in this version:
 1. **4× more training data** (490K → 2M) — the dominant factor
@@ -283,7 +285,7 @@ This project went through several iterations:
 3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
 4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
 5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
-6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **48.93** (clean refs: **49.5**).
 ## Limitations

           split: test
         metrics:
           - type: chrf
+            value: 51.65
+            name: chrF (greedy, case-normalized)
           - type: chrf
+            value: 51.80
+            name: chrF (clean refs, case-normalized)
 ---
 # English → Malay Transformer (6+2 Tied, 16K BPE)
 Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
+**All chrF scores are case-normalized** (both hypothesis and reference lowercased before scoring). This is the fair metric because our BPE tokenizer applies NFKC + **lowercase** normalization — the model *cannot* produce cased output, so penalizing it for case mismatches against mixed-case references would be unfair.
 | Metric | Score |
 |---|---|
+| **chrF (greedy, case-normalized)** | **51.65** |
+| **chrF (clean refs, case-normalized)** | **51.80** |
 | Best validation loss | 2.8176 |
 ### Reference Quality Analysis
 - **Leading dashes** (subtitle speaker indicators)
 We automatically filter these using heuristics + **langid language identification**:
+- **Clean references:** 4,339 (86.8%) → chrF **51.80** (case-normalized)
+- **Garbage references:** 661 (13.2%) → chrF dragged down to 51.65
+The **true model performance** is better represented by the clean-ref case-normalized score of **51.80 chrF**.
 ### Post-Processing
 | Version | Data | Dropout | LR | Warmup | chrF (greedy) |
 |---|---|---|---|---|---|
 | NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
+| **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **51.65** |
+| **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+6.03** |
 Key changes in this version:
 1. **4× more training data** (490K → 2M) — the dominant factor
 3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
 4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
 5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
+6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **51.65** (clean refs: **51.80**, case-normalized).
 ## Limitations

config.json CHANGED Viewed

@@ -46,10 +46,10 @@
     "training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
   },
   "evaluation": {
-    "chrf_greedy": 48.93,
-    "chrf_clean_refs_only": 49.5,
     "best_val_loss": 2.8176,
     "test_set": "5K in-distribution OpenSubtitles",
-    "note": "chrF with post-processing; clean-ref score filters out mojibake, Indonesian contamination, truncated and untranslated refs"
   }
 }

     "training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
   },
   "evaluation": {
+    "chrf_greedy_case_normalized": 51.65,
+    "chrf_clean_refs_case_normalized": 51.8,
     "best_val_loss": 2.8176,
     "test_set": "5K in-distribution OpenSubtitles",
+    "note": "Case-normalized chrF (lowercased before scoring) \u2014 fair because BPE tokenizer applies lowercase normalization, so model cannot produce cased output"
   }
 }