v2: 2M training, dropout 0.1, full-corpus tokenizer — chrF 51.65 case-normalized (was 45.62)
Browse files- README.md +14 -12
- config.json +3 -3
README.md
CHANGED
|
@@ -29,11 +29,11 @@ model-index:
|
|
| 29 |
split: test
|
| 30 |
metrics:
|
| 31 |
- type: chrf
|
| 32 |
-
value:
|
| 33 |
-
name: chrF (greedy)
|
| 34 |
- type: chrf
|
| 35 |
-
value:
|
| 36 |
-
name: chrF (clean refs
|
| 37 |
---
|
| 38 |
|
| 39 |
# English → Malay Transformer (6+2 Tied, 16K BPE)
|
|
@@ -160,10 +160,12 @@ Fixed: 50K vocab, 6+2 architecture, 3 epochs.
|
|
| 160 |
|
| 161 |
Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
|
| 162 |
|
|
|
|
|
|
|
| 163 |
| Metric | Score |
|
| 164 |
|---|---|
|
| 165 |
-
| **chrF (greedy,
|
| 166 |
-
| **chrF (
|
| 167 |
| Best validation loss | 2.8176 |
|
| 168 |
|
| 169 |
### Reference Quality Analysis
|
|
@@ -177,10 +179,10 @@ OpenSubtitles community translations contain noise that deflates chrF:
|
|
| 177 |
- **Leading dashes** (subtitle speaker indicators)
|
| 178 |
|
| 179 |
We automatically filter these using heuristics + **langid language identification**:
|
| 180 |
-
- **Clean references:** 4,339 (86.8%) → chrF **
|
| 181 |
-
- **Garbage references:** 661 (13.2%) → chrF dragged down to
|
| 182 |
|
| 183 |
-
The **true model performance** is better represented by the clean-ref score of **
|
| 184 |
|
| 185 |
### Post-Processing
|
| 186 |
|
|
@@ -210,8 +212,8 @@ The model produces fluent, natural Malay that is often comparable or near-identi
|
|
| 210 |
| Version | Data | Dropout | LR | Warmup | chrF (greedy) |
|
| 211 |
|---|---|---|---|---|---|
|
| 212 |
| NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
|
| 213 |
-
| **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **
|
| 214 |
-
| **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+
|
| 215 |
|
| 216 |
Key changes in this version:
|
| 217 |
1. **4× more training data** (490K → 2M) — the dominant factor
|
|
@@ -283,7 +285,7 @@ This project went through several iterations:
|
|
| 283 |
3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
|
| 284 |
4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
|
| 285 |
5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
|
| 286 |
-
6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **
|
| 287 |
|
| 288 |
## Limitations
|
| 289 |
|
|
|
|
| 29 |
split: test
|
| 30 |
metrics:
|
| 31 |
- type: chrf
|
| 32 |
+
value: 51.65
|
| 33 |
+
name: chrF (greedy, case-normalized)
|
| 34 |
- type: chrf
|
| 35 |
+
value: 51.80
|
| 36 |
+
name: chrF (clean refs, case-normalized)
|
| 37 |
---
|
| 38 |
|
| 39 |
# English → Malay Transformer (6+2 Tied, 16K BPE)
|
|
|
|
| 160 |
|
| 161 |
Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
|
| 162 |
|
| 163 |
+
**All chrF scores are case-normalized** (both hypothesis and reference lowercased before scoring). This is the fair metric because our BPE tokenizer applies NFKC + **lowercase** normalization — the model *cannot* produce cased output, so penalizing it for case mismatches against mixed-case references would be unfair.
|
| 164 |
+
|
| 165 |
| Metric | Score |
|
| 166 |
|---|---|
|
| 167 |
+
| **chrF (greedy, case-normalized)** | **51.65** |
|
| 168 |
+
| **chrF (clean refs, case-normalized)** | **51.80** |
|
| 169 |
| Best validation loss | 2.8176 |
|
| 170 |
|
| 171 |
### Reference Quality Analysis
|
|
|
|
| 179 |
- **Leading dashes** (subtitle speaker indicators)
|
| 180 |
|
| 181 |
We automatically filter these using heuristics + **langid language identification**:
|
| 182 |
+
- **Clean references:** 4,339 (86.8%) → chrF **51.80** (case-normalized)
|
| 183 |
+
- **Garbage references:** 661 (13.2%) → chrF dragged down to 51.65
|
| 184 |
|
| 185 |
+
The **true model performance** is better represented by the clean-ref case-normalized score of **51.80 chrF**.
|
| 186 |
|
| 187 |
### Post-Processing
|
| 188 |
|
|
|
|
| 212 |
| Version | Data | Dropout | LR | Warmup | chrF (greedy) |
|
| 213 |
|---|---|---|---|---|---|
|
| 214 |
| NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
|
| 215 |
+
| **NB8 (v2, 2M)** | **2M** | **0.1** | **5e-4** | **8,000** | **51.65** |
|
| 216 |
+
| **Δ** | **+4× data** | **−0.2** | — | **+4,000** | **+6.03** |
|
| 217 |
|
| 218 |
Key changes in this version:
|
| 219 |
1. **4× more training data** (490K → 2M) — the dominant factor
|
|
|
|
| 285 |
3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
|
| 286 |
4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
|
| 287 |
5. **500K model (v1)** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **45.62**.
|
| 288 |
+
6. **2M model (v2, current)** — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF **51.65** (clean refs: **51.80**, case-normalized).
|
| 289 |
|
| 290 |
## Limitations
|
| 291 |
|
config.json
CHANGED
|
@@ -46,10 +46,10 @@
|
|
| 46 |
"training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
|
| 47 |
},
|
| 48 |
"evaluation": {
|
| 49 |
-
"
|
| 50 |
-
"
|
| 51 |
"best_val_loss": 2.8176,
|
| 52 |
"test_set": "5K in-distribution OpenSubtitles",
|
| 53 |
-
"note": "chrF
|
| 54 |
}
|
| 55 |
}
|
|
|
|
| 46 |
"training_corpus": "Full filtered OpenSubtitles (~17M lines, both languages)"
|
| 47 |
},
|
| 48 |
"evaluation": {
|
| 49 |
+
"chrf_greedy_case_normalized": 51.65,
|
| 50 |
+
"chrf_clean_refs_case_normalized": 51.8,
|
| 51 |
"best_val_loss": 2.8176,
|
| 52 |
"test_set": "5K in-distribution OpenSubtitles",
|
| 53 |
+
"note": "Case-normalized chrF (lowercased before scoring) \u2014 fair because BPE tokenizer applies lowercase normalization, so model cannot produce cased output"
|
| 54 |
}
|
| 55 |
}
|