English → Malay Transformer (6+2 Tied, 16K BPE)
A custom encoder-decoder Transformer for English-to-Malay translation, built entirely from scratch in PyTorch. This model was developed as part of IT3103 Advanced Topics in AI — Assignment 2, 2025 Semester 2.
The project encompasses the full NMT pipeline: dataset curation, tokenizer training, architecture design with ablation studies, training with mixed-precision, and evaluation — all without using any pretrained models or high-level frameworks like Fairseq or OpenNMT.
Model Description
| Component | Details |
|---|---|
| Architecture | 6-layer encoder + 2-layer decoder, pre-norm Transformer |
| d_model / n_head / d_ff | 512 / 8 / 2048 |
| Vocab | 16,000 shared BPE (English + Malay, joint) |
| Dropout | 0.1 |
| Parameters | ~27M |
| Tied embeddings | Yes — encoder input, decoder input, and output projection share the same weight matrix (Press & Wolf, 2017) |
| Normalisation | Pre-norm (LayerNorm before attention/FFN, not after) |
Design Decisions and Rationale
Why 6+2 (Deep Encoder, Shallow Decoder)?
The asymmetric 6+2 architecture is grounded in Kasai et al. (2021) — "Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation". The core insight is that encoder depth contributes more to translation quality (richer source representations), while the decoder can be kept shallow without significant degradation. This was empirically validated by our own Ablation Sweep 1 (see below), which showed that encoder depths of 2, 4, 6, and 8 all produced similar chrF scores (22–25 range), indicating the model hits diminishing returns quickly. We chose 6 as a safe operating point.
The shallow 2-layer decoder provides a practical speed advantage: ~2× faster inference compared to a symmetric 6+6, since autoregressive decoding must run the decoder once per output token.
Why 16K shared vocabulary?
We initially trained with 50K vocabulary but found it too sparse for our data — most tokens appeared very infrequently, leaving embeddings under-trained. Reducing to 16K shared BPE produced denser embeddings and faster training. English and Malay share the Latin script with substantial lexical overlap (loanwords like "teknologi", "universiti"; numbers; proper nouns), making a joint vocabulary highly effective.
The tokenizer was trained on the full filtered OpenSubtitles corpus (~17M lines), not just the 2M training split. BPE only needs raw text frequency statistics — more text = better merge rules — and noisy translations don't hurt tokenizer training since it just counts character n-grams.
Why tied embeddings?
With a shared source-target vocabulary, tying the encoder embedding, decoder embedding, and output projection matrix (Press & Wolf, 2017) reduces the parameter count by ~8M while acting as a strong regulariser. The model learns a single semantic space for both languages.
Why dropout 0.1?
With 2M training sentences, aggressive dropout (0.3) would over-regularise. Dropout 0.1 is the standard Transformer default and was confirmed appropriate — the train-val gap remained small throughout training.
Training Data
- Dataset: OpenSubtitles v2018 (English-Malay aligned parallel corpus)
- Raw corpus size: ~17.3M parallel sentence pairs
- After filtering: 2,010,000 pairs selected
- Split: 2,000,000 train / 5,000 validation / 5,000 test (all in-distribution)
Data Preprocessing Pipeline
The raw OpenSubtitles corpus is notoriously noisy (subtitle artifacts, music symbols, HTML tags, near-duplicate lines). We applied the following quality filters:
- Length filter: 3–80 words per side (removes fragments and overly long lines)
- Length ratio filter: max(len_en, len_ms) / min(len_en, len_ms) ≤ 3.0 (removes misaligned pairs)
- Character length filter: 10–400 characters per side
- Junk pattern removal: Regex filter for music symbols (♪♫), HTML tags, bracket-only lines (e.g.
[music playing]), ellipsis-only lines, dash-only lines - Deduplication: Case-insensitive exact match on the English side
Why OpenSubtitles over TED Talks?
We initially experimented with the IWSLT TED Talks dataset (~5K en-ms pairs) and achieved a chrF of only 6.76 — the dataset was far too small. We then moved to OpenSubtitles which provides orders of magnitude more data. Importantly, we evaluate on in-distribution OpenSubtitles test data rather than using TED Talks as an out-of-distribution test set, which would unfairly penalise the model for domain mismatch (conversational subtitles vs. formal TED lectures).
Proxy LR Sweep
Before the full 2M training, we ran a proxy LR sweep on a 200K subset (8 epochs, no early stopping) to select the learning rate without needing multiple expensive full-scale runs:
| LR | Val Loss | chrF (greedy) | Best Epoch |
|---|---|---|---|
| 3e-4 | 3.4254 | 43.46 | 8 |
| 5e-4 | 3.3471 | 44.17 | 7 |
| 7e-4 | 3.3375 | 43.81 | 8 |
Winner: LR = 5e-4 — highest chrF on the 5K test set.
Rationale: LR transfers well across data scales (Kaplan et al., 2020). Running the sweep on 200K avoids training on 2M multiple times, saving ~7 hours of GPU time.
Ablation Studies
We conducted two systematic ablation sweeps to guide architecture and data decisions. All sweeps used a 50K vocabulary baseline with 3 training epochs for efficiency.
Sweep 1: Encoder Depth
Fixed: 50K vocab, 500K data, 2-layer decoder, 3 epochs.
| Encoder Layers | chrF (TED test) | Val Loss | Params |
|---|---|---|---|
| 2 | 24.42 | 3.92 | 48.5M |
| 4 | 22.37 | 3.84 | 61.5M |
| 6 | 24.65 | 3.80 | 74.6M |
| 8 | 22.91 | 3.76 | 87.6M |
Finding: Encoder depth has flat returns on downstream chrF despite steadily decreasing validation loss. This suggests the TED Talks OOD test set was the bottleneck (confirmed later), not model capacity. We selected 6 layers as the sweet spot.
Sweep 2: Training Data Size
Fixed: 50K vocab, 6+2 architecture, 3 epochs.
| Train Size | chrF (TED test) | Val Loss |
|---|---|---|
| 50K | 16.67 | 4.50 |
| 100K | 19.60 | 4.11 |
| 200K | 22.47 | 3.93 |
| 500K | 26.50 | 3.75 |
Finding: chrF scales approximately linearly with log(data size) — a ~3.3 chrF improvement per doubling. This confirmed that data volume is the dominant factor for translation quality at this scale, motivating our final model to use 2M sentences.
Training Details
| Setting | Value |
|---|---|
| Optimizer | AdamW (lr=5e-4, β₁=0.9, β₂=0.98, ε=1e-9) |
| Schedule | Linear warmup (8,000 steps, ~0.5 epochs) → cosine decay to 0 |
| Batch size | 128 |
| Max sequence length | 128 tokens |
| Epochs | 17 of 20 max (early stopping, patience=3) |
| Best epoch | 17 (val loss 2.8176) |
| Label smoothing | 0.1 |
| Gradient clipping | max_norm=1.0 |
| Dropout | 0.1 |
| AMP | fp16 mixed precision (PyTorch GradScaler) |
Evaluation Results
Evaluated on 5,000 held-out in-distribution OpenSubtitles test sentences with post-processing applied.
All chrF scores are case-normalized (both hypothesis and reference lowercased before scoring). This is the fair metric because our BPE tokenizer applies NFKC + lowercase normalization — the model cannot produce cased output, so penalizing it for case mismatches against mixed-case references would be unfair.
| Metric | Score |
|---|---|
| chrF (greedy, case-normalized) | 51.65 |
| chrF (clean refs, case-normalized) | 51.80 |
| chrF (clean refs, case-normalized, beam=5) | 52.14 |
| Best validation loss | 2.8176 |
Reference Quality Analysis
OpenSubtitles community translations contain noise that deflates chrF:
- UTF-8 corruption / mojibake (’, Â, �, etc.)
- Truncated references that drop half the source sentence
- Untranslated references left in English
- Indonesian contamination (OpenSubtitles "ms" is heavily mixed with Bahasa Indonesia)
- ALL-CAPS (burnt-in subtitle OCR artifacts — chrF is case-sensitive)
- Leading dashes (subtitle speaker indicators)
We automatically filter these using heuristics + langid language identification:
- Clean references: 4,339 (86.8%) → chrF 51.80 (case-normalized)
- Garbage references: 661 (13.2%) → chrF dragged down to 51.65
With beam search, it increases to 52.14
The true model performance is better represented by the clean-ref case-normalized score of 52.14 chrF.
Post-Processing
The BPE tokenizer uses a Whitespace pre-tokenizer without continuation markers, so raw decode() output contains spurious spaces before punctuation. We apply lightweight regex-based post-processing:
- Removes spaces before punctuation marks (
. , ? ! ; :) - Removes spaces after opening brackets/quotes
- Collapses spaced hyphens in compound words
- Capitalises the first character
Sample Translations
| # | English (Source) | Reference (Malay) | Model Output |
|---|---|---|---|
| 1 | Heather, you in here? | Heather, awak ada di sini? | Heather, awak di sini? |
| 2 | Hey, dude, why do you run? | Hey, dude, mengapa anda menjalankan? | Hei, kawan, kenapa kau lari? |
| 3 | What about your wife and daughter? | Bagaimana dengan isteri dan anak perempuan awak? | Bagaimana dengan isteri dan anak perempuan awak? |
| 4 | Thank you, gentlemen. | Terima kasih, tuan-tuan semua. | Terima kasih, tuan-tuan. |
| 5 | We'll be ready for the shipment. | Kami akan bersedia untuk penghantaran. | Kami akan bersedia untuk penghantaran. |
| 6 | You're at home. | Awak ada di rumah. | Awak di rumah. |
| 7 | She may be dying and it's all my fault. | Dia mungkin akan mati dan semuanya salah saya. | Dia mungkin akan mati dan semuanya salah saya. |
The model produces fluent, natural Malay that is often comparable or near-identical to the reference translations. Note that in some cases (e.g., #2), the model output is arguably better Malay than the reference — "kenapa kau lari?" is more natural than "mengapa anda menjalankan?".
Improvement over Previous Version
| Version | Data | Dropout | LR | Warmup | chrF (greedy) |
|---|---|---|---|---|---|
| NB6 (v1, 500K) | 490K | 0.3 | 5e-4 | 4,000 | 45.62 |
| NB8 (v2, 2M) | 2M | 0.1 | 5e-4 | 8,000 | 51.65 |
| Δ | +4× data | −0.2 | — | +4,000 | +6.03 |
Key changes in this version:
- 4× more training data (490K → 2M) — the dominant factor
- Reduced dropout (0.3 → 0.1) — less regularisation with more data
- Full-corpus tokenizer — BPE trained on all ~17M filtered lines instead of just 490K
- Proxy LR sweep — systematic LR selection instead of default
- Longer warmup (4,000 → 8,000 steps) — scaled proportionally to data
Tokenizer
- Type: Byte-Pair Encoding (BPE) via HuggingFace
tokenizerslibrary (Rust backend) - Vocab size: 16,000 (shared joint vocabulary for both English and Malay)
- Normalization: NFKC Unicode normalisation + lowercase
- Pre-tokenization: Whitespace splitting
- Post-processing:
[BOS] $A [EOS]template (auto-wraps encoded sequences) - Special tokens:
[PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, [BOS]=5, [EOS]=6 - Trained on: Full filtered OpenSubtitles corpus (~17M lines, both languages)
Why Shared BPE for en-ms?
English and Malay both use the Latin script with significant lexical overlap (loanwords: "teknologi", "matematik", "universiti"; numbers; proper nouns; punctuation). A joint BPE vocabulary captures cross-lingual subword patterns and directly enables tied embeddings. Malay's morphological affixes (me-, ber-, di-, -kan, -an, -i) are naturally learned as subword units by BPE, providing good coverage without an explicitly morphological tokenizer.
Usage
import torch
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer_shared_16k.json")
# Load model (requires model.py from src/)
from src.model import build_model
model = build_model(
vocab_size=16000, pad_idx=0, device=torch.device("cpu"),
d_model=512, n_head=8, num_encoder_layers=6, num_decoder_layers=2,
d_ff=2048, dropout=0.1, max_len=144,
)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu", weights_only=True))
model.eval()
# Translate (requires eval.py from src/)
from src.eval import translate
result = translate(model, "Hello, how are you?", tokenizer, tokenizer,
bos_id=5, eos_id=6, pad_id=0, max_len=128,
device=torch.device("cpu"), beam_width=1)
print(result)
Repository Structure
| File | Description |
|---|---|
best_model.pt |
Model weights (state_dict format) |
tokenizer_shared_16k.json |
Shared BPE tokenizer (16K vocab, trained on full corpus) |
config.json |
Full model configuration and training hyperparameters |
src/model.py |
TransformerTranslator — complete encoder-decoder architecture |
src/tokenizer.py |
BPE tokenizer training, saving, loading, encoding, decoding |
src/training.py |
Full training loop with early stopping, warmup, cosine decay, AMP |
src/eval.py |
Greedy/beam decoding, chrF scoring, post-processing |
Experimental Journey
This project went through several iterations:
- TED Talks baseline — IWSLT TED Talks en-ms (~5K pairs). chrF 6.76. Dataset far too small.
- OPUS-100 pivot — Switched to OPUS-100 en-ms. chrF 26.39 with 10+2 architecture. Significant improvement but still limited by data quality.
- OpenSubtitles pivot — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
- Ablation sweeps — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
- 500K model (v1) — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF 45.62.
- 2M model (v2, current) — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF on cleaned, case-normalized and beam=5: 52.14).
Limitations
- Domain specificity: Trained exclusively on movie/TV subtitles — performance degrades on formal, academic, or technical text.
- Subword fragmentation: Rare proper nouns and domain-specific terms get split into character-level fragments.
- No backtranslation or data augmentation: The model trains on natural parallel data only.
- Reference noise: OpenSubtitles contains ~13% garbage references (Indonesian instead of Malay, mojibake, truncated). True performance is higher than raw chrF suggests.
References
- Vaswani, A. et al. (2017). Attention is All You Need. NeurIPS.
- Kasai, J. et al. (2021). Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation. ICLR.
- Press, O. & Wolf, L. (2017). Using the Output Embedding to Improve Language Models. EACL.
- Xiong, R. et al. (2020). On Layer Normalization in the Transformer Architecture. ICML.
- Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. WMT.
- Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL.
- Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv.
- Lison, P. & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. LREC.
Citation
@misc{astralpotato2025enms,
title={English-Malay Neural Machine Translation with Deep Encoder, Shallow Decoder Transformer},
author={AstralPotato},
year={2025},
howpublished={IT3103 Advanced Topics in AI, Assignment 2, 2025S2},
}
- Downloads last month
- 37
Dataset used to train AstralPotato/en-ms-transformer
Space using AstralPotato/en-ms-transformer 1
Papers for AstralPotato/en-ms-transformer
On Layer Normalization in the Transformer Architecture
Scaling Laws for Neural Language Models
Attention Is All You Need
Using the Output Embedding to Improve Language Models
Evaluation results
- chrF (greedy, case-normalized) on OpenSubtitles v2018 en-mstest set self-reported51.650
- chrF (clean refs, case-normalized) on OpenSubtitles v2018 en-mstest set self-reported51.800
- chrF (clean refs, case-normalized, beam=5) on OpenSubtitles v2018 en-mstest set self-reported52.140