English → Malay Transformer (6+2 Tied, 16K BPE)

A custom encoder-decoder Transformer for English-to-Malay translation, built entirely from scratch in PyTorch. This model was developed as part of IT3103 Advanced Topics in AI — Assignment 2, 2025 Semester 2.

The project encompasses the full NMT pipeline: dataset curation, tokenizer training, architecture design with ablation studies, training with mixed-precision, and evaluation — all without using any pretrained models or high-level frameworks like Fairseq or OpenNMT.

Model Description

Component	Details
Architecture	6-layer encoder + 2-layer decoder, pre-norm Transformer
d_model / n_head / d_ff	512 / 8 / 2048
Vocab	16,000 shared BPE (English + Malay, joint)
Dropout	0.1
Parameters	~27M
Tied embeddings	Yes — encoder input, decoder input, and output projection share the same weight matrix (Press & Wolf, 2017)
Normalisation	Pre-norm (LayerNorm before attention/FFN, not after)

Design Decisions and Rationale

Why 6+2 (Deep Encoder, Shallow Decoder)?

The asymmetric 6+2 architecture is grounded in Kasai et al. (2021) — "Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation". The core insight is that encoder depth contributes more to translation quality (richer source representations), while the decoder can be kept shallow without significant degradation. This was empirically validated by our own Ablation Sweep 1 (see below), which showed that encoder depths of 2, 4, 6, and 8 all produced similar chrF scores (22–25 range), indicating the model hits diminishing returns quickly. We chose 6 as a safe operating point.

The shallow 2-layer decoder provides a practical speed advantage: ~2× faster inference compared to a symmetric 6+6, since autoregressive decoding must run the decoder once per output token.

Why 16K shared vocabulary?

We initially trained with 50K vocabulary but found it too sparse for our data — most tokens appeared very infrequently, leaving embeddings under-trained. Reducing to 16K shared BPE produced denser embeddings and faster training. English and Malay share the Latin script with substantial lexical overlap (loanwords like "teknologi", "universiti"; numbers; proper nouns), making a joint vocabulary highly effective.

The tokenizer was trained on the full filtered OpenSubtitles corpus (~17M lines), not just the 2M training split. BPE only needs raw text frequency statistics — more text = better merge rules — and noisy translations don't hurt tokenizer training since it just counts character n-grams.

Why tied embeddings?

With a shared source-target vocabulary, tying the encoder embedding, decoder embedding, and output projection matrix (Press & Wolf, 2017) reduces the parameter count by ~8M while acting as a strong regulariser. The model learns a single semantic space for both languages.

Why dropout 0.1?

With 2M training sentences, aggressive dropout (0.3) would over-regularise. Dropout 0.1 is the standard Transformer default and was confirmed appropriate — the train-val gap remained small throughout training.

Training Data

Dataset: OpenSubtitles v2018 (English-Malay aligned parallel corpus)
Raw corpus size: ~17.3M parallel sentence pairs
After filtering: 2,010,000 pairs selected
Split: 2,000,000 train / 5,000 validation / 5,000 test (all in-distribution)

Data Preprocessing Pipeline

The raw OpenSubtitles corpus is notoriously noisy (subtitle artifacts, music symbols, HTML tags, near-duplicate lines). We applied the following quality filters:

Length filter: 3–80 words per side (removes fragments and overly long lines)
Length ratio filter: max(len_en, len_ms) / min(len_en, len_ms) ≤ 3.0 (removes misaligned pairs)
Character length filter: 10–400 characters per side
Junk pattern removal: Regex filter for music symbols (♪♫), HTML tags, bracket-only lines (e.g. [music playing]), ellipsis-only lines, dash-only lines
Deduplication: Case-insensitive exact match on the English side

Why OpenSubtitles over TED Talks?

We initially experimented with the IWSLT TED Talks dataset (~5K en-ms pairs) and achieved a chrF of only 6.76 — the dataset was far too small. We then moved to OpenSubtitles which provides orders of magnitude more data. Importantly, we evaluate on in-distribution OpenSubtitles test data rather than using TED Talks as an out-of-distribution test set, which would unfairly penalise the model for domain mismatch (conversational subtitles vs. formal TED lectures).

Proxy LR Sweep

Before the full 2M training, we ran a proxy LR sweep on a 200K subset (8 epochs, no early stopping) to select the learning rate without needing multiple expensive full-scale runs:

LR	Val Loss	chrF (greedy)	Best Epoch
3e-4	3.4254	43.46	8
5e-4	3.3471	44.17	7
7e-4	3.3375	43.81	8

Winner: LR = 5e-4 — highest chrF on the 5K test set.

Rationale: LR transfers well across data scales (Kaplan et al., 2020). Running the sweep on 200K avoids training on 2M multiple times, saving ~7 hours of GPU time.

Ablation Studies

We conducted two systematic ablation sweeps to guide architecture and data decisions. All sweeps used a 50K vocabulary baseline with 3 training epochs for efficiency.

Sweep 1: Encoder Depth

Fixed: 50K vocab, 500K data, 2-layer decoder, 3 epochs.

Encoder Layers	chrF (TED test)	Val Loss	Params
2	24.42	3.92	48.5M
4	22.37	3.84	61.5M
6	24.65	3.80	74.6M
8	22.91	3.76	87.6M

Finding: Encoder depth has flat returns on downstream chrF despite steadily decreasing validation loss. This suggests the TED Talks OOD test set was the bottleneck (confirmed later), not model capacity. We selected 6 layers as the sweet spot.

Sweep 2: Training Data Size

Fixed: 50K vocab, 6+2 architecture, 3 epochs.

Train Size	chrF (TED test)	Val Loss
50K	16.67	4.50
100K	19.60	4.11
200K	22.47	3.93
500K	26.50	3.75

Finding: chrF scales approximately linearly with log(data size) — a ~3.3 chrF improvement per doubling. This confirmed that data volume is the dominant factor for translation quality at this scale, motivating our final model to use 2M sentences.

Training Details

Setting	Value
Optimizer	AdamW (lr=5e-4, β₁=0.9, β₂=0.98, ε=1e-9)
Schedule	Linear warmup (8,000 steps, ~0.5 epochs) → cosine decay to 0
Batch size	128
Max sequence length	128 tokens
Epochs	17 of 20 max (early stopping, patience=3)
Best epoch	17 (val loss 2.8176)
Label smoothing	0.1
Gradient clipping	max_norm=1.0
Dropout	0.1
AMP	fp16 mixed precision (PyTorch GradScaler)

Evaluation Results

Evaluated on 5,000 held-out in-distribution OpenSubtitles test sentences with post-processing applied.

All chrF scores are case-normalized (both hypothesis and reference lowercased before scoring). This is the fair metric because our BPE tokenizer applies NFKC + lowercase normalization — the model cannot produce cased output, so penalizing it for case mismatches against mixed-case references would be unfair.

Metric	Score
chrF (greedy, case-normalized)	51.65
chrF (clean refs, case-normalized)	51.80
chrF (clean refs, case-normalized, beam=5)	52.14
Best validation loss	2.8176

Reference Quality Analysis

OpenSubtitles community translations contain noise that deflates chrF:

UTF-8 corruption / mojibake (â€™, Â, ï¿½, etc.)
Truncated references that drop half the source sentence
Untranslated references left in English
Indonesian contamination (OpenSubtitles "ms" is heavily mixed with Bahasa Indonesia)
ALL-CAPS (burnt-in subtitle OCR artifacts — chrF is case-sensitive)
Leading dashes (subtitle speaker indicators)

We automatically filter these using heuristics + langid language identification:

Clean references: 4,339 (86.8%) → chrF 51.80 (case-normalized)
Garbage references: 661 (13.2%) → chrF dragged down to 51.65

With beam search, it increases to 52.14

The true model performance is better represented by the clean-ref case-normalized score of 52.14 chrF.

Post-Processing

The BPE tokenizer uses a Whitespace pre-tokenizer without continuation markers, so raw decode() output contains spurious spaces before punctuation. We apply lightweight regex-based post-processing:

Removes spaces before punctuation marks (. , ? ! ; :)
Removes spaces after opening brackets/quotes
Collapses spaced hyphens in compound words
Capitalises the first character

Sample Translations

#	English (Source)	Reference (Malay)	Model Output
1	Heather, you in here?	Heather, awak ada di sini?	Heather, awak di sini?
2	Hey, dude, why do you run?	Hey, dude, mengapa anda menjalankan?	Hei, kawan, kenapa kau lari?
3	What about your wife and daughter?	Bagaimana dengan isteri dan anak perempuan awak?	Bagaimana dengan isteri dan anak perempuan awak?
4	Thank you, gentlemen.	Terima kasih, tuan-tuan semua.	Terima kasih, tuan-tuan.
5	We'll be ready for the shipment.	Kami akan bersedia untuk penghantaran.	Kami akan bersedia untuk penghantaran.
6	You're at home.	Awak ada di rumah.	Awak di rumah.
7	She may be dying and it's all my fault.	Dia mungkin akan mati dan semuanya salah saya.	Dia mungkin akan mati dan semuanya salah saya.

The model produces fluent, natural Malay that is often comparable or near-identical to the reference translations. Note that in some cases (e.g., #2), the model output is arguably better Malay than the reference — "kenapa kau lari?" is more natural than "mengapa anda menjalankan?".

Improvement over Previous Version

Version	Data	Dropout	LR	Warmup	chrF (greedy)
NB6 (v1, 500K)	490K	0.3	5e-4	4,000	45.62
NB8 (v2, 2M)	2M	0.1	5e-4	8,000	51.65
Δ	+4× data	−0.2	—	+4,000	+6.03

Key changes in this version:

4× more training data (490K → 2M) — the dominant factor
Reduced dropout (0.3 → 0.1) — less regularisation with more data
Full-corpus tokenizer — BPE trained on all ~17M filtered lines instead of just 490K
Proxy LR sweep — systematic LR selection instead of default
Longer warmup (4,000 → 8,000 steps) — scaled proportionally to data

Tokenizer

Type: Byte-Pair Encoding (BPE) via HuggingFace tokenizers library (Rust backend)
Vocab size: 16,000 (shared joint vocabulary for both English and Malay)
Normalization: NFKC Unicode normalisation + lowercase
Pre-tokenization: Whitespace splitting
Post-processing: [BOS] $A [EOS] template (auto-wraps encoded sequences)
Special tokens: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, [BOS]=5, [EOS]=6
Trained on: Full filtered OpenSubtitles corpus (~17M lines, both languages)

Why Shared BPE for en-ms?

English and Malay both use the Latin script with significant lexical overlap (loanwords: "teknologi", "matematik", "universiti"; numbers; proper nouns; punctuation). A joint BPE vocabulary captures cross-lingual subword patterns and directly enables tied embeddings. Malay's morphological affixes (me-, ber-, di-, -kan, -an, -i) are naturally learned as subword units by BPE, providing good coverage without an explicitly morphological tokenizer.

Usage

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer_shared_16k.json")

# Load model (requires model.py from src/)
from src.model import build_model

model = build_model(
    vocab_size=16000, pad_idx=0, device=torch.device("cpu"),
    d_model=512, n_head=8, num_encoder_layers=6, num_decoder_layers=2,
    d_ff=2048, dropout=0.1, max_len=144,
)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu", weights_only=True))
model.eval()

# Translate (requires eval.py from src/)
from src.eval import translate
result = translate(model, "Hello, how are you?", tokenizer, tokenizer,
                   bos_id=5, eos_id=6, pad_id=0, max_len=128,
                   device=torch.device("cpu"), beam_width=1)
print(result)

Repository Structure

File	Description
`best_model.pt`	Model weights (`state_dict` format)
`tokenizer_shared_16k.json`	Shared BPE tokenizer (16K vocab, trained on full corpus)
`config.json`	Full model configuration and training hyperparameters
`src/model.py`	`TransformerTranslator` — complete encoder-decoder architecture
`src/tokenizer.py`	BPE tokenizer training, saving, loading, encoding, decoding
`src/training.py`	Full training loop with early stopping, warmup, cosine decay, AMP
`src/eval.py`	Greedy/beam decoding, chrF scoring, post-processing

Experimental Journey

This project went through several iterations:

TED Talks baseline — IWSLT TED Talks en-ms (~5K pairs). chrF 6.76. Dataset far too small.
OPUS-100 pivot — Switched to OPUS-100 en-ms. chrF 26.39 with 10+2 architecture. Significant improvement but still limited by data quality.
OpenSubtitles pivot — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
Ablation sweeps — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
500K model (v1) — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF 45.62.
2M model (v2, current) — Same architecture, 2M data, dropout 0.1, full-corpus tokenizer, proxy LR sweep. chrF on cleaned, case-normalized and beam=5: 52.14).

Limitations

Domain specificity: Trained exclusively on movie/TV subtitles — performance degrades on formal, academic, or technical text.
Subword fragmentation: Rare proper nouns and domain-specific terms get split into character-level fragments.
No backtranslation or data augmentation: The model trains on natural parallel data only.
Reference noise: OpenSubtitles contains ~13% garbage references (Indonesian instead of Malay, mojibake, truncated). True performance is higher than raw chrF suggests.

References

Vaswani, A. et al. (2017). Attention is All You Need. NeurIPS.
Kasai, J. et al. (2021). Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation. ICLR.
Press, O. & Wolf, L. (2017). Using the Output Embedding to Improve Language Models. EACL.
Xiong, R. et al. (2020). On Layer Normalization in the Transformer Architecture. ICML.
Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. WMT.
Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL.
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv.
Lison, P. & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. LREC.

Citation

@misc{astralpotato2025enms,
  title={English-Malay Neural Machine Translation with Deep Encoder, Shallow Decoder Transformer},
  author={AstralPotato},
  year={2025},
  howpublished={IT3103 Advanced Topics in AI, Assignment 2, 2025S2},
}

Downloads last month: 6

Dataset used to train AstralPotato/en-ms-transformer

Space using AstralPotato/en-ms-transformer 1

Papers for AstralPotato/en-ms-transformer

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Paper • 2006.10369 • Published Jun 18, 2020

Evaluation results

chrF (greedy, case-normalized) on OpenSubtitles v2018 en-ms
test set self-reported

51.650
chrF (clean refs, case-normalized) on OpenSubtitles v2018 en-ms
test set self-reported

51.800
chrF (clean refs, case-normalized, beam=5) on OpenSubtitles v2018 en-ms
test set self-reported

52.140