Upload en-ms Transformer (6+2 Tied, 16K BPE, chrF 45.62)

Browse files

Files changed (8) hide show

README.md +292 -0
best_model.pt +3 -0
config.json +36 -0
src/eval.py +315 -0
src/model.py +287 -0
src/tokenizer.py +360 -0
src/training.py +370 -0
tokenizer_shared_16k.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,292 @@

+---
+language:
+  - en
+  - ms
+tags:
+  - translation
+  - transformer
+  - en-ms
+  - pytorch
+  - bpe
+  - encoder-decoder
+  - tied-embeddings
+  - deep-encoder-shallow-decoder
+license: mit
+datasets:
+  - open_subtitles
+metrics:
+  - chrf
+pipeline_tag: translation
+model-index:
+  - name: en-ms-transformer-6+2-tied
+    results:
+      - task:
+          type: translation
+          name: Translation
+        dataset:
+          name: OpenSubtitles v2018 en-ms
+          type: open_subtitles
+          split: test
+        metrics:
+          - type: chrf
+            value: 45.62
+            name: chrF (greedy)
+          - type: chrf
+            value: 44.99
+            name: chrF (beam=5)
+---
+# English → Malay Transformer (6+2 Tied, 16K BPE)
+A custom encoder-decoder Transformer for English-to-Malay translation, built entirely from scratch in PyTorch. This model was developed as part of **IT3103 Advanced Topics in AI — Assignment 2, 2025 Semester 2**.
+The project encompasses the full NMT pipeline: dataset curation, tokenizer training, architecture design with ablation studies, training with mixed-precision, and evaluation — all without using any pretrained models or high-level frameworks like Fairseq or OpenNMT.
+## Model Description
+| Component | Details |
+|---|---|
+| **Architecture** | 6-layer encoder + 2-layer decoder, pre-norm Transformer |
+| **d_model / n_head / d_ff** | 512 / 8 / 2048 |
+| **Vocab** | 16,000 shared BPE (English + Malay, joint) |
+| **Dropout** | 0.3 |
+| **Parameters** | ~36.6M |
+| **Tied embeddings** | Yes — encoder input, decoder input, and output projection share the same weight matrix (Press & Wolf, 2017) |
+| **Normalisation** | Pre-norm (LayerNorm before attention/FFN, not after) |
+### Design Decisions and Rationale
+**Why 6+2 (Deep Encoder, Shallow Decoder)?**
+The asymmetric 6+2 architecture is grounded in [Kasai et al. (2021)](https://arxiv.org/abs/2006.10369) — *"Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation"*. The core insight is that encoder depth contributes more to translation quality (richer source representations), while the decoder can be kept shallow without significant degradation. This was empirically validated by our own **Ablation Sweep 1** (see below), which showed that encoder depths of 2, 4, 6, and 8 all produced similar chrF scores (22–25 range), indicating the model hits diminishing returns quickly. We chose 6 as a safe operating point.
+The shallow 2-layer decoder provides a practical speed advantage: ~2× faster inference compared to a symmetric 6+6, since autoregressive decoding must run the decoder once per output token.
+**Why 16K shared vocabulary?**
+We initially trained with 50K vocabulary but found it too sparse for 500K training sentences — most tokens appeared very infrequently, leaving embeddings under-trained. Reducing to 16K shared BPE produced denser embeddings and led to a 2.5× speedup per epoch (7.8 min vs ~20 min estimated at 50K). English and Malay share the Latin script with substantial lexical overlap (loanwords like "teknologi", "universiti"; numbers; proper nouns), making a joint vocabulary highly effective.
+**Why tied embeddings?**
+With a shared source-target vocabulary, tying the encoder embedding, decoder embedding, and output projection matrix (Press & Wolf, 2017) reduces the parameter count by ~16M while acting as a strong regulariser. The model learns a single semantic space for both languages.
+**Why dropout 0.3?**
+490K training sentences is relatively small for a Transformer. Dropout 0.3 was chosen as aggressive regularisation to prevent overfitting. Training curves confirm this was appropriate — the gap between train loss (3.17) and val loss (3.21) remained small throughout training, with no signs of overfitting even at epoch 20.
+## Training Data
+- **Dataset:** [OpenSubtitles v2018](https://opus.nlpl.eu/OpenSubtitles-v2018.php) (English-Malay aligned parallel corpus)
+- **Raw corpus size:** ~17.3M parallel sentence pairs
+- **After filtering:** 500,000 pairs selected
+- **Split:** 490,000 train / 5,000 validation / 5,000 test (all in-distribution)
+### Data Preprocessing Pipeline
+The raw OpenSubtitles corpus is notoriously noisy (subtitle artifacts, music symbols, HTML tags, near-duplicate lines). We applied the following quality filters:
+1. **Length filter:** 3–80 words per side (removes fragments and overly long lines)
+2. **Length ratio filter:** max(len_en, len_ms) / min(len_en, len_ms) ≤ 3.0 (removes misaligned pairs)
+3. **Character length filter:** 10–400 characters per side
+4. **Junk pattern removal:** Regex filter for music symbols (♪♫), HTML tags, bracket-only lines (e.g. `[music playing]`), ellipsis-only lines, dash-only lines
+5. **Deduplication:** Case-insensitive exact match on the English side
+This pipeline retains ~500K high-quality pairs from the first ~2.7M lines scanned.
+### Why OpenSubtitles over TED Talks?
+We initially experimented with the [IWSLT TED Talks](https://huggingface.co/datasets/IWSLT/ted_talks_iwslt) dataset (~5K en-ms pairs) and achieved a chrF of only **6.76** — the dataset was far too small. We then moved to OpenSubtitles which provides orders of magnitude more data. Importantly, we evaluate on **in-distribution** OpenSubtitles test data rather than using TED Talks as an out-of-distribution test set, which would unfairly penalise the model for domain mismatch (conversational subtitles vs. formal TED lectures).
+## Ablation Studies
+We conducted two systematic ablation sweeps to guide architecture and data decisions. All sweeps used a 50K vocabulary baseline with 3 training epochs for efficiency.
+### Sweep 1: Encoder Depth
+Fixed: 50K vocab, 500K data, 2-layer decoder, 3 epochs.
+| Encoder Layers | chrF (TED test) | Val Loss | Params |
+|---|---|---|---|
+| 2 | 24.42 | 3.92 | 48.5M |
+| 4 | 22.37 | 3.84 | 61.5M |
+| 6 | 24.65 | 3.80 | 74.6M |
+| 8 | 22.91 | 3.76 | 87.6M |
+**Finding:** Encoder depth has **flat returns** on downstream chrF despite steadily decreasing validation loss. This suggests the TED Talks OOD test set was the bottleneck (confirmed later), not model capacity. We selected 6 layers as the sweet spot — the lowest loss before severe diminishing returns, and well-supported by the Kasai et al. finding.
+### Sweep 2: Training Data Size
+Fixed: 50K vocab, 6+2 architecture, 3 epochs.
+| Train Size | chrF (TED test) | Val Loss |
+|---|---|---|
+| 50K | 16.67 | 4.50 |
+| 100K | 19.60 | 4.11 |
+| 200K | 22.47 | 3.93 |
+| 500K | 26.50 | 3.75 |
+**Finding:** chrF scales **approximately linearly with log(data size)** — a ~3.3 chrF improvement per doubling. This confirmed that **data volume is the dominant factor** for translation quality at this scale, far more impactful than architectural changes. This motivated our final model to use the maximum feasible data (490K after filtering).
+## Training Details
+| Setting | Value |
+|---|---|
+| Optimizer | AdamW (lr=5e-4, β₁=0.9, β₂=0.98, ε=1e-9) |
+| Schedule | Linear warmup (4,000 steps) → cosine decay to 0 |
+| Batch size | 128 |
+| Max sequence length | 128 tokens |
+| Epochs | 20 (early stopping patience=3, did not trigger) |
+| Label smoothing | 0.1 |
+| Gradient clipping | max_norm=1.0 |
+| AMP | fp16 mixed precision (PyTorch GradScaler) |
+| Hardware | NVIDIA RTX 5070 Ti (16GB VRAM), CUDA 13.1 |
+| Training time | **2.62 hours** (157 min, ~7.85 min/epoch) |
+### Training Progression
+| Epoch | Train Loss | Val Loss | LR |
+|---|---|---|---|
+| 1 | 5.4036 | 4.2485 | 6.5e-5 |
+| 5 | 3.5888 | 3.4519 | 3.7e-4 |
+| 10 | 3.3605 | 3.2986 | 3.7e-4 |
+| 15 | 3.2268 | 3.2346 | 1.8e-4 |
+| 20 | 3.1683 | 3.2110 | 4.9e-6 |
+The model converged smoothly with no overfitting — the train-val gap remained under 0.05 throughout training. The cosine LR decay drove the final epochs to squeeze out the last bits of improvement (val loss 3.23 at epoch 15 → 3.21 at epoch 20).
+## Evaluation Results
+Evaluated on **5,000 held-out in-distribution** OpenSubtitles test sentences with post-processing applied.
+| Decoding Strategy | chrF |
+|---|---|
+| Greedy | **45.62** |
+| Beam search (beam=5, length_penalty=0.6) | 44.99 |
+### Post-Processing
+The BPE tokenizer uses a `Whitespace` pre-tokenizer without continuation markers, so raw `decode()` output contains spurious spaces before punctuation (e.g., `"mendarat , tuan ."` instead of `"mendarat, tuan."`). We apply a lightweight regex-based post-processing step that:
+1. Removes spaces before punctuation marks (`. , ? ! ; :`)
+2. Removes spaces after opening brackets/quotes
+3. Collapses spaced hyphens in compound words
+4. Capitalises the first character
+This post-processing improved chrF by **+1.05 points** (greedy: 44.57 → 45.62) — a free gain with zero retraining.
+### Why Greedy > Beam Search?
+Interestingly, greedy decoding outperforms beam search here. This is a known phenomenon in NMT: beam search with length penalty can produce outputs that are slightly too long or too short for chrF's character n-gram matching. Greedy decoding produces more "natural length" outputs that happen to align better with reference lengths in this corpus.
+### Sample Translations
+| # | English (Source) | Reference (Malay) | Model Output |
+|---|---|---|---|
+| 1 | Skywalker has just landed, lord. | Skywalker baru sahaja mendarat, tuan. | Skywalker baru mendarat, tuan. |
+| 2 | Raymond, you like me? | Raymond, awak suka saya? | Raymond, awak suka saya? |
+| 3 | She may be dying and it's all my fault. | Dia mungkin akan mati dan semuanya salah saya. | Dia mungkin akan mati dan semuanya salah saya. |
+| 4 | He always remembers the cards. | Ia ingat kad. | Dia selalu ingat kad. |
+| 5 | Hey, you wanna see something? | Hei, awak nak tengok sesuatu? | Hei, awak nak lihat sesuatu? |
+| 6 | Why don't you just go talk to her? | Mengapa awak tidak bercakap dengannya? | Apa kata awak cakap dengan dia? |
+| 7 | We still got that meat-lovers' pizza in the trunk. | Kita masih ada piza daging dalam but. | Kita masih ada piza daging di dalam but kereta. |
+The model produces fluent, natural Malay that is often comparable or near-identical to the reference translations. Errors tend to occur on rare proper nouns (subword fragmentation) and idiomatic expressions.
+## Tokenizer
+- **Type:** Byte-Pair Encoding (BPE) via HuggingFace `tokenizers` library (Rust backend)
+- **Vocab size:** 16,000 (shared joint vocabulary for both English and Malay)
+- **Normalization:** NFKC Unicode normalisation + lowercase
+- **Pre-tokenization:** Whitespace splitting
+- **Post-processing:** `[BOS] $A [EOS]` template (auto-wraps encoded sequences)
+- **Special tokens:** `[PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, [BOS]=5, [EOS]=6`
+- **Trained on:** 490K training pairs only (980K sentences total) — no data leakage from val/test
+### Why Shared BPE for en-ms?
+English and Malay both use the Latin script with significant lexical overlap (loanwords: "teknologi", "matematik", "universiti"; numbers; proper nouns; punctuation). A joint BPE vocabulary captures cross-lingual subword patterns and directly enables tied embeddings. Malay's morphological affixes (me-, ber-, di-, -kan, -an, -i) are naturally learned as subword units by BPE, providing good coverage without an explicitly morphological tokenizer.
+## Usage
+```python
+import torch
+from tokenizers import Tokenizer
+# Load tokenizer
+tokenizer = Tokenizer.from_file("tokenizer_shared_16k.json")
+# Load model (requires model.py from src/)
+from src.model import build_model
+model = build_model(
+    vocab_size=16000, pad_idx=0, device=torch.device("cpu"),
+    d_model=512, n_head=8, num_encoder_layers=6, num_decoder_layers=2,
+    d_ff=2048, dropout=0.3, max_len=144,
+)
+model.load_state_dict(torch.load("best_model.pt", map_location="cpu", weights_only=True))
+model.eval()
+# Translate (requires eval.py from src/)
+from src.eval import translate
+result = translate(model, "Hello, how are you?", tokenizer, tokenizer,
+                   bos_id=5, eos_id=6, pad_id=0, max_len=128,
+                   device=torch.device("cpu"), beam_width=5)
+print(result)  # → "Hai, apa khabar?"
+```
+## Repository Structure
+| File | Description |
+|---|---|
+| `best_model.pt` | Model weights (`state_dict` format, ~140MB) |
+| `tokenizer_shared_16k.json` | Shared BPE tokenizer (16K vocab) |
+| `config.json` | Full model configuration and training hyperparameters |
+| `src/model.py` | `TransformerTranslator` — complete encoder-decoder architecture |
+| `src/tokenizer.py` | BPE tokenizer training, saving, loading, encoding, decoding |
+| `src/training.py` | Full training loop with early stopping, warmup, cosine decay, AMP |
+| `src/eval.py` | Greedy/beam decoding, chrF scoring, post-processing |
+## Experimental Journey
+This project went through several iterations:
+1. **TED Talks baseline** — IWSLT TED Talks en-ms (~5K pairs). chrF **6.76**. Dataset far too small.
+2. **OPUS-100 pivot** — Switched to OPUS-100 en-ms. chrF **26.39** with 10+2 architecture. Significant improvement but still limited by data quality.
+3. **OpenSubtitles pivot** — Moved to OpenSubtitles v2018 (17.3M raw pairs). Quality filtering pipeline developed.
+4. **Ablation sweeps** — Systematically tested encoder depth (2/4/6/8) and data size (50K/100K/200K/500K). Discovered data size is the dominant factor.
+5. **Final model** — 6+2 tied Transformer, 16K BPE, 490K data, dropout 0.3. chrF **44.57** (greedy, no postprocessing).
+6. **Post-processing fix** — Added punctuation cleanup. chrF **45.62** (greedy). Free +1.05 improvement.
+## Limitations and Future Work
+### Current Limitations
+- **Domain specificity:** Trained exclusively on movie/TV subtitles — performance degrades significantly on formal, academic, or technical text (e.g., TED Talks test set gave chrF ~6–26 depending on configuration).
+- **Subword fragmentation:** Rare proper nouns and domain-specific terms get split into character-level fragments (e.g., "Burgundy" → "bur gun dy", "android" → "dan ro id"). A larger vocabulary or byte-level fallback could mitigate this.
+- **16K vocab trade-off:** The compact vocabulary provides dense embeddings but over-segments rare words. A 32K vocabulary might be a better balance.
+- **No backtranslation or data augmentation:** The model trains on natural parallel data only.
+### Future Improvements
+- **Scale data to 2M+**: Our sweep shows chrF gains ~3.3 points per data doubling. 2M sentences could push chrF to ~50+.
+- **Reduce dropout to 0.1**: With more data, the aggressive 0.3 dropout likely over-regularises.
+- **Byte-level fallback**: Handle rare words more gracefully.
+- **Ensemble decoding**: Combine checkpoints from different training stages.
+## References
+- Vaswani, A. et al. (2017). [Attention is All You Need](https://arxiv.org/abs/1706.03762). *NeurIPS*.
+- Kasai, J. et al. (2021). [Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation](https://arxiv.org/abs/2006.10369). *ICLR*.
+- Press, O. & Wolf, L. (2017). [Using the Output Embedding to Improve Language Models](https://arxiv.org/abs/1608.05859). *EACL*.
+- Popović, M. (2015). [chrF: character n-gram F-score for automatic MT evaluation](https://aclanthology.org/W15-3049/). *WMT*.
+- Sennrich, R. et al. (2016). [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909). *ACL*.
+- Lison, P. & Tiedemann, J. (2016). [OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles](http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf). *LREC*.
+## Citation
+```bibtex
+@misc{astralpotato2025enms,
+  title={English-Malay Neural Machine Translation with Deep Encoder, Shallow Decoder Transformer},
+  author={AstralPotato},
+  year={2025},
+  howpublished={IT3103 Advanced Topics in AI, Assignment 2, 2025S2},
+}
+```

best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:053df05b5a8c77434507d745eee6fff4c52cddfc72abf27330d71f1e8688c3e3
+size 142469700

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "architecture": "TransformerTranslator (6+2 Tied)",
+  "vocab_size": 16000,
+  "d_model": 512,
+  "n_head": 8,
+  "num_encoder_layers": 6,
+  "num_decoder_layers": 2,
+  "d_ff": 2048,
+  "dropout": 0.3,
+  "max_len": 144,
+  "pad_idx": 0,
+  "bos_id": 5,
+  "eos_id": 6,
+  "tied_embeddings": true,
+  "pre_norm": true,
+  "label_smoothing": 0.1,
+  "training": {
+    "dataset": "OpenSubtitles v2018 en-ms",
+    "train_size": 490000,
+    "val_size": 5000,
+    "test_size": 5000,
+    "epochs_trained": 20,
+    "batch_size": 128,
+    "lr": 0.0005,
+    "warmup_steps": 4000,
+    "optimizer": "AdamW",
+    "scheduler": "linear warmup + cosine decay",
+    "amp": true
+  },
+  "evaluation": {
+    "chrf_greedy": 45.62,
+    "chrf_beam5_lp06": 44.99,
+    "test_set": "5K in-distribution OpenSubtitles",
+    "note": "chrF with post-processing (punctuation cleanup)"
+  }
+}

src/eval.py ADDED Viewed

	@@ -0,0 +1,315 @@

+"""
+Evaluation module – greedy / beam-search decoding + chrF scoring.
+=================================================================
+Provides:
+  • ``greedy_decode``       – auto-regressive greedy decoding.
+  • ``beam_search_decode``  – beam search with length normalisation.
+  • ``translate``           – end-to-end: raw English string → Malay string.
+  • ``compute_chrf``        – corpus-level chrF score via *sacrebleu*.
+  • ``evaluate``            – decode the full validation set, compute chrF,
+    and print sample translations.
+"""
+from __future__ import annotations
+import re
+from typing import List, Optional
+import torch
+import torch.nn as nn
+from tokenizers import Tokenizer
+import sacrebleu
+# ──────────────────────────────────────────────────────────────────────
+# 0.  Post-processing: fix tokenizer spacing artefacts
+# ──────────────────────────────────────────────────────────────────────
+def postprocess_translation(text: str) -> str:
+    """
+    Clean up raw tokenizer decode output:
+      1. Remove spaces before punctuation  ( ", tuan ." → ", tuan.")
+      2. Remove spaces after opening brackets/quotes
+      3. Remove spaces before closing brackets/quotes
+      4. Capitalise the first letter
+      5. Collapse multiple spaces
+    """
+    # Remove space before punctuation: . , ? ! ; : ) ] } ' " ...
+    text = re.sub(r'\s+([.,?!;:)\]}"\'…])', r'\1', text)
+    # Remove space after opening brackets/quotes
+    text = re.sub(r'([(\[{"\'])\s+', r'\1', text)
+    # Fix spaced hyphens in compound words (e.g. "brother - in - arms" → "brother-in-arms")
+    text = re.sub(r'\s*-\s*', '-', text)
+    # Collapse multiple spaces
+    text = re.sub(r'\s{2,}', ' ', text)
+    # Strip and capitalise
+    text = text.strip()
+    if text:
+        text = text[0].upper() + text[1:]
+    return text
+# ──────────────────────────────────────────────────────────────────────
+# 1.  Greedy decoding
+# ──────────────────────────────────────────────────────────────────────
+@torch.no_grad()
+def greedy_decode(
+    model: nn.Module,
+    src: torch.Tensor,
+    bos_id: int,
+    eos_id: int,
+    pad_id: int = 0,
+    max_len: int = 128,
+) -> torch.Tensor:
+    """
+    Auto-regressive greedy decoding for a single source sequence.
+    Parameters
+    ----------
+    model : TransformerTranslator
+    src : (1, src_len) source token IDs.
+    bos_id : beginning-of-sentence token ID.
+    eos_id : end-of-sentence token ID.
+    pad_id : padding token ID.
+    max_len : maximum decoding steps.
+    Returns
+    -------
+    (1, out_len) generated token IDs (including [BOS], up to [EOS]).
+    """
+    device = src.device
+    model.eval()
+    # Encode source once
+    src_pad_mask = (src == pad_id)
+    memory = model.encode(src, src_key_padding_mask=src_pad_mask)
+    # Start with [BOS]
+    ys = torch.tensor([[bos_id]], dtype=torch.long, device=device)
+    for _ in range(max_len - 1):
+        logits = model.decode(
+            ys, memory,
+            memory_key_padding_mask=src_pad_mask,
+        )  # (1, cur_len, vocab)
+        next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)  # (1, 1)
+        ys = torch.cat([ys, next_token], dim=1)
+        if next_token.item() == eos_id:
+            break
+    return ys
+# ──────────────────────────────────────────────────────────────────────
+# 1b.  Beam-search decoding
+# ──────────────────────────────────────────────────────────────────────
+@torch.no_grad()
+def beam_search_decode(
+    model: nn.Module,
+    src: torch.Tensor,
+    bos_id: int,
+    eos_id: int,
+    pad_id: int = 0,
+    max_len: int = 128,
+    beam_width: int = 5,
+    length_penalty: float = 0.6,
+) -> torch.Tensor:
+    """
+    Beam-search decoding for a single source sequence.
+    Parameters
+    ----------
+    model : TransformerTranslator
+    src : (1, src_len) source token IDs.
+    bos_id, eos_id, pad_id : special token IDs.
+    max_len : maximum decoding steps.
+    beam_width : number of beams to keep at each step.
+    length_penalty : α for length normalisation: score / len^α.
+    Returns
+    -------
+    (1, out_len) best hypothesis token IDs (including [BOS], up to [EOS]).
+    """
+    device = src.device
+    model.eval()
+    # Encode source once
+    src_pad_mask = (src == pad_id)
+    memory = model.encode(src, src_key_padding_mask=src_pad_mask)
+    # Each beam: (log_prob, token_ids_list)
+    beams = [(0.0, [bos_id])]
+    completed = []
+    for _ in range(max_len - 1):
+        candidates = []
+        for score, tokens in beams:
+            if tokens[-1] == eos_id:
+                completed.append((score, tokens))
+                continue
+            ys = torch.tensor([tokens], dtype=torch.long, device=device)
+            logits = model.decode(
+                ys, memory,
+                memory_key_padding_mask=src_pad_mask,
+            )  # (1, cur_len, vocab)
+            log_probs = torch.log_softmax(logits[:, -1, :], dim=-1).squeeze(0)
+            topk_log_probs, topk_ids = log_probs.topk(beam_width)
+            for k in range(beam_width):
+                new_score = score + topk_log_probs[k].item()
+                new_tokens = tokens + [topk_ids[k].item()]
+                candidates.append((new_score, new_tokens))
+        if not candidates:
+            break
+        # Keep top beam_width by length-normalised score
+        candidates.sort(
+            key=lambda x: x[0] / (len(x[1]) ** length_penalty),
+            reverse=True,
+        )
+        beams = candidates[:beam_width]
+        # Early exit if all beams have finished
+        if all(b[1][-1] == eos_id for b in beams):
+            completed.extend(beams)
+            break
+    # Add any remaining beams
+    completed.extend(beams)
+    # Pick best by length-normalised score
+    best = max(
+        completed,
+        key=lambda x: x[0] / (len(x[1]) ** length_penalty),
+    )
+    return torch.tensor([best[1]], dtype=torch.long, device=device)
+# ──────────────────────────────────────────────────────────────────────
+# 2.  Translate a raw string
+# ──────────────────────────────────────────────────────────────────────
+def translate(
+    model: nn.Module,
+    sentence: str,
+    src_tokenizer: Tokenizer,
+    tgt_tokenizer: Tokenizer,
+    bos_id: int,
+    eos_id: int,
+    pad_id: int = 0,
+    max_len: int = 128,
+    device: Optional[torch.device] = None,
+    beam_width: int = 1,
+    length_penalty: float = 0.6,
+) -> str:
+    """Translate a single English sentence to Malay.
+    Set beam_width=1 for greedy, >1 for beam search.
+    """
+    if device is None:
+        device = next(model.parameters()).device
+    # Tokenise source
+    src_ids = src_tokenizer.encode(sentence).ids
+    src = torch.tensor([src_ids], dtype=torch.long, device=device)
+    # Decode
+    if beam_width > 1:
+        out_ids = beam_search_decode(
+            model, src, bos_id, eos_id, pad_id, max_len,
+            beam_width=beam_width, length_penalty=length_penalty,
+        )
+    else:
+        out_ids = greedy_decode(model, src, bos_id, eos_id, pad_id, max_len)
+    # Convert IDs → string (skip special tokens) + clean up spacing
+    raw = tgt_tokenizer.decode(out_ids.squeeze(0).tolist(), skip_special_tokens=True)
+    return postprocess_translation(raw)
+# ──────────────────────────────────────────────────────────────────────
+# 3.  Corpus-level chrF
+# ──────────────────────────────────────────────────────────────────────
+def compute_chrf(hypotheses: List[str], references: List[str]) -> sacrebleu.CHRFScore:
+    """
+    Compute corpus-level chrF score.
+    Parameters
+    ----------
+    hypotheses : list[str]
+        System outputs (decoded translations).
+    references : list[str]
+        Gold reference translations.
+    Returns
+    -------
+    sacrebleu.CHRFScore   – has ``.score`` attribute (0–100 scale).
+    """
+    return sacrebleu.corpus_chrf(hypotheses, [references])
+# ──────────────────────────────────────────────────────────────────────
+# 4.  Full evaluation driver
+# ──────────────────────────────────────────────────────────────────────
+def evaluate(
+    model: nn.Module,
+    hf_dataset,
+    src_tokenizer: Tokenizer,
+    tgt_tokenizer: Tokenizer,
+    src_lang: str = "en",
+    tgt_lang: str = "ms",
+    bos_id: int = 5,
+    eos_id: int = 6,
+    pad_id: int = 0,
+    max_len: int = 128,
+    device: Optional[torch.device] = None,
+    num_samples: int = 5,
+    beam_width: int = 1,
+    length_penalty: float = 0.6,
+) -> float:
+    """
+    Decode every example in *hf_dataset*, compute corpus chrF, and
+    print ``num_samples`` side-by-side translations.
+    Set beam_width=1 for greedy, >1 for beam search.
+    Returns
+    -------
+    chrf_score : float   (0–100)
+    """
+    if device is None:
+        device = next(model.parameters()).device
+    model.eval()
+    hypotheses: List[str] = []
+    references: List[str] = []
+    for i, example in enumerate(hf_dataset):
+        src_text = example["translation"][src_lang]
+        ref_text = example["translation"][tgt_lang]
+        hyp_text = translate(
+            model, src_text,
+            src_tokenizer, tgt_tokenizer,
+            bos_id, eos_id, pad_id, max_len, device,
+            beam_width=beam_width,
+            length_penalty=length_penalty,
+        )
+        hypotheses.append(hyp_text)
+        references.append(ref_text)
+    chrf = compute_chrf(hypotheses, references)
+    # Print samples
+    print(f"\n{'='*60}")
+    print(f"chrF Score: {chrf.score:.2f}")
+    print(f"{'='*60}")
+    for i in range(min(num_samples, len(hypotheses))):
+        src_text = hf_dataset[i]["translation"][src_lang]
+        print(f"\n[{i}] SRC: {src_text[:120]}")
+        print(f"    REF: {references[i][:120]}")
+        print(f"    HYP: {hypotheses[i][:120]}")
+    return chrf.score

src/model.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""
+10+2 Tied Transformer for English → Malay Translation
+=======================================================
+An asymmetric encoder-decoder Transformer built on ``torch.nn.Transformer``.
+Architecture (redesigned for efficient T4 GPU training & inference):
+    d_model            = 512   (embedding dimension, head_dim = 64)
+    n_head             = 8     (attention heads)
+    encoder layers     = 10    (deep encoder for source understanding)
+    decoder layers     = 2     (shallow decoder for fast generation)
+    d_ff               = 2048  (feed-forward inner dimension)
+    dropout            = 0.1
+    norm_first         = True  (pre-norm for training stability)
+    shared embeddings  = True  (single vocab, en+ms share Latin script)
+    tied output proj.  = True  (output reuses embedding weights)
+Key design choices (see architecture_report.md for full rationale):
+  • **Asymmetric depth (Kasai et al., 2021):** Encoder depth drives
+    translation quality; decoder depth can be aggressively reduced
+    with minimal quality loss and ~3× faster inference.
+  • **Shared vocabulary:** English and Malay both use Latin script with
+    significant lexical overlap (loanwords, numbers, proper nouns).
+    A joint BPE naturally captures cross-lingual subword patterns.
+  • **Tied output projection (Press & Wolf, 2017):** The decoder's output
+    linear layer reuses the shared embedding matrix, saving ~26M params
+    and acting as a regulariser.
+  • **Pre-layer normalisation (Xiong et al., 2020):** Essential for stable
+    training of a 10-layer encoder.  Places LayerNorm before each sublayer.
+  • Uses PyTorch's native ``nn.Transformer`` to keep FlashAttention /
+    SDPA fused kernels active (PyTorch 2.0+).
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+import torch
+import torch.nn as nn
+# ---------------------------------------------------------------------------
+# Positional Encoding (sinusoidal, from "Attention Is All You Need")
+# ---------------------------------------------------------------------------
+class PositionalEncoding(nn.Module):
+    """
+    Inject positional information via fixed sinusoidal signals.
+    PE(pos, 2i)   = sin(pos / 10000^{2i / d_model})
+    PE(pos, 2i+1) = cos(pos / 10000^{2i / d_model})
+    """
+    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
+        super().__init__()
+        self.dropout = nn.Dropout(p=dropout)
+        pe = torch.zeros(max_len, d_model)                       # (max_len, d_model)
+        position = torch.arange(0, max_len).unsqueeze(1).float() # (max_len, 1)
+        div_term = torch.exp(
+            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
+        )                                                         # (d_model/2,)
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)                                      # (1, max_len, d_model)
+        self.register_buffer("pe", pe)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            x: (batch, seq_len, d_model)
+        Returns:
+            (batch, seq_len, d_model) with positional encoding added.
+        """
+        x = x + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+# ---------------------------------------------------------------------------
+# Full Transformer Model (10+2 Tied)
+# ---------------------------------------------------------------------------
+class TransformerTranslator(nn.Module):
+    """
+    Asymmetric encoder-decoder Transformer with shared/tied embeddings.
+    Parameters
+    ----------
+    vocab_size : int
+        Size of the shared source+target vocabulary.
+    d_model : int
+        Embedding / hidden dimension.
+    n_head : int
+        Number of attention heads.
+    num_encoder_layers : int
+        Number of encoder blocks (default 10).
+    num_decoder_layers : int
+        Number of decoder blocks (default 2).
+    d_ff : int
+        Feed-forward inner dimension.
+    dropout : float
+        Dropout rate.
+    max_len : int
+        Maximum sequence length for positional encoding.
+    pad_idx : int
+        Padding token ID (used to create padding masks).
+    """
+    def __init__(
+        self,
+        vocab_size: int,
+        d_model: int = 512,
+        n_head: int = 8,
+        num_encoder_layers: int = 10,
+        num_decoder_layers: int = 2,
+        d_ff: int = 2048,
+        dropout: float = 0.1,
+        max_len: int = 512,
+        pad_idx: int = 0,
+    ):
+        super().__init__()
+        self.pad_idx = pad_idx
+        self.d_model = d_model
+        # --- Shared embedding (one matrix for both enc & dec) -------------
+        self.shared_embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)
+        self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)
+        self.embed_scale = math.sqrt(d_model)
+        # --- Core Transformer (asymmetric, pre-norm) ----------------------
+        self.transformer = nn.Transformer(
+            d_model=d_model,
+            nhead=n_head,
+            num_encoder_layers=num_encoder_layers,
+            num_decoder_layers=num_decoder_layers,
+            dim_feedforward=d_ff,
+            dropout=dropout,
+            batch_first=True,
+            norm_first=True,            # pre-layer norm for stability
+        )
+        # --- Tied output projection (reuses embedding weights) ------------
+        # No separate nn.Linear — forward() uses F.linear with shared weights
+        self.output_bias = nn.Parameter(torch.zeros(vocab_size))
+        # --- Initialize weights -------------------------------------------
+        self._init_weights()
+    def _embed(self, tokens: torch.Tensor) -> torch.Tensor:
+        """Shared embedding + scale + positional encoding."""
+        return self.pos_encoding(self.shared_embedding(tokens) * self.embed_scale)
+    def _init_weights(self):
+        """Xavier-uniform initialization for embeddings."""
+        nn.init.normal_(self.shared_embedding.weight, mean=0, std=self.d_model ** -0.5)
+        # Zero out padding embedding
+        with torch.no_grad():
+            self.shared_embedding.weight[self.pad_idx].zero_()
+    # ------------------------------------------------------------------
+    # Mask utilities
+    # ------------------------------------------------------------------
+    @staticmethod
+    def generate_square_subsequent_mask(sz: int, device: torch.device) -> torch.Tensor:
+        """
+        Causal mask for the decoder: prevents attending to future positions.
+        Returns a (sz, sz) boolean mask where True = blocked.
+        """
+        return torch.triu(torch.ones(sz, sz, device=device, dtype=torch.bool), diagonal=1)
+    def _make_pad_mask(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Create a padding mask: True where token == pad_idx.
+        Shape: (batch, seq_len)
+        """
+        return x == self.pad_idx
+    # ------------------------------------------------------------------
+    # Forward
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        src: torch.Tensor,
+        tgt: torch.Tensor,
+        src_key_padding_mask: Optional[torch.Tensor] = None,
+        tgt_key_padding_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Args:
+            src: (batch, src_len) source token IDs.
+            tgt: (batch, tgt_len) target token IDs (teacher-forced).
+        Returns:
+            logits: (batch, tgt_len, vocab_size)
+        """
+        # Build masks if not provided
+        if src_key_padding_mask is None:
+            src_key_padding_mask = self._make_pad_mask(src)
+        if tgt_key_padding_mask is None:
+            tgt_key_padding_mask = self._make_pad_mask(tgt)
+        # Causal mask for decoder
+        tgt_len = tgt.size(1)
+        tgt_mask = self.generate_square_subsequent_mask(tgt_len, tgt.device)
+        # Shared embeddings for both encoder and decoder
+        src_emb = self._embed(src)
+        tgt_emb = self._embed(tgt)
+        # Transformer forward
+        out = self.transformer(
+            src=src_emb,
+            tgt=tgt_emb,
+            tgt_mask=tgt_mask,
+            src_key_padding_mask=src_key_padding_mask,
+            tgt_key_padding_mask=tgt_key_padding_mask,
+            memory_key_padding_mask=src_key_padding_mask,
+        )  # (batch, tgt_len, d_model)
+        # Tied output projection: logits = out @ embedding_weights.T + bias
+        logits = torch.nn.functional.linear(out, self.shared_embedding.weight, self.output_bias)
+        return logits
+    # ------------------------------------------------------------------
+    # Inference helpers
+    # ------------------------------------------------------------------
+    def encode(self, src: torch.Tensor, src_key_padding_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Run only the encoder. Returns memory: (batch, src_len, d_model)."""
+        if src_key_padding_mask is None:
+            src_key_padding_mask = self._make_pad_mask(src)
+        src_emb = self._embed(src)
+        return self.transformer.encoder(src_emb, src_key_padding_mask=src_key_padding_mask)
+    def decode(
+        self,
+        tgt: torch.Tensor,
+        memory: torch.Tensor,
+        tgt_key_padding_mask: Optional[torch.Tensor] = None,
+        memory_key_padding_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Run only the decoder given encoder memory. Returns logits."""
+        if tgt_key_padding_mask is None:
+            tgt_key_padding_mask = self._make_pad_mask(tgt)
+        tgt_len = tgt.size(1)
+        tgt_mask = self.generate_square_subsequent_mask(tgt_len, tgt.device)
+        tgt_emb = self._embed(tgt)
+        out = self.transformer.decoder(
+            tgt_emb,
+            memory,
+            tgt_mask=tgt_mask,
+            tgt_key_padding_mask=tgt_key_padding_mask,
+            memory_key_padding_mask=memory_key_padding_mask,
+        )
+        return torch.nn.functional.linear(out, self.shared_embedding.weight, self.output_bias)
+# ---------------------------------------------------------------------------
+# Helper: count parameters
+# ---------------------------------------------------------------------------
+def count_parameters(model: nn.Module) -> int:
+    """Return the number of trainable parameters."""
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+# ---------------------------------------------------------------------------
+# Helper: build model
+# ---------------------------------------------------------------------------
+def build_model(
+    vocab_size: int,
+    pad_idx: int = 0,
+    device: Optional[torch.device] = None,
+    **kwargs,
+) -> TransformerTranslator:
+    """
+    Build and return a TransformerTranslator with default hyperparameters.
+    Any kwarg (d_model, n_head, etc.) overrides the default.
+    """
+    if device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model = TransformerTranslator(
+        vocab_size=vocab_size,
+        pad_idx=pad_idx,
+        **kwargs,
+    ).to(device)
+    return model

src/tokenizer.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""
+Byte-Pair Encoding (BPE) Tokenizer for English-Malay Translation
+=================================================================
+We support two modes:
+  1. **Shared tokenizer** (preferred for 10+2 Tied Transformer):
+     A single BPE tokenizer trained on the concatenated en+ms corpus.
+     Both encoder and decoder share the same vocabulary.
+  2. **Separate tokenizers** (legacy):
+     Two independent BPE tokenizers, one per language.
+Why BPE?
+  • Handles subword units, so rare / unseen words are decomposed into
+    known subword pieces instead of mapping to [UNK].
+  • Malay is morphologically rich (prefixes: me-, ber-, di-; suffixes:
+    -kan, -an, -i).  BPE naturally learns these affixes as subword units,
+    giving much better coverage than a word-level tokenizer.
+  • Keeps vocabulary compact while still reaching high coverage on both
+    English and Malay.
+Why shared vocabulary for en-ms?
+  • Both languages use the Latin script with significant lexical overlap
+    (loanwords: "teknologi", "matematik", "universiti"; numbers; proper nouns).
+  • A joint BPE captures cross-lingual subword patterns and enables
+    tied embeddings in the model (Press & Wolf, 2017), saving ~26M params.
+Design choices:
+  • NFKC normalisation + lowercase – ensures consistent encoding of
+    Unicode characters and removes casing noise.
+  • Whitespace pre-tokeniser – splits on spaces before BPE merges; simple
+    and effective for Latin-script languages.
+  • Special tokens:
+      [PAD]  – padding for uniform sequence lengths in batches
+      [UNK]  – fallback for unknown characters
+      [CLS]  – beginning-of-sequence / classification token
+      [SEP]  – separator (unused in basic seq2seq but reserved)
+      [MASK] – reserved for masked-LM pretraining objectives
+      [BOS]  – beginning of sentence (fed to decoder at step 0)
+      [EOS]  – end of sentence (signals the decoder to stop)
+"""
+from __future__ import annotations
+import os
+import tempfile
+from pathlib import Path
+from typing import Iterator, List, Optional, Union
+from tokenizers import Tokenizer
+from tokenizers.models import BPE
+from tokenizers.trainers import BpeTrainer
+from tokenizers.pre_tokenizers import Whitespace
+from tokenizers.normalizers import Sequence, NFKC, Lowercase
+from tokenizers.processors import TemplateProcessing
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+SPECIAL_TOKENS: List[str] = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"]
+PAD_TOKEN = "[PAD]"
+UNK_TOKEN = "[UNK]"
+CLS_TOKEN = "[CLS]"
+SEP_TOKEN = "[SEP]"
+MASK_TOKEN = "[MASK]"
+BOS_TOKEN = "[BOS]"
+EOS_TOKEN = "[EOS]"
+DEFAULT_VOCAB_SIZE = 50_000
+DEFAULT_MIN_FREQUENCY = 2
+# ---------------------------------------------------------------------------
+# Helper: write an iterator of strings to a temporary file (needed by the
+# HuggingFace `tokenizers` training API which expects file paths).
+# ---------------------------------------------------------------------------
+def _write_texts_to_tmpfile(texts: Iterator[str]) -> str:
+    """Write an iterable of strings to a temp file, one per line. Returns path."""
+    tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False, encoding="utf-8")
+    for line in texts:
+        line = line.strip()
+        if line:
+            tmp.write(line + "\n")
+    tmp.close()
+    return tmp.name
+# ---------------------------------------------------------------------------
+# Core: build & train a BPE tokenizer
+# ---------------------------------------------------------------------------
+def build_tokenizer(
+    vocab_size: int = DEFAULT_VOCAB_SIZE,
+    min_frequency: int = DEFAULT_MIN_FREQUENCY,
+) -> tuple[Tokenizer, BpeTrainer]:
+    """
+    Create an *untrained* BPE tokenizer and its trainer.
+    Returns
+    -------
+    tokenizer : Tokenizer
+        Ready to call ``tokenizer.train(files, trainer)``.
+    trainer : BpeTrainer
+        Configured trainer instance.
+    """
+    tokenizer = Tokenizer(BPE(unk_token=UNK_TOKEN))
+    # --- Normalisation: NFKC (canonical Unicode) + lowercase -------------
+    tokenizer.normalizer = Sequence([NFKC(), Lowercase()])
+    # --- Pre-tokenisation: split on whitespace ---------------------------
+    tokenizer.pre_tokenizer = Whitespace()
+    # --- Trainer ---------------------------------------------------------
+    trainer = BpeTrainer(
+        vocab_size=vocab_size,
+        min_frequency=min_frequency,
+        special_tokens=SPECIAL_TOKENS,
+        show_progress=True,
+    )
+    return tokenizer, trainer
+def train_tokenizer(
+    texts: Union[List[str], Iterator[str]],
+    vocab_size: int = DEFAULT_VOCAB_SIZE,
+    min_frequency: int = DEFAULT_MIN_FREQUENCY,
+    files: Optional[List[str]] = None,
+) -> Tokenizer:
+    """
+    Train a BPE tokenizer on the given texts **or** files.
+    Parameters
+    ----------
+    texts : list[str] or iterator of str, optional
+        Raw sentences.  Ignored when *files* is provided.
+    vocab_size : int
+        Target vocabulary size (default 30 000).
+    min_frequency : int
+        Minimum frequency for a pair to be merged.
+    files : list[str], optional
+        Paths to plain-text files (one sentence per line).
+    Returns
+    -------
+    Tokenizer
+        Trained tokenizer ready for encoding / decoding.
+    """
+    tokenizer, trainer = build_tokenizer(vocab_size, min_frequency)
+    if files is not None:
+        tokenizer.train(files, trainer)
+    else:
+        # Write texts to a temporary file so we can use the fast Rust trainer
+        tmp_path = _write_texts_to_tmpfile(iter(texts))
+        try:
+            tokenizer.train([tmp_path], trainer)
+        finally:
+            os.remove(tmp_path)
+    # --- Post-processing: wrap every encoded sequence with [BOS] … [EOS] -
+    bos_id = tokenizer.token_to_id(BOS_TOKEN)
+    eos_id = tokenizer.token_to_id(EOS_TOKEN)
+    tokenizer.post_processor = TemplateProcessing(
+        single=f"[BOS]:0 $A:0 [EOS]:0",
+        pair=f"[BOS]:0 $A:0 [EOS]:0 [BOS]:1 $B:1 [EOS]:1",
+        special_tokens=[
+            ("[BOS]", bos_id),
+            ("[EOS]", eos_id),
+        ],
+    )
+    return tokenizer
+# ---------------------------------------------------------------------------
+# Convenience wrappers for saving / loading
+# ---------------------------------------------------------------------------
+def save_tokenizer(tokenizer: Tokenizer, path: Union[str, Path]) -> None:
+    """Save a trained tokenizer to a JSON file."""
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    tokenizer.save(str(path))
+    print(f"[✓] Tokenizer saved → {path}")
+def load_tokenizer(path: Union[str, Path]) -> Tokenizer:
+    """Load a previously saved tokenizer from a JSON file."""
+    tokenizer = Tokenizer.from_file(str(path))
+    print(f"[✓] Tokenizer loaded ← {path}")
+    return tokenizer
+# ---------------------------------------------------------------------------
+# Encoding / decoding helpers
+# ---------------------------------------------------------------------------
+def encode(tokenizer: Tokenizer, text: str) -> List[int]:
+    """Encode a single string and return token IDs (includes [BOS]/[EOS])."""
+    return tokenizer.encode(text).ids
+def decode(tokenizer: Tokenizer, ids: List[int]) -> str:
+    """Decode token IDs back to a string, skipping special tokens."""
+    return tokenizer.decode(ids, skip_special_tokens=True)
+def get_vocab_size(tokenizer: Tokenizer) -> int:
+    """Return the size of the tokenizer's vocabulary."""
+    return tokenizer.get_vocab_size()
+def token_to_id(tokenizer: Tokenizer, token: str) -> Optional[int]:
+    """Look up the integer ID for a single token string."""
+    return tokenizer.token_to_id(token)
+def id_to_token(tokenizer: Tokenizer, id: int) -> Optional[str]:
+    """Look up the token string for a single integer ID."""
+    return tokenizer.id_to_token(id)
+# ---------------------------------------------------------------------------
+# High-level: train a SHARED tokenizer on both languages (for tied embeddings)
+# ---------------------------------------------------------------------------
+def train_shared_tokenizer_from_dataset(
+    dataset,
+    src_lang: str = "en",
+    tgt_lang: str = "ms",
+    vocab_size: int = DEFAULT_VOCAB_SIZE,
+    save_dir: Union[str, Path] = "tokenizer",
+) -> Tokenizer:
+    """
+    Train a single shared BPE tokenizer on the concatenated en+ms corpus.
+    This is used with the 10+2 Tied Transformer architecture, where both
+    encoder and decoder share the same vocabulary and embedding matrix.
+    Parameters
+    ----------
+    dataset : datasets.Dataset
+        A HuggingFace dataset split where each example has a ``'translation'``
+        dict with keys for each language code.
+    src_lang : str
+        Source language code (default ``'en'``).
+    tgt_lang : str
+        Target language code (default ``'ms'``).
+    vocab_size : int
+        Vocabulary size for the shared tokenizer.
+    save_dir : str or Path
+        Directory to save the trained tokenizer JSON file.
+    Returns
+    -------
+    Tokenizer
+        A single shared tokenizer for both languages.
+    """
+    save_dir = Path(save_dir)
+    # Concatenate all source and target sentences into one corpus
+    src_texts = [example["translation"][src_lang] for example in dataset]
+    tgt_texts = [example["translation"][tgt_lang] for example in dataset]
+    all_texts = src_texts + tgt_texts
+    print(f"Training shared BPE tokenizer on {len(all_texts):,} sentences "
+          f"({len(src_texts):,} {src_lang} + {len(tgt_texts):,} {tgt_lang}) …")
+    shared_tokenizer = train_tokenizer(all_texts, vocab_size=vocab_size)
+    save_tokenizer(shared_tokenizer, save_dir / "tokenizer_shared.json")
+    # Sanity check
+    for name, sample in [(src_lang, src_texts[0]), (tgt_lang, tgt_texts[0])]:
+        enc = shared_tokenizer.encode(sample)
+        print(f"\n[{name}] Sample: {sample[:80]}…")
+        print(f"  Tokens : {enc.tokens[:15]}…")
+        print(f"  IDs    : {enc.ids[:15]}…")
+        print(f"  Decoded: {shared_tokenizer.decode(enc.ids, skip_special_tokens=True)[:80]}…")
+    print(f"\n[✓] Shared tokenizer trained and saved to {save_dir}/tokenizer_shared.json")
+    return shared_tokenizer
+# ---------------------------------------------------------------------------
+# High-level: train source (English) & target (Malay) tokenizers from a
+# HuggingFace dataset split.
+# ---------------------------------------------------------------------------
+def train_tokenizers_from_dataset(
+    dataset,
+    src_lang: str = "en",
+    tgt_lang: str = "ms",
+    vocab_size: int = DEFAULT_VOCAB_SIZE,
+    save_dir: Union[str, Path] = "tokenizer",
+) -> tuple[Tokenizer, Tokenizer]:
+    """
+    Train separate BPE tokenizers for source and target languages.
+    Parameters
+    ----------
+    dataset : datasets.Dataset
+        A HuggingFace dataset split (e.g. ``dataset['train']``) where each
+        example has a ``'translation'`` dict with keys for each language code.
+    src_lang : str
+        Source language code (default ``'en'``).
+    tgt_lang : str
+        Target language code (default ``'ms'``).
+    vocab_size : int
+        Vocabulary size for each tokenizer.
+    save_dir : str or Path
+        Directory to save the trained tokenizer JSON files.
+    Returns
+    -------
+    (src_tokenizer, tgt_tokenizer)
+    """
+    save_dir = Path(save_dir)
+    # Extract raw sentences from the dataset
+    src_texts = [example["translation"][src_lang] for example in dataset]
+    tgt_texts = [example["translation"][tgt_lang] for example in dataset]
+    print(f"Training source tokenizer ({src_lang}) on {len(src_texts):,} sentences …")
+    src_tokenizer = train_tokenizer(src_texts, vocab_size=vocab_size)
+    save_tokenizer(src_tokenizer, save_dir / f"tokenizer_{src_lang}.json")
+    print(f"Training target tokenizer ({tgt_lang}) on {len(tgt_texts):,} sentences …")
+    tgt_tokenizer = train_tokenizer(tgt_texts, vocab_size=vocab_size)
+    save_tokenizer(tgt_tokenizer, save_dir / f"tokenizer_{tgt_lang}.json")
+    # Quick sanity check
+    for name, tok, sample in [
+        (src_lang, src_tokenizer, src_texts[0]),
+        (tgt_lang, tgt_tokenizer, tgt_texts[0]),
+    ]:
+        enc = tok.encode(sample)
+        print(f"\n[{name}] Sample: {sample[:80]}…")
+        print(f"  Tokens : {enc.tokens[:15]}…")
+        print(f"  IDs    : {enc.ids[:15]}…")
+        print(f"  Decoded: {tok.decode(enc.ids, skip_special_tokens=True)[:80]}…")
+    print(f"\n[✓] Both tokenizers trained and saved to {save_dir}/")
+    return src_tokenizer, tgt_tokenizer
+# ---------------------------------------------------------------------------
+# Standalone usage
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    from datasets import load_from_disk
+    print("Loading TED Talks IWSLT dataset (en ↔ ms, 2016) …")
+    ds = load_from_disk("dataset/en_ms_2016")
+    src_tok, tgt_tok = train_tokenizers_from_dataset(
+        ds,
+        src_lang="en",
+        tgt_lang="ms",
+        vocab_size=DEFAULT_VOCAB_SIZE,
+        save_dir="tokenizer",
+    )
+    print(f"\nEnglish vocab size : {get_vocab_size(src_tok):,}")
+    print(f"Malay vocab size   : {get_vocab_size(tgt_tok):,}")
+    print(f"[PAD] id (en)      : {token_to_id(src_tok, PAD_TOKEN)}")
+    print(f"[EOS] id (ms)      : {token_to_id(tgt_tok, EOS_TOKEN)}")

src/training.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""
+Training loop for the Transformer translator.
+===============================================
+Provides:
+  • ``TranslationDataset``  – a PyTorch Dataset that tokenises and pads
+    source/target sentence pairs.
+  • ``create_dataloaders``  – builds train / validation DataLoaders with
+    an 90/10 split.
+  • ``train_one_epoch``     – one full pass over the training set.
+  • ``evaluate_loss``       – average loss on the validation set.
+  • ``train``               – full training driver with logging, LR
+    scheduling, checkpointing, and early stopping.
+Design choices:
+  • Label-smoothed cross-entropy (smoothing = 0.1) for better
+    generalisation.
+  • AdamW with a linear-warmup + cosine-decay schedule (stable for
+    small datasets).
+  • Mixed-precision (AMP) with ``torch.amp`` for speed / memory on T4.
+  • Gradient clipping at max_norm = 1.0 to avoid exploding gradients.
+"""
+from __future__ import annotations
+import math
+import os
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional, Tuple
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader, random_split
+from tokenizers import Tokenizer
+# ──────────────────────────────────────────────────────────────────────
+# 1.  Translation Dataset
+# ──────────────────────────────────────────────────────────────────────
+class TranslationDataset(Dataset):
+    """
+    Wraps a HuggingFace dataset of translation pairs into a PyTorch
+    Dataset that returns padded token-ID tensors.
+    Each ``__getitem__`` returns::
+        {
+            "src": LongTensor[max_len],   # source token IDs (padded)
+            "tgt": LongTensor[max_len],   # target input  (with [BOS], no final [EOS])
+            "label": LongTensor[max_len], # target labels (no [BOS], with [EOS])
+        }
+    The *tgt* / *label* split implements **teacher forcing**: the decoder
+    receives ``[BOS] w1 w2 …`` and must predict ``w1 w2 … [EOS]``.
+    """
+    def __init__(
+        self,
+        hf_dataset,
+        src_tokenizer: Tokenizer,
+        tgt_tokenizer: Tokenizer,
+        src_lang: str = "en",
+        tgt_lang: str = "ms",
+        max_len: int = 128,
+        pad_id: int = 0,
+    ):
+        self.data = hf_dataset
+        self.src_tok = src_tokenizer
+        self.tgt_tok = tgt_tokenizer
+        self.src_lang = src_lang
+        self.tgt_lang = tgt_lang
+        self.max_len = max_len
+        self.pad_id = pad_id
+    def __len__(self) -> int:
+        return len(self.data)
+    def _pad(self, ids: List[int]) -> List[int]:
+        """Truncate to max_len, then right-pad with pad_id."""
+        ids = ids[: self.max_len]
+        return ids + [self.pad_id] * (self.max_len - len(ids))
+    def __getitem__(self, idx: int) -> dict:
+        pair = self.data[idx]["translation"]
+        # Encode (includes [BOS] … [EOS] from post-processor)
+        src_ids = self.src_tok.encode(pair[self.src_lang]).ids
+        tgt_ids = self.tgt_tok.encode(pair[self.tgt_lang]).ids
+        # Teacher-forcing split:
+        #   tgt_input  = [BOS] w1 w2 … wN        (drop last token)
+        #   tgt_label  = w1 w2 … wN [EOS]        (drop first token)
+        tgt_input = tgt_ids[:-1]
+        tgt_label = tgt_ids[1:]
+        return {
+            "src":   torch.tensor(self._pad(src_ids),   dtype=torch.long),
+            "tgt":   torch.tensor(self._pad(tgt_input), dtype=torch.long),
+            "label": torch.tensor(self._pad(tgt_label), dtype=torch.long),
+        }
+# ──────────────────────────────────────────────────────────────────────
+# 2.  DataLoader factory
+# ──────────────────────────────────────────────────────────────────────
+def create_dataloaders(
+    hf_dataset,
+    src_tokenizer: Tokenizer,
+    tgt_tokenizer: Tokenizer,
+    src_lang: str = "en",
+    tgt_lang: str = "ms",
+    max_len: int = 128,
+    batch_size: int = 32,
+    val_ratio: float = 0.1,
+    pad_id: int = 0,
+    seed: int = 42,
+) -> Tuple[DataLoader, DataLoader, TranslationDataset]:
+    """
+    Build training and validation DataLoaders from a HuggingFace dataset.
+    Returns
+    -------
+    train_loader, val_loader, full_dataset
+    """
+    full_ds = TranslationDataset(
+        hf_dataset, src_tokenizer, tgt_tokenizer,
+        src_lang, tgt_lang, max_len, pad_id,
+    )
+    val_size = max(1, int(len(full_ds) * val_ratio))
+    train_size = len(full_ds) - val_size
+    generator = torch.Generator().manual_seed(seed)
+    train_ds, val_ds = random_split(full_ds, [train_size, val_size], generator=generator)
+    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True,  drop_last=False)
+    val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False, drop_last=False)
+    print(f"Train: {train_size}  |  Val: {val_size}  |  Batch size: {batch_size}")
+    return train_loader, val_loader, full_ds
+# ──────────────────────────────────────────────────────────────────────
+# 3.  Training configuration dataclass
+# ──────────────────────────────────────────────────────────────────────
+@dataclass
+class TrainConfig:
+    """All tuneable knobs in one place."""
+    epochs: int = 50
+    batch_size: int = 32
+    max_len: int = 128
+    lr: float = 5e-4
+    warmup_steps: int = 200
+    label_smoothing: float = 0.1
+    grad_clip: float = 1.0
+    use_amp: bool = True
+    val_ratio: float = 0.1
+    checkpoint_dir: str = "training/checkpoints"
+    log_every: int = 10          # print loss every N steps
+    patience: int = 10           # early-stopping patience (epochs)
+    seed: int = 42
+# ──────────────────────────────────────────────────────────────────────
+# 4.  LR scheduler with linear warmup + cosine decay
+# ──────────────────────────────────────────────────────────────────────
+def _build_scheduler(optimizer, warmup_steps: int, total_steps: int):
+    """Linear warmup for `warmup_steps`, then cosine decay to 0."""
+    def lr_lambda(step):
+        if step < warmup_steps:
+            return step / max(1, warmup_steps)
+        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
+        return 0.5 * (1.0 + math.cos(math.pi * progress))
+    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+# ──────────────────────────────────────────────────────────────────────
+# 5.  Single-epoch training
+# ──────────────────────────────────────────────────────────────────────
+def train_one_epoch(
+    model: nn.Module,
+    loader: DataLoader,
+    optimizer: torch.optim.Optimizer,
+    scheduler,
+    criterion: nn.Module,
+    device: torch.device,
+    scaler: Optional[torch.amp.GradScaler],
+    grad_clip: float = 1.0,
+    log_every: int = 10,
+    epoch: int = 0,
+) -> float:
+    """Train for one epoch. Returns average loss."""
+    model.train()
+    total_loss = 0.0
+    n_tokens = 0
+    for step, batch in enumerate(loader):
+        src   = batch["src"].to(device)
+        tgt   = batch["tgt"].to(device)
+        label = batch["label"].to(device)
+        optimizer.zero_grad()
+        amp_enabled = scaler is not None
+        with torch.amp.autocast("cuda", enabled=amp_enabled):
+            logits = model(src, tgt)                          # (B, T, V)
+            loss = criterion(logits.reshape(-1, logits.size(-1)), label.reshape(-1))
+        if scaler is not None:
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
+            optimizer.step()
+        scheduler.step()
+        # Accumulate loss (ignore padding contribution)
+        non_pad = (label != model.pad_idx).sum().item()
+        total_loss += loss.item() * non_pad
+        n_tokens += non_pad
+        if (step + 1) % log_every == 0:
+            avg = total_loss / max(n_tokens, 1)
+            lr = scheduler.get_last_lr()[0]
+            print(f"  Epoch {epoch+1} | Step {step+1}/{len(loader)} | Loss {avg:.4f} | LR {lr:.2e}")
+    return total_loss / max(n_tokens, 1)
+# ──────────────────────────────────────────────────────────────────────
+# 6.  Validation loss
+# ──────────────────────────────────────────────────────────────────────
+@torch.no_grad()
+def evaluate_loss(
+    model: nn.Module,
+    loader: DataLoader,
+    criterion: nn.Module,
+    device: torch.device,
+    use_amp: bool = False,
+) -> float:
+    """Compute average loss over a validation set (with AMP to match training)."""
+    model.eval()
+    total_loss = 0.0
+    n_tokens = 0
+    n_batches = len(loader)
+    for step, batch in enumerate(loader):
+        src   = batch["src"].to(device)
+        tgt   = batch["tgt"].to(device)
+        label = batch["label"].to(device)
+        with torch.amp.autocast("cuda", enabled=use_amp):
+            logits = model(src, tgt)
+            loss = criterion(logits.reshape(-1, logits.size(-1)), label.reshape(-1))
+        non_pad = (label != model.pad_idx).sum().item()
+        total_loss += loss.item() * non_pad
+        n_tokens += non_pad
+        if (step + 1) % max(1, n_batches // 4) == 0 or (step + 1) == n_batches:
+            print(f"    Val {step+1}/{n_batches}", end="\r")
+    return total_loss / max(n_tokens, 1)
+# ──────────────────────────────────────────────────────────────────────
+# 7.  Full training driver
+# ──────────────────────────────────────────────────────────────────────
+def train(
+    model: nn.Module,
+    train_loader: DataLoader,
+    val_loader: DataLoader,
+    cfg: TrainConfig,
+    device: torch.device,
+    trial=None,
+) -> dict:
+    """
+    Full training loop with logging, checkpointing, and early stopping.
+    Parameters
+    ----------
+    trial : optuna.trial.Trial, optional
+        If provided, reports val_loss after each epoch for ASHA pruning.
+    Returns
+    -------
+    history : dict
+        ``{"train_loss": [...], "val_loss": [...], "lr": [...]}``
+    """
+    # --- Loss function (label-smoothed CE, ignoring PAD) ---------------
+    criterion = nn.CrossEntropyLoss(
+        ignore_index=model.pad_idx,
+        label_smoothing=cfg.label_smoothing,
+    )
+    # --- Optimiser ------------------------------------------------------
+    optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.lr, betas=(0.9, 0.98), eps=1e-9)
+    # --- LR schedule ---------------------------------------------------
+    total_steps = cfg.epochs * len(train_loader)
+    scheduler = _build_scheduler(optimizer, cfg.warmup_steps, total_steps)
+    # --- AMP scaler ----------------------------------------------------
+    scaler = torch.amp.GradScaler("cuda") if (cfg.use_amp and device.type == "cuda") else None
+    # --- Checkpoint dir ------------------------------------------------
+    ckpt_dir = Path(cfg.checkpoint_dir)
+    ckpt_dir.mkdir(parents=True, exist_ok=True)
+    history: dict = {"train_loss": [], "val_loss": [], "lr": []}
+    best_val = float("inf")
+    patience_ctr = 0
+    print(f"\n{'='*60}")
+    print(f"Starting training: {cfg.epochs} epochs, lr={cfg.lr}, AMP={cfg.use_amp}")
+    print(f"{'='*60}\n")
+    for epoch in range(cfg.epochs):
+        t0 = time.time()
+        train_loss = train_one_epoch(
+            model, train_loader, optimizer, scheduler, criterion,
+            device, scaler, cfg.grad_clip, cfg.log_every, epoch,
+        )
+        use_amp = cfg.use_amp and device.type == "cuda"
+        val_loss = evaluate_loss(model, val_loader, criterion, device, use_amp=use_amp)
+        lr = scheduler.get_last_lr()[0]
+        elapsed = time.time() - t0
+        history["train_loss"].append(train_loss)
+        history["val_loss"].append(val_loss)
+        history["lr"].append(lr)
+        print(
+            f"Epoch {epoch+1}/{cfg.epochs}  |  "
+            f"Train {train_loss:.4f}  |  Val {val_loss:.4f}  |  "
+            f"LR {lr:.2e}  |  {elapsed:.1f}s"
+        )
+        # --- Optuna ASHA pruning (if trial provided) ------------------
+        if trial is not None:
+            import optuna
+            trial.report(val_loss, epoch)
+            if trial.should_prune():
+                print(f"\n✂ Optuna pruned this trial at epoch {epoch+1}.")
+                raise optuna.TrialPruned()
+        # --- Checkpoint best model ------------------------------------
+        if val_loss < best_val:
+            best_val = val_loss
+            patience_ctr = 0
+            torch.save(model.state_dict(), ckpt_dir / "best_model.pt")
+            print(f"  ↳ New best val loss — checkpoint saved.")
+        else:
+            patience_ctr += 1
+            if patience_ctr >= cfg.patience:
+                print(f"\n⏹ Early stopping after {cfg.patience} epochs without improvement.")
+                break
+    # Load best checkpoint
+    model.load_state_dict(torch.load(ckpt_dir / "best_model.pt", map_location=device, weights_only=True))
+    print(f"\n✓ Training complete. Best val loss: {best_val:.4f}")
+    return history

tokenizer_shared_16k.json ADDED Viewed

The diff for this file is too large to render. See raw diff