Transformer Base — WMT14 en→fr (from scratch)
A from-scratch PyTorch implementation of the Transformer (Vaswani et al., 2017), trained on WMT14 English→French without any pretrained weights. This is the strongest checkpoint from the parent project and the one worth sharing externally.
| Metric | Value |
|---|---|
| Test BLEU (newstest2014) | 35.31 |
| Valid BLEU (newstest2013) | 30.52 |
| Tokenizer | sacrebleu 13a (detokenized) |
| Parameters | 93,554,688 |
| Training compute | single RTX 5090, ~2h 5m |
BLEU is reported as sacrebleu 13a, the modern detokenized standard.
Vaswani's original paper reports 38.1 in historical tokenized BLEU for
Base on WMT14 en-fr, which is roughly equivalent to 35-36 sacrebleu —
so this checkpoint lands ~1-1.5 BLEU below paper Base, attributable to
training on 9.3M strict-filtered pairs (vs the paper's ~36M full corpus).
See the parent repository for full training logs, ablation studies against larger variants, and why a 9.3M clean corpus outperformed a 30M noisy one (data quality > data quantity > capacity).
Architecture
Standard Transformer Base from the paper, no architectural modifications:
| d_model | 512 |
| n_heads | 8 |
| encoder layers | 6 |
| decoder layers | 6 |
| FFN dim | 2048 |
| dropout | 0.1 |
| max seq len | 256 |
| vocab size | 32000 (shared SentencePiece BPE) |
| shared embeddings | True |
Files in this repo
| File | Purpose |
|---|---|
pytorch_model.bin |
Model weights only (averaged over the last 5 step-checkpoints, Vaswani trick) |
sentencepiece.model |
Shared 32K BPE tokenizer (SentencePiece) |
config.json |
Architecture config — sufficient to instantiate the model |
example.py |
Minimal self-contained inference script |
README.md |
This file |
Usage
# 1. Clone the parent repo for model definition + beam search code
git clone https://github.com/Euswbnix/Machine_translation
cd Machine_translation
pip install -r requirements.txt
# 2. Download the weights + tokenizer from this HF repo
pip install huggingface_hub
hf download euswbnix/transformer-wmt14-enfr-base \
pytorch_model.bin sentencepiece.model config.json --local-dir hf_model
# 3. Translate
python examples/load_and_translate.py \
--weights hf_model/pytorch_model.bin \
--spm hf_model/sentencepiece.model \
--config hf_model/config.json \
--text "Machine learning is transforming the world."
Or use the bundled example.py:
import sentencepiece as spm
import torch
from src.model import Transformer
# Load the model (shapes come from config.json)
cfg = json.load(open("config.json"))
model = Transformer(
vocab_size=cfg["vocab_size"], d_model=cfg["d_model"],
n_heads=cfg["n_heads"], n_encoder_layers=cfg["n_encoder_layers"],
n_decoder_layers=cfg["n_decoder_layers"], d_ff=cfg["d_ff"],
dropout=0.0, max_seq_len=cfg["max_seq_len"],
share_embeddings=cfg["share_embeddings"], pad_idx=0,
)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()
Training data
- Source: WMT14 en-fr parallel corpus (via HuggingFace
datasetswmt14config) - Cleaning: strict filter — length ratio [0.5, 2.0], min 3 / max 200 tokens per side, Latin-script ratio ≥ 0.9, no tgt line appearing > 50×
- Post-clean size: 9.3M pairs (of 10M subsampled from raw)
- BPE: 32K shared vocab, SentencePiece, character coverage 0.9995
Intended use & limitations
- Translates English → French news / general prose
- Trained only on WMT14 (≈2014 news + Europarl + Common Crawl)
- Does not handle code, tables, or long documents
- Output may reflect biases present in WMT14 training data
- Not benchmarked on low-resource domains (medical, legal, etc.)
Citation
If you use this checkpoint, please cite the original Transformer paper:
@inproceedings{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
booktitle={NeurIPS},
year={2017}
}
And link back to this repo and the GitHub project:
- Downloads last month
- 4