Transformer Base — WMT14 en→fr (from scratch)

A from-scratch PyTorch implementation of the Transformer (Vaswani et al., 2017), trained on WMT14 English→French without any pretrained weights. This is the strongest checkpoint from the parent project and the one worth sharing externally.

Metric	Value
Test BLEU (newstest2014)	35.31
Valid BLEU (newstest2013)	30.52
Tokenizer	sacrebleu `13a` (detokenized)
Parameters	93,554,688
Training compute	single RTX 5090, ~2h 5m

BLEU is reported as sacrebleu 13a, the modern detokenized standard. Vaswani's original paper reports 38.1 in historical tokenized BLEU for Base on WMT14 en-fr, which is roughly equivalent to 35-36 sacrebleu — so this checkpoint lands ~1-1.5 BLEU below paper Base, attributable to training on 9.3M strict-filtered pairs (vs the paper's ~36M full corpus).

See the parent repository for full training logs, ablation studies against larger variants, and why a 9.3M clean corpus outperformed a 30M noisy one (data quality > data quantity > capacity).

Architecture

Standard Transformer Base from the paper, no architectural modifications:


d_model	512
n_heads	8
encoder layers	6
decoder layers	6
FFN dim	2048
dropout	0.1
max seq len	256
vocab size	32000 (shared SentencePiece BPE)
shared embeddings	True

Files in this repo

File	Purpose
`pytorch_model.bin`	Model weights only (averaged over the last 5 step-checkpoints, Vaswani trick)
`sentencepiece.model`	Shared 32K BPE tokenizer (SentencePiece)
`config.json`	Architecture config — sufficient to instantiate the model
`example.py`	Minimal self-contained inference script
`README.md`	This file

Usage

# 1. Clone the parent repo for model definition + beam search code
git clone https://github.com/Euswbnix/Machine_translation
cd Machine_translation
pip install -r requirements.txt

# 2. Download the weights + tokenizer from this HF repo
pip install huggingface_hub
hf download euswbnix/transformer-wmt14-enfr-base \
    pytorch_model.bin sentencepiece.model config.json --local-dir hf_model

# 3. Translate
python examples/load_and_translate.py \
    --weights hf_model/pytorch_model.bin \
    --spm hf_model/sentencepiece.model \
    --config hf_model/config.json \
    --text "Machine learning is transforming the world."

Or use the bundled example.py:

import sentencepiece as spm
import torch
from src.model import Transformer

# Load the model (shapes come from config.json)
cfg = json.load(open("config.json"))
model = Transformer(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"],
    n_heads=cfg["n_heads"], n_encoder_layers=cfg["n_encoder_layers"],
    n_decoder_layers=cfg["n_decoder_layers"], d_ff=cfg["d_ff"],
    dropout=0.0, max_seq_len=cfg["max_seq_len"],
    share_embeddings=cfg["share_embeddings"], pad_idx=0,
)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()

Training data

Source: WMT14 en-fr parallel corpus (via HuggingFace datasets wmt14 config)
Cleaning: strict filter — length ratio [0.5, 2.0], min 3 / max 200 tokens per side, Latin-script ratio ≥ 0.9, no tgt line appearing > 50×
Post-clean size: 9.3M pairs (of 10M subsampled from raw)
BPE: 32K shared vocab, SentencePiece, character coverage 0.9995

Intended use & limitations

Translates English → French news / general prose
Trained only on WMT14 (≈2014 news + Europarl + Common Crawl)
Does not handle code, tables, or long documents
Output may reflect biases present in WMT14 training data
Not benchmarked on low-resource domains (medical, legal, etc.)

Citation

If you use this checkpoint, please cite the original Transformer paper:

@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
  booktitle={NeurIPS},
  year={2017}
}

And link back to this repo and the GitHub project:

Downloads last month: 4

euswbnix
/

transformer-wmt14-enfr-base