Transformer Base — WMT14 en→fr (from scratch)

A from-scratch PyTorch implementation of the Transformer (Vaswani et al., 2017), trained on WMT14 English→French without any pretrained weights. This is the strongest checkpoint from the parent project and the one worth sharing externally.

Metric Value
Test BLEU (newstest2014) 35.31
Valid BLEU (newstest2013) 30.52
Tokenizer sacrebleu 13a (detokenized)
Parameters 93,554,688
Training compute single RTX 5090, ~2h 5m

BLEU is reported as sacrebleu 13a, the modern detokenized standard. Vaswani's original paper reports 38.1 in historical tokenized BLEU for Base on WMT14 en-fr, which is roughly equivalent to 35-36 sacrebleu — so this checkpoint lands ~1-1.5 BLEU below paper Base, attributable to training on 9.3M strict-filtered pairs (vs the paper's ~36M full corpus).

See the parent repository for full training logs, ablation studies against larger variants, and why a 9.3M clean corpus outperformed a 30M noisy one (data quality > data quantity > capacity).

Architecture

Standard Transformer Base from the paper, no architectural modifications:

d_model 512
n_heads 8
encoder layers 6
decoder layers 6
FFN dim 2048
dropout 0.1
max seq len 256
vocab size 32000 (shared SentencePiece BPE)
shared embeddings True

Files in this repo

File Purpose
pytorch_model.bin Model weights only (averaged over the last 5 step-checkpoints, Vaswani trick)
sentencepiece.model Shared 32K BPE tokenizer (SentencePiece)
config.json Architecture config — sufficient to instantiate the model
example.py Minimal self-contained inference script
README.md This file

Usage

# 1. Clone the parent repo for model definition + beam search code
git clone https://github.com/Euswbnix/Machine_translation
cd Machine_translation
pip install -r requirements.txt

# 2. Download the weights + tokenizer from this HF repo
pip install huggingface_hub
hf download euswbnix/transformer-wmt14-enfr-base \
    pytorch_model.bin sentencepiece.model config.json --local-dir hf_model

# 3. Translate
python examples/load_and_translate.py \
    --weights hf_model/pytorch_model.bin \
    --spm hf_model/sentencepiece.model \
    --config hf_model/config.json \
    --text "Machine learning is transforming the world."

Or use the bundled example.py:

import sentencepiece as spm
import torch
from src.model import Transformer

# Load the model (shapes come from config.json)
cfg = json.load(open("config.json"))
model = Transformer(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"],
    n_heads=cfg["n_heads"], n_encoder_layers=cfg["n_encoder_layers"],
    n_decoder_layers=cfg["n_decoder_layers"], d_ff=cfg["d_ff"],
    dropout=0.0, max_seq_len=cfg["max_seq_len"],
    share_embeddings=cfg["share_embeddings"], pad_idx=0,
)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()

Training data

  • Source: WMT14 en-fr parallel corpus (via HuggingFace datasets wmt14 config)
  • Cleaning: strict filter — length ratio [0.5, 2.0], min 3 / max 200 tokens per side, Latin-script ratio ≥ 0.9, no tgt line appearing > 50×
  • Post-clean size: 9.3M pairs (of 10M subsampled from raw)
  • BPE: 32K shared vocab, SentencePiece, character coverage 0.9995

Intended use & limitations

  • Translates English → French news / general prose
  • Trained only on WMT14 (≈2014 news + Europarl + Common Crawl)
  • Does not handle code, tables, or long documents
  • Output may reflect biases present in WMT14 training data
  • Not benchmarked on low-resource domains (medical, legal, etc.)

Citation

If you use this checkpoint, please cite the original Transformer paper:

@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
  booktitle={NeurIPS},
  year={2017}
}

And link back to this repo and the GitHub project:

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train euswbnix/transformer-wmt14-enfr-base