Helsinki-NLP/europarl
Viewer β’ Updated β’ 186M β’ 10.9k β’ 39
This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic.
Here you can find the weights of the trained small size Transformer and the pretrained tokenizers.
pip install tfs-mt
import torch
from tfs_mt.architecture import build_model
from tfs_mt.data_utils import WordTokenizer
from tfs_mt.decoding_utils import greedy_decoding
base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/"
src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json")
tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json")
model = build_model(
config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml",
from_pretrained=True,
model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors",
)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?")
output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0]
print(output)
Model Size: small
Helsinki-NLP/europarlen-itseed: 42
log_every_iters: 1000
save_every_iters: 10000
eval_every_iters: 10000
update_pbar_every_iters: 100
time_limit_sec: -1
checkpoints_retain_n: 5
model_base_name: tfs_mt
model_parameters:
dropout: 0.1
model_configs:
pretrained_word_embeddings: GloVe
positional_embeddings: sinusoidal
nano:
num_encoder_layers: 4
num_decoder_layers: 4
d_model: 50
num_heads: 4
d_ff: 200
norm_type: postnorm
glove_version: glove.2024.wikigiga.50d
glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined
small:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 100
num_heads: 6
d_ff: 400
norm_type: postnorm
glove_version: glove.2024.wikigiga.100d
glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined
base:
num_encoder_layers: 8
num_decoder_layers: 8
d_model: 300
num_heads: 8
d_ff: 800
norm_type: postnorm
glove_version: glove.2024.wikigiga.300d
glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined
original:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 512
num_heads: 8
d_ff: 2048
norm_type: postnorm
training_hp:
num_epochs: 2
use_amp: true
amp_dtype: bfloat16
torch_compile_mode: max-autotune
loss:
type: crossentropy
label_smoothing: 0.1
optimizer:
type: AdamW
weight_decay: 0.0001
beta1: 0.9
beta2: 0.999
eps: 1.0e-08
lr_scheduler:
type: original
min_lr: 0.0003
max_lr: 0.001
warmup_iters: 25000
stable_iters_prop: 0.7
max_gradient_norm: 5.0
early_stopping:
enabled: false
patience: 40000
min_delta: 1.0e-05
tokenizer:
type: word
sos_token: <s>
eos_token: </s>
pad_token: <PAD>
unk_token: <UNK>
max_seq_len: 131
max_vocab_size: 70000
vocab_min_freq: 2
src_sos_token_idx: 60932
src_eos_token_idx: 60854
src_pad_token_idx: 18895
src_unk_token_idx: 3358
tgt_sos_token_idx: 60933
tgt_eos_token_idx: 60860
tgt_pad_token_idx: 18800
tgt_unk_token_idx: 3289
dataset:
dataset_task: machine-translation
dataset_id: Helsinki-NLP/europarl
dataset_name: en-it
train_split: 0.95
src_lang: en
tgt_lang: it
max_len: -1
train_dataloader:
batch_size: 64
num_workers: 4
shuffle: true
drop_last: true
prefetch_factor: 2
pad_all_to_max_len: true
test_dataloader:
batch_size: 128
num_workers: 4
shuffle: false
drop_last: false
prefetch_factor: 2
pad_all_to_max_len: true
chosen_model_size: small
model_name: tfs_mt_small_260207-0915
exec_mode: dev
src_tokenizer_vocab_size: 70000
tgt_tokenizer_vocab_size: 70000
num_train_iters_per_epoch: 28889
num_test_iters_per_epoch: 761
This model weights are licensed under the MIT License.
The base weights used for training were sourced from GloVe. Their are licensed under the ODC Public Domain Dedication and License (PDDL).
If you use tfs-mt in your research or project, please cite:
@software{Spadaro_tfs-mt,
author = {Spadaro, Giovanni},
licenses = {MIT, CC BY-SA 4.0},
title = {{tfs-mt}},
url = {https://github.com/Giovo17/tfs-mt}
}