| --- |
| language: |
| - en |
| - it |
| license: mit |
| tags: |
| - pytorch |
| - nlp |
| - machine-translation |
| pipeline_tag: translation |
| datasets: |
| - Helsinki-NLP/europarl |
| --- |
| |
| <h1 align="center">tfs-mt<br> |
| Transformer from scratch for Machine Translation</h1> |
|
|
| <div align="center"> |
| <a href="https://img.shields.io/github/v/release/Giovo17/tfs-mt" alt="Release"> |
| <img src="https://img.shields.io/github/v/release/Giovo17/tfs-mt"/> |
| </a> |
| <a href="https://github.com/Giovo17/tfs-mt/actions/workflows/main.yml?query=branch%3Amain" alt="Build status"> |
| <img src="https://img.shields.io/github/actions/workflow/status/Giovo17/tfs-mt/main.yml?branch=main"/> |
| </a> |
| <a href="https://huggingface.co/giovo17/tfs-mt/blob/main/LICENSE" alt="License"> |
| <img src="https://img.shields.io/badge/license-MIT-green.svg"/> |
| </a> |
| <br> |
| <a href="https://github.com/Giovo17/tfs-mt"> |
| 🏠 Homepage |
| </a> |
| • |
| <a href="https://giovo17.github.io/tfs-mt"> |
| 📖 Documentation |
| </a> |
| • |
| <a href="https://huggingface.co/spaces/giovo17/tfs-mt-demo"> |
| 🎬 Demo |
| </a> |
| • |
| <a href="https://pypi.org/project/tfs-mt"> |
| 📦 PyPi |
| </a> |
| |
| </div> |
|
|
| --- |
|
|
| This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic. |
|
|
| Here you can find the weights of the trained `small` size Transformer and the pretrained tokenizers. |
|
|
| ## Quick Start |
|
|
| ```bash |
| pip install tfs-mt |
| ``` |
|
|
| ```python |
| import torch |
| |
| from tfs_mt.architecture import build_model |
| from tfs_mt.data_utils import WordTokenizer |
| from tfs_mt.decoding_utils import greedy_decoding |
| |
| base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/" |
| src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json") |
| tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json") |
| |
| model = build_model( |
| config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml", |
| from_pretrained=True, |
| model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors", |
| ) |
| |
| device = "cuda:0" if torch.cuda.is_available() else "cpu" |
| model.to(device) |
| model.eval() |
| |
| input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?") |
| |
| output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0] |
| print(output) |
| ``` |
|
|
| ## Model Architecture |
|
|
| **Model Size**: `small` |
|
|
| - **Encoder Layers**: 6 |
| - **Decoder Layers**: 6 |
| - **Model Dimension**: 100 |
| - **Attention Heads**: 6 |
| - **FFN Dimension**: 400 |
| - **Normalization Type**: postnorm |
| - **Dropout**: 0.1 |
| - **Pretrained Embeddings**: GloVe |
| - **Positional Embeddings**: sinusoidal |
| - **GloVe Version**: glove.2024.wikigiga.100d |
|
|
| ### Tokenizer |
|
|
| - **Type**: word |
| - **Max Sequence Length**: 131 |
| - **Max Vocabulary Size**: 70000 |
| - **Minimum Frequency**: 2 |
|
|
|
|
| ## Dataset |
|
|
| - **Task**: machine-translation |
| - **Dataset ID**: `Helsinki-NLP/europarl` |
| - **Dataset Name**: `en-it` |
| - **Source Language**: en |
| - **Target Language**: it |
| - **Train Split**: 0.95 |
|
|
| ## Full training configuration |
|
|
| <details> |
| <summary>Click to expand complete config-lock.yaml</summary> |
|
|
| ```yaml |
| seed: 42 |
| log_every_iters: 1000 |
| save_every_iters: 10000 |
| eval_every_iters: 10000 |
| update_pbar_every_iters: 100 |
| time_limit_sec: -1 |
| checkpoints_retain_n: 5 |
| model_base_name: tfs_mt |
| model_parameters: |
| dropout: 0.1 |
| model_configs: |
| pretrained_word_embeddings: GloVe |
| positional_embeddings: sinusoidal |
| nano: |
| num_encoder_layers: 4 |
| num_decoder_layers: 4 |
| d_model: 50 |
| num_heads: 4 |
| d_ff: 200 |
| norm_type: postnorm |
| glove_version: glove.2024.wikigiga.50d |
| glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined |
| small: |
| num_encoder_layers: 6 |
| num_decoder_layers: 6 |
| d_model: 100 |
| num_heads: 6 |
| d_ff: 400 |
| norm_type: postnorm |
| glove_version: glove.2024.wikigiga.100d |
| glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined |
| base: |
| num_encoder_layers: 8 |
| num_decoder_layers: 8 |
| d_model: 300 |
| num_heads: 8 |
| d_ff: 800 |
| norm_type: postnorm |
| glove_version: glove.2024.wikigiga.300d |
| glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined |
| original: |
| num_encoder_layers: 6 |
| num_decoder_layers: 6 |
| d_model: 512 |
| num_heads: 8 |
| d_ff: 2048 |
| norm_type: postnorm |
| training_hp: |
| num_epochs: 2 |
| use_amp: true |
| amp_dtype: bfloat16 |
| torch_compile_mode: max-autotune |
| loss: |
| type: crossentropy |
| label_smoothing: 0.1 |
| optimizer: |
| type: AdamW |
| weight_decay: 0.0001 |
| beta1: 0.9 |
| beta2: 0.999 |
| eps: 1.0e-08 |
| lr_scheduler: |
| type: original |
| min_lr: 0.0003 |
| max_lr: 0.001 |
| warmup_iters: 25000 |
| stable_iters_prop: 0.7 |
| max_gradient_norm: 5.0 |
| early_stopping: |
| enabled: false |
| patience: 40000 |
| min_delta: 1.0e-05 |
| tokenizer: |
| type: word |
| sos_token: <s> |
| eos_token: </s> |
| pad_token: <PAD> |
| unk_token: <UNK> |
| max_seq_len: 131 |
| max_vocab_size: 70000 |
| vocab_min_freq: 2 |
| src_sos_token_idx: 60932 |
| src_eos_token_idx: 60854 |
| src_pad_token_idx: 18895 |
| src_unk_token_idx: 3358 |
| tgt_sos_token_idx: 60933 |
| tgt_eos_token_idx: 60860 |
| tgt_pad_token_idx: 18800 |
| tgt_unk_token_idx: 3289 |
| dataset: |
| dataset_task: machine-translation |
| dataset_id: Helsinki-NLP/europarl |
| dataset_name: en-it |
| train_split: 0.95 |
| src_lang: en |
| tgt_lang: it |
| max_len: -1 |
| train_dataloader: |
| batch_size: 64 |
| num_workers: 4 |
| shuffle: true |
| drop_last: true |
| prefetch_factor: 2 |
| pad_all_to_max_len: true |
| test_dataloader: |
| batch_size: 128 |
| num_workers: 4 |
| shuffle: false |
| drop_last: false |
| prefetch_factor: 2 |
| pad_all_to_max_len: true |
| chosen_model_size: small |
| model_name: tfs_mt_small_260207-0915 |
| exec_mode: dev |
| src_tokenizer_vocab_size: 70000 |
| tgt_tokenizer_vocab_size: 70000 |
| num_train_iters_per_epoch: 28889 |
| num_test_iters_per_epoch: 761 |
| ``` |
|
|
| </details> |
|
|
|
|
| ## License |
|
|
| This model weights are licensed under the **MIT License**. |
|
|
| The base weights used for training were sourced from GloVe. Their are licensed under the |
| [ODC Public Domain Dedication and License (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/). |
|
|
|
|
| ## Citation |
|
|
| If you use `tfs-mt` in your research or project, please cite: |
|
|
| ```bibtex |
| @software{Spadaro_tfs-mt, |
| author = {Spadaro, Giovanni}, |
| licenses = {MIT, CC BY-SA 4.0}, |
| title = {{tfs-mt}}, |
| url = {https://github.com/Giovo17/tfs-mt} |
| } |
| ``` |