tfs-mt / README.md
giovo17's picture
Upload README.md with huggingface_hub
9771b2f verified
---
language:
- en
- it
license: mit
tags:
- pytorch
- nlp
- machine-translation
pipeline_tag: translation
datasets:
- Helsinki-NLP/europarl
---
<h1 align="center">tfs-mt<br>
Transformer from scratch for Machine Translation</h1>
<div align="center">
<a href="https://img.shields.io/github/v/release/Giovo17/tfs-mt" alt="Release">
<img src="https://img.shields.io/github/v/release/Giovo17/tfs-mt"/>
</a>
<a href="https://github.com/Giovo17/tfs-mt/actions/workflows/main.yml?query=branch%3Amain" alt="Build status">
<img src="https://img.shields.io/github/actions/workflow/status/Giovo17/tfs-mt/main.yml?branch=main"/>
</a>
<a href="https://huggingface.co/giovo17/tfs-mt/blob/main/LICENSE" alt="License">
<img src="https://img.shields.io/badge/license-MIT-green.svg"/>
</a>
<br>
<a href="https://github.com/Giovo17/tfs-mt">
🏠 Homepage
</a>
<a href="https://giovo17.github.io/tfs-mt">
📖 Documentation
</a>
<a href="https://huggingface.co/spaces/giovo17/tfs-mt-demo">
🎬 Demo
</a>
<a href="https://pypi.org/project/tfs-mt">
📦 PyPi
</a>
</div>
---
This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic.
Here you can find the weights of the trained `small` size Transformer and the pretrained tokenizers.
## Quick Start
```bash
pip install tfs-mt
```
```python
import torch
from tfs_mt.architecture import build_model
from tfs_mt.data_utils import WordTokenizer
from tfs_mt.decoding_utils import greedy_decoding
base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/"
src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json")
tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json")
model = build_model(
config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml",
from_pretrained=True,
model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors",
)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?")
output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0]
print(output)
```
## Model Architecture
**Model Size**: `small`
- **Encoder Layers**: 6
- **Decoder Layers**: 6
- **Model Dimension**: 100
- **Attention Heads**: 6
- **FFN Dimension**: 400
- **Normalization Type**: postnorm
- **Dropout**: 0.1
- **Pretrained Embeddings**: GloVe
- **Positional Embeddings**: sinusoidal
- **GloVe Version**: glove.2024.wikigiga.100d
### Tokenizer
- **Type**: word
- **Max Sequence Length**: 131
- **Max Vocabulary Size**: 70000
- **Minimum Frequency**: 2
## Dataset
- **Task**: machine-translation
- **Dataset ID**: `Helsinki-NLP/europarl`
- **Dataset Name**: `en-it`
- **Source Language**: en
- **Target Language**: it
- **Train Split**: 0.95
## Full training configuration
<details>
<summary>Click to expand complete config-lock.yaml</summary>
```yaml
seed: 42
log_every_iters: 1000
save_every_iters: 10000
eval_every_iters: 10000
update_pbar_every_iters: 100
time_limit_sec: -1
checkpoints_retain_n: 5
model_base_name: tfs_mt
model_parameters:
dropout: 0.1
model_configs:
pretrained_word_embeddings: GloVe
positional_embeddings: sinusoidal
nano:
num_encoder_layers: 4
num_decoder_layers: 4
d_model: 50
num_heads: 4
d_ff: 200
norm_type: postnorm
glove_version: glove.2024.wikigiga.50d
glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined
small:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 100
num_heads: 6
d_ff: 400
norm_type: postnorm
glove_version: glove.2024.wikigiga.100d
glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined
base:
num_encoder_layers: 8
num_decoder_layers: 8
d_model: 300
num_heads: 8
d_ff: 800
norm_type: postnorm
glove_version: glove.2024.wikigiga.300d
glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined
original:
num_encoder_layers: 6
num_decoder_layers: 6
d_model: 512
num_heads: 8
d_ff: 2048
norm_type: postnorm
training_hp:
num_epochs: 2
use_amp: true
amp_dtype: bfloat16
torch_compile_mode: max-autotune
loss:
type: crossentropy
label_smoothing: 0.1
optimizer:
type: AdamW
weight_decay: 0.0001
beta1: 0.9
beta2: 0.999
eps: 1.0e-08
lr_scheduler:
type: original
min_lr: 0.0003
max_lr: 0.001
warmup_iters: 25000
stable_iters_prop: 0.7
max_gradient_norm: 5.0
early_stopping:
enabled: false
patience: 40000
min_delta: 1.0e-05
tokenizer:
type: word
sos_token: <s>
eos_token: </s>
pad_token: <PAD>
unk_token: <UNK>
max_seq_len: 131
max_vocab_size: 70000
vocab_min_freq: 2
src_sos_token_idx: 60932
src_eos_token_idx: 60854
src_pad_token_idx: 18895
src_unk_token_idx: 3358
tgt_sos_token_idx: 60933
tgt_eos_token_idx: 60860
tgt_pad_token_idx: 18800
tgt_unk_token_idx: 3289
dataset:
dataset_task: machine-translation
dataset_id: Helsinki-NLP/europarl
dataset_name: en-it
train_split: 0.95
src_lang: en
tgt_lang: it
max_len: -1
train_dataloader:
batch_size: 64
num_workers: 4
shuffle: true
drop_last: true
prefetch_factor: 2
pad_all_to_max_len: true
test_dataloader:
batch_size: 128
num_workers: 4
shuffle: false
drop_last: false
prefetch_factor: 2
pad_all_to_max_len: true
chosen_model_size: small
model_name: tfs_mt_small_260207-0915
exec_mode: dev
src_tokenizer_vocab_size: 70000
tgt_tokenizer_vocab_size: 70000
num_train_iters_per_epoch: 28889
num_test_iters_per_epoch: 761
```
</details>
## License
This model weights are licensed under the **MIT License**.
The base weights used for training were sourced from GloVe. Their are licensed under the
[ODC Public Domain Dedication and License (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/).
## Citation
If you use `tfs-mt` in your research or project, please cite:
```bibtex
@software{Spadaro_tfs-mt,
author = {Spadaro, Giovanni},
licenses = {MIT, CC BY-SA 4.0},
title = {{tfs-mt}},
url = {https://github.com/Giovo17/tfs-mt}
}
```