giovo17 commited on
Commit
9771b2f
·
verified ·
1 Parent(s): 5d1bf21

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +266 -0
README.md ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - it
5
+ license: mit
6
+ tags:
7
+ - pytorch
8
+ - nlp
9
+ - machine-translation
10
+ pipeline_tag: translation
11
+ datasets:
12
+ - Helsinki-NLP/europarl
13
+ ---
14
+
15
+ <h1 align="center">tfs-mt<br>
16
+ Transformer from scratch for Machine Translation</h1>
17
+
18
+ <div align="center">
19
+ <a href="https://img.shields.io/github/v/release/Giovo17/tfs-mt" alt="Release">
20
+ <img src="https://img.shields.io/github/v/release/Giovo17/tfs-mt"/>
21
+ </a>
22
+ <a href="https://github.com/Giovo17/tfs-mt/actions/workflows/main.yml?query=branch%3Amain" alt="Build status">
23
+ <img src="https://img.shields.io/github/actions/workflow/status/Giovo17/tfs-mt/main.yml?branch=main"/>
24
+ </a>
25
+ <a href="https://huggingface.co/giovo17/tfs-mt/blob/main/LICENSE" alt="License">
26
+ <img src="https://img.shields.io/badge/license-MIT-green.svg"/>
27
+ </a>
28
+ <br>
29
+ <a href="https://github.com/Giovo17/tfs-mt">
30
+ 🏠 Homepage
31
+ </a>
32
+
33
+ <a href="https://giovo17.github.io/tfs-mt">
34
+ 📖 Documentation
35
+ </a>
36
+
37
+ <a href="https://huggingface.co/spaces/giovo17/tfs-mt-demo">
38
+ 🎬 Demo
39
+ </a>
40
+
41
+ <a href="https://pypi.org/project/tfs-mt">
42
+ 📦 PyPi
43
+ </a>
44
+
45
+ </div>
46
+
47
+ ---
48
+
49
+ This project implements the Transformer architecture from scratch considering Machine Translation as the usecase. It's mainly intended as an educational resource and a functional implementation of the architecture and the training/inference logic.
50
+
51
+ Here you can find the weights of the trained `small` size Transformer and the pretrained tokenizers.
52
+
53
+ ## Quick Start
54
+
55
+ ```bash
56
+ pip install tfs-mt
57
+ ```
58
+
59
+ ```python
60
+ import torch
61
+
62
+ from tfs_mt.architecture import build_model
63
+ from tfs_mt.data_utils import WordTokenizer
64
+ from tfs_mt.decoding_utils import greedy_decoding
65
+
66
+ base_url = "https://huggingface.co/giovo17/tfs-mt/resolve/main/"
67
+ src_tokenizer = WordTokenizer.from_pretrained(base_url + "src_tokenizer_word.json")
68
+ tgt_tokenizer = WordTokenizer.from_pretrained(base_url + "tgt_tokenizer_word.json")
69
+
70
+ model = build_model(
71
+ config="https://huggingface.co/giovo17/tfs-mt/resolve/main/config-lock.yaml",
72
+ from_pretrained=True,
73
+ model_path="https://huggingface.co/giovo17/tfs-mt/resolve/main/model.safetensors",
74
+ )
75
+
76
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
77
+ model.to(device)
78
+ model.eval()
79
+
80
+ input_tokens, input_mask = src_tokenizer.encode("Hi, how are you?")
81
+
82
+ output = greedy_decoding(model, tgt_tokenizer, input_tokens, input_mask)[0]
83
+ print(output)
84
+ ```
85
+
86
+ ## Model Architecture
87
+
88
+ **Model Size**: `small`
89
+
90
+ - **Encoder Layers**: 6
91
+ - **Decoder Layers**: 6
92
+ - **Model Dimension**: 100
93
+ - **Attention Heads**: 6
94
+ - **FFN Dimension**: 400
95
+ - **Normalization Type**: postnorm
96
+ - **Dropout**: 0.1
97
+ - **Pretrained Embeddings**: GloVe
98
+ - **Positional Embeddings**: sinusoidal
99
+ - **GloVe Version**: glove.2024.wikigiga.100d
100
+
101
+ ### Tokenizer
102
+
103
+ - **Type**: word
104
+ - **Max Sequence Length**: 131
105
+ - **Max Vocabulary Size**: 70000
106
+ - **Minimum Frequency**: 2
107
+
108
+
109
+ ## Dataset
110
+
111
+ - **Task**: machine-translation
112
+ - **Dataset ID**: `Helsinki-NLP/europarl`
113
+ - **Dataset Name**: `en-it`
114
+ - **Source Language**: en
115
+ - **Target Language**: it
116
+ - **Train Split**: 0.95
117
+
118
+ ## Full training configuration
119
+
120
+ <details>
121
+ <summary>Click to expand complete config-lock.yaml</summary>
122
+
123
+ ```yaml
124
+ seed: 42
125
+ log_every_iters: 1000
126
+ save_every_iters: 10000
127
+ eval_every_iters: 10000
128
+ update_pbar_every_iters: 100
129
+ time_limit_sec: -1
130
+ checkpoints_retain_n: 5
131
+ model_base_name: tfs_mt
132
+ model_parameters:
133
+ dropout: 0.1
134
+ model_configs:
135
+ pretrained_word_embeddings: GloVe
136
+ positional_embeddings: sinusoidal
137
+ nano:
138
+ num_encoder_layers: 4
139
+ num_decoder_layers: 4
140
+ d_model: 50
141
+ num_heads: 4
142
+ d_ff: 200
143
+ norm_type: postnorm
144
+ glove_version: glove.2024.wikigiga.50d
145
+ glove_filename: wiki_giga_2024_50_MFT20_vectors_seed_123_alpha_0.75_eta_0.075_combined
146
+ small:
147
+ num_encoder_layers: 6
148
+ num_decoder_layers: 6
149
+ d_model: 100
150
+ num_heads: 6
151
+ d_ff: 400
152
+ norm_type: postnorm
153
+ glove_version: glove.2024.wikigiga.100d
154
+ glove_filename: wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined
155
+ base:
156
+ num_encoder_layers: 8
157
+ num_decoder_layers: 8
158
+ d_model: 300
159
+ num_heads: 8
160
+ d_ff: 800
161
+ norm_type: postnorm
162
+ glove_version: glove.2024.wikigiga.300d
163
+ glove_filename: wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined
164
+ original:
165
+ num_encoder_layers: 6
166
+ num_decoder_layers: 6
167
+ d_model: 512
168
+ num_heads: 8
169
+ d_ff: 2048
170
+ norm_type: postnorm
171
+ training_hp:
172
+ num_epochs: 2
173
+ use_amp: true
174
+ amp_dtype: bfloat16
175
+ torch_compile_mode: max-autotune
176
+ loss:
177
+ type: crossentropy
178
+ label_smoothing: 0.1
179
+ optimizer:
180
+ type: AdamW
181
+ weight_decay: 0.0001
182
+ beta1: 0.9
183
+ beta2: 0.999
184
+ eps: 1.0e-08
185
+ lr_scheduler:
186
+ type: original
187
+ min_lr: 0.0003
188
+ max_lr: 0.001
189
+ warmup_iters: 25000
190
+ stable_iters_prop: 0.7
191
+ max_gradient_norm: 5.0
192
+ early_stopping:
193
+ enabled: false
194
+ patience: 40000
195
+ min_delta: 1.0e-05
196
+ tokenizer:
197
+ type: word
198
+ sos_token: <s>
199
+ eos_token: </s>
200
+ pad_token: <PAD>
201
+ unk_token: <UNK>
202
+ max_seq_len: 131
203
+ max_vocab_size: 70000
204
+ vocab_min_freq: 2
205
+ src_sos_token_idx: 60932
206
+ src_eos_token_idx: 60854
207
+ src_pad_token_idx: 18895
208
+ src_unk_token_idx: 3358
209
+ tgt_sos_token_idx: 60933
210
+ tgt_eos_token_idx: 60860
211
+ tgt_pad_token_idx: 18800
212
+ tgt_unk_token_idx: 3289
213
+ dataset:
214
+ dataset_task: machine-translation
215
+ dataset_id: Helsinki-NLP/europarl
216
+ dataset_name: en-it
217
+ train_split: 0.95
218
+ src_lang: en
219
+ tgt_lang: it
220
+ max_len: -1
221
+ train_dataloader:
222
+ batch_size: 64
223
+ num_workers: 4
224
+ shuffle: true
225
+ drop_last: true
226
+ prefetch_factor: 2
227
+ pad_all_to_max_len: true
228
+ test_dataloader:
229
+ batch_size: 128
230
+ num_workers: 4
231
+ shuffle: false
232
+ drop_last: false
233
+ prefetch_factor: 2
234
+ pad_all_to_max_len: true
235
+ chosen_model_size: small
236
+ model_name: tfs_mt_small_260207-0915
237
+ exec_mode: dev
238
+ src_tokenizer_vocab_size: 70000
239
+ tgt_tokenizer_vocab_size: 70000
240
+ num_train_iters_per_epoch: 28889
241
+ num_test_iters_per_epoch: 761
242
+ ```
243
+
244
+ </details>
245
+
246
+
247
+ ## License
248
+
249
+ This model weights are licensed under the **MIT License**.
250
+
251
+ The base weights used for training were sourced from GloVe. Their are licensed under the
252
+ [ODC Public Domain Dedication and License (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/).
253
+
254
+
255
+ ## Citation
256
+
257
+ If you use `tfs-mt` in your research or project, please cite:
258
+
259
+ ```bibtex
260
+ @software{Spadaro_tfs-mt,
261
+ author = {Spadaro, Giovanni},
262
+ licenses = {MIT, CC BY-SA 4.0},
263
+ title = {{tfs-mt}},
264
+ url = {https://github.com/Giovo17/tfs-mt}
265
+ }
266
+ ```