shubhexists commited on Sep 2, 2025

Commit

170fb3e

verified ·

1 Parent(s): ac6ade7

Upload folder using huggingface_hub

Browse files

Files changed (20) hide show

.gitignore +1 -0
CODE_OF_CONDUCT.md +128 -0
LICENSE +21 -0
README.md +8 -3
images/google.png +0 -0
images/inference.png +0 -0
pyproject.toml +5 -0
src/__pycache__/config.cpython-39.pyc +0 -0
src/__pycache__/dataset.cpython-39.pyc +0 -0
src/__pycache__/model.cpython-39.pyc +0 -0
src/__pycache__/train.cpython-39.pyc +0 -0
src/config.py +24 -0
src/dataset.py +111 -0
src/inference.py +36 -0
src/model.py +433 -0
src/train.py +294 -0
tokenizer_en.json +0 -0
tokenizer_it.json +0 -0
uv.lock +0 -0
weights/tmodel_19.pt +3 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ .venv

CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,128 @@

+# Contributor Covenant Code of Conduct
+## Our Pledge
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+## Our Standards
+Examples of behavior that contributes to a positive environment for our
+community include:
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+Examples of unacceptable behavior include:
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Enforcement Responsibilities
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+## Scope
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+shubh622005@gmail.com.
+All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+## Enforcement Guidelines
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+### 1. Correction
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+### 2. Warning
+**Community Impact**: A violation through a single incident or series
+of actions.
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+### 3. Temporary Ban
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+### 4. Permanent Ban
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Shubham
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,8 @@
----
-license: mit
----

+Implementation of transformer architecture based on [Attention is All You Need](https://arxiv.org/abs/1706.03762)
+### Inference outputs
+<div align="center">
+  <img src="/images/inference.png" alt="Inference" width="400" height="300" style="margin: 0 10px;">
+  <img src="/images/google.png" alt="Google" width="400" height="300" style="margin: 0 10px;">
+</div>

images/google.png ADDED Viewed

images/inference.png ADDED Viewed

pyproject.toml ADDED Viewed

	@@ -0,0 +1,5 @@

+[project]
+name = "transformers"
+version = "0.1.0"
+requires-python = ">=3.9"
+dependencies = ["torch==2.8.0", "datasets==4.0.0", "tokenizers==0.22.0"]

src/__pycache__/config.cpython-39.pyc ADDED Viewed

Binary file (789 Bytes). View file

src/__pycache__/dataset.cpython-39.pyc ADDED Viewed

Binary file (2.72 kB). View file

src/__pycache__/model.cpython-39.pyc ADDED Viewed

Binary file (15.9 kB). View file

src/__pycache__/train.cpython-39.pyc ADDED Viewed

Binary file (6.4 kB). View file

src/config.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from pathlib import Path
+def get_config():
+    return {
+        "batch_size": 8,
+        "num_epochs": 20,
+        "lr": 10**-4,
+        "seq_len": 350,
+        "d_model": 512,
+        "lang_src": "en",
+        "lang_target": "it",
+        "model_folder": "weights",
+        "model_basename": "tmodel_",
+        "preload": None,
+        "tokenizer_file": "tokenizer_{0}.json",
+        "experiment_name": "runs/tmodel",
+    }
+def get_weights_file_path(config, epoch: str):
+    model_folder = config["model_folder"]
+    model_filename = f"{config['model_basename']}{epoch}.pt"
+    return str(Path(".") / model_folder / model_filename)

src/dataset.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import torch
+from torch.utils.data import Dataset
+class BilingualDataset(Dataset):
+    def __init__(
+        self, dataset, tokenizer_src, tokenizer_target, src_lang, target_lang, seq_len
+    ):
+        """
+        Initializes a new instance of this Dataset. One language pair of the dataset
+        https://huggingface.co/datasets/Helsinki-NLP/opus_books
+        """
+        super().__init__()
+        self.seq_len = seq_len
+        self.src_lang = src_lang
+        self.tokenizer_target = tokenizer_target
+        self.tokenizer_src = tokenizer_src
+        self.target_lang = target_lang
+        self.dataset = dataset
+        self.start_of_sentence_token = torch.tensor(
+            [tokenizer_target.token_to_id("[SOS]")], dtype=torch.int64
+        )
+        self.end_of_sentence_token = torch.tensor(
+            [tokenizer_target.token_to_id("[EOS]")], dtype=torch.int64
+        )
+        self.padding_token = torch.tensor(
+            [tokenizer_target.token_to_id("[PAD]")], dtype=torch.int64
+        )
+    def __len__(self):
+        return len(self.dataset)
+    def __getitem__(self, index):
+        """
+        This function takes the text of the sentence from the dataset, tokenizes it using the
+        tokenizer_src and the tokenizer_target respectively and constructs the tensors used to pass to the transformer
+        """
+        src_target_pair = self.dataset[index]
+        src_text = src_target_pair["translation"][self.src_lang]
+        target_text = src_target_pair["translation"][self.target_lang]
+        encoder_input_tokens = self.tokenizer_src.encode(src_text).ids
+        decoder_input_tokens = self.tokenizer_target.encode(target_text).ids
+        enc_num_padding_tokens = self.seq_len - len(encoder_input_tokens) - 2
+        dec_num_padding_tokens = self.seq_len - len(decoder_input_tokens) - 1
+        if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
+            raise ValueError("Sentence is too long")
+        encoder_input = torch.cat(
+            [
+                self.start_of_sentence_token,
+                torch.tensor(encoder_input_tokens, dtype=torch.int64),
+                self.end_of_sentence_token,
+                torch.tensor(
+                    [self.padding_token] * enc_num_padding_tokens, dtype=torch.int64
+                ),
+            ],
+            dim=0,
+        )
+        decoder_input = torch.cat(
+            [
+                self.start_of_sentence_token,
+                torch.tensor(decoder_input_tokens, dtype=torch.int64),
+                torch.tensor(
+                    [self.padding_token] * dec_num_padding_tokens, dtype=torch.int64
+                ),
+            ],
+            dim=0,
+        )
+        label = torch.cat(
+            [
+                torch.tensor(decoder_input_tokens, dtype=torch.int64),
+                self.end_of_sentence_token,
+                torch.tensor(
+                    [self.padding_token] * dec_num_padding_tokens, dtype=torch.int64
+                ),
+            ],
+            dim=0,
+        )
+        assert encoder_input.size(0) == self.seq_len
+        assert decoder_input.size(0) == self.seq_len
+        assert label.size(0) == self.seq_len
+        return {
+            "encoder_input": encoder_input,  # (seq_len)
+            "decoder_input": decoder_input,  # (seq_len)
+            "encoder_mask": (encoder_input != self.padding_token)
+            .unsqueeze(0)
+            .unsqueeze(0)
+            .int(),  # (1, 1, seq_len) adding the sequence dimension and batch dimension
+            "decoder_mask": (decoder_input != self.padding_token).unsqueeze(0).int()
+            & causal_mask(
+                decoder_input.size(0)
+            ),  # (1, seq_len) & (1, seq_len, seq_len),
+            "label": label,  # (seq_len)
+            "src_text": src_text,
+            "tgt_text": target_text,
+        }
+def causal_mask(size):
+    # This returns everything above the diagonal. Hence we reverse it by mask == 0 in return as we need
+    # stuff below the diagonal
+    mask = torch.triu(torch.ones((1, size, size)), diagonal=1).type(torch.int)
+    return mask == 0

src/inference.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import torch
+from train import get_model, greedy_decode, get_or_build_tokenizer
+from config import get_config
+INPUT_TEXT = "sun rises in the night"
+def inference():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    device = torch.device(device)
+    config = get_config()
+    tokenizer_src = get_or_build_tokenizer(config, None, config["lang_src"])
+    tokenizer_tgt = get_or_build_tokenizer(config, None, config["lang_target"])
+    model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)
+    model_filename = "weights/tmodel_19.pt"
+    state = torch.load(model_filename, map_location=device)
+    model.load_state_dict(state["model_state_dict"])
+    model.eval()
+    tokens = tokenizer_src.encode(INPUT_TEXT).ids
+    tokens = [tokenizer_src.token_to_id("[SOS]")] + tokens + [tokenizer_src.token_to_id("[EOS]")]
+    encoder_input = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).to(device)
+    encoder_mask = (encoder_input != tokenizer_src.token_to_id("[PAD]")).unsqueeze(0).unsqueeze(0).to(device)
+    model_out = greedy_decode(model, encoder_input, encoder_mask, tokenizer_src, tokenizer_tgt, config["seq_len"], device)
+    output_text = tokenizer_tgt.decode(model_out.detach().cpu().numpy())
+    print("Source:", INPUT_TEXT)
+    print("Predicted:", output_text)
+if __name__ == "__main__":
+    inference()

src/model.py ADDED Viewed

	@@ -0,0 +1,433 @@

+# In pytorch, forward function of each class is called automatically, so we do not need to call it each time we call that class.
+import torch
+import torch.nn as nn
+import math
+class InputEmbeddings(nn.Module):
+    def __init__(self, d_model: int, vocab_size: int) -> None:
+        """
+        vocab_size: number of words in the vocabulary
+        d_model: dimension of the model
+        1. Creates a embedding of size d_model for each word in the vocab
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.vocab_size = vocab_size
+        self.embeddings = nn.Embedding(vocab_size, d_model)
+    def forward(self, x):
+        """
+        x: (batch_size, seq_len)
+        return: (batch_size, seq_len, d_model)
+        Convert the input words to their corresponding embeddings
+        """
+        # multiplying by sqrt(self.d_model) to scale the embeddings
+        return self.embeddings(x) * math.sqrt(self.d_model)
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
+        """
+        seq_len: maximum length of the input sentence
+        d_modal: dimension of the model
+        dropout: dropout rate
+        1. Create a matrix of shape (seq_len, d_model) with all values set to 0
+        2. Create a position vector of shape (seq_len, 1) with values from 0 to seq_len-1
+        3. Create a denominator vector of shape (d_model/2) with values from 0 to d_model/2-1
+           and apply the formula: exp(-log(10000) * (2i/d_model))
+        4. Apply the sine function to the even indices of the positional encoding matrix
+           and the cosine function to the odd indices
+        5. Add a batch dimension to the positional encoding matrix and register it as a buffer
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.seq_len = seq_len
+        # dropout prevents overfitting of the model, randomly zeroes some values
+        self.dropout = nn.Dropout(dropout)
+        positional_encoding = torch.zeros(seq_len, d_model)  # (seq_len, d_model)
+        position_vector = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(
+            1
+        )  # (seq_len, 1)
+        denominator = torch.exp(
+            torch.arange(0, d_model, 2).float() * (-math.log(10_000.0) / d_model)
+        )  # (d_model/2, )
+        positional_encoding[:, 0::2] = torch.sin(position_vector * denominator)
+        positional_encoding[:, 1::2] = torch.cos(position_vector * denominator)
+        # we unsqueeze to make it broadcastable over batch dimension (batch_size, seq_len, d_model) + (1, seq_len, d_model)
+        positional_encoding = positional_encoding.unsqueeze(0)  # (1, seq_len, d_model)
+        self.register_buffer("positional_encoding", positional_encoding)
+    def forward(self, x):
+        """
+        x: (batch_size, seq_len, d_model)
+        return: (batch_size, seq_len, d_model)
+        Add positional encoding to the input embeddings
+        """
+        x = x + (self.positional_encoding[:, : x.shape[1], :]).requires_grad_(False)
+        return self.dropout(x)
+class LayerNormalization(nn.Module):
+    def __init__(self, features: int, epsilon: float = 10**-6) -> None:
+        """
+        features: number of features for which we have to perform layer normalization, i.e, d_model
+        epsilon: a very small number to prevent division by a very small number or 0
+        """
+        super().__init__()
+        self.epsilon = epsilon
+        self.alpha = nn.Parameter(torch.ones(features))
+        self.beta = nn.Parameter(torch.zeros(features))
+    def forward(self, x):
+        """
+        x: (batch_size, seq_len, features)
+        return: (batch_size, seq_len, features)
+        Implements the layer normalization formula
+        """
+        mean = x.mean(dim=-1, keepdim=True)
+        std = x.std(dim=-1, keepdim=True)
+        return self.alpha * (x - mean) / (std + self.epsilon) + self.beta
+class FeedForwardBlock(nn.Module):
+    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
+        """
+        d_model: dimension of the model. It would be the input dimension of the input layer of our feed forward network.
+        d_ff: dimensions of the hidden layer. It is usually larger than the input dimensions i.e. d_model
+        Architecture:
+            Input (batch_size, seq_len, d_model)
+                -> Linear(d_model → d_ff)
+                -> ReLU (non-linearity)
+                -> Dropout
+                -> Linear(d_ff → d_mudrodip?tab=overview&from=2025-08-01&to=2025-08-29odel)
+            Output (batch_size, seq_len, d_model)
+        """
+        super().__init__()
+        self.layer_1 = nn.Linear(d_model, d_ff)
+        self.dropout = nn.Dropout(dropout)
+        self.layer_2 = nn.Linear(d_ff, d_model)
+    def forward(self, x):
+        return self.layer_2(self.dropout(torch.relu(self.layer_1(x))))
+class MultiHeadAttentionBlock(nn.Module):
+    def __init__(self, d_model: int, head: int, dropout: float) -> None:
+        """
+        d_model: dimension of the model.
+        head: number of parts we have to break the multihead attention block into
+        Initialize four linear layers of size d_model by d_model which we will use later
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.heads = head
+        assert d_model % head == 0, "Head should completely divide the model dimensions"
+        self.d_k = d_model // head
+        self.w_q = nn.Linear(d_model, d_model)
+        self.w_k = nn.Linear(d_model, d_model)
+        self.w_v = nn.Linear(d_model, d_model)
+        self.w_o = nn.Linear(d_model, d_model)
+        self.dropout = nn.Dropout(dropout)
+    @staticmethod
+    def attention(query, key, value, mask, dropout: nn.Dropout):
+        """
+        query, key and value are the input matrices to calculate the attention
+        mask is used in a case where we need to ignore the interactions between certain values.
+        For eg. While using this in a decoder, we would mask all the keys ahead of the word.
+        Similarly, we will ignore all the padded elements in a sentence.
+        This function implements the the attention calculation logic.
+        """
+        d_k = query.shape[-1]
+        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(
+            d_k
+        )  # "@" represents matrix multiplication in pytorch
+        if mask is not None:
+            attention_scores.masked_fill_(mask == 0, float("-inf"))
+        attention_scores = attention_scores.softmax(dim=-1)
+        if dropout is not None:
+            attention_scores = dropout(attention_scores)
+        return (attention_scores @ value), attention_scores
+    def forward(self, query, key, value, mask):
+        query = self.w_q(query)
+        key = self.w_k(key)
+        value = self.w_v(value)
+        # We now divide the matrices in `heads` part.
+        # (batch_size, seq_len, d_model) --> (batch_size, seq_len, head, (d_model // head)) --> (batch_size, head, seq_len, (d_model // head))
+        query = query.view(
+            query.shape[0], query.shape[1], self.heads, self.d_k
+        ).transpose(1, 2)
+        key = key.view(key.shape[0], key.shape[1], self.heads, self.d_k).transpose(1, 2)
+        value = value.view(
+            value.shape[0], value.shape[1], self.heads, self.d_k
+        ).transpose(1, 2)
+        # Calculate the attention values and the final output after multiplying it with `value`
+        x, self.attention_scores = MultiHeadAttentionBlock.attention(
+            query, key, value, mask, self.dropout
+        )
+        # (batch_size, head, seq_len, (d_model // head)) --> (batch_size, seq_len, head, (d_model // head)) --> (batch_size, seq_len, d_model)
+        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.heads * self.d_k)
+        return self.w_o(x)
+class ResidualConnection(nn.Module):
+    def __init__(self, features: int, dropout: float) -> None:
+        """
+        This class is basically a wrapper around all the blocks that we'll use in the transformer.
+        It will pass through that layer and automatically apply dropout and layer normalization to prevent values to go out of bound.
+        [LayerNorm -> Sublayer -> Dropout] + Input
+        """
+        super().__init__()
+        self.dropout = nn.Dropout(dropout)
+        self.norm = LayerNormalization(features=features)
+    def forward(self, x, sublayer):
+        return x + self.dropout(sublayer(self.norm(x)))
+class EncoderBlock(nn.Module):
+    def __init__(
+        self,
+        features: int,
+        self_attention_block: MultiHeadAttentionBlock,
+        feed_forward_block: FeedForwardBlock,
+        dropout: float,
+    ) -> None:
+        """
+        This defines the structure of the encoder block.
+        First is the multihead self attention block and the second is the feed forward block
+        """
+        super().__init__()
+        self.self_attention_block = self_attention_block
+        self.feed_forward_block = feed_forward_block
+        self.dropout = dropout
+        self.residual_connections = nn.ModuleList(
+            [ResidualConnection(features, dropout) for _ in range(2)]
+        )
+    def forward(self, x, src_mask):
+        x = self.residual_connections[0](
+            x, lambda x: self.self_attention_block(x, x, x, src_mask)
+        )
+        x = self.residual_connections[1](x, self.feed_forward_block)
+        return x
+class Encoder(nn.Module):
+    def __init__(self, features: int, layers: nn.ModuleList) -> None:
+        """
+        This is the main Encoder class built up of multiple "EncoderBlock" classes
+        """
+        super().__init__()
+        self.layers = layers
+        self.norm = LayerNormalization(features=features)
+    def forward(self, x, mask):
+        for layer in self.layers:
+            x = layer(x, mask)
+        return self.norm(x)
+class DecoderBlock(nn.Module):
+    def __init__(
+        self,
+        self_attention_block: MultiHeadAttentionBlock,
+        cross_attention_block: MultiHeadAttentionBlock,
+        feed_forward_layer: FeedForwardBlock,
+        features: int,
+        dropout: float,
+    ) -> None:
+        """
+        This class defines the structure of the decoder block.
+        First is the masked multihead self attention layer which takes in the target embeddings,
+        Second is the cross multihead attention layer which takes query from the decoder but key and value from the encoder
+        Thirdly the feed forward layer that takes the output of the cross multi head attention
+        """
+        super().__init__()
+        self.self_attention_block = self_attention_block
+        self.cross_attention_block = cross_attention_block
+        self.feed_forward_layer = feed_forward_layer
+        self.residual_connections = nn.ModuleList(
+            [ResidualConnection(features, dropout) for _ in range(3)]
+        )
+    def forward(self, x, encoder_output, target_mask, src_mask):
+        x = self.residual_connections[0](
+            x, lambda x: self.self_attention_block(x, x, x, target_mask)
+        )
+        x = self.residual_connections[1](
+            x,
+            lambda x: self.cross_attention_block(
+                x, encoder_output, encoder_output, src_mask
+            ),
+        )
+        x = self.residual_connections[2](x, self.feed_forward_layer)
+        return x
+class Decoder(nn.Module):
+    def __init__(self, layers: nn.ModuleList, features: int) -> None:
+        """
+        This is the main "Decoder" class built up of multiple "DecoderBlock" classes
+        """
+        super().__init__()
+        self.layers = layers
+        self.norm = LayerNormalization(features=features)
+    def forward(self, x, encoder_output, target_mask, src_mask):
+        for layer in self.layers:
+            x = layer(x, encoder_output, target_mask, src_mask)
+        return self.norm(x)
+class ProjectionLayer(nn.Module):
+    def __init__(self, d_model: int, vocab_size: int):
+        """
+        The output of the decoder block is passed through a linear layer and then a softmax to convert the vector embedding back to vocabulary
+        """
+        super().__init__()
+        self.proj = nn.Linear(d_model, vocab_size)
+    def forward(self, x):
+        return torch.log_softmax(self.proj(x), dim=-1)
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        encoder: Encoder,
+        decoder: Decoder,
+        src_embedding: InputEmbeddings,
+        target_embedding: InputEmbeddings,
+        src_position: PositionalEncoding,
+        target_position: PositionalEncoding,
+        projection_layer: ProjectionLayer,
+    ) -> None:
+        """
+        This is the main transformer class that encompasses the encoder, decoder and the projection layer.
+        """
+        super().__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.src_embedding = src_embedding
+        self.target_embedding = target_embedding
+        self.src_position = src_position
+        self.target_position = target_position
+        self.projection_layer = projection_layer
+    def encode(self, src, src_mask):
+        src = self.src_embedding(src)
+        src = self.src_position(src)
+        return self.encoder(src, src_mask)
+    def decode(self, encoder_output, src_mask, target, target_mask):
+        target = self.target_embedding(target)
+        target = self.target_position(target)
+        return self.decoder(target, encoder_output, target_mask, src_mask)
+    def projection(self, x):
+        return self.projection_layer(x)
+def build_transformer(
+    src_vocab_size: int,
+    target_vocab_size: int,
+    src_seq_len: int,
+    target_seq_len: int,
+    d_model: int = 512,
+    N: int = 6,
+    head: int = 8,
+    dropout: float = 0.1,
+    d_ff: int = 2048,
+) -> Transformer:
+    """
+    src_vocab_size: number of words in the vocab
+    target_vocab_size: its the output of the target vocab
+    src_seq_len: it represents the maximum number of words in a sentence
+    target_seq_len: it represents the maximum number of words in a target sentence, usually equal to src_seq_len
+    d_model: It is the size of the model i.e the size of the embedding vector
+    N: Number of times the encoder/decoder blocks are repeated in an architecture
+    head: Number of splits to make in a in multihead attention
+    dropout: dropout after each step
+    d_ff: neurons in the inner layer of the linear layer
+    """
+    src_embeddings = InputEmbeddings(d_model, src_vocab_size)
+    target_embeddings = InputEmbeddings(d_model, target_vocab_size)
+    src_positional_embeddings = PositionalEncoding(d_model, src_seq_len, dropout)
+    target_postional_embeddings = PositionalEncoding(d_model, target_seq_len, dropout)
+    encoder_blocks = []
+    for i in range(N):
+        encoder_self_multi_head_attention_block = MultiHeadAttentionBlock(
+            d_model, head, dropout
+        )
+        feed_forward_layer = FeedForwardBlock(d_model, d_ff, dropout)
+        encoder_blocks.append(
+            EncoderBlock(
+                d_model,
+                encoder_self_multi_head_attention_block,
+                feed_forward_layer,
+                dropout,
+            )
+        )
+    decoder_blocks = []
+    for i in range(N):
+        decoder_masked_multi_head_attention_block = MultiHeadAttentionBlock(
+            d_model, head, dropout
+        )
+        cross_multihead_attention_block = MultiHeadAttentionBlock(
+            d_model, head, dropout
+        )
+        feed_forward_layer = FeedForwardBlock(d_model, d_ff, dropout)
+        decoder_blocks.append(
+            DecoderBlock(
+                decoder_masked_multi_head_attention_block,
+                cross_multihead_attention_block,
+                feed_forward_layer,
+                d_model,
+                dropout,
+            )
+        )
+    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))
+    decoder = Decoder(nn.ModuleList(decoder_blocks), d_model)
+    projection_layer = ProjectionLayer(d_model, target_vocab_size)
+    transformer = Transformer(
+        encoder,
+        decoder,
+        src_embeddings,
+        target_embeddings,
+        src_positional_embeddings,
+        target_postional_embeddings,
+        projection_layer,
+    )
+    # This is to initialize the values of the vector embeddings with sensible defaults
+    for p in transformer.parameters():
+        if p.dim() > 1:
+            nn.init.xavier_uniform_(p)
+    return transformer

src/train.py ADDED Viewed

	@@ -0,0 +1,294 @@

+import torch
+import torch.nn as nn
+from config import get_config, get_weights_file_path
+from torch.utils.data import random_split, DataLoader
+from datasets import load_dataset
+from tokenizers import Tokenizer
+from dataset import BilingualDataset, causal_mask
+from tokenizers.models import WordLevel
+from tokenizers.trainers import WordLevelTrainer
+from tokenizers.pre_tokenizers import Whitespace
+from pathlib import Path
+from model import build_transformer, Transformer
+from tqdm import tqdm
+import warnings
+def greedy_decode(
+    model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device
+):
+    """
+    Inference -
+    Start with just SOS token in target
+    Every iteration gives us a new next word which we concatenate into the decoder input and rerun the cycle
+    Loop till we get EOS
+    """
+    sos_idx = tokenizer_tgt.token_to_id("[SOS]")
+    eos_idx = tokenizer_tgt.token_to_id("[EOS]")
+    # Just calculate the encoder input once
+    encoder_output = model.encode(source, source_mask)
+    decoder_input = torch.empty(1, 1).fill_(sos_idx).type_as(source).to(device)
+    while True:
+        if decoder_input.size(1) == max_len:
+            break
+        # run causal_mask
+        decoder_mask = (
+            causal_mask(decoder_input.size(1)).type_as(source_mask).to(device)
+        )
+        out = model.decode(encoder_output, source_mask, decoder_input, decoder_mask)
+        prob = model.projection(out[:, -1])
+        _, next_word = torch.max(prob, dim=1)
+        decoder_input = torch.cat(
+            [
+                decoder_input,
+                torch.empty(1, 1).type_as(source).fill_(next_word.item()).to(device),
+            ],
+            dim=1,
+        )
+        if next_word == eos_idx:
+            break
+    return decoder_input.squeeze(0)
+def run_validation(
+    model,
+    validation_dataset,
+    tokenizer_src,
+    tokenizer_target,
+    max_len,
+    device,
+    print_msg,
+    num_examples=2,
+):
+    model.eval()
+    count = 0
+    console_width = 80
+    with torch.no_grad():
+        for batch in validation_dataset:
+            count += 1
+            encoder_input = batch["encoder_input"].to(device)  # (b, seq_len)
+            encoder_mask = batch["encoder_mask"].to(device)  # (b, 1, 1, seq_len)
+            # check that the batch size is 1
+            assert encoder_input.size(0) == 1, "Batch size must be 1 for validation"
+            model_out = greedy_decode(
+                model,
+                encoder_input,
+                encoder_mask,
+                tokenizer_src,
+                tokenizer_target,
+                max_len,
+                device,
+            )
+            source_text = batch["src_text"][0]
+            target_text = batch["tgt_text"][0]
+            model_out_text = tokenizer_target.decode(model_out.detach().cpu().numpy())
+            print_msg("-" * console_width)
+            print_msg(f"{'SOURCE: ':>12}{source_text}")
+            print_msg(f"{'TARGET: ':>12}{target_text}")
+            print_msg(f"{'PREDICTED: ':>12}{model_out_text}")
+            if count == num_examples:
+                print_msg("-" * console_width)
+                break
+def get_all_sentences(dataset, lang):
+    for item in dataset:
+        yield item["translation"][lang]
+def get_or_build_tokenizer(config, dataset, lang):
+    """
+    This takes in the dataset and splits all the sentences into tokens
+    Adds four extra tokens to the token list -> "[UNK]", "[SOS]", "[EOS]" and "[PAD]"
+    min frequency for each word to be in our tokenizer is 2 i.e. each word should appear alteast 2 times
+    to be included
+    """
+    tokenizer_path = Path(config["tokenizer_file"].format(lang))
+    if not Path.exists(tokenizer_path):
+        tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
+        tokenizer.pre_tokenizer = Whitespace()
+        trainer = WordLevelTrainer(
+            special_tokens=["[UNK]", "[SOS]", "[EOS]", "[PAD]"], min_frequency=2
+        )
+        tokenizer.train_from_iterator(get_all_sentences(dataset, lang), trainer=trainer)
+        tokenizer.save(str(tokenizer_path))
+    else:
+        tokenizer = Tokenizer.from_file(str(tokenizer_path))
+    return tokenizer
+def get_dataset(config):
+    dataset_raw = load_dataset(
+        "opus_books", f"{config['lang_src']}-{config['lang_target']}", split="train"
+    )
+    tokenizer_src = get_or_build_tokenizer(config, dataset_raw, config["lang_src"])
+    tokenizer_target = get_or_build_tokenizer(
+        config, dataset_raw, config["lang_target"]
+    )
+    # Split the dataset into training and validation
+    train_dataset_size = int(0.9 * len(dataset_raw))
+    validation_dataset_size = len(dataset_raw) - train_dataset_size
+    train_dataset_raw, validation_dataset_raw = random_split(
+        dataset_raw, [train_dataset_size, validation_dataset_size]
+    )
+    # Initialize the classes
+    train_dataset = BilingualDataset(
+        train_dataset_raw,
+        tokenizer_src,
+        tokenizer_target,
+        config["lang_src"],
+        config["lang_target"],
+        config["seq_len"],
+    )
+    validation_dataset = BilingualDataset(
+        validation_dataset_raw,
+        tokenizer_src,
+        tokenizer_target,
+        config["lang_src"],
+        config["lang_target"],
+        config["seq_len"],
+    )
+    # Calculate the max_len
+    max_len_src = 0
+    max_len_target = 0
+    for item in dataset_raw:
+        src_ids = tokenizer_src.encode(item["translation"][config["lang_src"]]).ids
+        target_ids = tokenizer_src.encode(
+            item["translation"][config["lang_target"]]
+        ).ids
+        max_len_src = max(len(src_ids), max_len_src)
+        max_len_target = max(len(target_ids), max_len_target)
+    train_dataloader = DataLoader(
+        train_dataset, batch_size=config["batch_size"], shuffle=True
+    )
+    validation_dataloader = DataLoader(validation_dataset, batch_size=1, shuffle=True)
+    return train_dataloader, validation_dataloader, tokenizer_src, tokenizer_target
+def get_model(config, vocab_src_len, vocab_target_length) -> Transformer:
+    model = build_transformer(
+        vocab_src_len,
+        vocab_target_length,
+        config["seq_len"],
+        config["seq_len"],
+        d_model=config["d_model"],
+        N=4,
+        head=4,
+        dropout=0.1,
+        d_ff=256,
+    )
+    return model
+def train_model(config) -> None:
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    device = torch.device(device)
+    Path(config["model_folder"]).mkdir(parents=True, exist_ok=True)
+    train_dataloader, validation_dataloader, tokenizer_src, tokenizer_target = (
+        get_dataset(config)
+    )
+    model = get_model(
+        config, tokenizer_src.get_vocab_size(), tokenizer_target.get_vocab_size()
+    ).to(device)
+    # Adam's optimizer
+    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"], eps=1e-9)
+    initial_epoch = 0
+    global_step = 0
+    if config["preload"]:
+        model_filename = get_weights_file_path(config, config["preload"])
+        state = torch.load(model_filename)
+        initial_epoch = state["epoch"] + 1
+        optimizer.load_state_dict(state["optimizer_state_dict"])
+        global_step = state["global_step"]
+    # Loss functions
+    loss_fn = nn.CrossEntropyLoss(
+        ignore_index=tokenizer_src.token_to_id("[PAD]"), label_smoothing=0.1
+    ).to(device)
+    for epoch in range(initial_epoch, config["num_epochs"]):
+        batch_iterator = tqdm(train_dataloader, desc=f"Processing epoch : {epoch:02d}")
+        for batch in batch_iterator:
+            model.train()
+            encoder_input = batch["encoder_input"].to(device)  # (b, seq_len)
+            decoder_input = batch["decoder_input"].to(device)  # (B, seq_len)
+            encoder_mask = batch["encoder_mask"].to(device)  # (B, 1, 1, seq_len)
+            decoder_mask = batch["decoder_mask"].to(device)  # (B, 1, seq_len, seq_len)
+            encoder_output = model.encode(
+                encoder_input, encoder_mask
+            )  # (B, seq_len, d_model)
+            decoder_output = model.decode(
+                encoder_output, encoder_mask, decoder_input, decoder_mask
+            )  # (B, seq_len, d_model)
+            proj_output = model.projection(decoder_output)  # (B, seq_len, vocab_size)
+            label = batch["label"].to(device)  # (B, seq_len)
+            # Compare the expected output with the label
+            loss = loss_fn(
+                proj_output.view(-1, tokenizer_target.get_vocab_size()), label.view(-1)
+            )
+            batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}"})
+            # Back Propogation
+            loss.backward()
+            optimizer.step()
+            optimizer.zero_grad(set_to_none=True)
+            global_step += 1
+        # Inference after each epoch to see the results
+        run_validation(
+            model,
+            validation_dataloader,
+            tokenizer_src,
+            tokenizer_target,
+            config["seq_len"],
+            device,
+            lambda msg: batch_iterator.write(msg),
+        )
+    model_filename = get_weights_file_path(config, f"{epoch:02d}")
+    torch.save(
+        {
+            "epoch": epoch,
+            "model_state_dict": model.state_dict(),
+            "optimizer_state_dict": optimizer.state_dict(),
+            "global_step": global_step,
+        },
+        model_filename,
+    )
+if __name__ == "__main__":
+    warnings.filterwarnings("ignore")
+    config = get_config()
+    train_model(config)

tokenizer_en.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_it.json ADDED Viewed

The diff for this file is too large to render. See raw diff

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

weights/tmodel_19.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb36d9b55db46492ea5335b961e4008878aa71496e89b29b6147fc329e89d1fb
+size 551199243