---
license: apache-2.0
---
```bash
Baseline: NLP Getting Started Tutorial with LogisticRegression performance: 0.78363 V1
V2 NLP to monitor Twitter for natural disasters Natural Language Processing (NLP) techniques including LSTM and GRU are compared. performance:0.82653 V2
v3 Intro to 🤗 Transformers performance:0.83481 V3
```
## 1 - Forward Propagation for the Basic Recurrent Neural Network
Later this week, you'll get a chance to generate music using an RNN! The basic RNN that you'll implement has the following structure:
In this example, Tx = Ty.
The above image shows a language translation model from French to English. Actually we can use stack of encoder(one in top of each) and stack of decoders as below:
Before going further Let us see a full fledged image of our attention model.
We know that transformer has an encoder decoder architecture for language translation. Before getting in to encoder pr decoder, let us discuss some common components.
```
pos -> refers to order in the sentence
i -> refers to position along embedding vector dimension
```
Positinal embedding will generate a matrix of similar to embedding matrix. It will create a matrix of dimension sequence length x embedding dimension. For each token(word) in sequence, we will find the embedding vector which is of dimension 1 x 512 and it is added with the correspondng positional vector which is of dimension 1 x 512 to get 1 x 512 dim out for each word/token.
for eg: if we have batch size of 32 and seq length of 10 and let embedding dimension be 512. Then we will have embedding vector of dimension 32 x 10 x 512. Similarly we will have positional encoding vector of dimension 32 x 10 x 512. Then we add both.
In the encoder section -
**Step 1:** First input(padded tokens corresponding to the sentence) get passes through embedding layer and positional encoding layer.
```
code hint
suppose we have input of 32x10 (batch size=32 and sequence length=10). Once it passes through embedding layer it becomes 32x10x512. Then it gets added with correspondng positional encoding vector and produces output of 32x10x512. This gets passed to the multihead attention
```
**Step 2:** As discussed above it will passed through the multihead attention layer and creates useful representational matrix as output.
```
code hint
input to multihead attention will be a 32x10x512 from which key,query and value vectors are generated as above and finally produces a 32x10x512 output.
```
**Step 3:** Next we have a normalization and residual connection. The output from multihead attention is added with its input and then normalized.
```
code hint
output of multihead attention which is 32x10x512 gets added with 32x10x512 input(which is output created by embedding vector) and then the layer is normalized.
```
**Step 4:** Next we have a feed forward layer and a then normalization layer with residual connection from input(input of feed forward layer) where we passes the output after normalization though it and finally gets the output of encoder.
```
code hint
The normalized output will be of dimension 32x10x512. This gets passed through 2 linear layers: 32x10x512 -> 32x10x2048 -> 32x10x512. Finally we have a residual connection which gets added with the output and the layer is normalized. Thus a 32x10x512 dimensional vector is created as output for the encoder.
```
Now we have gone through most parts of the encoder.Let us get in to the components of the decoder. We will use the output of encoder to generate key and value vectors for the decoder.There are two kinds of multi head attention in the decoder.One is the decoder attention and other is the encoder decoder attention. Don't worry we will go step by step.
et us explain with respect to the training phase. Firt
**Step 1:**
First the output gets passed through the embeddin and positional encoding to create a embedding vector of dimension 1x512 corresponding to each word in the target sequence.
```
code hint
Suppose we have a sequence length of 10. batch size of 32 and embedding vector dimension of 512. we have input of size 32x10 to the embedding matrix which produces and output of dimension 32x10x512 which gets added with the positional encoding of same dimension and produces a 32x10x512 out
```
**Step 2:**
The embeddig output gets passed through a multihead attention layers as before(creating key,query and value matrixes from the target input) and produces an output vector. This time the major difference is that we uses a mask with multihead attention.
**Why mask?**
Mask is used because while creating attention of target words, we donot need a word to look in to the future words to check the dependency. ie, we already learned that why we create attention because we need to know contribution of each word with the other word. Since we are creating attention for words in target sequnce, we donot need a particular word to see the future words. For eg: in word "I am a strudent", we donot need the word "a" to look word "student".
```
code hint
For creating attention we created a triangular matrix with 1 and 0.eg:traingular matrix for seq length 5 looks as below:
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
After the key gets multiplied with query, we fill all zero positions with negative inifinity, In code we will fill it with a very small number to avoid division errors.
(with -1e 20)
```
**Step 3:**
As before we have a add and norm layer where we add with output of embedding with attention out and normalized it.
**Step 4:**
Next we have another multihead attention and then a add and norm layer. This multihead attention is called encoder-decorder multihead attention. For this multihead attention we create we create key and value vectors from the encoder output. Query is created from the output of previous decoder layer.
```
code hint:
Thus we have 32x10x512 out from encoder out. key and value for all words are generated from it. Similary query matrix is generated from otput from previous layer of decoder(32x10x512).
```
Thus it is passed through a multihead atention (we used number of heads = 8) the through a Add and Norm layer. Here the output from previous encoder layer(ie previoud add and norm layer) gets added with encoder-decoder attention output and then normalized.
**Step 5:**
Next we have a feed forward layer(linear layer) with add and nom which is similar to that of present in the encoder.
**Step 6:**
Finally we create a linear layer with length equal to number of words in total target corpus and a softmax function with it to get probablity of each word.