Upload 11 files

Browse files

Files changed (12) hide show

.gitattributes +6 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/.ipynb_checkpoints/C5_W4_A1_Transformer_Subclass_v1-checkpoint.ipynb +1303 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/C5_W4_A1_Transformer_Subclass_v1.ipynb +0 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/__pycache__/public_tests.cpython-37.pyc +0 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder.png +3 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder_layer.png +3 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder.png +3 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder_layer.png +3 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/ner.json +0 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/public_tests.py +315 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/self-attention.png +3 -0
Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/transformer.png +3 -0

.gitattributes CHANGED Viewed

@@ -103,3 +103,9 @@ Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/raw_data/negatives/5_1.wav filte
 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/tmp.wav filter=lfs diff=lfs merge=lfs -text
 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/train.wav filter=lfs diff=lfs merge=lfs -text
 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/Trigger_word_detection_v2a.ipynb filter=lfs diff=lfs merge=lfs -text

 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/tmp.wav filter=lfs diff=lfs merge=lfs -text
 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/train.wav filter=lfs diff=lfs merge=lfs -text
 Seq2Seq/Trigger_detection/home/jovyan/work/W3A2/Trigger_word_detection_v2a.ipynb filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder_layer.png filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder.png filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder_layer.png filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder.png filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/self-attention.png filter=lfs diff=lfs merge=lfs -text
+Transformer[[:space:]]Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/transformer.png filter=lfs diff=lfs merge=lfs -text

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/.ipynb_checkpoints/C5_W4_A1_Transformer_Subclass_v1-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,1303 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "AbzZLqIPv6b7",
+    "outputId": "19f2fc2b-6f1d-4b43-fd50-4c513e3936fd"
+   },
+   "source": [
+    "# Transformer Network\n",
+    "\n",
+    "Welcome to Week 4's assignment, the last assignment of Course 5 of the Deep Learning Specialization! And congratulations on making it to the last assignment of the entire Deep Learning Specialization - you're almost done!\n",
+    "\n",
+    "Earlier in the course, you've implemented sequential neural networks such as RNNs, GRUs, and LSTMs. In this notebook you'll explore the Transformer architecture, a neural network that takes advantage of parallel processing and allows you to substantially speed up the training process. \n",
+    "\n",
+    "**After this assignment you'll be able to**:\n",
+    "\n",
+    "* Create positional encodings to capture sequential relationships in data\n",
+    "* Calculate scaled dot-product self-attention with word embeddings\n",
+    "* Implement masked multi-head attention\n",
+    "* Build and train a Transformer model\n",
+    "\n",
+    "For the last time, let's get started!\n",
+    "\n",
+    "## Important Note on Submission to the AutoGrader\n",
+    "\n",
+    "Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:\n",
+    "\n",
+    "1. You have not added any _extra_ `print` statement(s) in the assignment.\n",
+    "2. You have not added any _extra_ code cell(s) in the assignment.\n",
+    "3. You have not changed any of the function parameters.\n",
+    "4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.\n",
+    "5. You are not changing the assignment code where it is not required, like creating _extra_ variables.\n",
+    "\n",
+    "If you do any of the following, you will get something like, `Grader Error: Grader feedback not found` (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these [instructions](https://www.coursera.org/learn/nlp-sequence-models/supplement/X39s5/optional-downloading-your-notebook-downloading-your-workspace-and-refreshing)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Table of Contents\n",
+    "\n",
+    "- [Packages](#0)\n",
+    "- [1 - Positional Encoding](#1)\n",
+    "    - [1.1 - Sine and Cosine Angles](#1-1)\n",
+    "        - [Exercise 1 - get_angles](#ex-1)\n",
+    "    - [1.2 - Sine and Cosine Positional Encodings](#1-2)\n",
+    "        - [Exercise 2 - positional_encoding](#ex-2)\n",
+    "- [2 - Masking](#2)\n",
+    "    - [2.1 - Padding Mask](#2-1)\n",
+    "    - [2.2 - Look-ahead Mask](#2-2)\n",
+    "- [3 - Self-Attention](#3)\n",
+    "    - [Exercise 3 - scaled_dot_product_attention](#ex-3)\n",
+    "- [4 - Encoder](#4)\n",
+    "    - [4.1 Encoder Layer](#4-1)\n",
+    "        - [Exercise 4 - EncoderLayer](#ex-4)\n",
+    "    - [4.2 - Full Encoder](#4-2)\n",
+    "        - [Exercise 5 - Encoder](#ex-5)\n",
+    "- [5 - Decoder](#5)\n",
+    "    - [5.1 - Decoder Layer](#5-1)\n",
+    "        - [Exercise 6 - DecoderLayer](#ex-6)\n",
+    "    - [5.2 - Full Decoder](#5-2)\n",
+    "        - [Exercise 7 - Decoder](#ex-7)\n",
+    "- [6 - Transformer](#6)\n",
+    "    - [Exercise 8 - Transformer](#ex-8)\n",
+    "- [7 - References](#7)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='0'></a>\n",
+    "## Packages\n",
+    "\n",
+    "Run the following cell to load the packages you'll need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false
+   },
+   "outputs": [],
+   "source": [
+    "### v1.6"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_OpwqWL2QH5G"
+   },
+   "outputs": [],
+   "source": [
+    "import tensorflow as tf\n",
+    "import time\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='1'></a>\n",
+    "## 1 - Positional Encoding\n",
+    "\n",
+    "In sequence to sequence tasks, the relative order of your data is extremely important to its meaning. When you were training sequential neural networks such as RNNs, you fed your inputs into the network in order. Information about the order of your data was automatically fed into your model.  However, when you train a Transformer network using multi-head attention, you feed your data into the model all at once. While this dramatically reduces training time, there is no information about the order of your data. This is where positional encoding is useful - you can specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:\n",
+    "    \n",
+    "$$\n",
+    "PE_{(pos, 2i)}= sin\\left(\\frac{pos}{{10000}^{\\frac{2i}{d}}}\\right)\n",
+    "\\tag{1}$$\n",
+    "<br>\n",
+    "$$\n",
+    "PE_{(pos, 2i+1)}= cos\\left(\\frac{pos}{{10000}^{\\frac{2i}{d}}}\\right)\n",
+    "\\tag{2}$$\n",
+    "\n",
+    "* $d$ is the dimension of the word embedding and positional encoding\n",
+    "* $pos$ is the position of the word.\n",
+    "* $k$ refers to each of the different dimensions in the positional encodings, with $i$ equal to $k$ $//$ $2$.\n",
+    "\n",
+    "To develop some intuition about positional encodings, you can think of them broadly as a feature that contains the information about the relative positions of words. The sum of the positional encoding and word embedding is ultimately what is fed into the model. If you just hard code the positions in, say by adding a matrix of 1's or whole numbers to the word embedding, the semantic meaning is distorted. Conversely, the values of the sine and cosine equations are small enough (between -1 and 1) that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted, and is instead enriched with positional information. Using a combination of these two equations helps your Transformer network attend to the relative positions of your input data. This was a short discussion on positional encodings, but to develop further intuition, check out the *Positional Encoding Ungraded Lab*. \n",
+    "\n",
+    "**Note:** In the lectures Andrew uses vertical vectors, but in this assignment all vectors are horizontal. All matrix multiplications should be adjusted accordingly.\n",
+    "\n",
+    "<a name='1-1'></a>\n",
+    "### 1.1 - Sine and Cosine Angles\n",
+    "\n",
+    "Notice that even though the sine and cosine positional encoding equations take in different arguments (`2i` versus `2i+1`, or even versus odd numbers) the inner terms for both equations are the same: $$\\theta(pos, i, d) = \\frac{pos}{10000^{\\frac{2i}{d}}} \\tag{3}$$\n",
+    "\n",
+    "Consider the inner term as you calculate the positional encoding for a word in a sequence.<br> \n",
+    "$PE_{(pos, 0)}= sin\\left(\\frac{pos}{{10000}^{\\frac{0}{d}}}\\right)$, since solving `2i = 0` gives `i = 0` <br>\n",
+    "$PE_{(pos, 1)}= cos\\left(\\frac{pos}{{10000}^{\\frac{0}{d}}}\\right)$, since solving `2i + 1 = 1` gives `i = 0`\n",
+    "\n",
+    "The angle is the same for both! The angles for $PE_{(pos, 2)}$ and $PE_{(pos, 3)}$ are the same as well, since for both, `i = 1` and therefore the inner term is $\\left(\\frac{pos}{{10000}^{\\frac{2}{d}}}\\right)$. This relationship holds true for all paired sine and cosine curves:\n",
+    "\n",
+    "|      k         | <code>       0      </code>|<code>       1      </code>|<code>       2      </code>|<code>       3      </code>| <code> ... </code> |<code>      d - 2     </code>|<code>      d - 1     </code>| \n",
+    "| ---------------- | :------: | ----------------- | ----------------- | ----------------- | ----- | ----------------- | ----------------- |\n",
+    "| encoding(0) = |[$sin(\\theta(0, 0, d))$| $cos(\\theta(0, 0, d))$| $sin(\\theta(0, 1, d))$| $cos(\\theta(0, 1, d))$|... |$sin(\\theta(0, d//2, d))$| $cos(\\theta(0, d//2, d))$]|\n",
+    "| encoding(1) = | [$sin(\\theta(1, 0, d))$| $cos(\\theta(1, 0, d))$| $sin(\\theta(1, 1, d))$| $cos(\\theta(1, 1, d))$|... |$sin(\\theta(1, d//2, d))$| $cos(\\theta(1, d//2, d))$]|\n",
+    "...\n",
+    "| encoding(pos) = | [$sin(\\theta(pos, 0, d))$| $cos(\\theta(pos, 0, d))$| $sin(\\theta(pos, 1, d))$| $cos(\\theta(pos, 1, d))$|... |$sin(\\theta(pos, d//2, d))$| $cos(\\theta(pos, d//2, d))]$|\n",
+    "\n",
+    "\n",
+    "<a name='ex-1'></a>\n",
+    "### Exercise 1 - get_angles\n",
+    "\n",
+    "Implement the function `get_angles()` to calculate the possible angles for the sine and cosine positional encodings\n",
+    "\n",
+    "**Hints**\n",
+    "\n",
+    "- If `k = [0, 1, 2, 3, 4, 5]`, then, `i` must be `i = [0, 0, 1, 1, 2, 2]`\n",
+    "- `i = k//2`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "bPzwMVfcQpT-"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION get_angles\n",
+    "def get_angles(pos, k, d):\n",
+    "    \"\"\"\n",
+    "    Get the angles for the positional encoding\n",
+    "    \n",
+    "    Arguments:\n",
+    "        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]\n",
+    "        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]\n",
+    "        d(integer) -- Encoding size\n",
+    "    \n",
+    "    Returns:\n",
+    "        angles -- (pos, d) numpy array \n",
+    "    \"\"\"\n",
+    "    \n",
+    "    # START CODE HERE\n",
+    "    # Get i from dimension span k\n",
+    "    i = None\n",
+    "    # Calculate the angles using pos, i and d\n",
+    "    angles = None\n",
+    "    # END CODE HERE\n",
+    "    \n",
+    "    return angles"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from public_tests import *\n",
+    "\n",
+    "get_angles_test(get_angles)\n",
+    "\n",
+    "# Example\n",
+    "position = 4\n",
+    "d_model = 8\n",
+    "pos_m = np.arange(position)[:, np.newaxis]\n",
+    "dims = np.arange(d_model)[np.newaxis, :]\n",
+    "get_angles(pos_m, dims, d_model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='1-2'></a>\n",
+    "### 1.2 - Sine and Cosine Positional Encodings\n",
+    "\n",
+    "Now you can use the angles you computed to calculate the sine and cosine positional encodings.\n",
+    "\n",
+    "$$\n",
+    "PE_{(pos, 2i)}= sin\\left(\\frac{pos}{{10000}^{\\frac{2i}{d}}}\\right)\n",
+    "$$\n",
+    "<br>\n",
+    "$$\n",
+    "PE_{(pos, 2i+1)}= cos\\left(\\frac{pos}{{10000}^{\\frac{2i}{d}}}\\right)\n",
+    "$$\n",
+    "\n",
+    "<a name='ex-2'></a>\n",
+    "### Exercise 2 - positional_encoding\n",
+    "\n",
+    "Implement the function `positional_encoding()` to calculate the sine and cosine  positional encodings\n",
+    "\n",
+    "**Reminder:** Use the sine equation when $i$ is an even number and the cosine equation when $i$ is an odd number.\n",
+    "\n",
+    "#### Additional Hints\n",
+    "* You may find \n",
+    "[np.newaxis](https://numpy.org/doc/stable/user/basics.indexing.html#dimensional-indexing-tools) useful depending on the implementation you choose. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "y78txxoHQtwG"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION positional_encoding\n",
+    "def positional_encoding(positions, d):\n",
+    "    \"\"\"\n",
+    "    Precomputes a matrix with all the positional encodings \n",
+    "    \n",
+    "    Arguments:\n",
+    "        positions (int) -- Maximum number of positions to be encoded \n",
+    "        d (int) -- Encoding size \n",
+    "    \n",
+    "    Returns:\n",
+    "        pos_encoding -- (1, position, d_model) A matrix with the positional encodings\n",
+    "    \"\"\"\n",
+    "    # START CODE HERE\n",
+    "    # initialize a matrix angle_rads of all the angles \n",
+    "    angle_rads = get_angles(None,\n",
+    "                            None,\n",
+    "                            None)\n",
+    "  \n",
+    "    # apply sin to even indices in the array; 2i\n",
+    "    angle_rads[:, 0::2] = None\n",
+    "  \n",
+    "    # apply cos to odd indices in the array; 2i+1\n",
+    "    angle_rads[:, 1::2] = None\n",
+    "    # END CODE HERE\n",
+    "    \n",
+    "    pos_encoding = angle_rads[np.newaxis, ...]\n",
+    "    \n",
+    "    return tf.cast(pos_encoding, dtype=tf.float32)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 300
+    },
+    "id": "jYiWrawRQvuv",
+    "outputId": "cfccc7c9-428e-4b08-d969-e3090fafc1ad"
+   },
+   "outputs": [],
+   "source": [
+    "# UNIT TEST    \n",
+    "positional_encoding_test(positional_encoding, get_angles)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Nice work calculating the positional encodings! Now you can visualize them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pos_encoding = positional_encoding(50, 512)\n",
+    "\n",
+    "print (pos_encoding.shape)\n",
+    "\n",
+    "plt.pcolormesh(pos_encoding[0], cmap='RdBu')\n",
+    "plt.xlabel('d')\n",
+    "plt.xlim((0, 512))\n",
+    "plt.ylabel('Position')\n",
+    "plt.colorbar()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Each row represents a positional encoding - notice how none of the rows are identical! You have created a unique positional encoding for each of the words."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='2'></a>\n",
+    "## 2 - Masking\n",
+    "\n",
+    "There are two types of masks that are useful when building your Transformer network: the *padding mask* and the *look-ahead mask*. Both help the softmax computation give the appropriate weights to the words in your input sentence. \n",
+    "\n",
+    "<a name='2-1'></a>\n",
+    "### 2.1 - Padding Mask\n",
+    "\n",
+    "Oftentimes your input sequence will exceed the maximum length of a sequence your network can process. Let's say the maximum length of your model is five, it is fed the following sequences:\n",
+    "\n",
+    "    [[\"Do\", \"you\", \"know\", \"when\", \"Jane\", \"is\", \"going\", \"to\", \"visit\", \"Africa\"], \n",
+    "     [\"Jane\", \"visits\", \"Africa\", \"in\", \"September\" ],\n",
+    "     [\"Exciting\", \"!\"]\n",
+    "    ]\n",
+    "\n",
+    "which might get vectorized as:\n",
+    "\n",
+    "    [[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],\n",
+    "     [ 56, 1285, 15, 181, 545],\n",
+    "     [ 87, 600]\n",
+    "    ]\n",
+    "    \n",
+    "When passing sequences into a transformer model, it is important that they are of uniform length. You can achieve this by padding the sequence with zeros, and truncating sentences that exceed the maximum length of your model:\n",
+    "\n",
+    "    [[ 71, 121, 4, 56, 99],\n",
+    "     [ 2344, 345, 1284, 15, 0],\n",
+    "     [ 56, 1285, 15, 181, 545],\n",
+    "     [ 87, 600, 0, 0, 0],\n",
+    "    ]\n",
+    "    \n",
+    "Sequences longer than the maximum length of five will be truncated, and zeros will be added to the truncated sequence to achieve uniform length. Similarly, for sequences shorter than the maximum length, zeros will also be added for padding. However, these zeros will affect the softmax calculation - this is when a padding mask comes in handy! You will need to define a boolean mask that specifies to which elements you must attend(1) and which elements you must ignore(0). Later you will use that mask to set all the zeros in the sequence to a value close to negative infinity (-1e9). We'll implement this for you so you can get to the fun of building the Transformer network! 😇 Just make sure you go through the code so you can correctly implement padding when building your model. \n",
+    "\n",
+    "After masking, your input should go from `[87, 600, 0, 0, 0]` to `[87, 600, -1e9, -1e9, -1e9]`, so that when you take the softmax, the zeros don't affect the score.\n",
+    "\n",
+    "The [MultiheadAttention](https://keras.io/api/layers/attention_layers/multi_head_attention/) layer implemented in Keras, uses this masking logic.\n",
+    "\n",
+    "**Note:** The below function only creates the mask of an _already padded sequence_. Later in this week, you’ll go through some Labs on Transformer applications, where you’ll be introduced to [TensorFlow Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) and [Hugging Face Tokenizer](https://huggingface.co/docs/tokenizers/api/tokenizer), which internally handle padding (and truncating) the input sequence."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "JOL9XWsFQxxo"
+   },
+   "outputs": [],
+   "source": [
+    "def create_padding_mask(decoder_token_ids):\n",
+    "    \"\"\"\n",
+    "    Creates a matrix mask for the padding cells\n",
+    "    \n",
+    "    Arguments:\n",
+    "        decoder_token_ids -- (n, m) matrix\n",
+    "    \n",
+    "    Returns:\n",
+    "        mask -- (n, 1, m) binary tensor\n",
+    "    \"\"\"    \n",
+    "    seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)\n",
+    "  \n",
+    "    # add extra dimensions to add the padding\n",
+    "    # to the attention logits. \n",
+    "    # this will allow for broadcasting later when comparing sequences\n",
+    "    return seq[:, tf.newaxis, :] "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "5J5FFjklQ1Fz",
+    "outputId": "8319446f-3ed4-406a-cf38-ca2b08142ff4"
+   },
+   "outputs": [],
+   "source": [
+    "x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])\n",
+    "print(create_padding_mask(x))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(tf.keras.activations.softmax(x))\n",
+    "print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='2-2'></a>\n",
+    "### 2.2 - Look-ahead Mask\n",
+    "\n",
+    "The look-ahead mask follows similar intuition. In training, you will have access to the complete correct output of your training example. The look-ahead mask helps your model pretend that it correctly predicted a part of the output and see if, *without looking ahead*, it can correctly predict the next output. \n",
+    "\n",
+    "For example, if the expected correct output is `[1, 2, 3]` and you wanted to see if given that the model correctly predicted the first value it could predict the second value, you would mask out the second and third values. So you would input the masked sequence `[1, -1e9, -1e9]` and see if it could generate `[1, 2, -1e9]`.\n",
+    "\n",
+    "Just because you've worked so hard, we'll also implement this mask for you 😇😇. Again, take a close look at the code so you can effectively implement it later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9O9UbM31Q3hK"
+   },
+   "outputs": [],
+   "source": [
+    "def create_look_ahead_mask(sequence_length):\n",
+    "    \"\"\"\n",
+    "    Returns a lower triangular matrix filled with ones\n",
+    "    \n",
+    "    Arguments:\n",
+    "        sequence_length -- matrix size\n",
+    "    \n",
+    "    Returns:\n",
+    "        mask -- (size, size) tensor\n",
+    "    \"\"\"\n",
+    "    mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)\n",
+    "    return mask "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "nfzHoVj9Q5nG",
+    "outputId": "300e76ec-77d0-460a-b6df-71e40de86606"
+   },
+   "outputs": [],
+   "source": [
+    "x = tf.random.uniform((1, 3))\n",
+    "temp = create_look_ahead_mask(x.shape[1])\n",
+    "temp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VG0gPyv0oDBi"
+   },
+   "source": [
+    "<a name='3'></a>\n",
+    "## 3 - Self-Attention\n",
+    "\n",
+    "As the authors of the Transformers paper state, \"Attention is All You Need\". \n",
+    "\n",
+    "<img src=\"self-attention.png\" alt=\"Encoder\" width=\"600\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 1: Self-Attention calculation visualization</font></center></caption>\n",
+    "    \n",
+    "The use of self-attention paired with traditional convolutional networks allows for parallelization which speeds up training. You will implement **scaled dot product attention** which takes in a query, key, value, and a mask as inputs to return rich, attention-based vector representations of the words in your sequence. This type of self-attention can be mathematically expressed as:\n",
+    "$$\n",
+    "\\text { Attention }(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d_{k}}}+{M}\\right) V\\tag{4}\\\n",
+    "$$\n",
+    "\n",
+    "* $Q$ is the matrix of queries \n",
+    "* $K$ is the matrix of keys\n",
+    "* $V$ is the matrix of values\n",
+    "* $M$ is the optional mask you choose to apply \n",
+    "* ${d_k}$ is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode\n",
+    "\n",
+    "<a name='ex-3'></a>\n",
+    "### Exercise 3 - scaled_dot_product_attention \n",
+    "\n",
+    "Implement the function `scaled_dot_product_attention()` to create attention-based representations.\n",
+    "\n",
+    "**Reminder**: The boolean mask parameter can be passed in as `none` or as either padding or look-ahead. \n",
+    "    \n",
+    "    Multiply (1. - mask) by -1e9 before applying the softmax. \n",
+    "\n",
+    "**Additional Hints**\n",
+    "* You may find [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul) useful for matrix multiplication."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "CSysk_rjQ7lp"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION scaled_dot_product_attention\n",
+    "def scaled_dot_product_attention(q, k, v, mask):\n",
+    "    \"\"\"\n",
+    "    Calculate the attention weights.\n",
+    "      q, k, v must have matching leading dimensions.\n",
+    "      k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.\n",
+    "      The mask has different shapes depending on its type(padding or look ahead) \n",
+    "      but it must be broadcastable for addition.\n",
+    "\n",
+    "    Arguments:\n",
+    "        q -- query shape == (..., seq_len_q, depth)\n",
+    "        k -- key shape == (..., seq_len_k, depth)\n",
+    "        v -- value shape == (..., seq_len_v, depth_v)\n",
+    "        mask: Float tensor with shape broadcastable \n",
+    "              to (..., seq_len_q, seq_len_k). Defaults to None.\n",
+    "\n",
+    "    Returns:\n",
+    "        output -- attention_weights\n",
+    "    \"\"\"\n",
+    "    # START CODE HERE\n",
+    "    \n",
+    "    matmul_qk = None  # (..., seq_len_q, seq_len_k)\n",
+    "\n",
+    "    # scale matmul_qk\n",
+    "    dk = None\n",
+    "    scaled_attention_logits = None\n",
+    "\n",
+    "    # add the mask to the scaled tensor.\n",
+    "    if mask is not None: # Don't replace this None\n",
+    "        scaled_attention_logits += None \n",
+    "\n",
+    "    # softmax is normalized on the last axis (seq_len_k) so that the scores\n",
+    "    # add up to 1.\n",
+    "    attention_weights = None  # (..., seq_len_q, seq_len_k)\n",
+    "\n",
+    "    output = None  # (..., seq_len_q, depth_v)\n",
+    "    \n",
+    "    # END CODE HERE\n",
+    "\n",
+    "    return output, attention_weights"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST\n",
+    "scaled_dot_product_attention_test(scaled_dot_product_attention)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Excellent work! You can now implement self-attention. With that, you can start building the encoder block! "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "blS0pEpTqRVI"
+   },
+   "source": [
+    "<a name='4'></a>\n",
+    "## 4 - Encoder\n",
+    "\n",
+    "The Transformer Encoder layer pairs self-attention and convolutional neural network style of processing to improve the speed of training and passes K and V matrices to the Decoder, which you'll build later in the assignment. In this section of the assignment, you will implement the Encoder by pairing multi-head attention and a feed forward neural network (Figure 2a). \n",
+    "<img src=\"encoder_layer.png\" alt=\"Encoder\" width=\"400\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 2a: Transformer encoder layer</font></center></caption>\n",
+    "\n",
+    "* `MultiHeadAttention` you can think of as computing the self-attention several times to detect different features. \n",
+    "* Feed forward neural network contains two Dense layers which we'll implement as the function `FullyConnected`\n",
+    "\n",
+    "Your input sentence first passes through a *multi-head attention layer*, where the encoder looks at other words in the input sentence as it encodes a specific word. The outputs of the multi-head attention layer are then fed to a *feed forward neural network*. The exact same feed forward network is independently applied to each position.\n",
+    "   \n",
+    "* For the `MultiHeadAttention` layer, you will use the [Keras implementation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention). If you're curious about how to split the query matrix Q, key matrix K, and value matrix V into different heads, you can look through the implementation. \n",
+    "* You will also use the [Sequential API](https://keras.io/api/models/sequential/) with two dense layers to built the feed forward neural network layers.\n",
+    "    \n",
+    "**Note:** In Python, the `__call__` method allows you to call an object like a function. TensorFlow leverages this by providing a `call` function within Keras layers.  This means you can use a layer like `a_Layer(arg)` and TensorFlow will internally execute `a_Layer.call(arg)` to process your data.  This design makes it intuitive to apply layers within your TensorFlow models.    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "sC5vJhz29vZR"
+   },
+   "outputs": [],
+   "source": [
+    "def FullyConnected(embedding_dim, fully_connected_dim):\n",
+    "    return tf.keras.Sequential([\n",
+    "        tf.keras.layers.Dense(fully_connected_dim, activation='relu'),  # (batch_size, seq_len, dff)\n",
+    "        tf.keras.layers.Dense(embedding_dim)  # (batch_size, seq_len, embedding_dim)\n",
+    "    ])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "R65WbX5wqYYH"
+   },
+   "source": [
+    "<a name='4-1'></a>\n",
+    "### 4.1 Encoder Layer\n",
+    "\n",
+    "Now you can pair multi-head attention and feed forward neural network together in an encoder layer! You will also use residual connections and layer normalization to help speed up training (Figure 2a).\n",
+    "\n",
+    "<a name='ex-4'></a>\n",
+    "### Exercise 4 - EncoderLayer\n",
+    "\n",
+    "Implement `EncoderLayer()` using the `call()` method\n",
+    "\n",
+    "In this exercise, you will implement one encoder block (Figure 2) using the `call()` method. The function should perform the following steps: \n",
+    "1. You will pass the Q, V, K matrices and a boolean mask to a multi-head attention layer. Remember that to compute *self*-attention Q, V and K should be the same. Set the default values for `return_attention_scores` and `training`. You will also perform Dropout in this multi-head attention layer during training. \n",
+    "2. Now add a skip connection by adding your original input `x` and the output of the your multi-head attention layer. \n",
+    "3. After adding the skip connection, pass the output through the first normalization layer.\n",
+    "4. Finally, repeat steps 1-3 but with the feed forward neural network with a dropout layer instead of the multi-head attention layer. \n",
+    "\n",
+    "<details>\n",
+    "  <summary><font size=\"2\" color=\"darkgreen\"><b>Additional Hints (Click to expand)</b></font></summary>\n",
+    "    \n",
+    "* The `__init__` method creates all the layers that will be accesed by the the `call` method. Wherever you want to use a layer defined inside  the `__init__`  method you will have to use the syntax `self.[insert layer name]`. \n",
+    "* You will find the documentation of [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention) helpful. *Note that if query, key and value are the same, then this function performs self-attention.*\n",
+    "* The call arguments for `self.mha` are (Where B is for batch_size, T is for target sequence shapes, and S is output_shape):\n",
+    " - `query`: Query Tensor of shape (B, T, dim).\n",
+    " - `value`: Value Tensor of shape (B, S, dim).\n",
+    " - `key`: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.\n",
+    " - `attention_mask`: a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.\n",
+    " - `return_attention_scores`: A boolean to indicate whether the output should be (attention_output, attention_scores) if True, or attention_output if False. Defaults to False.\n",
+    " - `training`: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer. Take a look at [tf.keras.layers.Dropout](https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/keras/layers/Dropout) for more details (Additional reading in [Keras FAQ](https://keras.io/getting_started/faq/#whats-the-difference-between-the-training-argument-in-call-and-the-trainable-attribute))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "tIufbrc-9_2u"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION EncoderLayer\n",
+    "class EncoderLayer(tf.keras.layers.Layer):\n",
+    "    \"\"\"\n",
+    "    The encoder layer is composed by a multi-head self-attention mechanism,\n",
+    "    followed by a simple, positionwise fully connected feed-forward network. \n",
+    "    This architecture includes a residual connection around each of the two \n",
+    "    sub-layers, followed by layer normalization.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, embedding_dim, num_heads, fully_connected_dim,\n",
+    "                 dropout_rate=0.1, layernorm_eps=1e-6):\n",
+    "        super(EncoderLayer, self).__init__()\n",
+    "\n",
+    "        self.mha = MultiHeadAttention(num_heads=num_heads,\n",
+    "                                      key_dim=embedding_dim,\n",
+    "                                      dropout=dropout_rate)\n",
+    "\n",
+    "        self.ffn = FullyConnected(embedding_dim=embedding_dim,\n",
+    "                                  fully_connected_dim=fully_connected_dim)\n",
+    "\n",
+    "        self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)\n",
+    "        self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)\n",
+    "\n",
+    "        self.dropout_ffn = Dropout(dropout_rate)\n",
+    "    \n",
+    "    def call(self, x, training, mask):\n",
+    "        \"\"\"\n",
+    "        Forward pass for the Encoder Layer\n",
+    "        \n",
+    "        Arguments:\n",
+    "            x -- Tensor of shape (batch_size, input_seq_len, embedding_dim)\n",
+    "            training -- Boolean, set to true to activate\n",
+    "                        the training mode for dropout layers\n",
+    "            mask -- Boolean mask to ensure that the padding is not \n",
+    "                    treated as part of the input\n",
+    "        Returns:\n",
+    "            encoder_layer_out -- Tensor of shape (batch_size, input_seq_len, embedding_dim)\n",
+    "        \"\"\"\n",
+    "        # START CODE HERE\n",
+    "        # calculate self-attention using mha(~1 line).\n",
+    "        # Dropout is added by Keras automatically if the dropout parameter is non-zero during training\n",
+    "        self_mha_output = None  # Self attention (batch_size, input_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # skip connection\n",
+    "        # apply layer normalization on sum of the input and the attention output to get the  \n",
+    "        # output of the multi-head attention layer (~1 line)\n",
+    "        skip_x_attention = None  # (batch_size, input_seq_len, embedding_dim)\n",
+    "\n",
+    "        # pass the output of the multi-head attention layer through a ffn (~1 line)\n",
+    "        ffn_output = None  # (batch_size, input_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # apply dropout layer to ffn output during training (~1 line)\n",
+    "        # use `training=training` \n",
+    "        ffn_output = None\n",
+    "        \n",
+    "        # apply layer normalization on sum of the output from multi-head attention (skip connection) and ffn output to get the\n",
+    "        # output of the encoder layer (~1 line)\n",
+    "        encoder_layer_out = None  # (batch_size, input_seq_len, embedding_dim)\n",
+    "        # END CODE HERE\n",
+    "        \n",
+    "        return encoder_layer_out\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST\n",
+    "EncoderLayer_test(EncoderLayer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='4-2'></a>\n",
+    "### 4.2 - Full Encoder\n",
+    "\n",
+    "Awesome job! You have now successfully implemented positional encoding, self-attention, and an encoder layer - give yourself a pat on the back. Now you're ready to build the full Transformer Encoder (Figure 2b), where you will embed your input and add the positional encodings you calculated. You will then feed your encoded embeddings to a stack of Encoder layers. \n",
+    "\n",
+    "<img src=\"encoder.png\" alt=\"Encoder\" width=\"330\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 2b: Transformer Encoder</font></center></caption>\n",
+    "\n",
+    "\n",
+    "<a name='ex-5'></a>\n",
+    "### Exercise 5 - Encoder\n",
+    "\n",
+    "Complete the `Encoder()` function using the `call()` method to embed your input, add positional encoding, and implement multiple encoder layers. \n",
+    "\n",
+    "In this exercise, you will initialize your Encoder with an Embedding layer, positional encoding, and multiple EncoderLayers. Your `call()` method will perform the following steps: \n",
+    "1. Pass your input through the Embedding layer.\n",
+    "2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.\n",
+    "3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.\n",
+    "4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode. \n",
+    "5. Pass the output of the dropout layer through the stack of encoding layers using a for loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "7j2Tjr0K0t0I"
+   },
+   "outputs": [],
+   "source": [
+    " # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION\n",
+    "class Encoder(tf.keras.layers.Layer):\n",
+    "    \"\"\"\n",
+    "    The entire Encoder starts by passing the input to an embedding layer \n",
+    "    and using positional encoding to then pass the output through a stack of\n",
+    "    encoder Layers\n",
+    "        \n",
+    "    \"\"\"  \n",
+    "    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,\n",
+    "               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):\n",
+    "        super(Encoder, self).__init__()\n",
+    "\n",
+    "        self.embedding_dim = embedding_dim\n",
+    "        self.num_layers = num_layers\n",
+    "\n",
+    "        self.embedding = Embedding(input_vocab_size, self.embedding_dim)\n",
+    "        self.pos_encoding = positional_encoding(maximum_position_encoding, \n",
+    "                                                self.embedding_dim)\n",
+    "\n",
+    "\n",
+    "        self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,\n",
+    "                                        num_heads=num_heads,\n",
+    "                                        fully_connected_dim=fully_connected_dim,\n",
+    "                                        dropout_rate=dropout_rate,\n",
+    "                                        layernorm_eps=layernorm_eps) \n",
+    "                           for _ in range(self.num_layers)]\n",
+    "\n",
+    "        self.dropout = Dropout(dropout_rate)\n",
+    "        \n",
+    "    def call(self, x, training, mask):\n",
+    "        \"\"\"\n",
+    "        Forward pass for the Encoder\n",
+    "        \n",
+    "        Arguments:\n",
+    "            x -- Tensor of shape (batch_size, input_seq_len)\n",
+    "            training -- Boolean, set to true to activate\n",
+    "                        the training mode for dropout layers\n",
+    "            mask -- Boolean mask to ensure that the padding is not \n",
+    "                    treated as part of the input\n",
+    "        Returns:\n",
+    "            x -- Tensor of shape (batch_size, input_seq_len, embedding_dim)\n",
+    "        \"\"\"\n",
+    "        seq_len = tf.shape(x)[1]\n",
+    "        \n",
+    "        # START CODE HERE\n",
+    "        # Pass input through the Embedding layer\n",
+    "        x = None  # (batch_size, input_seq_len, embedding_dim)\n",
+    "        # Scale embedding by multiplying it by the square root of the embedding dimension\n",
+    "        x *= None\n",
+    "        # Add the position encoding to embedding\n",
+    "        x += None\n",
+    "        # Pass the encoded embedding through a dropout layer\n",
+    "        # use `training=training`\n",
+    "        x = None\n",
+    "        # Pass the output through the stack of encoding layers \n",
+    "        for i in range(self.num_layers):\n",
+    "            x = None\n",
+    "        # END CODE HERE\n",
+    "\n",
+    "        return x  # (batch_size, input_seq_len, embedding_dim)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST    \n",
+    "Encoder_test(Encoder)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='5'></a>\n",
+    "## 5 - Decoder\n",
+    "\n",
+    "The Decoder layer takes the K and V matrices generated by the Encoder and computes the second multi-head attention layer with the Q matrix from the output (Figure 3a).\n",
+    "\n",
+    "<img src=\"decoder_layer.png\" alt=\"Encoder\" width=\"250\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 3a: Transformer Decoder layer</font></center></caption>\n",
+    "\n",
+    "<a name='5-1'></a>    \n",
+    "### 5.1 - Decoder Layer\n",
+    "Again, you'll pair multi-head attention with a feed forward neural network, but this time you'll implement two multi-head attention layers. You will also use residual connections and layer normalization to help speed up training (Figure 3a).\n",
+    "\n",
+    "<a name='ex-6'></a>    \n",
+    "### Exercise 6 - DecoderLayer\n",
+    "    \n",
+    "Implement `DecoderLayer()` using the `call()` method\n",
+    "    \n",
+    "1. Block 1 is a multi-head attention layer with a residual connection, and look-ahead mask. Like in the `EncoderLayer`, Dropout is defined within the multi-head attention layer.\n",
+    "2. Block 2 will take into account the output of the Encoder, so the multi-head attention layer will receive K and V from the encoder, and Q from the Block 1. You will then apply a normalization layer and a residual connection, just like you did before with the `EncoderLayer`.\n",
+    "3. Finally, Block 3 is a feed forward neural network with dropout and normalization layers and a residual connection.\n",
+    "    \n",
+    "**Additional Hints:**\n",
+    "* The first two blocks are fairly similar to the EncoderLayer except you will return `attention_scores` when computing self-attention"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "wEouNFvCzMeT"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION DecoderLayer\n",
+    "class DecoderLayer(tf.keras.layers.Layer):\n",
+    "    \"\"\"\n",
+    "    The decoder layer is composed by two multi-head attention blocks, \n",
+    "    one that takes the new input and uses self-attention, and the other \n",
+    "    one that combines it with the output of the encoder, followed by a\n",
+    "    fully connected block. \n",
+    "    \"\"\"\n",
+    "    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):\n",
+    "        super(DecoderLayer, self).__init__()\n",
+    "\n",
+    "        self.mha1 = MultiHeadAttention(num_heads=num_heads,\n",
+    "                                      key_dim=embedding_dim,\n",
+    "                                      dropout=dropout_rate)\n",
+    "\n",
+    "        self.mha2 = MultiHeadAttention(num_heads=num_heads,\n",
+    "                                      key_dim=embedding_dim,\n",
+    "                                      dropout=dropout_rate)\n",
+    "\n",
+    "        self.ffn = FullyConnected(embedding_dim=embedding_dim,\n",
+    "                                  fully_connected_dim=fully_connected_dim)\n",
+    "\n",
+    "        self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)\n",
+    "        self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)\n",
+    "        self.layernorm3 = LayerNormalization(epsilon=layernorm_eps)\n",
+    "\n",
+    "        self.dropout_ffn = Dropout(dropout_rate)\n",
+    "    \n",
+    "    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):\n",
+    "        \"\"\"\n",
+    "        Forward pass for the Decoder Layer\n",
+    "        \n",
+    "        Arguments:\n",
+    "            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)\n",
+    "            enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)\n",
+    "            training -- Boolean, set to true to activate\n",
+    "                        the training mode for dropout layers\n",
+    "            look_ahead_mask -- Boolean mask for the target_input\n",
+    "            padding_mask -- Boolean mask for the second multihead attention layer\n",
+    "        Returns:\n",
+    "            out3 -- Tensor of shape (batch_size, target_seq_len, embedding_dim)\n",
+    "            attn_weights_block1 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)\n",
+    "            attn_weights_block2 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)\n",
+    "        \"\"\"\n",
+    "        \n",
+    "        # START CODE HERE\n",
+    "        # enc_output.shape == (batch_size, input_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # BLOCK 1\n",
+    "        # calculate self-attention and return attention scores as attn_weights_block1.\n",
+    "        # Dropout will be applied during training (~1 line).\n",
+    "        mult_attn_out1, attn_weights_block1 = self.mha1(None, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)\n",
+    "        Q1 = None\n",
+    "\n",
+    "        # BLOCK 2\n",
+    "        # calculate self-attention using the Q from the first block and K and V from the encoder output. \n",
+    "        # Dropout will be applied during training\n",
+    "        # Return attention scores as attn_weights_block2 (~1 line) \n",
+    "        mult_attn_out2, attn_weights_block2 = self.mha2(None, None, None, None, return_attention_scores=True)  # (batch_size, target_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)\n",
+    "        mult_attn_out2 = None  # (batch_size, target_seq_len, embedding_dim)\n",
+    "                \n",
+    "        #BLOCK 3\n",
+    "        # pass the output of the second block through a ffn\n",
+    "        ffn_output = None  # (batch_size, target_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # apply a dropout layer to the ffn output\n",
+    "        # use `training=training`\n",
+    "        ffn_output = None\n",
+    "        \n",
+    "        # apply layer normalization (layernorm3) to the sum of the ffn output and the output of the second block\n",
+    "        out3 = None  # (batch_size, target_seq_len, embedding_dim)\n",
+    "        # END CODE HERE\n",
+    "\n",
+    "        return out3, attn_weights_block1, attn_weights_block2\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST\n",
+    "DecoderLayer_test(DecoderLayer, create_look_ahead_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='5-2'></a> \n",
+    "### 5.2 - Full Decoder\n",
+    "You're almost there! Time to use your Decoder layer to build a full Transformer Decoder (Figure 3b). You will embed your output and add positional encodings. You will then feed your encoded embeddings to a stack of Decoder layers. \n",
+    "\n",
+    "\n",
+    "<img src=\"decoder.png\" alt=\"Encoder\" width=\"300\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 3b: Transformer Decoder</font></center></caption>\n",
+    "\n",
+    "<a name='ex-7'></a>     \n",
+    "### Exercise 7 - Decoder\n",
+    "\n",
+    "Implement `Decoder()` using the `call()` method to embed your output, add positional encoding, and implement multiple decoder layers.\n",
+    " \n",
+    "In this exercise, you will initialize your Decoder with an Embedding layer, positional encoding, and multiple DecoderLayers. Your `call()` method will perform the following steps: \n",
+    "1. Pass your generated output through the Embedding layer.\n",
+    "2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.\n",
+    "3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.\n",
+    "4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode. \n",
+    "5. Pass the output of the dropout layer through the stack of Decoding layers using a for loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "McS3by6k4pnP"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION Decoder\n",
+    "class Decoder(tf.keras.layers.Layer):\n",
+    "    \"\"\"\n",
+    "    The entire Encoder starts by passing the target input to an embedding layer \n",
+    "    and using positional encoding to then pass the output through a stack of\n",
+    "    decoder Layers\n",
+    "        \n",
+    "    \"\"\" \n",
+    "    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size,\n",
+    "               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):\n",
+    "        super(Decoder, self).__init__()\n",
+    "\n",
+    "        self.embedding_dim = embedding_dim\n",
+    "        self.num_layers = num_layers\n",
+    "\n",
+    "        self.embedding = Embedding(target_vocab_size, self.embedding_dim)\n",
+    "        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)\n",
+    "\n",
+    "        self.dec_layers = [DecoderLayer(embedding_dim=self.embedding_dim,\n",
+    "                                        num_heads=num_heads,\n",
+    "                                        fully_connected_dim=fully_connected_dim,\n",
+    "                                        dropout_rate=dropout_rate,\n",
+    "                                        layernorm_eps=layernorm_eps) \n",
+    "                           for _ in range(self.num_layers)]\n",
+    "        self.dropout = Dropout(dropout_rate)\n",
+    "    \n",
+    "    def call(self, x, enc_output, training, \n",
+    "           look_ahead_mask, padding_mask):\n",
+    "        \"\"\"\n",
+    "        Forward  pass for the Decoder\n",
+    "        \n",
+    "        Arguments:\n",
+    "            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)\n",
+    "            enc_output --  Tensor of shape(batch_size, input_seq_len, embedding_dim)\n",
+    "            training -- Boolean, set to true to activate\n",
+    "                        the training mode for dropout layers\n",
+    "            look_ahead_mask -- Boolean mask for the target_input\n",
+    "            padding_mask -- Boolean mask for the second multihead attention layer\n",
+    "        Returns:\n",
+    "            x -- Tensor of shape (batch_size, target_seq_len, embedding_dim)\n",
+    "            attention_weights - Dictionary of tensors containing all the attention weights\n",
+    "                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)\n",
+    "        \"\"\"\n",
+    "\n",
+    "        seq_len = tf.shape(x)[1]\n",
+    "        attention_weights = {}\n",
+    "        \n",
+    "        # START CODE HERE\n",
+    "        # create word embeddings \n",
+    "        x = None  # (batch_size, target_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # scale embeddings by multiplying by the square root of their dimension\n",
+    "        x *= None\n",
+    "        \n",
+    "        # calculate positional encodings and add to word embedding\n",
+    "        x += None\n",
+    "\n",
+    "        # apply a dropout layer to x\n",
+    "        # use `training=training`\n",
+    "        x = None\n",
+    "\n",
+    "        # use a for loop to pass x through a stack of decoder layers and update attention_weights (~4 lines total)\n",
+    "        for i in range(self.num_layers):\n",
+    "            # pass x and the encoder output through a stack of decoder layers and save the attention weights\n",
+    "            # of block 1 and 2 (~1 line)\n",
+    "            x, block1, block2 = self.dec_layers[i](None, None, None,\n",
+    "                                                 None, None)\n",
+    "\n",
+    "            #update attention_weights dictionary with the attention weights of block 1 and block 2\n",
+    "            attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = None\n",
+    "            attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = None\n",
+    "        # END CODE HERE\n",
+    "        \n",
+    "        # x.shape == (batch_size, target_seq_len, embedding_dim)\n",
+    "        return x, attention_weights"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST\n",
+    "Decoder_test(Decoder, create_look_ahead_mask, create_padding_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name='6'></a> \n",
+    "## 6 - Transformer\n",
+    "\n",
+    "Phew! This has been quite the assignment, and now you've made it to your last exercise of the Deep Learning Specialization. Congratulations! You've done all the hard work, now it's time to put it all together.  \n",
+    "\n",
+    "<img src=\"transformer.png\" alt=\"Transformer\" width=\"550\"/>\n",
+    "<caption><center><font color='purple'><b>Figure 4: Transformer</font></center></caption>\n",
+    "    \n",
+    "The flow of data through the Transformer Architecture is as follows:\n",
+    "* First your input passes through an Encoder, which is just repeated Encoder layers that you implemented:\n",
+    "    - embedding and positional encoding of your input\n",
+    "    - multi-head attention on your input\n",
+    "    - feed forward neural network to help detect features\n",
+    "* Then the predicted output passes through a Decoder, consisting of the decoder layers that you implemented:\n",
+    "    - embedding and positional encoding of the output\n",
+    "    - multi-head attention on your generated output\n",
+    "    - multi-head attention with the Q from the first multi-head attention layer and the K and V from the Encoder\n",
+    "    - a feed forward neural network to help detect features\n",
+    "* Finally, after the Nth Decoder layer, one dense layer and a softmax are applied to generate prediction for the next output in your sequence.\n",
+    "\n",
+    "<a name='ex-8'></a> \n",
+    "### Exercise 8 - Transformer\n",
+    "\n",
+    "Implement `Transformer()` using the `call()` method\n",
+    "1. Pass the input through the Encoder with the appropiate mask.\n",
+    "2. Pass the encoder output and the target through the Decoder with the appropiate mask.\n",
+    "3. Apply a linear transformation and a softmax to get a prediction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "QHymPmaj-2ba"
+   },
+   "outputs": [],
+   "source": [
+    "# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
+    "# GRADED FUNCTION Transformer\n",
+    "class Transformer(tf.keras.Model):\n",
+    "    \"\"\"\n",
+    "    Complete transformer with an Encoder and a Decoder\n",
+    "    \"\"\"\n",
+    "    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, \n",
+    "               target_vocab_size, max_positional_encoding_input,\n",
+    "               max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6):\n",
+    "        super(Transformer, self).__init__()\n",
+    "\n",
+    "        self.encoder = Encoder(num_layers=num_layers,\n",
+    "                               embedding_dim=embedding_dim,\n",
+    "                               num_heads=num_heads,\n",
+    "                               fully_connected_dim=fully_connected_dim,\n",
+    "                               input_vocab_size=input_vocab_size,\n",
+    "                               maximum_position_encoding=max_positional_encoding_input,\n",
+    "                               dropout_rate=dropout_rate,\n",
+    "                               layernorm_eps=layernorm_eps)\n",
+    "\n",
+    "        self.decoder = Decoder(num_layers=num_layers, \n",
+    "                               embedding_dim=embedding_dim,\n",
+    "                               num_heads=num_heads,\n",
+    "                               fully_connected_dim=fully_connected_dim,\n",
+    "                               target_vocab_size=target_vocab_size, \n",
+    "                               maximum_position_encoding=max_positional_encoding_target,\n",
+    "                               dropout_rate=dropout_rate,\n",
+    "                               layernorm_eps=layernorm_eps)\n",
+    "\n",
+    "        self.final_layer = Dense(target_vocab_size, activation='softmax')\n",
+    "    \n",
+    "    def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):\n",
+    "        \"\"\"\n",
+    "        Forward pass for the entire Transformer\n",
+    "        Arguments:\n",
+    "            input_sentence -- Tensor of shape (batch_size, input_seq_len)\n",
+    "                              An array of the indexes of the words in the input sentence\n",
+    "            output_sentence -- Tensor of shape (batch_size, target_seq_len)\n",
+    "                              An array of the indexes of the words in the output sentence\n",
+    "            training -- Boolean, set to true to activate\n",
+    "                        the training mode for dropout layers\n",
+    "            enc_padding_mask -- Boolean mask to ensure that the padding is not \n",
+    "                    treated as part of the input\n",
+    "            look_ahead_mask -- Boolean mask for the target_input\n",
+    "            dec_padding_mask -- Boolean mask for the second multihead attention layer\n",
+    "        Returns:\n",
+    "            final_output -- Describe me\n",
+    "            attention_weights - Dictionary of tensors containing all the attention weights for the decoder\n",
+    "                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)\n",
+    "        \n",
+    "        \"\"\"\n",
+    "        # START CODE HERE\n",
+    "        # call self.encoder with the appropriate arguments to get the encoder output\n",
+    "        enc_output = None  # (batch_size, inp_seq_len, embedding_dim)\n",
+    "        \n",
+    "        # call self.decoder with the appropriate arguments to get the decoder output\n",
+    "        # dec_output.shape == (batch_size, tar_seq_len, embedding_dim)\n",
+    "        dec_output, attention_weights = self.decoder(None, None, None, None, None)\n",
+    "        \n",
+    "        # pass decoder output through a linear layer and softmax (~2 lines)\n",
+    "        final_output = None # (batch_size, tar_seq_len, target_vocab_size)\n",
+    "        # END CODE HERE\n",
+    "\n",
+    "        return final_output, attention_weights"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# UNIT TEST\n",
+    "Transformer_test(Transformer, create_look_ahead_mask, create_padding_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "You've come to the end of the graded portion of the assignment. By now, you've: \n",
+    "\n",
+    "* Created positional encodings to capture sequential relationships in data\n",
+    "* Calculated scaled dot-product self-attention with word embeddings\n",
+    "* Implemented masked multi-head attention\n",
+    "* Built and trained a Transformer model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<font color='blue'>\n",
+    "    <b>What you should remember</b>:\n",
+    "\n",
+    "- The combination of self-attention and convolutional network layers allows of parallelization of training and *faster training*.\n",
+    "- Self-attention is calculated using the generated query Q, key K, and value V matrices.\n",
+    "- Adding positional encoding to word embeddings is an effective way to include sequence information in self-attention calculations. \n",
+    "- Multi-head attention can help detect multiple features in your sentence.\n",
+    "- Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you have completed the Transformer assignment, make sure you check out the ungraded labs to apply the Transformer model to practical use cases such as Name Entity Recogntion (NER) and Question Answering (QA).  \n",
+    "\n",
+    "\n",
+    "# Congratulations on finishing the Deep Learning Specialization!!!!!! 🎉🎉🎉🎉🎉\n",
+    "\n",
+    "This was the last graded assignment of the specialization. It is now time to celebrate all your hard work and dedication! \n",
+    "\n",
+    "<a name='7'></a> \n",
+    "## 7 - References\n",
+    "\n",
+    "The Transformer algorithm was due to Vaswani et al. (2017). \n",
+    "\n",
+    "- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/C5_W4_A1_Transformer_Subclass_v1.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/__pycache__/public_tests.cpython-37.pyc ADDED Viewed

Binary file (11.3 kB). View file

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder.png ADDED Viewed

Git LFS Details

SHA256: 810b53e90652631fe53ca0a12c0de6b295d9cf02365b40cab549bf8102ed0f77
Pointer size: 131 Bytes
Size of remote file: 160 kB

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/decoder_layer.png ADDED Viewed

Git LFS Details

SHA256: 02a310648d114bd9b4e45590ed66b3122ebaa491ae0d4cfd4418ed7055d6c2d9
Pointer size: 131 Bytes
Size of remote file: 119 kB

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder.png ADDED Viewed

Git LFS Details

SHA256: 324ecb745c1f234eedd82f54c753a6880a57422ab7c003f2e1d74ecaa264b79b
Pointer size: 131 Bytes
Size of remote file: 111 kB

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/encoder_layer.png ADDED Viewed

Git LFS Details

SHA256: 4b136a9265a6abcf53e5034388ef3b6b785da231a4798fb6aabebae7976ae600
Pointer size: 131 Bytes
Size of remote file: 116 kB

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/ner.json ADDED Viewed

The diff for this file is too large to render. See raw diff

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/public_tests.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import numpy as np
+import tensorflow as tf
+def get_angles_test(target):
+    position = 4
+    d_model = 16
+    pos_m = np.arange(position)[:, np.newaxis]
+    dims = np.arange(d_model)[np.newaxis, :]
+    result = target(pos_m, dims, d_model)
+    assert type(result) == np.ndarray, "You must return a numpy ndarray"
+    assert result.shape == (position, d_model), f"Wrong shape. We expected: ({position}, {d_model})"
+    assert np.sum(result[0, :]) == 0
+    assert np.isclose(np.sum(result[:, 0]), position * (position - 1) / 2)
+    even_cols =  result[:, 0::2]
+    odd_cols = result[:,  1::2]
+    assert np.all(even_cols == odd_cols), "Submatrices of odd and even columns must be equal"
+    limit = (position - 1) / np.power(10000,14.0/16.0)
+    assert np.isclose(result[position - 1, d_model -1], limit ), f"Last value must be {limit}"
+    print("\033[92mAll tests passed")
+def positional_encoding_test(target, get_angles):
+    position = 8
+    d_model = 16
+    pos_encoding = target(position, d_model)
+    sin_part = pos_encoding[:, :, 0::2]
+    cos_part = pos_encoding[:, :, 1::2]
+    assert tf.is_tensor(pos_encoding), "Output is not a tensor"
+    assert pos_encoding.shape == (1, position, d_model), f"Wrong shape. We expected: (1, {position}, {d_model})"
+    ones = sin_part ** 2  +  cos_part ** 2
+    assert np.allclose(ones, np.ones((1, position, d_model // 2))), "Sum of square pairs must be 1 = sin(a)**2 + cos(a)**2"
+    angs = np.arctan(sin_part / cos_part)
+    angs[angs < 0] += np.pi
+    angs[sin_part.numpy() < 0] += np.pi
+    angs = angs % (2 * np.pi)
+    pos_m = np.arange(position)[:, np.newaxis]
+    dims = np.arange(d_model)[np.newaxis, :]
+    trueAngs = get_angles(pos_m, dims, d_model)[:, 0::2] % (2 * np.pi)
+    assert np.allclose(angs[0], trueAngs), "Did you apply sin and cos to even and odd parts respectively?"
+    print("\033[92mAll tests passed")
+def scaled_dot_product_attention_test(target):
+    q = np.array([[1, 0, 1, 1], [0, 1, 1, 1], [1, 0, 0, 1]]).astype(np.float32)
+    k = np.array([[1, 1, 0, 1], [1, 0, 1, 1 ], [0, 1, 1, 0], [0, 0, 0, 1]]).astype(np.float32)
+    v = np.array([[0, 0], [1, 0], [1, 0], [1, 1]]).astype(np.float32)
+    attention, weights = target(q, k, v, None)
+    assert tf.is_tensor(weights), "Weights must be a tensor"
+    assert tuple(tf.shape(weights).numpy()) == (q.shape[0], k.shape[1]), f"Wrong shape. We expected ({q.shape[0]}, {k.shape[1]})"
+    assert np.allclose(weights, [[0.2589478,  0.42693272, 0.15705977, 0.15705977],
+                                   [0.2772748,  0.2772748,  0.2772748,  0.16817567],
+                                   [0.33620113, 0.33620113, 0.12368149, 0.2039163 ]]), "Wrong unmasked weights"
+    assert tf.is_tensor(attention), "Output must be a tensor"
+    assert tuple(tf.shape(attention).numpy()) == (q.shape[0], v.shape[1]), f"Wrong shape. We expected ({q.shape[0]}, {v.shape[1]})"
+    assert np.allclose(attention, [[0.74105227, 0.15705977],
+                                   [0.7227253,  0.16817567],
+                                   [0.6637989,  0.2039163 ]]), "Wrong unmasked attention"
+    mask = np.array([[[1, 1, 0, 1], [1, 1, 0, 1], [1, 1, 0, 1]]])
+    attention, weights = target(q, k, v, mask)
+    assert np.allclose(weights, [[0.30719590187072754, 0.5064803957939148, 0.0, 0.18632373213768005],
+                                 [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862],
+                                 [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862]]), "Wrong masked weights"
+    assert np.allclose(attention, [[0.6928040981292725, 0.18632373213768005],
+                                   [0.6163482666015625, 0.2326965481042862],
+                                   [0.6163482666015625, 0.2326965481042862]]), "Wrong masked attention"
+    print("\033[92mAll tests passed")
+def EncoderLayer_test(target):
+    q = np.array([[[1, 0, 1, 1], [0, 1, 1, 1], [1, 0, 0, 1]]]).astype(np.float32)
+    encoder_layer1 = target(4, 2, 8)
+    tf.random.set_seed(10)
+    encoded = encoder_layer1(q, True, np.array([[1, 0, 1]]))
+    assert tf.is_tensor(encoded), "Wrong type. Output must be a tensor"
+    assert tuple(tf.shape(encoded).numpy()) == (1, q.shape[1], q.shape[2]), f"Wrong shape. We expected ((1, {q.shape[1]}, {q.shape[2]}))"
+    assert np.allclose(encoded.numpy(),
+                       [[ 0.23017104, -0.98100424, -0.78707516,  1.5379084 ],
+                       [-1.2280797 ,  0.76477575, -0.7169283 ,  1.1802323 ],
+                       [ 0.14880152, -0.48318022, -1.1908402 ,  1.5252188 ]]), "Wrong values when training=True"
+    encoded = encoder_layer1(q, False, np.array([[1, 1, 0]]))
+    assert np.allclose(encoded.numpy(), [[ 0.5167701 , -0.92981905, -0.9731106 ,  1.3861597 ],
+                           [-1.120878  ,  1.0826552 , -0.8671041 ,  0.905327  ],
+                           [ 0.28154755, -0.3661362 , -1.3330412 ,  1.4176297 ]]), "Wrong values when training=False"
+    print("\033[92mAll tests passed")
+def Encoder_test(target):
+    tf.random.set_seed(10)
+    embedding_dim=4
+    encoderq = target(num_layers=2,
+                      embedding_dim=embedding_dim,
+                      num_heads=2,
+                      fully_connected_dim=8,
+                      input_vocab_size=32,
+                      maximum_position_encoding=5)
+    x = np.array([[2, 1, 3], [1, 2, 0]])
+    encoderq_output = encoderq(x, True, None)
+    assert tf.is_tensor(encoderq_output), "Wrong type. Output must be a tensor"
+    assert tuple(tf.shape(encoderq_output).numpy()) == (x.shape[0], x.shape[1], embedding_dim), f"Wrong shape. We expected ({x.shape[0]}, {x.shape[1]}, {embedding_dim})"
+    assert np.allclose(encoderq_output.numpy(),
+                       [[[-0.6906098 ,  1.0988709 , -1.260586  ,  0.85232526],
+                         [ 0.7319228 , -0.3826024 , -1.4507656 ,  1.1014453 ],
+                         [ 1.0995713 , -1.1686686 , -0.80888665,  0.8779839 ]],
+                        [[-0.4612937 ,  1.0697356 , -1.4127715 ,  0.8043293 ],
+                         [ 0.27027237,  0.28793618, -1.6370889 ,  1.0788803 ],
+                         [ 1.2370994 , -1.0687275 , -0.8945037 ,  0.7261319 ]]]), "Wrong values case 1"
+    encoderq_output = encoderq(x, True, np.array([[[[1., 1., 1.]]], [[[1., 1., 0.]]]]))
+    assert np.allclose(encoderq_output.numpy(),
+                       [[[-0.36764443,  0.98527074, -1.4714274 ,  0.85380095],
+                           [-0.50018215,  0.66005886, -1.3647256 ,  1.204849  ],
+                           [ 0.99951494, -1.0142792 , -0.9856176 ,  1.0003818 ]],
+                         [[ 0.01838917,  1.038109  , -1.6154225 ,  0.55892444],
+                           [ 0.3872563 , -0.40960154, -1.3456631 ,  1.3680083 ],
+                           [ 0.534565  , -0.70262754, -1.18215   ,  1.3502126 ]]]), "Wrong values case 2"
+    encoderq_output = encoderq(x, False, np.array([[[[1., 1., 1.]]], [[[1., 1., 0.]]]]))
+    assert np.allclose(encoderq_output.numpy(),
+                       [[[-0.5642399 ,  1.0386591 , -1.3530676 ,  0.87864864],
+                           [ 0.5261332 ,  0.21861789, -1.6758442 ,  0.93109316],
+                           [ 1.2870724 , -1.1545564 , -0.7739521 ,  0.6414361 ]],
+                         [[-0.01885331,  0.8866553 , -1.624897  ,  0.75709504],
+                           [ 0.4165045 ,  0.27912217, -1.6719477 ,  0.97632086],
+                           [ 0.71298015, -0.7565592 , -1.1861688 ,  1.2297478 ]]]), "Wrong values case 3"
+    print("\033[92mAll tests passed")
+def DecoderLayer_test(target, create_look_ahead_mask):
+    num_heads=8
+    tf.random.set_seed(10)
+    decoderLayerq = target(
+        embedding_dim=4,
+        num_heads=num_heads,
+        fully_connected_dim=32,
+        dropout_rate=0.1,
+        layernorm_eps=1e-6)
+    encoderq_output = tf.constant([[[-0.40172306,  0.11519244, -1.2322885,   1.5188192 ],
+                                   [ 0.4017268,   0.33922842, -1.6836855,   0.9427304 ],
+                                   [ 0.4685002,  -1.6252842,   0.09368491,  1.063099  ]]])
+    q = np.array([[[1, 0, 1, 1], [0, 1, 1, 1], [1, 0, 0, 1]]]).astype(np.float32)
+    look_ahead_mask = create_look_ahead_mask(q.shape[1])
+    padding_mask = None
+    out, attn_w_b1, attn_w_b2 = decoderLayerq(q, encoderq_output, True, look_ahead_mask, padding_mask)
+    assert tf.is_tensor(attn_w_b1), "Wrong type for attn_w_b1. Output must be a tensor"
+    assert tf.is_tensor(attn_w_b2), "Wrong type for attn_w_b2. Output must be a tensor"
+    assert tf.is_tensor(out), "Wrong type for out. Output must be a tensor"
+    shape1 = (q.shape[0], num_heads, q.shape[1], q.shape[1])
+    assert tuple(tf.shape(attn_w_b1).numpy()) == shape1, f"Wrong shape. We expected {shape1}"
+    assert tuple(tf.shape(attn_w_b2).numpy()) == shape1, f"Wrong shape. We expected {shape1}"
+    assert tuple(tf.shape(out).numpy()) == q.shape, f"Wrong shape. We expected {q.shape}"
+    assert np.allclose(attn_w_b1[0, 0, 1], [0.5271505,  0.47284946, 0.], atol=1e-2), "Wrong values in attn_w_b1. Check the call to self.mha1"
+    assert np.allclose(attn_w_b2[0, 0, 1], [0.32048798, 0.390301, 0.28921106]),  "Wrong values in attn_w_b2. Check the call to self.mha2"
+    assert np.allclose(out[0, 0], [-0.22109576, -1.5455486, 0.852692, 0.9139523]), "Wrong values in out"
+    # Now let's try a example with padding mask
+    padding_mask = np.array([[[1, 1, 0]]])
+    out, attn_w_b1, attn_w_b2 = decoderLayerq(q, encoderq_output, True, look_ahead_mask, padding_mask)
+    assert np.allclose(out[0, 0], [0.14950314, -1.6444231, 1.0268553, 0.4680646]), "Wrong values in out when we mask the last word. Are you passing the padding_mask to the inner functions?"
+    print("\033[92mAll tests passed")
+def Decoder_test(target, create_look_ahead_mask, create_padding_mask):
+    tf.random.set_seed(10)
+    num_layers=7
+    embedding_dim=4
+    num_heads=3
+    fully_connected_dim=8
+    target_vocab_size=33
+    maximum_position_encoding=6
+    x_array = np.array([[3, 2, 1], [2, 1, 0]])
+    encoderq_output = tf.constant([[[-0.40172306,  0.11519244, -1.2322885,   1.5188192 ],
+                         [ 0.4017268,   0.33922842, -1.6836855,   0.9427304 ],
+                         [ 0.4685002,  -1.6252842,   0.09368491,  1.063099  ]],
+                        [[-0.3489219,   0.31335592, -1.3568854,   1.3924513 ],
+                         [-0.08761203, -0.1680029,  -1.2742313,   1.5298463 ],
+                         [ 0.2627198,  -1.6140151,   0.2212624 ,  1.130033  ]]])
+    look_ahead_mask = create_look_ahead_mask(x_array.shape[1])
+    decoderk = target(num_layers,
+                    embedding_dim,
+                    num_heads,
+                    fully_connected_dim,
+                    target_vocab_size,
+                    maximum_position_encoding)
+    x, attention_weights = decoderk(x_array, encoderq_output, False, look_ahead_mask, None)
+    assert tf.is_tensor(x), "Wrong type for x. It must be a dict"
+    assert np.allclose(tf.shape(x), tf.shape(encoderq_output)), f"Wrong shape. We expected { tf.shape(encoderq_output)}"
+    assert np.allclose(x[1, 1], [-0.2715261, -0.5606001, -0.861783, 1.69390933]), "Wrong values in x"
+    keys = list(attention_weights.keys())
+    assert type(attention_weights) == dict, "Wrong type for attention_weights[0]. Output must be a tensor"
+    assert len(keys) == 2 * num_layers, f"Wrong length for attention weights. It must be 2 x num_layers = {2*num_layers}"
+    assert tf.is_tensor(attention_weights[keys[0]]), f"Wrong type for attention_weights[{keys[0]}]. Output must be a tensor"
+    shape1 = (x_array.shape[0], num_heads, x_array.shape[1], x_array.shape[1])
+    assert tuple(tf.shape(attention_weights[keys[1]]).numpy()) == shape1, f"Wrong shape. We expected {shape1}"
+    assert np.allclose(attention_weights[keys[0]][0, 0, 1], [0.52145624, 0.47854376, 0.]), f"Wrong values in attention_weights[{keys[0]}]"
+    x, attention_weights = decoderk(x_array, encoderq_output, True, look_ahead_mask, None)
+    assert np.allclose(x[1, 1], [-0.30814743, -0.6213016, -0.77767026, 1.7071193]), "Wrong values in x when training=True"
+    x, attention_weights = decoderk(x_array, encoderq_output, True, look_ahead_mask, create_padding_mask(x_array))
+    assert np.allclose(x[1, 1], [-0.0250004, 0.50791883, -1.5877104, 1.1047921]), "Wrong values in x when training=True and use padding mask"
+    print("\033[92mAll tests passed")
+def Transformer_test(target, create_look_ahead_mask, create_padding_mask):
+    tf.random.set_seed(10)
+    num_layers = 6
+    embedding_dim = 4
+    num_heads = 4
+    fully_connected_dim = 8
+    input_vocab_size = 30
+    target_vocab_size = 35
+    max_positional_encoding_input = 5
+    max_positional_encoding_target = 6
+    trans = target(num_layers,
+                        embedding_dim,
+                        num_heads,
+                        fully_connected_dim,
+                        input_vocab_size,
+                        target_vocab_size,
+                        max_positional_encoding_input,
+                        max_positional_encoding_target)
+    # 0 is the padding value
+    sentence_lang_a = np.array([[2, 1, 4, 3, 0]])
+    sentence_lang_b = np.array([[3, 2, 1, 0, 0]])
+    enc_padding_mask = create_padding_mask(sentence_lang_a)
+    dec_padding_mask = create_padding_mask(sentence_lang_b)
+    look_ahead_mask = create_look_ahead_mask(sentence_lang_a.shape[1])
+    translation, weights = trans(
+        sentence_lang_a,
+        sentence_lang_b,
+        True,  # Training
+        enc_padding_mask,
+        look_ahead_mask,
+        dec_padding_mask
+    )
+    assert tf.is_tensor(translation), "Wrong type for translation. Output must be a tensor"
+    shape1 = (sentence_lang_a.shape[0], max_positional_encoding_input, target_vocab_size)
+    assert tuple(tf.shape(translation).numpy()) == shape1, f"Wrong shape. We expected {shape1}"
+    assert np.allclose(translation[0, 0, 0:8],
+                       [0.017416516, 0.030932948, 0.024302809, 0.01997807,
+                        0.014861834, 0.034384135, 0.054789476, 0.032087505]), "Wrong values in translation"
+    keys = list(weights.keys())
+    assert type(weights) == dict, "Wrong type for weights. It must be a dict"
+    assert len(keys) == 2 * num_layers, f"Wrong length for attention weights. It must be 2 x num_layers = {2*num_layers}"
+    assert tf.is_tensor(weights[keys[0]]), f"Wrong type for att_weights[{keys[0]}]. Output must be a tensor"
+    shape1 = (sentence_lang_a.shape[0], num_heads, sentence_lang_a.shape[1], sentence_lang_a.shape[1])
+    assert tuple(tf.shape(weights[keys[1]]).numpy()) == shape1, f"Wrong shape. We expected {shape1}"
+    assert np.allclose(weights[keys[0]][0, 0, 1], [0.4805548, 0.51944524, 0.0, 0.0, 0.0]), f"Wrong values in weights[{keys[0]}]"
+    translation, weights = trans(
+        sentence_lang_a,
+        sentence_lang_b,
+        False, # Training
+        enc_padding_mask,
+        look_ahead_mask,
+        dec_padding_mask
+    )
+    assert np.allclose(translation[0, 0, 0:8],
+                       [0.01751175, 0.029051155, 0.024785805, 0.020421047,
+                        0.0149451075, 0.033235606, 0.053800166, 0.028556924]), "Wrong values in outd"
+    print(translation)
+    print("\033[92mAll tests passed")

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/self-attention.png ADDED Viewed

Git LFS Details

SHA256: 6d843c634ab8867bbc26cde22fc243ef514a4b81bc8ab35cee6a68118b7db74a
Pointer size: 131 Bytes
Size of remote file: 213 kB

Transformer Mechanism/Transformer_Implementation/home/jovyan/work/W4A1/transformer.png ADDED Viewed

Git LFS Details

SHA256: 050073bfbe272db8fedf76dac0c6da69885dec397744b931a71e32930e9c0781
Pointer size: 131 Bytes
Size of remote file: 368 kB