Tjdharani
/

minGPT

Model card Files Files and versions

xet

Community

Tjdharani commited on Jun 13, 2023

Commit

14f4c62

1 Parent(s): c4bceef

Upload minGPT.ipynb

Browse files

Files changed (1) hide show

minGPT.ipynb +1505 -0

minGPT.ipynb ADDED Viewed

	@@ -0,0 +1,1505 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Building GPT"
+      ],
+      "metadata": {
+        "id": "8FHnXpkTv_5f"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# We always start with a dataset to train on. Let's download the tiny shakespeare dataset\n",
+        "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "YTPlvPQn-Zef",
+        "outputId": "45f9c50f-d2c6-4629-cabe-d1378e2882a7"
+      },
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "--2023-06-13 07:55:40--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n",
+            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
+            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 1115394 (1.1M) [text/plain]\n",
+            "Saving to: ‘input.txt’\n",
+            "\n",
+            "input.txt           100%[===================>]   1.06M  --.-KB/s    in 0.005s  \n",
+            "\n",
+            "2023-06-13 07:55:40 (199 MB/s) - ‘input.txt’ saved [1115394/1115394]\n",
+            "\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with open('input.txt', 'r', encoding='utf-8') as f:\n",
+        "    text = f.read()"
+      ],
+      "metadata": {
+        "id": "mfIiqOSm-euI"
+      },
+      "execution_count": 2,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(\"length of dataset in characters:\", len(text))\n"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "4Qgkvnr0_N66",
+        "outputId": "6063f096-78b7-40c1-c830-531594a0bb1a"
+      },
+      "execution_count": 3,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "length of dataset in characters: 1115394\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# let's look at the first 1000 characters\n",
+        "print(text[:1000])"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Qn9QIHwf_c-_",
+        "outputId": "4f4f837a-7b53-43fd-807e-42d16b0519c6"
+      },
+      "execution_count": 4,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "First Citizen:\n",
+            "Before we proceed any further, hear me speak.\n",
+            "\n",
+            "All:\n",
+            "Speak, speak.\n",
+            "\n",
+            "First Citizen:\n",
+            "You are all resolved rather to die than to famish?\n",
+            "\n",
+            "All:\n",
+            "Resolved. resolved.\n",
+            "\n",
+            "First Citizen:\n",
+            "First, you know Caius Marcius is chief enemy to the people.\n",
+            "\n",
+            "All:\n",
+            "We know't, we know't.\n",
+            "\n",
+            "First Citizen:\n",
+            "Let us kill him, and we'll have corn at our own price.\n",
+            "Is't a verdict?\n",
+            "\n",
+            "All:\n",
+            "No more talking on't; let it be done: away, away!\n",
+            "\n",
+            "Second Citizen:\n",
+            "One word, good citizens.\n",
+            "\n",
+            "First Citizen:\n",
+            "We are accounted poor citizens, the patricians good.\n",
+            "What authority surfeits on would relieve us: if they\n",
+            "would yield us but the superfluity, while it were\n",
+            "wholesome, we might guess they relieved us humanely;\n",
+            "but they think we are too dear: the leanness that\n",
+            "afflicts us, the object of our misery, is as an\n",
+            "inventory to particularise their abundance; our\n",
+            "sufferance is a gain to them Let us revenge this with\n",
+            "our pikes, ere we become rakes: for the gods know I\n",
+            "speak this in hunger for bread, not in thirst for revenge.\n",
+            "\n",
+            "\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# here are all the unique characters that occur in this text\n",
+        "chars = sorted(list(set(text)))\n",
+        "vocab_size = len(chars)\n",
+        "print(''.join(chars))\n",
+        "print(vocab_size)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "JN8_xJFY_zvq",
+        "outputId": "d0ab20bb-c366-41af-9378-15ced2913126"
+      },
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "\n",
+            " !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\n",
+            "65\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# create a mapping from characters to integers \n",
+        "stoi = { ch:i for i, ch in enumerate(chars)}\n",
+        "itos = { i:ch for i, ch in enumerate(chars)}\n",
+        "encode = lambda s: [stoi[c] for c in s] # sting to integer\n",
+        "decode = lambda l: ''.join([itos[i] for i in l]) # integer to string\n",
+        "\n",
+        "print(encode(\"hii there\"))\n",
+        "print(decode(encode(\"hii there\")))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "X1lJF7-IAjz_",
+        "outputId": "18702fc0-b1c0-4675-b78a-e047a06f4887"
+      },
+      "execution_count": 6,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "[46, 47, 47, 1, 58, 46, 43, 56, 43]\n",
+            "hii there\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# let's now encode the entire text dataset and store it into torch.Tensor\n",
+        "import torch # PyTorch\n",
+        "data = torch.tensor(encode(text), dtype=torch.long)\n",
+        "print(data.shape, data.dtype)\n",
+        "print(data[:1000])"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ML1pjHfLCJ_M",
+        "outputId": "3f21fc94-ed1f-4bb5-b9db-0a1ad2e5b227"
+      },
+      "execution_count": 7,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "torch.Size([1115394]) torch.int64\n",
+            "tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,\n",
+            "        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,\n",
+            "         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,\n",
+            "        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,\n",
+            "         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,\n",
+            "        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,\n",
+            "         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,\n",
+            "        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,\n",
+            "        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,\n",
+            "         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,\n",
+            "         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,\n",
+            "        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,\n",
+            "        47, 59, 57,  1, 47, 57,  1, 41, 46, 47, 43, 44,  1, 43, 52, 43, 51, 63,\n",
+            "         1, 58, 53,  1, 58, 46, 43,  1, 54, 43, 53, 54, 50, 43,  8,  0,  0, 13,\n",
+            "        50, 50, 10,  0, 35, 43,  1, 49, 52, 53, 61,  5, 58,  6,  1, 61, 43,  1,\n",
+            "        49, 52, 53, 61,  5, 58,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,\n",
+            "        47, 64, 43, 52, 10,  0, 24, 43, 58,  1, 59, 57,  1, 49, 47, 50, 50,  1,\n",
+            "        46, 47, 51,  6,  1, 39, 52, 42,  1, 61, 43,  5, 50, 50,  1, 46, 39, 60,\n",
+            "        43,  1, 41, 53, 56, 52,  1, 39, 58,  1, 53, 59, 56,  1, 53, 61, 52,  1,\n",
+            "        54, 56, 47, 41, 43,  8,  0, 21, 57,  5, 58,  1, 39,  1, 60, 43, 56, 42,\n",
+            "        47, 41, 58, 12,  0,  0, 13, 50, 50, 10,  0, 26, 53,  1, 51, 53, 56, 43,\n",
+            "         1, 58, 39, 50, 49, 47, 52, 45,  1, 53, 52,  5, 58, 11,  1, 50, 43, 58,\n",
+            "         1, 47, 58,  1, 40, 43,  1, 42, 53, 52, 43, 10,  1, 39, 61, 39, 63,  6,\n",
+            "         1, 39, 61, 39, 63,  2,  0,  0, 31, 43, 41, 53, 52, 42,  1, 15, 47, 58,\n",
+            "        47, 64, 43, 52, 10,  0, 27, 52, 43,  1, 61, 53, 56, 42,  6,  1, 45, 53,\n",
+            "        53, 42,  1, 41, 47, 58, 47, 64, 43, 52, 57,  8,  0,  0, 18, 47, 56, 57,\n",
+            "        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 35, 43,  1, 39, 56, 43,  1,\n",
+            "        39, 41, 41, 53, 59, 52, 58, 43, 42,  1, 54, 53, 53, 56,  1, 41, 47, 58,\n",
+            "        47, 64, 43, 52, 57,  6,  1, 58, 46, 43,  1, 54, 39, 58, 56, 47, 41, 47,\n",
+            "        39, 52, 57,  1, 45, 53, 53, 42,  8,  0, 35, 46, 39, 58,  1, 39, 59, 58,\n",
+            "        46, 53, 56, 47, 58, 63,  1, 57, 59, 56, 44, 43, 47, 58, 57,  1, 53, 52,\n",
+            "         1, 61, 53, 59, 50, 42,  1, 56, 43, 50, 47, 43, 60, 43,  1, 59, 57, 10,\n",
+            "         1, 47, 44,  1, 58, 46, 43, 63,  0, 61, 53, 59, 50, 42,  1, 63, 47, 43,\n",
+            "        50, 42,  1, 59, 57,  1, 40, 59, 58,  1, 58, 46, 43,  1, 57, 59, 54, 43,\n",
+            "        56, 44, 50, 59, 47, 58, 63,  6,  1, 61, 46, 47, 50, 43,  1, 47, 58,  1,\n",
+            "        61, 43, 56, 43,  0, 61, 46, 53, 50, 43, 57, 53, 51, 43,  6,  1, 61, 43,\n",
+            "         1, 51, 47, 45, 46, 58,  1, 45, 59, 43, 57, 57,  1, 58, 46, 43, 63,  1,\n",
+            "        56, 43, 50, 47, 43, 60, 43, 42,  1, 59, 57,  1, 46, 59, 51, 39, 52, 43,\n",
+            "        50, 63, 11,  0, 40, 59, 58,  1, 58, 46, 43, 63,  1, 58, 46, 47, 52, 49,\n",
+            "         1, 61, 43,  1, 39, 56, 43,  1, 58, 53, 53,  1, 42, 43, 39, 56, 10,  1,\n",
+            "        58, 46, 43,  1, 50, 43, 39, 52, 52, 43, 57, 57,  1, 58, 46, 39, 58,  0,\n",
+            "        39, 44, 44, 50, 47, 41, 58, 57,  1, 59, 57,  6,  1, 58, 46, 43,  1, 53,\n",
+            "        40, 48, 43, 41, 58,  1, 53, 44,  1, 53, 59, 56,  1, 51, 47, 57, 43, 56,\n",
+            "        63,  6,  1, 47, 57,  1, 39, 57,  1, 39, 52,  0, 47, 52, 60, 43, 52, 58,\n",
+            "        53, 56, 63,  1, 58, 53,  1, 54, 39, 56, 58, 47, 41, 59, 50, 39, 56, 47,\n",
+            "        57, 43,  1, 58, 46, 43, 47, 56,  1, 39, 40, 59, 52, 42, 39, 52, 41, 43,\n",
+            "        11,  1, 53, 59, 56,  0, 57, 59, 44, 44, 43, 56, 39, 52, 41, 43,  1, 47,\n",
+            "        57,  1, 39,  1, 45, 39, 47, 52,  1, 58, 53,  1, 58, 46, 43, 51,  1, 24,\n",
+            "        43, 58,  1, 59, 57,  1, 56, 43, 60, 43, 52, 45, 43,  1, 58, 46, 47, 57,\n",
+            "         1, 61, 47, 58, 46,  0, 53, 59, 56,  1, 54, 47, 49, 43, 57,  6,  1, 43,\n",
+            "        56, 43,  1, 61, 43,  1, 40, 43, 41, 53, 51, 43,  1, 56, 39, 49, 43, 57,\n",
+            "        10,  1, 44, 53, 56,  1, 58, 46, 43,  1, 45, 53, 42, 57,  1, 49, 52, 53,\n",
+            "        61,  1, 21,  0, 57, 54, 43, 39, 49,  1, 58, 46, 47, 57,  1, 47, 52,  1,\n",
+            "        46, 59, 52, 45, 43, 56,  1, 44, 53, 56,  1, 40, 56, 43, 39, 42,  6,  1,\n",
+            "        52, 53, 58,  1, 47, 52,  1, 58, 46, 47, 56, 57, 58,  1, 44, 53, 56,  1,\n",
+            "        56, 43, 60, 43, 52, 45, 43,  8,  0,  0])\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# split the data into train and validation set\n",
+        "n = int(0.9*len(data)) #train 90% data\n",
+        "train_data = data[:n]\n",
+        "val_data = data[n:]"
+      ],
+      "metadata": {
+        "id": "F-6DyilNE7KM"
+      },
+      "execution_count": 8,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "block_size = 8\n",
+        "train_data[:block_size+1]"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "z79mbyx-GJC-",
+        "outputId": "b4b90aae-90f9-4f07-bbc0-2f726b0ff4d3"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 9
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "x = train_data[:block_size]\n",
+        "y = train_data[1:block_size+1]\n",
+        "for t in range(block_size):\n",
+        "    context = x[:t+1]\n",
+        "    target = y[t]\n",
+        "    print(f\"when input is {context} the target: {target}\")"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "5SQI_jZXGb7_",
+        "outputId": "52404a4a-91dd-4757-9c7e-c30a8a2eb2a3"
+      },
+      "execution_count": 10,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "when input is tensor([18]) the target: 47\n",
+            "when input is tensor([18, 47]) the target: 56\n",
+            "when input is tensor([18, 47, 56]) the target: 57\n",
+            "when input is tensor([18, 47, 56, 57]) the target: 58\n",
+            "when input is tensor([18, 47, 56, 57, 58]) the target: 1\n",
+            "when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15\n",
+            "when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47\n",
+            "when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "torch.manual_seed(1337)\n",
+        "batch_size = 4\n",
+        "block_size = 8\n",
+        "\n",
+        "def get_batch(split):\n",
+        "  # generate a small batch of data of inputs x and targets y\n",
+        "  data = train_data if split == 'train' else val_data\n",
+        "  ix = torch.randint(len(data) - block_size, (batch_size,))\n",
+        "  x = torch.stack([data[i:i+block_size] for i in ix])\n",
+        "  y = torch.stack([data[i+1:i+block_size+1] for i in ix])\n",
+        "  return x, y\n",
+        "\n",
+        "xb, yb = get_batch('train')\n",
+        "print('inputs:')\n",
+        "print(xb.shape)\n",
+        "print(xb)\n",
+        "print('targets:')\n",
+        "print(yb.shape)\n",
+        "print(yb)\n",
+        "\n",
+        "print('----')\n",
+        "\n",
+        "for b in range(batch_size): # batch dimension\n",
+        "    for t in range(block_size): # time dimension\n",
+        "        context = xb[b, :t+1]\n",
+        "        target = yb[b,t]\n",
+        "        print(f\"when input is {context.tolist()} the target: {target}\")"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "IAjhF0PTI1HF",
+        "outputId": "245c0f68-9502-4633-d365-e411176a5a14"
+      },
+      "execution_count": 11,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "inputs:\n",
+            "torch.Size([4, 8])\n",
+            "tensor([[24, 43, 58,  5, 57,  1, 46, 43],\n",
+            "        [44, 53, 56,  1, 58, 46, 39, 58],\n",
+            "        [52, 58,  1, 58, 46, 39, 58,  1],\n",
+            "        [25, 17, 27, 10,  0, 21,  1, 54]])\n",
+            "targets:\n",
+            "torch.Size([4, 8])\n",
+            "tensor([[43, 58,  5, 57,  1, 46, 43, 39],\n",
+            "        [53, 56,  1, 58, 46, 39, 58,  1],\n",
+            "        [58,  1, 58, 46, 39, 58,  1, 46],\n",
+            "        [17, 27, 10,  0, 21,  1, 54, 39]])\n",
+            "----\n",
+            "when input is [24] the target: 43\n",
+            "when input is [24, 43] the target: 58\n",
+            "when input is [24, 43, 58] the target: 5\n",
+            "when input is [24, 43, 58, 5] the target: 57\n",
+            "when input is [24, 43, 58, 5, 57] the target: 1\n",
+            "when input is [24, 43, 58, 5, 57, 1] the target: 46\n",
+            "when input is [24, 43, 58, 5, 57, 1, 46] the target: 43\n",
+            "when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39\n",
+            "when input is [44] the target: 53\n",
+            "when input is [44, 53] the target: 56\n",
+            "when input is [44, 53, 56] the target: 1\n",
+            "when input is [44, 53, 56, 1] the target: 58\n",
+            "when input is [44, 53, 56, 1, 58] the target: 46\n",
+            "when input is [44, 53, 56, 1, 58, 46] the target: 39\n",
+            "when input is [44, 53, 56, 1, 58, 46, 39] the target: 58\n",
+            "when input is [44, 53, 56, 1, 58, 46, 39, 58] the target: 1\n",
+            "when input is [52] the target: 58\n",
+            "when input is [52, 58] the target: 1\n",
+            "when input is [52, 58, 1] the target: 58\n",
+            "when input is [52, 58, 1, 58] the target: 46\n",
+            "when input is [52, 58, 1, 58, 46] the target: 39\n",
+            "when input is [52, 58, 1, 58, 46, 39] the target: 58\n",
+            "when input is [52, 58, 1, 58, 46, 39, 58] the target: 1\n",
+            "when input is [52, 58, 1, 58, 46, 39, 58, 1] the target: 46\n",
+            "when input is [25] the target: 17\n",
+            "when input is [25, 17] the target: 27\n",
+            "when input is [25, 17, 27] the target: 10\n",
+            "when input is [25, 17, 27, 10] the target: 0\n",
+            "when input is [25, 17, 27, 10, 0] the target: 21\n",
+            "when input is [25, 17, 27, 10, 0, 21] the target: 1\n",
+            "when input is [25, 17, 27, 10, 0, 21, 1] the target: 54\n",
+            "when input is [25, 17, 27, 10, 0, 21, 1, 54] the target: 39\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print (xb) # our input to the transformer"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Sy2A0cbXM1Bd",
+        "outputId": "ba015f11-ee15-435e-b88a-2ad4164d7abe"
+      },
+      "execution_count": 12,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "tensor([[24, 43, 58,  5, 57,  1, 46, 43],\n",
+            "        [44, 53, 56,  1, 58, 46, 39, 58],\n",
+            "        [52, 58,  1, 58, 46, 39, 58,  1],\n",
+            "        [25, 17, 27, 10,  0, 21,  1, 54]])\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import torch\n",
+        "import torch.nn as nn\n",
+        "from torch.nn import functional as F\n",
+        "torch.manual_seed(1337)\n",
+        "\n",
+        "class BigramLanguageModel(nn.Module):\n",
+        "\n",
+        "    def __init__(self, vocab_size):\n",
+        "        super().__init__()\n",
+        "        # each token directly reads off the logits for the next token from a lookup table\n",
+        "        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)\n",
+        "\n",
+        "    def forward(self, idx, targets=None):\n",
+        "\n",
+        "        # idx and targets are both (B,T) tensor of integers\n",
+        "        logits = self.token_embedding_table(idx) # (B,T,C)\n",
+        "        \n",
+        "        if targets is None:\n",
+        "            loss = None\n",
+        "        else:\n",
+        "            B, T, C = logits.shape\n",
+        "            logits = logits.view(B*T, C)\n",
+        "            targets = targets.view(B*T)\n",
+        "            loss = F.cross_entropy(logits, targets)\n",
+        "\n",
+        "        return logits, loss\n",
+        "    \n",
+        "    def generate(self, idx, max_new_tokens):\n",
+        "        # idx is (B, T) array of indices in the current context\n",
+        "        for _ in range(max_new_tokens):\n",
+        "            # get the predictions\n",
+        "            logits, loss = self(idx)\n",
+        "            # focus only on the last time step\n",
+        "            logits = logits[:, -1, :] # becomes (B, C)\n",
+        "            # apply softmax to get probabilities\n",
+        "            probs = F.softmax(logits, dim=-1) # (B, C)\n",
+        "            # sample from the distribution\n",
+        "            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)\n",
+        "            # append sampled index to the running sequence\n",
+        "            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)\n",
+        "        return idx\n",
+        "\n",
+        "m = BigramLanguageModel(vocab_size)\n",
+        "logits, loss = m(xb, yb)\n",
+        "print(logits.shape)\n",
+        "print(loss)\n",
+        "\n",
+        "print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))\n"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "JadlSYPFfV5i",
+        "outputId": "48885ec2-7337-4d9b-8931-9db5b06ff04a"
+      },
+      "execution_count": 13,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "torch.Size([32, 65])\n",
+            "tensor(4.8786, grad_fn=<NllLossBackward0>)\n",
+            "\n",
+            "Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# create a PyTorch optimizer\n",
+        "optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)"
+      ],
+      "metadata": {
+        "id": "kC6Sf0DkfZEs"
+      },
+      "execution_count": 14,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "batch_size = 32\n",
+        "for steps in range(100):\n",
+        "\n",
+        "  xb, yb = get_batch('train')\n",
+        "\n",
+        "  logits, loss = m(xb, yb)\n",
+        "  optimizer.zero_grad(set_to_none=True)\n",
+        "  loss.backward()\n",
+        "  optimizer.step()\n",
+        "\n",
+        "print(loss.item())"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "eAdiWhq8mq0v",
+        "outputId": "2210d81b-5438-4e35-9336-5f30567de53d"
+      },
+      "execution_count": 15,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "4.587916374206543\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(decode(m.generate(idx = torch.zeros((1, 1), dtype = torch.long), max_new_tokens=500)[0].tolist()))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "9I0z9v9NnVcW",
+        "outputId": "07133374-3061-41e3-9e0e-77ba644c3c94"
+      },
+      "execution_count": 16,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "\n",
+            "xiKi-RJ:CgqVuUa!U?qMH.uk!sCuMXvv!CJFfx;LgRyJknOEti.?I&-gPlLyulId?XlaInQ'q,lT$\n",
+            "3Q&sGlvHQ?mqSq-eON\n",
+            "x?SP fUAfCAuCX:bOlgiRQWN:Mphaw\n",
+            "tRLKuYXEaAXxrcq-gCUzeh3w!AcyaylgYWjmJM?Uzw:inaY,:C&OECW:vmGGJAn3onAuMgia!ms$Vb q-gCOcPcUhOnxJGUGSPJWT:.?ujmJFoiNL&A'DxY,prZ?qdT;hoo'dHooXXlxf'WkHK&u3Q?rqUi.kz;?Yx?C&u3Qbfzxlyh'Vl:zyxjKXgC?\n",
+            "lv'QKFiBeviNxO'm!Upm$srm&TqViqiBD3HBP!juEOpmZJyF$Fwfy!PlvWPFC\n",
+            "&WDdP!Ko,px\n",
+            "x\n",
+            "tREOE;AJ.BeXkylOVD3KHp$e?nD,.SFbWWI'ubcL!q-tU;aXmJ&uGXHxJXI&Z!gHRpajj;l.\n",
+            "pTErIBjx;JKIgoCnLGXrJSP!AU-AcbczR?\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#Mathematical Trick in self-attention"
+      ],
+      "metadata": {
+        "id": "JPRFdk7pn7Xz"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# toy example for M Mul for weighted Aggregation\n",
+        "torch.manual_seed(42)\n",
+        "a = torch.tril(torch.ones(3, 3))\n",
+        "a = a / torch.sum(a, 1, keepdim=True)\n",
+        "b = torch.randint(0,10,(3,2)).float()\n",
+        "c = a @ b\n",
+        "print('a=')\n",
+        "print(a)\n",
+        "print('--')\n",
+        "print('b=')\n",
+        "print(b)\n",
+        "print('--')\n",
+        "print('c=')\n",
+        "print(c)\n"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "z-XvQJi_u0HL",
+        "outputId": "486bcbac-c42e-494c-e9a0-341779370076"
+      },
+      "execution_count": 17,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "a=\n",
+            "tensor([[1.0000, 0.0000, 0.0000],\n",
+            "        [0.5000, 0.5000, 0.0000],\n",
+            "        [0.3333, 0.3333, 0.3333]])\n",
+            "--\n",
+            "b=\n",
+            "tensor([[2., 7.],\n",
+            "        [6., 4.],\n",
+            "        [6., 5.]])\n",
+            "--\n",
+            "c=\n",
+            "tensor([[2.0000, 7.0000],\n",
+            "        [4.0000, 5.5000],\n",
+            "        [4.6667, 5.3333]])\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "torch.manual_seed(1337)\n",
+        "B,T,C = 4,8,2 # BATCH, TIME, CHANNELS\n",
+        "x = torch.randn(B,T,C)\n",
+        "x.shape"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "8zInghO3v5yg",
+        "outputId": "4f7a38e9-05a2-494b-eda1-2d8ca136fe03"
+      },
+      "execution_count": 18,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "torch.Size([4, 8, 2])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 18
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "xbow = torch.zeros((B,T,C))\n",
+        "for b in range(B):\n",
+        "  for t in range(T):\n",
+        "    xprev = x[b, :t+1]\n",
+        "    xbow[b,t] = torch.mean(xprev, 0)"
+      ],
+      "metadata": {
+        "id": "kM4Az6f3xXwz"
+      },
+      "execution_count": 19,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "wei = torch.tril(torch.ones(T, T))\n",
+        "wei = wei / wei.sum(1, keepdim=True)\n",
+        "xbow2 = wei @ x\n",
+        "torch.allclose(xbow, xbow2)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "j6mzu409x9qt",
+        "outputId": "16c8abd7-5e22-4c7e-b2e4-fc53041411d2"
+      },
+      "execution_count": 20,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "True"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 20
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "tril = torch.tril(torch.ones(T, T))\n",
+        "wei = torch.zeros((T, T))\n",
+        "wei = wei.masked_fill(tril == 0, float('-inf'))\n",
+        "wei = F.softmax(wei, dim=-1)\n",
+        "xbow3 = wei @ x\n",
+        "torch.allclose(xbow, xbow3)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Ez5cxjXjyeyA",
+        "outputId": "8cf70b82-93bb-4b9a-c29c-50342c99ca0b"
+      },
+      "execution_count": 22,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "True"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 22
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Self-attention !\n",
+        "torch.manual_seed(1337)\n",
+        "B,T,C = 4,8,32\n",
+        "x = torch.randn(B,T,C)\n",
+        "\n",
+        "# Single head perform self-attention\n",
+        "head_size = 16\n",
+        "key = nn.Linear(C, head_size, bias=False)\n",
+        "query = nn.Linear(C, head_size, bias=False)\n",
+        "value = nn.Linear(C, head_size, bias=False)\n",
+        "k = key(x)\n",
+        "q = query(x)\n",
+        "wei = q @ k.transpose(-2, -1)\n",
+        "\n",
+        "tril = torch.tril(torch.ones(T, T))\n",
+        "wei = wei.masked_fill(tril == 0, float('-inf'))\n",
+        "wei = F.softmax(wei, dim=-1)\n",
+        "\n",
+        "v = value(x)\n",
+        "out = wei @ v\n",
+        "\n",
+        "out.shape"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "d4fbZKO_zJlE",
+        "outputId": "61bfb573-3b08-4e83-aed1-cdb4be76ead8"
+      },
+      "execution_count": 23,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "torch.Size([4, 8, 16])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 23
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "wei[0]"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "5mUg8q-D1xJ3",
+        "outputId": "24f9aa45-1d20-4bc6-8efb-af5f5fb9899c"
+      },
+      "execution_count": 24,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
+              "        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
+              "        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
+              "        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],\n",
+              "        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],\n",
+              "        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],\n",
+              "        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],\n",
+              "        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],\n",
+              "       grad_fn=<SelectBackward0>)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 24
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "k = torch.randn(B,T,head_size)\n",
+        "q = torch.randn(B,T,head_size)\n",
+        "wei = q @ k.transpose(-2, -1) * head_size**-0.5"
+      ],
+      "metadata": {
+        "id": "L6Hz65jN11C5"
+      },
+      "execution_count": 25,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "k.var()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "opow74Yg82UN",
+        "outputId": "7937ca44-b52d-4373-ae58-d0c1ed450fa7"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor(1.0449)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 26
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "q.var()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "jEGJMlZh86lD",
+        "outputId": "c093ea15-9db4-408b-8898-0192748f8ab2"
+      },
+      "execution_count": 27,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor(1.0700)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 27
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "wei.var()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "37djNLHJ88Gh",
+        "outputId": "a3ba1d4b-bca5-41a2-afa5-f135056b80ba"
+      },
+      "execution_count": 28,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor(1.0918)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 28
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "3NK1li0w89wx",
+        "outputId": "4205b108-d666-4add-dd3e-48da20a6e351"
+      },
+      "execution_count": 29,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 29
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "torch.softmax(torch.tensor([0.1, -0.2, 0.3,-0.2,0.5])*8, dim=-1)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "-3UqDMG79QLI",
+        "outputId": "61674514-3887-43a4-93aa-055dfcd61b76"
+      },
+      "execution_count": 30,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 30
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "class LayerNorm1d: # (used to be BatchNorm1d)\n",
+        "  \n",
+        "  def __init__(self, dim, eps=1e-5, momentum=0.1):\n",
+        "    self.eps = eps\n",
+        "    self.gamma = torch.ones(dim)\n",
+        "    self.beta = torch.zeros(dim)\n",
+        "  \n",
+        "  def __call__(self, x):\n",
+        "    # calculate the forward pass\n",
+        "    xmean = x.mean(1, keepdim=True) # batch mean\n",
+        "    xvar = x.var(1, keepdim=True) # batch variance\n",
+        "    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance\n",
+        "    self.out = self.gamma * xhat + self.beta\n",
+        "    return self.out\n",
+        "  \n",
+        "  def parameters(self):\n",
+        "    return [self.gamma, self.beta]\n",
+        "\n",
+        "torch.manual_seed(1337)\n",
+        "module = LayerNorm1d(100)\n",
+        "x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors\n",
+        "x = module(x)\n",
+        "x.shape"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "a_572UNcChia",
+        "outputId": "87012d0d-81cd-4841-a4e8-48bf9c0e2e61"
+      },
+      "execution_count": 32,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "torch.Size([32, 100])"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 32
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "x[:, 0].mean(), x[:,0].std()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "LHfhDFW1Coel",
+        "outputId": "7eff9314-f287-4566-aa4d-7d9082bff11b"
+      },
+      "execution_count": 33,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(tensor(0.1469), tensor(0.8803))"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 33
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "x[0,:].mean(), x[0,:].std()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "bt7xbja2FOu-",
+        "outputId": "8f1cbfe0-7862-4ba0-bd54-7149a78b7153"
+      },
+      "execution_count": 34,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(tensor(-9.5367e-09), tensor(1.0000))"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 34
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import torch\n",
+        "import torch.nn as nn\n",
+        "from torch.nn import functional as F\n",
+        "\n",
+        "# hyperparameters\n",
+        "batch_size = 16 # how many independent sequences will we process in parallel?\n",
+        "block_size = 32 # what is the maximum context length for predictions?\n",
+        "max_iters = 5000\n",
+        "eval_interval = 100\n",
+        "learning_rate = 1e-3\n",
+        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
+        "eval_iters = 200\n",
+        "n_embd = 64\n",
+        "n_head = 4\n",
+        "n_layer = 4\n",
+        "dropout = 0.0\n",
+        "\n",
+        "torch.manual_seed(1337)\n",
+        "\n",
+        "# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n",
+        "with open('input.txt', 'r', encoding='utf-8') as f:\n",
+        "    text = f.read()\n",
+        "\n",
+        "# here are all the unique characters that occur in this text\n",
+        "chars = sorted(list(set(text)))\n",
+        "vocab_size = len(chars)\n",
+        "# create a mapping from characters to integers\n",
+        "stoi = { ch:i for i,ch in enumerate(chars) }\n",
+        "itos = { i:ch for i,ch in enumerate(chars) }\n",
+        "encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers\n",
+        "decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string\n",
+        "\n",
+        "# Train and test splits\n",
+        "data = torch.tensor(encode(text), dtype=torch.long)\n",
+        "n = int(0.9*len(data)) # first 90% will be train, rest val\n",
+        "train_data = data[:n]\n",
+        "val_data = data[n:]\n",
+        "\n",
+        "# data loading\n",
+        "def get_batch(split):\n",
+        "    # generate a small batch of data of inputs x and targets y\n",
+        "    data = train_data if split == 'train' else val_data\n",
+        "    ix = torch.randint(len(data) - block_size, (batch_size,))\n",
+        "    x = torch.stack([data[i:i+block_size] for i in ix])\n",
+        "    y = torch.stack([data[i+1:i+block_size+1] for i in ix])\n",
+        "    x, y = x.to(device), y.to(device)\n",
+        "    return x, y\n",
+        "\n",
+        "@torch.no_grad()\n",
+        "def estimate_loss():\n",
+        "    out = {}\n",
+        "    model.eval()\n",
+        "    for split in ['train', 'val']:\n",
+        "        losses = torch.zeros(eval_iters)\n",
+        "        for k in range(eval_iters):\n",
+        "            X, Y = get_batch(split)\n",
+        "            logits, loss = model(X, Y)\n",
+        "            losses[k] = loss.item()\n",
+        "        out[split] = losses.mean()\n",
+        "    model.train()\n",
+        "    return out\n",
+        "\n",
+        "class Head(nn.Module):\n",
+        "    \"\"\" one head of self-attention \"\"\"\n",
+        "\n",
+        "    def __init__(self, head_size):\n",
+        "        super().__init__()\n",
+        "        self.key = nn.Linear(n_embd, head_size, bias=False)\n",
+        "        self.query = nn.Linear(n_embd, head_size, bias=False)\n",
+        "        self.value = nn.Linear(n_embd, head_size, bias=False)\n",
+        "        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))\n",
+        "\n",
+        "        self.dropout = nn.Dropout(dropout)\n",
+        "\n",
+        "    def forward(self, x):\n",
+        "        B,T,C = x.shape\n",
+        "        k = self.key(x)   # (B,T,C)\n",
+        "        q = self.query(x) # (B,T,C)\n",
+        "        # compute attention scores (\"affinities\")\n",
+        "        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)\n",
+        "        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)\n",
+        "        wei = F.softmax(wei, dim=-1) # (B, T, T)\n",
+        "        wei = self.dropout(wei)\n",
+        "        # perform the weighted aggregation of the values\n",
+        "        v = self.value(x) # (B,T,C)\n",
+        "        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)\n",
+        "        return out\n",
+        "\n",
+        "class MultiHeadAttention(nn.Module):\n",
+        "    \"\"\" multiple heads of self-attention in parallel \"\"\"\n",
+        "\n",
+        "    def __init__(self, num_heads, head_size):\n",
+        "        super().__init__()\n",
+        "        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])\n",
+        "        self.proj = nn.Linear(n_embd, n_embd)\n",
+        "        self.dropout = nn.Dropout(dropout)\n",
+        "\n",
+        "    def forward(self, x):\n",
+        "        out = torch.cat([h(x) for h in self.heads], dim=-1)\n",
+        "        out = self.dropout(self.proj(out))\n",
+        "        return out\n",
+        "\n",
+        "class FeedFoward(nn.Module):\n",
+        "    \"\"\" a simple linear layer followed by a non-linearity \"\"\"\n",
+        "\n",
+        "    def __init__(self, n_embd):\n",
+        "        super().__init__()\n",
+        "        self.net = nn.Sequential(\n",
+        "            nn.Linear(n_embd, 4 * n_embd),\n",
+        "            nn.ReLU(),\n",
+        "            nn.Linear(4 * n_embd, n_embd),\n",
+        "            nn.Dropout(dropout),\n",
+        "        )\n",
+        "\n",
+        "    def forward(self, x):\n",
+        "        return self.net(x)\n",
+        "\n",
+        "class Block(nn.Module):\n",
+        "    \"\"\" Transformer block: communication followed by computation \"\"\"\n",
+        "\n",
+        "    def __init__(self, n_embd, n_head):\n",
+        "        # n_embd: embedding dimension, n_head: the number of heads we'd like\n",
+        "        super().__init__()\n",
+        "        head_size = n_embd // n_head\n",
+        "        self.sa = MultiHeadAttention(n_head, head_size)\n",
+        "        self.ffwd = FeedFoward(n_embd)\n",
+        "        self.ln1 = nn.LayerNorm(n_embd)\n",
+        "        self.ln2 = nn.LayerNorm(n_embd)\n",
+        "\n",
+        "    def forward(self, x):\n",
+        "        x = x + self.sa(self.ln1(x))\n",
+        "        x = x + self.ffwd(self.ln2(x))\n",
+        "        return x\n",
+        "\n",
+        "# super simple bigram model\n",
+        "class BigramLanguageModel(nn.Module):\n",
+        "\n",
+        "    def __init__(self):\n",
+        "        super().__init__()\n",
+        "        # each token directly reads off the logits for the next token from a lookup table\n",
+        "        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)\n",
+        "        self.position_embedding_table = nn.Embedding(block_size, n_embd)\n",
+        "        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])\n",
+        "        self.ln_f = nn.LayerNorm(n_embd) # final layer norm\n",
+        "        self.lm_head = nn.Linear(n_embd, vocab_size)\n",
+        "\n",
+        "    def forward(self, idx, targets=None):\n",
+        "        B, T = idx.shape\n",
+        "\n",
+        "        # idx and targets are both (B,T) tensor of integers\n",
+        "        tok_emb = self.token_embedding_table(idx) # (B,T,C)\n",
+        "        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)\n",
+        "        x = tok_emb + pos_emb # (B,T,C)\n",
+        "        x = self.blocks(x) # (B,T,C)\n",
+        "        x = self.ln_f(x) # (B,T,C)\n",
+        "        logits = self.lm_head(x) # (B,T,vocab_size)\n",
+        "\n",
+        "        if targets is None:\n",
+        "            loss = None\n",
+        "        else:\n",
+        "            B, T, C = logits.shape\n",
+        "            logits = logits.view(B*T, C)\n",
+        "            targets = targets.view(B*T)\n",
+        "            loss = F.cross_entropy(logits, targets)\n",
+        "\n",
+        "        return logits, loss\n",
+        "\n",
+        "    def generate(self, idx, max_new_tokens):\n",
+        "        # idx is (B, T) array of indices in the current context\n",
+        "        for _ in range(max_new_tokens):\n",
+        "            # crop idx to the last block_size tokens\n",
+        "            idx_cond = idx[:, -block_size:]\n",
+        "            # get the predictions\n",
+        "            logits, loss = self(idx_cond)\n",
+        "            # focus only on the last time step\n",
+        "            logits = logits[:, -1, :] # becomes (B, C)\n",
+        "            # apply softmax to get probabilities\n",
+        "            probs = F.softmax(logits, dim=-1) # (B, C)\n",
+        "            # sample from the distribution\n",
+        "            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)\n",
+        "            # append sampled index to the running sequence\n",
+        "            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)\n",
+        "        return idx\n",
+        "\n",
+        "model = BigramLanguageModel()\n",
+        "m = model.to(device)\n",
+        "# print the number of parameters in the model\n",
+        "print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')\n",
+        "\n",
+        "# create a PyTorch optimizer\n",
+        "optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)\n",
+        "\n",
+        "for iter in range(max_iters):\n",
+        "\n",
+        "    # every once in a while evaluate the loss on train and val sets\n",
+        "    if iter % eval_interval == 0 or iter == max_iters - 1:\n",
+        "        losses = estimate_loss()\n",
+        "        print(f\"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}\")\n",
+        "\n",
+        "    # sample a batch of data\n",
+        "    xb, yb = get_batch('train')\n",
+        "\n",
+        "    # evaluate the loss\n",
+        "    logits, loss = model(xb, yb)\n",
+        "    optimizer.zero_grad(set_to_none=True)\n",
+        "    loss.backward()\n",
+        "    optimizer.step()\n",
+        "\n",
+        "# generate from the model\n",
+        "context = torch.zeros((1, 1), dtype=torch.long, device=device )\n",
+        "print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "WYnRTqPbFXHy",
+        "outputId": "d625a959-7490-4a84-e692-600da91e0ef9"
+      },
+      "execution_count": 35,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "0.209729 M parameters\n",
+            "step 0: train loss 4.4116, val loss 4.4022\n",
+            "step 100: train loss 2.6568, val loss 2.6670\n",
+            "step 200: train loss 2.5090, val loss 2.5059\n",
+            "step 300: train loss 2.4196, val loss 2.4338\n",
+            "step 400: train loss 2.3503, val loss 2.3565\n",
+            "step 500: train loss 2.2966, val loss 2.3129\n",
+            "step 600: train loss 2.2410, val loss 2.2500\n",
+            "step 700: train loss 2.2051, val loss 2.2191\n",
+            "step 800: train loss 2.1640, val loss 2.1874\n",
+            "step 900: train loss 2.1251, val loss 2.1515\n",
+            "step 1000: train loss 2.1023, val loss 2.1291\n",
+            "step 1100: train loss 2.0699, val loss 2.1192\n",
+            "step 1200: train loss 2.0375, val loss 2.0797\n",
+            "step 1300: train loss 2.0259, val loss 2.0647\n",
+            "step 1400: train loss 1.9924, val loss 2.0362\n",
+            "step 1500: train loss 1.9700, val loss 2.0304\n",
+            "step 1600: train loss 1.9631, val loss 2.0476\n",
+            "step 1700: train loss 1.9412, val loss 2.0131\n",
+            "step 1800: train loss 1.9097, val loss 1.9960\n",
+            "step 1900: train loss 1.9101, val loss 1.9882\n",
+            "step 2000: train loss 1.8867, val loss 1.9976\n",
+            "step 2100: train loss 1.8720, val loss 1.9754\n",
+            "step 2200: train loss 1.8588, val loss 1.9606\n",
+            "step 2300: train loss 1.8542, val loss 1.9525\n",
+            "step 2400: train loss 1.8424, val loss 1.9464\n",
+            "step 2500: train loss 1.8173, val loss 1.9455\n",
+            "step 2600: train loss 1.8256, val loss 1.9388\n",
+            "step 2700: train loss 1.8116, val loss 1.9350\n",
+            "step 2800: train loss 1.8056, val loss 1.9214\n",
+            "step 2900: train loss 1.8040, val loss 1.9300\n",
+            "step 3000: train loss 1.7974, val loss 1.9205\n",
+            "step 3100: train loss 1.7694, val loss 1.9157\n",
+            "step 3200: train loss 1.7539, val loss 1.9115\n",
+            "step 3300: train loss 1.7571, val loss 1.9071\n",
+            "step 3400: train loss 1.7531, val loss 1.8954\n",
+            "step 3500: train loss 1.7368, val loss 1.8918\n",
+            "step 3600: train loss 1.7274, val loss 1.8884\n",
+            "step 3700: train loss 1.7301, val loss 1.8819\n",
+            "step 3800: train loss 1.7210, val loss 1.8938\n",
+            "step 3900: train loss 1.7260, val loss 1.8750\n",
+            "step 4000: train loss 1.7122, val loss 1.8554\n",
+            "step 4100: train loss 1.7129, val loss 1.8717\n",
+            "step 4200: train loss 1.7041, val loss 1.8634\n",
+            "step 4300: train loss 1.6986, val loss 1.8434\n",
+            "step 4400: train loss 1.7052, val loss 1.8605\n",
+            "step 4500: train loss 1.6881, val loss 1.8467\n",
+            "step 4600: train loss 1.6849, val loss 1.8318\n",
+            "step 4700: train loss 1.6833, val loss 1.8449\n",
+            "step 4800: train loss 1.6686, val loss 1.8472\n",
+            "step 4900: train loss 1.6719, val loss 1.8425\n",
+            "step 4999: train loss 1.6619, val loss 1.8215\n",
+            "\n",
+            "And they bride will to lay be madie;\n",
+            "Thou but take O-dam the change:\n",
+            "Warth full him tother dilth ane away, my fears,\n",
+            "You have was them of is heart mile,\n",
+            "You, and if ensmy contlatist, drov the does me now that\n",
+            "just, lesing that.\n",
+            "His my now, you up; and the tyby love.\n",
+            "In Bodiet, and whom\n",
+            "that demperakenous, so what evily well my\n",
+            "Murtus censurence of him the reshep and thrust for to imper my monte in Mont,\n",
+            "To fight? gry of thy hourb! stiddy as\n",
+            "ards bearing her broint must are no Runnts\n",
+            "Infortuce will me not be arm.\n",
+            "You contrantymes have myse.-\n",
+            "And fortwerle madam them may in son, live body.\n",
+            "\n",
+            "Think you:\n",
+            "It stay might. \n",
+            "CLAMENCE:\n",
+            "My whilesse everew in movet, if Cassce of's counted;\n",
+            "How what make you fear tals: the gold my sun?\n",
+            "What, loudy forgor man our him.\n",
+            "I will were but with some. Povinly Ford the welcont.\n",
+            "\n",
+            "QUEEN FIDILIZ:\n",
+            "No?\n",
+            "Their him the not.\n",
+            "\n",
+            "POLIXENENE:\n",
+            "But to me, God no now the summe wip.\n",
+            "\n",
+            "GROMPEO:\n",
+            "Conguit, bruke this belike, on so han the bodiet.\n",
+            "\n",
+            "CORIOLANUS:\n",
+            "Till the;\n",
+            "you wellseers I am with you,\n",
+            "For I hust no where Mustconce, do wind that I am nobly.\n",
+            "\n",
+            "BRUSTHORD:\n",
+            "O, wenterings so me worting.\n",
+            "\n",
+            "GRUMIO:\n",
+            "O thus favour now,\n",
+            "An bear was all beenIn\n",
+            "Before and to the sever--and.\n",
+            "In to dot me, to liberfeleing breamn'd my have\n",
+            "epince, if that jutcey's leve,\n",
+            "That Tumselfly there's little ofjess the vown;\n",
+            "Maughter armied maste love in stide belothy dong'd the not.\n",
+            "\n",
+            "BENVOLIO:\n",
+            "Well cavonzy to I have must aboe;\n",
+            "I now, I thinke numt om Three teny, delelige,\n",
+            "And yet our son one old, we\n",
+            "ell sment on you; and plock, say, as If have to kavidess corby?\n",
+            "Then eteep; upose worth\n",
+            "But arm one wall preven him there.\n",
+            "\n",
+            "BUCKINGHARD\n",
+            "\n",
+            "IVIRHAMIUS:\n",
+            "Why, unere to-marrow thy sathe court his in on\n",
+            "some no, God the have blay not, these wife it:\n",
+            "The that hear I, thou with art, lives?\n",
+            "\n",
+            "LARY:\n",
+            "Our while with you\n",
+            "That I horrtw'd will theirs is.\n",
+            "Why, I would I drue, and was father,--\n",
+            "'Tensis, thy promb, many and sentry talbatt.\n",
+            "\n",
+            "PORDINCE:\n",
+            "Why Riparding:\n",
+            "In is shown's fortunds, but whom the brike our all\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "i8lCFzYGMkBk"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}