sanjudebnath
/

Numini

Question Answering

Model card Files Files and versions

xet

Community

sanjudebnath commited on Feb 16, 2025

Commit

a966e1f

verified ·

1 Parent(s): 22f3dba

Delete question_answering.ipynb

Browse files

Files changed (1) hide show

question_answering.ipynb +0 -2403

question_answering.ipynb DELETED Viewed

@@ -1,2403 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "19817716",
-   "metadata": {},
-   "source": [
-    "# Question Answering\n",
-    "The following notebook contains different question answering models. We will start by introducing a representation for the dataset and corresponding DataLoader and then evaluate different models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 50,
-   "id": "49bf46c6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from transformers import DistilBertModel, DistilBertForMaskedLM, DistilBertConfig, \\\n",
-    "            DistilBertTokenizerFast, AutoTokenizer, BertModel, BertForMaskedLM, BertTokenizerFast, BertConfig\n",
-    "from torch import nn\n",
-    "from pathlib import Path\n",
-    "import torch\n",
-    "import pandas as pd\n",
-    "from typing import Optional \n",
-    "from tqdm.auto import tqdm\n",
-    "from util import eval_test_set, count_parameters\n",
-    "from torch.optim import AdamW, RMSprop\n",
-    "\n",
-    "\n",
-    "from qa_model import QuestionDistilBERT, SimpleQuestionDistilBERT, ReuseQuestionDistilBERT, Dataset, test_model"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3ea47820",
-   "metadata": {},
-   "source": [
-    "## Data\n",
-    "Processing the data correctly is partly based on the Huggingface Tutorial (https://huggingface.co/course/chapter7/7?fw=pt)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 51,
-   "id": "7b1b2b3e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 52,
-   "id": "f276eba7",
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [],
-   "source": [
-    "   \n",
-    "# create datasets and loaders for training and test set\n",
-    "squad_paths = [str(x) for x in Path('data/training_squad/').glob('**/*.txt')]\n",
-    "nat_paths = [str(x) for x in Path('data/natural_questions_train/').glob('**/*.txt')]\n",
-    "hotpotqa_paths = [str(x) for x in Path('data/hotpotqa_training/').glob('**/*.txt')]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ad8d532a",
-   "metadata": {},
-   "source": [
-    "## POC Model\n",
-    "* Works very well:\n",
-    "  * Dropout 0.1 is too small (overfitting after first epoch) - changed to 0.15\n",
-    "  * Difference between AdamW and RMSprop minimal\n",
-    "  \n",
-    "### Results:\n",
-    "Dropout = 0.15\n",
-    "* Mean EM:  0.5374\n",
-    "* Mean F-1:  0.6826317532406944\n",
-    "\n",
-    "Dropout = 0.2 (overfitting realtively similar to first, but seems to be too high)\n",
-    "* Mean EM:  0.5044\n",
-    "* Mean F-1:  0.6437359169276439"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 54,
-   "id": "703e7f38",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=None, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
-    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
-    "\n",
-    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
-    "                       natural_question_paths=None, \n",
-    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
-    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 55,
-   "id": "6672f614",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")\n",
-    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")\n",
-    "mod = model.distilbert"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 56,
-   "id": "dec15198",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "SimpleQuestionDistilBERT(\n",
-       "  (distilbert): DistilBertModel(\n",
-       "    (embeddings): Embeddings(\n",
-       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
-       "      (position_embeddings): Embedding(512, 768)\n",
-       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (dropout): Dropout(p=0.1, inplace=False)\n",
-       "    )\n",
-       "    (transformer): Transformer(\n",
-       "      (layer): ModuleList(\n",
-       "        (0): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (1): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (2): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (3): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (4): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (5): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "      )\n",
-       "    )\n",
-       "  )\n",
-       "  (dropout): Dropout(p=0.5, inplace=False)\n",
-       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
-       ")"
-      ]
-     },
-     "execution_count": 56,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "model = SimpleQuestionDistilBERT(mod)\n",
-    "model.to(device)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 57,
-   "id": "9def3c83",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "+---------------------------------------------------------+------------+\n",
-      "|                         Modules                         | Parameters |\n",
-      "+---------------------------------------------------------+------------+\n",
-      "|       distilbert.embeddings.word_embeddings.weight      |  23440896  |\n",
-      "|     distilbert.embeddings.position_embeddings.weight    |   393216   |\n",
-      "|          distilbert.embeddings.LayerNorm.weight         |    768     |\n",
-      "|           distilbert.embeddings.LayerNorm.bias          |    768     |\n",
-      "|  distilbert.transformer.layer.0.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.0.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.0.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.0.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.0.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.0.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.0.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.0.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.0.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.0.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.0.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.0.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.0.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.0.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.0.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.0.output_layer_norm.bias  |    768     |\n",
-      "|  distilbert.transformer.layer.1.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.1.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.1.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.1.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.1.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.1.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.1.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.1.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.1.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.1.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.1.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.1.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.1.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.1.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.1.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.1.output_layer_norm.bias  |    768     |\n",
-      "|  distilbert.transformer.layer.2.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.2.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.2.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.2.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.2.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.2.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.2.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.2.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.2.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.2.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.2.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.2.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.2.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.2.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.2.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.2.output_layer_norm.bias  |    768     |\n",
-      "|  distilbert.transformer.layer.3.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.3.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.3.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.3.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.3.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.3.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.3.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.3.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.3.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.3.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.3.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.3.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.3.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.3.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.3.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.3.output_layer_norm.bias  |    768     |\n",
-      "|  distilbert.transformer.layer.4.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.4.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.4.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.4.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.4.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.4.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.4.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.4.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.4.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.4.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.4.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.4.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.4.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.4.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.4.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.4.output_layer_norm.bias  |    768     |\n",
-      "|  distilbert.transformer.layer.5.attention.q_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.5.attention.q_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.5.attention.k_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.5.attention.k_lin.bias   |    768     |\n",
-      "|  distilbert.transformer.layer.5.attention.v_lin.weight  |   589824   |\n",
-      "|   distilbert.transformer.layer.5.attention.v_lin.bias   |    768     |\n",
-      "| distilbert.transformer.layer.5.attention.out_lin.weight |   589824   |\n",
-      "|  distilbert.transformer.layer.5.attention.out_lin.bias  |    768     |\n",
-      "|   distilbert.transformer.layer.5.sa_layer_norm.weight   |    768     |\n",
-      "|    distilbert.transformer.layer.5.sa_layer_norm.bias    |    768     |\n",
-      "|      distilbert.transformer.layer.5.ffn.lin1.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.5.ffn.lin1.bias      |    3072    |\n",
-      "|      distilbert.transformer.layer.5.ffn.lin2.weight     |  2359296   |\n",
-      "|       distilbert.transformer.layer.5.ffn.lin2.bias      |    768     |\n",
-      "| distilbert.transformer.layer.5.output_layer_norm.weight |    768     |\n",
-      "|  distilbert.transformer.layer.5.output_layer_norm.bias  |    768     |\n",
-      "|                    classifier.weight                    |    1536    |\n",
-      "|                     classifier.bias                     |     2      |\n",
-      "+---------------------------------------------------------+------------+\n",
-      "Total Trainable Params: 66364418\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "66364418"
-      ]
-     },
-     "execution_count": 57,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "count_parameters(model)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "426a6311",
-   "metadata": {},
-   "source": [
-    "### Testing the model"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 58,
-   "id": "6151c201",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get smaller dataset\n",
-    "batch_size = 8\n",
-    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
-    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
-    "optim = RMSprop(model.parameters(), lr=1e-4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 59,
-   "id": "aeae0c56",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Passed\n"
-     ]
-    }
-   ],
-   "source": [
-    "test_model(model, optim, test_ds_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "59928d34",
-   "metadata": {},
-   "source": [
-    "### Model Training"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 60,
-   "id": "a8017b8c",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "SimpleQuestionDistilBERT(\n",
-       "  (distilbert): DistilBertModel(\n",
-       "    (embeddings): Embeddings(\n",
-       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
-       "      (position_embeddings): Embedding(512, 768)\n",
-       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (dropout): Dropout(p=0.1, inplace=False)\n",
-       "    )\n",
-       "    (transformer): Transformer(\n",
-       "      (layer): ModuleList(\n",
-       "        (0): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (1): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (2): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (3): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (4): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (5): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "      )\n",
-       "    )\n",
-       "  )\n",
-       "  (dropout): Dropout(p=0.5, inplace=False)\n",
-       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
-       ")"
-      ]
-     },
-     "execution_count": 60,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "model = SimpleQuestionDistilBERT(mod)\n",
-    "model.to(device)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 61,
-   "id": "f13c12dc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model.train()\n",
-    "optim = RMSprop(model.parameters(), lr=1e-4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 22,
-   "id": "e4fa54d9",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "0016d9f5ba764eb98e9df8573995c86c",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/10875 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 0.7555404769408292\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "96af0e22e2ee44fd920795b0e7317839",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 1.761920437876694\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "5160ffe5f60e4b72b46746a33b1d60d0",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/10875 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "ename": "KeyboardInterrupt",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
-      "\u001B[0;31mKeyboardInterrupt\u001B[0m                         Traceback (most recent call last)",
-      "Cell \u001B[0;32mIn [22], line 18\u001B[0m\n\u001B[1;32m     16\u001B[0m \u001B[38;5;66;03m# print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\u001B[39;00m\n\u001B[1;32m     17\u001B[0m loss \u001B[38;5;241m=\u001B[39m outputs[\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mloss\u001B[39m\u001B[38;5;124m'\u001B[39m]\n\u001B[0;32m---> 18\u001B[0m loss\u001B[38;5;241m.\u001B[39mbackward()\n\u001B[1;32m     19\u001B[0m \u001B[38;5;66;03m# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)\u001B[39;00m\n\u001B[1;32m     20\u001B[0m optim\u001B[38;5;241m.\u001B[39mstep()\n",
-      "File \u001B[0;32m~/Documents/University/WS2022/applieddl/venv/lib64/python3.10/site-packages/torch/_tensor.py:396\u001B[0m, in \u001B[0;36mTensor.backward\u001B[0;34m(self, gradient, retain_graph, create_graph, inputs)\u001B[0m\n\u001B[1;32m    387\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m has_torch_function_unary(\u001B[38;5;28mself\u001B[39m):\n\u001B[1;32m    388\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m handle_torch_function(\n\u001B[1;32m    389\u001B[0m         Tensor\u001B[38;5;241m.\u001B[39mbackward,\n\u001B[1;32m    390\u001B[0m         (\u001B[38;5;28mself\u001B[39m,),\n\u001B[0;32m   (...)\u001B[0m\n\u001B[1;32m    394\u001B[0m         create_graph\u001B[38;5;241m=\u001B[39mcreate_graph,\n\u001B[1;32m    395\u001B[0m         inputs\u001B[38;5;241m=\u001B[39minputs)\n\u001B[0;32m--> 396\u001B[0m \u001B[43mtorch\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mautograd\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mbackward\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mgradient\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mretain_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mcreate_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43minputs\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43minputs\u001B[49m\u001B[43m)\u001B[49m\n",
-      "File \u001B[0;32m~/Documents/University/WS2022/applieddl/venv/lib64/python3.10/site-packages/torch/autograd/__init__.py:173\u001B[0m, in \u001B[0;36mbackward\u001B[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)\u001B[0m\n\u001B[1;32m    168\u001B[0m     retain_graph \u001B[38;5;241m=\u001B[39m create_graph\n\u001B[1;32m    170\u001B[0m \u001B[38;5;66;03m# The reason we repeat same the comment below is that\u001B[39;00m\n\u001B[1;32m    171\u001B[0m \u001B[38;5;66;03m# some Python versions print out the first line of a multi-line function\u001B[39;00m\n\u001B[1;32m    172\u001B[0m \u001B[38;5;66;03m# calls in the traceback and some print out the last line\u001B[39;00m\n\u001B[0;32m--> 173\u001B[0m \u001B[43mVariable\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_execution_engine\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrun_backward\u001B[49m\u001B[43m(\u001B[49m\u001B[43m  \u001B[49m\u001B[38;5;66;43;03m# Calls into the C++ engine to run the backward pass\u001B[39;49;00m\n\u001B[1;32m    174\u001B[0m \u001B[43m    \u001B[49m\u001B[43mtensors\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mgrad_tensors_\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mretain_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mcreate_graph\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43minputs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    175\u001B[0m \u001B[43m    \u001B[49m\u001B[43mallow_unreachable\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mTrue\u001B[39;49;00m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maccumulate_grad\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mTrue\u001B[39;49;00m\u001B[43m)\u001B[49m\n",
-      "\u001B[0;31mKeyboardInterrupt\u001B[0m: "
-     ]
-    }
-   ],
-   "source": [
-    "epochs = 5\n",
-    "\n",
-    "for epoch in range(epochs):\n",
-    "    loop = tqdm(loader, leave=True)\n",
-    "    model.train()\n",
-    "    mean_training_error = []\n",
-    "    for batch in loop:\n",
-    "        optim.zero_grad()\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
-    "        loss = outputs['loss']\n",
-    "        loss.backward()\n",
-    "        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)\n",
-    "        optim.step()\n",
-    "        mean_training_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch}')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
-    "    \n",
-    "    \n",
-    "    loop = tqdm(test_loader, leave=True)\n",
-    "    model.eval()\n",
-    "    mean_test_error = []\n",
-    "    for batch in loop:\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
-    "        loss = outputs['loss']\n",
-    "        \n",
-    "        mean_test_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch} Testset')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Test Error\", np.mean(mean_test_error))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 19,
-   "id": "6ff26fb4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "torch.save(model.state_dict(), \"simple_distilbert_qa.model\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 20,
-   "id": "a5e7abeb",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "<All keys matched successfully>"
-      ]
-     },
-     "execution_count": 20,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "model = SimpleQuestionDistilBERT(mod)\n",
-    "model.load_state_dict(torch.load(\"simple_distilbert_qa.model\"))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 18,
-   "id": "f5ad7bee",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 2500/2500 [02:09<00:00, 19.37it/s]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean EM:  0.5374\n",
-      "Mean F-1:  0.6826317532406944\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "eval_test_set(model, tokenizer, test_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fa6017a8",
-   "metadata": {},
-   "source": [
-    "## Freeze baseline and train new head\n",
-    "This was my initial idea, to freeze the layers and add a completely new head, which we train from scratch. I tried a lot of different configurations, but nothing really worked, I usually stayed at a CrossEntropyLoss of about 3 the whole time. Below, you can see the different heads I have tried.\n",
-    "\n",
-    "Furthermore, I experimented with different data, because I though it might not be enough data all in all. I would conclude that this didn't work because (1) Transformers are very data-hungry and I probably still used too little data (one epoch took about 1h though, so it wasn't possible to use even more). (2) We train the layers completely new, which means they contain absolutely no structure about the problem and task beforehand. I do not think that this way of training leads to better results / less energy used all in all, because it would be too resource intense.\n",
-    "\n",
-    "The following setup is partly based on the HuggingFace implementation of the question answering model (https://github.com/huggingface/transformers/blob/v4.23.1/src/transformers/models/distilbert/modeling_distilbert.py#L805)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 62,
-   "id": "92b21967",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 63,
-   "id": "1d7b3a8c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 64,
-   "id": "91444894",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# only take base model, we do not need the classification head\n",
-    "mod = model.distilbert"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 65,
-   "id": "74ca6c07",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "QuestionDistilBERT(\n",
-       "  (distilbert): DistilBertModel(\n",
-       "    (embeddings): Embeddings(\n",
-       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
-       "      (position_embeddings): Embedding(512, 768)\n",
-       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (dropout): Dropout(p=0.1, inplace=False)\n",
-       "    )\n",
-       "    (transformer): Transformer(\n",
-       "      (layer): ModuleList(\n",
-       "        (0): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (1): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (2): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (3): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (4): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (5): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "      )\n",
-       "    )\n",
-       "  )\n",
-       "  (relu): ReLU()\n",
-       "  (dropout): Dropout(p=0.1, inplace=False)\n",
-       "  (te): TransformerEncoder(\n",
-       "    (layers): ModuleList(\n",
-       "      (0): TransformerEncoderLayer(\n",
-       "        (self_attn): MultiheadAttention(\n",
-       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
-       "        )\n",
-       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
-       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
-       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
-       "      )\n",
-       "      (1): TransformerEncoderLayer(\n",
-       "        (self_attn): MultiheadAttention(\n",
-       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
-       "        )\n",
-       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
-       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
-       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
-       "      )\n",
-       "      (2): TransformerEncoderLayer(\n",
-       "        (self_attn): MultiheadAttention(\n",
-       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
-       "        )\n",
-       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
-       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
-       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
-       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
-       "      )\n",
-       "    )\n",
-       "  )\n",
-       "  (classifier): Sequential(\n",
-       "    (0): Dropout(p=0.1, inplace=False)\n",
-       "    (1): ReLU()\n",
-       "    (2): Linear(in_features=768, out_features=512, bias=True)\n",
-       "    (3): Dropout(p=0.1, inplace=False)\n",
-       "    (4): ReLU()\n",
-       "    (5): Linear(in_features=512, out_features=256, bias=True)\n",
-       "    (6): Dropout(p=0.1, inplace=False)\n",
-       "    (7): ReLU()\n",
-       "    (8): Linear(in_features=256, out_features=128, bias=True)\n",
-       "    (9): Dropout(p=0.1, inplace=False)\n",
-       "    (10): ReLU()\n",
-       "    (11): Linear(in_features=128, out_features=64, bias=True)\n",
-       "    (12): Dropout(p=0.1, inplace=False)\n",
-       "    (13): ReLU()\n",
-       "    (14): Linear(in_features=64, out_features=2, bias=True)\n",
-       "  )\n",
-       ")"
-      ]
-     },
-     "execution_count": 65,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "model = QuestionDistilBERT(mod)\n",
-    "model.to(device)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 66,
-   "id": "340857f9",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "+---------------------------------------+------------+\n",
-      "|                Modules                | Parameters |\n",
-      "+---------------------------------------+------------+\n",
-      "|  te.layers.0.self_attn.in_proj_weight |  1769472   |\n",
-      "|   te.layers.0.self_attn.in_proj_bias  |    2304    |\n",
-      "| te.layers.0.self_attn.out_proj.weight |   589824   |\n",
-      "|  te.layers.0.self_attn.out_proj.bias  |    768     |\n",
-      "|       te.layers.0.linear1.weight      |  1572864   |\n",
-      "|        te.layers.0.linear1.bias       |    2048    |\n",
-      "|       te.layers.0.linear2.weight      |  1572864   |\n",
-      "|        te.layers.0.linear2.bias       |    768     |\n",
-      "|        te.layers.0.norm1.weight       |    768     |\n",
-      "|         te.layers.0.norm1.bias        |    768     |\n",
-      "|        te.layers.0.norm2.weight       |    768     |\n",
-      "|         te.layers.0.norm2.bias        |    768     |\n",
-      "|  te.layers.1.self_attn.in_proj_weight |  1769472   |\n",
-      "|   te.layers.1.self_attn.in_proj_bias  |    2304    |\n",
-      "| te.layers.1.self_attn.out_proj.weight |   589824   |\n",
-      "|  te.layers.1.self_attn.out_proj.bias  |    768     |\n",
-      "|       te.layers.1.linear1.weight      |  1572864   |\n",
-      "|        te.layers.1.linear1.bias       |    2048    |\n",
-      "|       te.layers.1.linear2.weight      |  1572864   |\n",
-      "|        te.layers.1.linear2.bias       |    768     |\n",
-      "|        te.layers.1.norm1.weight       |    768     |\n",
-      "|         te.layers.1.norm1.bias        |    768     |\n",
-      "|        te.layers.1.norm2.weight       |    768     |\n",
-      "|         te.layers.1.norm2.bias        |    768     |\n",
-      "|  te.layers.2.self_attn.in_proj_weight |  1769472   |\n",
-      "|   te.layers.2.self_attn.in_proj_bias  |    2304    |\n",
-      "| te.layers.2.self_attn.out_proj.weight |   589824   |\n",
-      "|  te.layers.2.self_attn.out_proj.bias  |    768     |\n",
-      "|       te.layers.2.linear1.weight      |  1572864   |\n",
-      "|        te.layers.2.linear1.bias       |    2048    |\n",
-      "|       te.layers.2.linear2.weight      |  1572864   |\n",
-      "|        te.layers.2.linear2.bias       |    768     |\n",
-      "|        te.layers.2.norm1.weight       |    768     |\n",
-      "|         te.layers.2.norm1.bias        |    768     |\n",
-      "|        te.layers.2.norm2.weight       |    768     |\n",
-      "|         te.layers.2.norm2.bias        |    768     |\n",
-      "|          classifier.2.weight          |   393216   |\n",
-      "|           classifier.2.bias           |    512     |\n",
-      "|          classifier.5.weight          |   131072   |\n",
-      "|           classifier.5.bias           |    256     |\n",
-      "|          classifier.8.weight          |   32768    |\n",
-      "|           classifier.8.bias           |    128     |\n",
-      "|          classifier.11.weight         |    8192    |\n",
-      "|           classifier.11.bias          |     64     |\n",
-      "|          classifier.14.weight         |    128     |\n",
-      "|           classifier.14.bias          |     2      |\n",
-      "+---------------------------------------+------------+\n",
-      "Total Trainable Params: 17108290\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "17108290"
-      ]
-     },
-     "execution_count": 66,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "count_parameters(model)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9babd013",
-   "metadata": {},
-   "source": [
-    "### Testing the model\n",
-    "This is the same procedure as in `distilbert.ipynb`. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 67,
-   "id": "694c828b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get smaller dataset\n",
-    "batch_size = 8\n",
-    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
-    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
-    "optim=torch.optim.Adam(model.parameters())"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 68,
-   "id": "a76587df",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Passed\n"
-     ]
-    }
-   ],
-   "source": [
-    "test_model(model, optim, test_ds_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7c326e8e",
-   "metadata": {},
-   "source": [
-    "### Training the model\n",
-    "* Parameter Tuning:\n",
-    "  * Learning Rate: I experimented with several values, 1e-4 seemed to work best for me. 1e-3 was very unstable and 1e-5 was too small.\n",
-    "  * Gradient Clipping: I experimented with this, but the difference was only minimal\n",
-    "\n",
-    "Data:\n",
-    "* I first used only the SQuAD dataset, but generalisation is a problem\n",
-    "  * The dataset is realtively small and we often have entries with the same context but different questions\n",
-    "  * I believe, the diversity is not big enough to train a fully functional model\n",
-    "* Hence, I included the Natural Questions dataset too\n",
-    "  * It is however a lot more messy - I elaborated a bit more on this in `load_data.ipynb`\n",
-    "* Also the hotpotqa data was used\n",
-    "\n",
-    "Tested with: \n",
-    "* 3 Linear Layers\n",
-    "  * Training Error high - needed more layers\n",
-    "  * Already expected - this was mostly a Proof of Concept\n",
-    "* 1 TransformerEncoder with 4 attention heads + 1 Linear Layer:\n",
-    "  * Training Error was high, still too simple\n",
-    "* 1 TransformerEncoder with 8 heads + 1 Linear Layer:\n",
-    "  * Training Error gets lower, however stagnates at some point\n",
-    "  * Probably still too simple, it doesn't generalise either\n",
-    "* 2 TransformerEncoder with 8 and 4 heads + 1 Linear Layer:\n",
-    "  * Loss gets down but doesn't go further after some time\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2e9f4bd3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=nat_paths, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
-    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
-    "\n",
-    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
-    "                       natural_question_paths=None, \n",
-    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
-    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "id": "03a6de37",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model = QuestionDistilBERT(mod)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 41,
-   "id": "ed854b73",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from torch.optim import AdamW, RMSprop\n",
-    "\n",
-    "model.train()\n",
-    "optim = RMSprop(model.parameters(), lr=1e-4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 42,
-   "id": "79fdfcc9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from torch.utils.tensorboard import SummaryWriter\n",
-    "writer = SummaryWriter()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f7bddb43",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "5e9e74167c4b4b22b3218f4ca3c5abf0",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 3.8791405910185013\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "f3ce562fc61d4bfc83a4860eb06bc20c",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/1250 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 3.7705092002868654\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "2e84e21cedd446a0a5f5a40501711d1c",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 3.7389922174091996\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "07135c48be0146498cd37d767c1ee6ab",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/1250 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 3.7443671816825868\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "e9a51fbabc7043c2819a68e247e4a3ec",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 3.7031057048117977\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "bfdbcc9fe32542a19c47bc1d7704400e",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/1250 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 3.743248237323761\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "81fd1278b22643dc9fb3ac306533a240",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 3.6711661003430685\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "8b38d6cd44e048ec8bcd6b5cb86cce16",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/1250 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 3.740310479736328\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "825248aa3f934f4aade9d973e6f3b43e",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 3.6591619139813827\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "edceb7af0ec6450997820967638c12db",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/1250 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 3.8138498876571654\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "27e903eb0d0f4f949c234e4faf4277a1",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/21750 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "epochs = 20\n",
-    "\n",
-    "for epoch in range(epochs):\n",
-    "    loop = tqdm(loader, leave=True)\n",
-    "    model.train()\n",
-    "    mean_training_error = []\n",
-    "    for batch in loop:\n",
-    "        optim.zero_grad()\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        \n",
-    "        loss = outputs['loss']\n",
-    "        loss.backward()\n",
-    "        \n",
-    "        optim.step()\n",
-    "        mean_training_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch}')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
-    "    writer.add_scalar(\"Loss/train\", np.mean(mean_training_error), epoch)\n",
-    "    \n",
-    "    loop = tqdm(test_loader, leave=True)\n",
-    "    model.eval()\n",
-    "    mean_test_error = []\n",
-    "    for batch in loop:\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
-    "        loss = outputs['loss']\n",
-    "        \n",
-    "        mean_test_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch} Testset')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Test Error\", np.mean(mean_test_error))\n",
-    "    writer.add_scalar(\"Loss/test\", np.mean(mean_test_error), epoch)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 238,
-   "id": "a9d6af2e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "writer.close()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 33,
-   "id": "ba43447e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "torch.save(model.state_dict(), \"distilbert_qa.model\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 34,
-   "id": "ffc49aca",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "<All keys matched successfully>"
-      ]
-     },
-     "execution_count": 34,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "model = QuestionDistilBERT(mod)\n",
-    "model.load_state_dict(torch.load(\"distilbert_qa.model\"))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 35,
-   "id": "730a86c1",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 2500/2500 [02:57<00:00, 14.09it/s]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean EM:  0.0479\n",
-      "Mean F-1:  0.08989175857485086\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "eval_test_set(model, tokenizer, test_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bd1c7076",
-   "metadata": {},
-   "source": [
-    "## Reuse Layer\n",
-    "This was inspired by how well the original model with just one classification head worked. I felt like the main problem with the previous model was the lack of structure which was already in the layers, combined with the massive amount of resources needed for a Transformer.\n",
-    "\n",
-    "Hence, I tried cloning the last (and then last two) layers of the DistilBERT model, putting a classifier on top and using this as the head. The base DistilBERT model is completely frozen. This worked extremely well, while we only fine-tune about 21% of the parameters (14 Mio as opposed to 66 Mio!) we did before. Below you can see the results.\n",
-    "\n",
-    "### Last DistilBERT layer\n",
-    "\n",
-    "Dropout 0.1 and RMSprop 1e-4:\n",
-    "* Mean EM:  0.3888\n",
-    "* Mean F-1:  0.5122932744694068\n",
-    "\n",
-    "Dropout 0.25: very early stagnating\n",
-    "* Mean EM:  0.3552\n",
-    "* Mean F-1:  0.4711235721312687\n",
-    "\n",
-    "Dropout 0.15: seems to work well - training and test error stagnate around 1.7 and 1.8 but good generalisation (need to add more layers)\n",
-    "* Mean EM:  0.4119\n",
-    "* Mean F-1:  0.5296387232893214\n",
-    "\n",
-    "### Last DitilBERT layer + more Dense layers\n",
-    "Dropout 0.15 + 4 dense layers((786-512)-(512-256)-(256-128)-(128-2)) & ReLU: doesn't work too well - stagnates at around 2.4\n",
-    "\n",
-    "### Last two DistilBERT layers\n",
-    "Dropout 0.1 but last 2 DistilBERT layers: works very well, but early overfitting - maybe use more data\n",
-    "* Mean EM:  0.458\n",
-    "* Mean F-1:  0.6003368353673634\n",
-    "\n",
-    "Dropout 0.1 - last 2 distilbert layers: all data\n",
-    "* Mean EM:  0.484\n",
-    "* Mean F-1:  0.6344960035215299\n",
-    "\n",
-    "Dropout 0.15 - **BEST**\n",
-    "* Mean EM:  0.5178\n",
-    "* Mean F-1:  0.6671140689626448\n",
-    "\n",
-    "Dropout 0.2 - doesn't work too well\n",
-    "* Mean EM:  0.4353\n",
-    "* Mean F-1:  0.5776847879304647\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 69,
-   "id": "654e09e8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=None, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
-    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
-    "\n",
-    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
-    "                       natural_question_paths=None, \n",
-    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
-    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 70,
-   "id": "707c0cb5",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "ReuseQuestionDistilBERT(\n",
-       "  (te): ModuleList(\n",
-       "    (0): TransformerBlock(\n",
-       "      (attention): MultiHeadSelfAttention(\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "      )\n",
-       "      (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (ffn): FFN(\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "        (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "        (activation): GELUActivation()\n",
-       "      )\n",
-       "      (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "    )\n",
-       "    (1): TransformerBlock(\n",
-       "      (attention): MultiHeadSelfAttention(\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "        (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "      )\n",
-       "      (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (ffn): FFN(\n",
-       "        (dropout): Dropout(p=0.1, inplace=False)\n",
-       "        (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "        (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "        (activation): GELUActivation()\n",
-       "      )\n",
-       "      (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "    )\n",
-       "  )\n",
-       "  (distilbert): DistilBertModel(\n",
-       "    (embeddings): Embeddings(\n",
-       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
-       "      (position_embeddings): Embedding(512, 768)\n",
-       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "      (dropout): Dropout(p=0.1, inplace=False)\n",
-       "    )\n",
-       "    (transformer): Transformer(\n",
-       "      (layer): ModuleList(\n",
-       "        (0): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (1): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (2): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (3): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (4): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "        (5): TransformerBlock(\n",
-       "          (attention): MultiHeadSelfAttention(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
-       "          )\n",
-       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "          (ffn): FFN(\n",
-       "            (dropout): Dropout(p=0.1, inplace=False)\n",
-       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
-       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
-       "            (activation): GELUActivation()\n",
-       "          )\n",
-       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
-       "        )\n",
-       "      )\n",
-       "    )\n",
-       "  )\n",
-       "  (relu): ReLU()\n",
-       "  (dropout): Dropout(p=0.15, inplace=False)\n",
-       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
-       ")"
-      ]
-     },
-     "execution_count": 70,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")\n",
-    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")\n",
-    "mod = model.distilbert\n",
-    "\n",
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "model = ReuseQuestionDistilBERT(mod)\n",
-    "model.to(device)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 71,
-   "id": "d2c6bff5",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "+-------------------------------+------------+\n",
-      "|            Modules            | Parameters |\n",
-      "+-------------------------------+------------+\n",
-      "|  te.0.attention.q_lin.weight  |   589824   |\n",
-      "|   te.0.attention.q_lin.bias   |    768     |\n",
-      "|  te.0.attention.k_lin.weight  |   589824   |\n",
-      "|   te.0.attention.k_lin.bias   |    768     |\n",
-      "|  te.0.attention.v_lin.weight  |   589824   |\n",
-      "|   te.0.attention.v_lin.bias   |    768     |\n",
-      "| te.0.attention.out_lin.weight |   589824   |\n",
-      "|  te.0.attention.out_lin.bias  |    768     |\n",
-      "|   te.0.sa_layer_norm.weight   |    768     |\n",
-      "|    te.0.sa_layer_norm.bias    |    768     |\n",
-      "|      te.0.ffn.lin1.weight     |  2359296   |\n",
-      "|       te.0.ffn.lin1.bias      |    3072    |\n",
-      "|      te.0.ffn.lin2.weight     |  2359296   |\n",
-      "|       te.0.ffn.lin2.bias      |    768     |\n",
-      "| te.0.output_layer_norm.weight |    768     |\n",
-      "|  te.0.output_layer_norm.bias  |    768     |\n",
-      "|  te.1.attention.q_lin.weight  |   589824   |\n",
-      "|   te.1.attention.q_lin.bias   |    768     |\n",
-      "|  te.1.attention.k_lin.weight  |   589824   |\n",
-      "|   te.1.attention.k_lin.bias   |    768     |\n",
-      "|  te.1.attention.v_lin.weight  |   589824   |\n",
-      "|   te.1.attention.v_lin.bias   |    768     |\n",
-      "| te.1.attention.out_lin.weight |   589824   |\n",
-      "|  te.1.attention.out_lin.bias  |    768     |\n",
-      "|   te.1.sa_layer_norm.weight   |    768     |\n",
-      "|    te.1.sa_layer_norm.bias    |    768     |\n",
-      "|      te.1.ffn.lin1.weight     |  2359296   |\n",
-      "|       te.1.ffn.lin1.bias      |    3072    |\n",
-      "|      te.1.ffn.lin2.weight     |  2359296   |\n",
-      "|       te.1.ffn.lin2.bias      |    768     |\n",
-      "| te.1.output_layer_norm.weight |    768     |\n",
-      "|  te.1.output_layer_norm.bias  |    768     |\n",
-      "|       classifier.weight       |    1536    |\n",
-      "|        classifier.bias        |     2      |\n",
-      "+-------------------------------+------------+\n",
-      "Total Trainable Params: 14177282\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "14177282"
-      ]
-     },
-     "execution_count": 71,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "count_parameters(model)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c386c2eb",
-   "metadata": {},
-   "source": [
-    "### Testing the Model"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 72,
-   "id": "818deed3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get smaller dataset\n",
-    "batch_size = 8\n",
-    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
-    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
-    "optim=torch.optim.Adam(model.parameters())"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 73,
-   "id": "9da40760",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Passed\n"
-     ]
-    }
-   ],
-   "source": [
-    "test_model(model, optim, test_ds_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3f80248",
-   "metadata": {},
-   "source": [
-    "### Model Training"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 24,
-   "id": "e1adabe6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from torch.optim import AdamW, RMSprop\n",
-    "\n",
-    "model.train()\n",
-    "optim = AdamW(model.parameters(), lr=1e-4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 25,
-   "id": "efe1cbd5",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "8785757b04214102830ded36c1392c8d",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 2.6535016193100383\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "836f5365498642fa9ae891a86dca5892",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 2.384517493388057\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "981e1cef83a1477e920d1cdbffdfcde1",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 2.172889394424643\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "20a785e7fefb43239f1120992d2c3416",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 2.013008696398139\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "47831e65b1ed4be78e8e7cb24068b0c3",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 1.9743544759827\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "15904a3f930249fb944ea87184676e14",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 1.8922049684919418\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "108bdbf644d94d78910195992b9e2652",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 1.857202093189742\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "d6a75a6ab40d4a2599b7511bfc60bf83",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 1.793771461571753\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "d3468a6ba72a4f42b0e7cc77ee0a0011",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 1.7750537034896867\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "8aca0aa529d2452e8bd29fe7ada934f2",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 1.7466133671954274\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "e09abdfa63c841ce97f445ba9b3eeaa8",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Training Error 1.7097622096568346\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "0f49dd32d33e4f398be0942a59d735ce",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/2500 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean Test Error 1.7642206047609448\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "a493dd70ffb64cd19830e5dc98607979",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/35000 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "KeyboardInterrupt\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "epochs = 16\n",
-    "\n",
-    "for epoch in range(epochs):\n",
-    "    loop = tqdm(loader, leave=True)\n",
-    "    model.train()\n",
-    "    mean_training_error = []\n",
-    "    for batch in loop:\n",
-    "        optim.zero_grad()\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
-    "        loss = outputs['loss']\n",
-    "        loss.backward()\n",
-    "        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)\n",
-    "        optim.step()\n",
-    "        mean_training_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch}')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
-    "    \n",
-    "    loop = tqdm(test_loader, leave=True)\n",
-    "    model.eval()\n",
-    "    mean_test_error = []\n",
-    "    for batch in loop:\n",
-    "        \n",
-    "        input_ids = batch['input_ids'].to(device)\n",
-    "        attention_mask = batch['attention_mask'].to(device)\n",
-    "        start = batch['start_positions'].to(device)\n",
-    "        end = batch['end_positions'].to(device)\n",
-    "        \n",
-    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
-    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
-    "        loss = outputs['loss']\n",
-    "        \n",
-    "        mean_test_error.append(loss.item())\n",
-    "        loop.set_description(f'Epoch {epoch} Testset')\n",
-    "        loop.set_postfix(loss=loss.item())\n",
-    "    print(\"Mean Test Error\", np.mean(mean_test_error))\n",
-    "    torch.save(model.state_dict(), \"distilbert_reuse_{}\".format(epoch))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 48,
-   "id": "fdf37d18",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "torch.save(model.state_dict(), \"distilbert_reuse.model\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 49,
-   "id": "d1cfded4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "m = ReuseQuestionDistilBERT(mod)\n",
-    "m.load_state_dict(torch.load(\"distilbert_reuse.model\"))\n",
-    "model = m"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 47,
-   "id": "233bdc18",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 2500/2500 [02:51<00:00, 14.59it/s]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Mean EM:  0.5178\n",
-      "Mean F-1:  0.6671140689626448\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "eval_test_set(model, tokenizer, test_loader, device)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0fb1ce9e",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3.10.8 ('venv': venv)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.8"
-  },
-  "toc": {
-   "base_numbering": 1,
-   "nav_menu": {},
-   "number_sections": true,
-   "sideBar": true,
-   "skip_h1_title": false,
-   "title_cell": "Table of Contents",
-   "title_sidebar": "Contents",
-   "toc_cell": false,
-   "toc_position": {},
-   "toc_section_display": true,
-   "toc_window_display": false
-  },
-  "varInspector": {
-   "cols": {
-    "lenName": 16,
-    "lenType": 16,
-    "lenVar": 40
-   },
-   "kernels_config": {
-    "python": {
-     "delete_cmd_postfix": "",
-     "delete_cmd_prefix": "del ",
-     "library": "var_list.py",
-     "varRefreshCmd": "print(var_dic_list())"
-    },
-    "r": {
-     "delete_cmd_postfix": ") ",
-     "delete_cmd_prefix": "rm(",
-     "library": "var_list.r",
-     "varRefreshCmd": "cat(var_dic_list()) "
-    }
-   },
-   "types_to_exclude": [
-    "module",
-    "function",
-    "builtin_function_or_method",
-    "instance",
-    "_Feature"
-   ],
-   "window_display": false
-  },
-  "vscode": {
-   "interpreter": {
-    "hash": "85bf9c14e9ba73b783ed1274d522bec79eb0b2b739090180d8ce17bb11aff4aa"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}