sanjudebnath
/

Numini

Question Answering

Model card Files Files and versions

xet

Community

sanjudebnath commited on Feb 16, 2025

Commit

89eec77

verified ·

1 Parent(s): a966e1f

Upload 2 files

Browse files

Files changed (2) hide show

load_data.ipynb +850 -0
question_answering.ipynb +1832 -0

load_data.ipynb ADDED Viewed

	@@ -0,0 +1,850 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "12d87b30",
+   "metadata": {},
+   "source": [
+    "# Load Data\n",
+    "This notebook loads and preproceses all necessary data, namely the following.\n",
+    "* OpenWebTextCorpus: for base DistilBERT model\n",
+    "* SQuAD datasrt: for Q&A\n",
+    "* Natural Questions (needs to be downloaded externally but is preprocessed here): for Q&A\n",
+    "* HotPotQA: for Q&A"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "7c82d7fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm.auto import tqdm\n",
+    "from datasets import load_dataset\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "import random"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1737f219",
+   "metadata": {},
+   "source": [
+    "## Distilbert Data\n",
+    "In the following, we download the english openwebtext dataset from huggingface (https://huggingface.co/datasets/openwebtext). The dataset is provided by Aaron Gokaslan and Vanya Cohen from Brown University (https://skylion007.github.io/OpenWebTextCorpus/).\n",
+    "\n",
+    "We first load the data, investigate the structure and write the dataset into files of each 10 000 texts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cce7623c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = load_dataset(\"openwebtext\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "678a5e86",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DatasetDict({\n",
+       "    train: Dataset({\n",
+       "        features: ['text'],\n",
+       "        num_rows: 8013769\n",
+       "    })\n",
+       "})"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# we have a text-only training dataset with 8 million entries\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b141bce7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create necessary folders\n",
+    "os.mkdir('data')\n",
+    "os.mkdir('data/original')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca94f995",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# save text in chunks of 10000 samples\n",
+    "text = []\n",
+    "i = 0\n",
+    "\n",
+    "for sample in tqdm(ds['train']):\n",
+    "    # replace all newlines\n",
+    "    sample = sample['text'].replace('\\n','')\n",
+    "    \n",
+    "    # append cleaned sample to all texts\n",
+    "    text.append(sample)\n",
+    "    \n",
+    "    # if we processed 10000 samples, write them to a file and start over\n",
+    "    if len(text) == 10000:\n",
+    "        with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "            f.write('\\n'.join(text))\n",
+    "        text = []\n",
+    "        i += 1 \n",
+    "\n",
+    "# write remaining samples to a file\n",
+    "with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "    f.write('\\n'.join(text))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f131dcfc",
+   "metadata": {},
+   "source": [
+    "### Testing\n",
+    "If we load the first file, we should get a file that is 10000 lines long and has one column\n",
+    "\n",
+    "As we do not preprocess the data in any way, but just write the read text into the file, this is all testing necessary"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "df50af74",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"data/original/text_0.txt\", 'r', encoding='utf-8') as f:\n",
+    "    lines = f.read().split('\\n')\n",
+    "lines = pd.DataFrame(lines)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "8ddb0085",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "assert lines.shape==(10000,1)\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a65b268",
+   "metadata": {},
+   "source": [
+    "## SQuAD Data\n",
+    "In the following, we download the SQuAD dataset from huggingface (https://huggingface.co/datasets/squad). It was initially provided by Rajpurkar et al. from Stanford University.\n",
+    "\n",
+    "We again load the dataset and store it in chunks of 1000 into files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6750ce6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = load_dataset(\"squad\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65a7ee23",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.mkdir(\"data/training_squad\")\n",
+    "os.mkdir(\"data/test_squad\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f6ebf63e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we already have a training and test split. Each sample has an id, title, context, question and answers.\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f67ae448",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# answers are provided like that - we need to extract answer_end for the model\n",
+    "dataset['train']['answers'][0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "101cd650",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# column contains the split (either train or validation), save_dir is the directory\n",
+    "def save_samples(column, save_dir):\n",
+    "    text = []\n",
+    "    i = 0\n",
+    "\n",
+    "    for sample in tqdm(dataset[column]):\n",
+    "        \n",
+    "        # preprocess the context and question by removing the newlines\n",
+    "        context = sample['context'].replace('\\n','')\n",
+    "        question = sample['question'].replace('\\n','')\n",
+    "\n",
+    "        # get the answer as text and start character index\n",
+    "        answer_text = sample['answers']['text'][0]\n",
+    "        answer_start = str(sample['answers']['answer_start'][0])\n",
+    "        \n",
+    "        text.append([context, question, answer_text, answer_start])\n",
+    "\n",
+    "        # we choose chunks of 1000\n",
+    "        if len(text) == 1000:\n",
+    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
+    "            text = []\n",
+    "            i += 1\n",
+    "\n",
+    "    # save remaining\n",
+    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
+    "\n",
+    "save_samples(\"train\", \"training_squad\")\n",
+    "save_samples(\"validation\", \"test_squad\")\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67044d13",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "### Testing\n",
+    "If we load a file, we should get a file with 10000 lines and 4 columns\n",
+    "\n",
+    "Also, we want to assure the correct interval. Hence, the second test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "446281cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"data/training_squad/text_0.txt\", 'r', encoding='utf-8') as f:\n",
+    "    lines = f.read().split('\\n')\n",
+    "    \n",
+    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ccd5c650",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert lines.shape==(1000,4)\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c9e4b70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we assert that we have the right interval\n",
+    "for ind, line in lines.iterrows():\n",
+    "    sample = line\n",
+    "    answer_start = int(sample['answer_start'])\n",
+    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02265ace",
+   "metadata": {},
+   "source": [
+    "## Natural Questions Dataset\n",
+    "* Download from https://ai.google.com/research/NaturalQuestions via gsutil (the one from huggingface has 134.92GB, the one from google cloud is in archives)\n",
+    "* Use gunzip to get some samples - we then get `.jsonl`files\n",
+    "* The dataset is a lot more messy, as it is just wikipedia articles with all web artifacts\n",
+    "  * I cleaned the html tags\n",
+    "  * Also I chose a random interval (containing the answer) from the dataset\n",
+    "  * We can't send the whole text into the model anyways"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3bce0c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "paths = [str(x) for x in Path('data/natural_questions/v1.0/train/').glob('**/*.jsonl')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9c58c00",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.mkdir(\"data/natural_questions_train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ed7ba6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "# clean html tags\n",
+    "CLEANR = re.compile('<.+?>')\n",
+    "# clean multiple spaces\n",
+    "CLEANMULTSPACE = re.compile('(\\s)+')\n",
+    "\n",
+    "# the function takes an html documents and removes artifacts\n",
+    "def cleanhtml(raw_html):\n",
+    "    # tags\n",
+    "    cleantext = re.sub(CLEANR, '', raw_html)\n",
+    "    # newlines\n",
+    "    cleantext = cleantext.replace(\"\\n\", '')\n",
+    "    # tabs\n",
+    "    cleantext = cleantext.replace(\"\\t\", '')\n",
+    "    # character encodings\n",
+    "    cleantext = cleantext.replace(\"&#39;\", \"'\")\n",
+    "    cleantext = cleantext.replace(\"&amp;\", \"'\")\n",
+    "    cleantext = cleantext.replace(\"&quot;\", '\"')\n",
+    "    # multiple spaces\n",
+    "    cleantext = re.sub(CLEANMULTSPACE, ' ', cleantext)\n",
+    "    # documents end with this tags, if it is present in the string, cut it off\n",
+    "    idx = cleantext.find(\"<!-- NewPP limit\")\n",
+    "    if idx > -1:\n",
+    "        cleantext = cleantext[:idx]\n",
+    "    return cleantext.strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66ca19ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "# file count\n",
+    "i = 0\n",
+    "data = []\n",
+    "\n",
+    "# iterate over all json files\n",
+    "for path in paths:\n",
+    "    print(path)\n",
+    "    # read file and store as list (this requires much memory, as the files are huge)\n",
+    "    with open(path, 'r') as json_file:\n",
+    "        json_list = list(json_file)\n",
+    "    \n",
+    "    # process every context, question, answer pair\n",
+    "    for json_str in json_list:\n",
+    "        result = json.loads(json_str)\n",
+    "\n",
+    "        # append a question mark - SQuAD questions end with a qm too\n",
+    "        question = result['question_text'] + \"?\"\n",
+    "        \n",
+    "        # some question do not contain an answer - we do not need them\n",
+    "        if(len(result['annotations'][0]['short_answers'])==0):\n",
+    "            continue\n",
+    "\n",
+    "        # get true start/end byte\n",
+    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
+    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
+    "\n",
+    "        # convert to bytes\n",
+    "        byte_encoding = bytes(result['document_html'], encoding='utf-8')\n",
+    "        \n",
+    "        # the document is the whole wikipedia article, we randomly choose an appropriate part (containing the\n",
+    "        # answer): we have 512 tokens as the input for the model - 4000 bytes lead to a good length\n",
+    "        max_back = 3500 if true_start >= 3500 else true_start\n",
+    "        first = random.randint(int(true_start)-max_back, int(true_start))\n",
+    "        end = first + 3500 + true_end - true_start\n",
+    "        \n",
+    "        # get chosen context\n",
+    "        cleanbytes = byte_encoding[first:end]\n",
+    "        # decode back to text - if our end byte is the middle of a word, we ignore it and cut it off\n",
+    "        cleantext = bytes.decode(cleanbytes, errors='ignore')\n",
+    "        # clean html tags\n",
+    "        cleantext = cleanhtml(cleantext)\n",
+    "\n",
+    "        # find the true answer\n",
+    "        answer_start = cleanbytes.find(byte_encoding[true_start:true_end])\n",
+    "        true_answer = bytes.decode(cleanbytes[answer_start:answer_start+(true_end-true_start)])\n",
+    "        \n",
+    "        # clean html tags\n",
+    "        true_answer = cleanhtml(true_answer)\n",
+    "        \n",
+    "        start_ind = cleantext.find(true_answer)\n",
+    "        \n",
+    "        # If cleaning the string makes the answer not findable skip it\n",
+    "        # this hardly ever happens, except if there is an emense amount of web artifacts\n",
+    "        if start_ind == -1:\n",
+    "            continue\n",
+    "            \n",
+    "        data.append([cleantext, question, true_answer, str(start_ind)])\n",
+    "\n",
+    "        if len(data) == 1000:\n",
+    "            with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))\n",
+    "            i += 1\n",
+    "            data = []\n",
+    "with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "    f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30f26b4e",
+   "metadata": {},
+   "source": [
+    "### Testing\n",
+    "In the following, we first check if the shape of the file is correct.\n",
+    "\n",
+    "Then we iterate over the file and check if the answers according to the file are the same as in the original file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "490ac0db",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"data/natural_questions_train/text_0.txt\", 'r', encoding='utf-8') as f:\n",
+    "    lines = f.read().split('\\n')\n",
+    "    \n",
+    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d7cc3ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert lines.shape == (1000, 4)\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fd8a854",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"data/natural_questions/v1.0/train/nq-train-00.jsonl\", 'r') as json_file:\n",
+    "    json_list = list(json_file)[:500]\n",
+    "del json_file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "170bff30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lines_index = 0\n",
+    "for i in range(len(json_list)):\n",
+    "    result = json.loads(json_list[i])\n",
+    "     \n",
+    "    if(len(result['annotations'][0]['short_answers'])==0):\n",
+    "        pass\n",
+    "    else: \n",
+    "        # assert that the question text is the same\n",
+    "        assert result['question_text'] + \"?\" == lines.loc[lines_index, 'question']\n",
+    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
+    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
+    "        true_answer = bytes.decode(bytes(result['document_html'], encoding='utf-8')[true_start:true_end])\n",
+    "        \n",
+    "        processed_answer = lines.loc[lines_index, 'answer']\n",
+    "        # assert that the answer is the same\n",
+    "        assert cleanhtml(true_answer) == processed_answer\n",
+    "    \n",
+    "        start_ind = int(lines.loc[lines_index, 'answer_start'])\n",
+    "        # assert that the answer (according to the index) is the same\n",
+    "        assert cleanhtml(true_answer) == lines.loc[lines_index, 'context'][start_ind:start_ind+len(processed_answer)]\n",
+    "        \n",
+    "        lines_index += 1\n",
+    "    \n",
+    "    if lines_index == len(lines):\n",
+    "        break\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78e6e737",
+   "metadata": {},
+   "source": [
+    "## Hotpot QA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27efcc8c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = load_dataset(\"hotpot_qa\", 'fullwiki')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1493f21f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a047946",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.mkdir('data/hotpotqa_training')\n",
+    "os.mkdir('data/hotpotqa_test')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e65b6485",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    },
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    }
+   ],
+   "source": [
+    "# column contains the split (either train or validation), save_dir is the directory\n",
+    "def save_samples(column, save_dir):\n",
+    "    text = []\n",
+    "    i = 0\n",
+    "\n",
+    "    for sample in tqdm(ds[column]):\n",
+    "        \n",
+    "        # preprocess the context and question by removing the newlines\n",
+    "        context = sample['context']['sentences']\n",
+    "        context = \" \".join([\"\".join(sentence) for sentence in context])\n",
+    "        question = sample['question'].replace('\\n','')\n",
+    "        \n",
+    "        # get the answer as text and start character index\n",
+    "        answer_text = sample['answer']\n",
+    "        answer_start = context.find(answer_text)\n",
+    "        if answer_start == -1:\n",
+    "            continue\n",
+    "            \n",
+    "        \n",
+    "            \n",
+    "        if answer_start > 1500:\n",
+    "            first = random.randint(answer_start-1500, answer_start)\n",
+    "            end = first + 1500 + len(answer_text)\n",
+    "            \n",
+    "            context = context[first:end+1]\n",
+    "            answer_start = context.find(answer_text)\n",
+    "            \n",
+    "            if answer_start == -1:continue\n",
+    "            \n",
+    "        text.append([context, question, answer_text, str(answer_start)])\n",
+    "\n",
+    "        # we choose chunks of 1000\n",
+    "        if len(text) == 1000:\n",
+    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
+    "            text = []\n",
+    "            i += 1\n",
+    "\n",
+    "    # save remaining\n",
+    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
+    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
+    "\n",
+    "save_samples(\"train\", \"hotpotqa_training\")\n",
+    "save_samples(\"validation\", \"hotpotqa_test\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97cc358f",
+   "metadata": {},
+   "source": [
+    "## Testing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f321483c",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    },
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    }
+   ],
+   "source": [
+    "with open(\"data/hotpotqa_training/text_0.txt\", 'r', encoding='utf-8') as f:\n",
+    "    lines = f.read().split('\\n')\n",
+    "    \n",
+    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "72a96e78",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    },
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    }
+   ],
+   "source": [
+    "assert lines.shape == (1000, 4)\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c32c2f16",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    },
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    }
+   ],
+   "source": [
+    "# we assert that we have the right interval\n",
+    "for ind, line in lines.iterrows():\n",
+    "    sample = line\n",
+    "    answer_start = int(sample['answer_start'])\n",
+    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
+    "print(\"Passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bc36fe7d",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    },
+    {
+     "ename": "",
+     "evalue": "",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31mnotebook controller is DISPOSED. \n",
+      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+     ]
+    }
+   ],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.16"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "85bf9c14e9ba73b783ed1274d522bec79eb0b2b739090180d8ce17bb11aff4aa"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

question_answering.ipynb ADDED Viewed

	@@ -0,0 +1,1832 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "19817716",
+   "metadata": {},
+   "source": [
+    "# Question Answering\n",
+    "The following notebook contains different question answering models. We will start by introducing a representation for the dataset and corresponding DataLoader and then evaluate different models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "49bf46c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import DistilBertModel, DistilBertForMaskedLM, DistilBertConfig, \\\n",
+    "            DistilBertTokenizerFast, AutoTokenizer, BertModel, BertForMaskedLM, BertTokenizerFast, BertConfig\n",
+    "from torch import nn\n",
+    "from pathlib import Path\n",
+    "import torch\n",
+    "import pandas as pd\n",
+    "from typing import Optional \n",
+    "from tqdm.auto import tqdm\n",
+    "from util import eval_test_set, count_parameters\n",
+    "from torch.optim import AdamW, RMSprop\n",
+    "\n",
+    "\n",
+    "from qa_model import QuestionDistilBERT, SimpleQuestionDistilBERT, ReuseQuestionDistilBERT, Dataset, test_model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ea47820",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "Processing the data correctly is partly based on the Huggingface Tutorial (https://huggingface.co/course/chapter7/7?fw=pt)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "7b1b2b3e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "id": "f276eba7",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "   \n",
+    "# create datasets and loaders for training and test set\n",
+    "squad_paths = [str(x) for x in Path('data/training_squad/').glob('**/*.txt')]\n",
+    "nat_paths = [str(x) for x in Path('data/natural_questions_train/').glob('**/*.txt')]\n",
+    "hotpotqa_paths = [str(x) for x in Path('data/hotpotqa_training/').glob('**/*.txt')]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad8d532a",
+   "metadata": {},
+   "source": [
+    "## POC Model\n",
+    "* Works very well:\n",
+    "  * Dropout 0.1 is too small (overfitting after first epoch) - changed to 0.15\n",
+    "  * Difference between AdamW and RMSprop minimal\n",
+    "  \n",
+    "### Results:\n",
+    "Dropout = 0.15\n",
+    "* Mean EM:  0.5374\n",
+    "* Mean F-1:  0.6826317532406944\n",
+    "\n",
+    "Dropout = 0.2 (overfitting realtively similar to first, but seems to be too high)\n",
+    "* Mean EM:  0.5044\n",
+    "* Mean F-1:  0.6437359169276439"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "id": "703e7f38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=None, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
+    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
+    "\n",
+    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
+    "                       natural_question_paths=None, \n",
+    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
+    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "id": "6672f614",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")\n",
+    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")\n",
+    "mod = model.distilbert"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "id": "dec15198",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "SimpleQuestionDistilBERT(\n",
+       "  (distilbert): DistilBertModel(\n",
+       "    (embeddings): Embeddings(\n",
+       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
+       "      (position_embeddings): Embedding(512, 768)\n",
+       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (dropout): Dropout(p=0.1, inplace=False)\n",
+       "    )\n",
+       "    (transformer): Transformer(\n",
+       "      (layer): ModuleList(\n",
+       "        (0): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (1): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (2): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (3): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (4): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (5): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (dropout): Dropout(p=0.5, inplace=False)\n",
+       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
+       ")"
+      ]
+     },
+     "execution_count": 56,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
+    "model = SimpleQuestionDistilBERT(mod)\n",
+    "model.to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "id": "9def3c83",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------------------------------------------------+------------+\n",
+      "|                         Modules                         | Parameters |\n",
+      "+---------------------------------------------------------+------------+\n",
+      "|       distilbert.embeddings.word_embeddings.weight      |  23440896  |\n",
+      "|     distilbert.embeddings.position_embeddings.weight    |   393216   |\n",
+      "|          distilbert.embeddings.LayerNorm.weight         |    768     |\n",
+      "|           distilbert.embeddings.LayerNorm.bias          |    768     |\n",
+      "|  distilbert.transformer.layer.0.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.0.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.0.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.0.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.0.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.0.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.0.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.0.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.0.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.0.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.0.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.0.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.0.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.0.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.0.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.0.output_layer_norm.bias  |    768     |\n",
+      "|  distilbert.transformer.layer.1.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.1.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.1.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.1.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.1.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.1.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.1.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.1.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.1.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.1.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.1.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.1.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.1.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.1.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.1.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.1.output_layer_norm.bias  |    768     |\n",
+      "|  distilbert.transformer.layer.2.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.2.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.2.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.2.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.2.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.2.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.2.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.2.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.2.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.2.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.2.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.2.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.2.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.2.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.2.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.2.output_layer_norm.bias  |    768     |\n",
+      "|  distilbert.transformer.layer.3.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.3.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.3.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.3.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.3.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.3.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.3.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.3.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.3.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.3.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.3.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.3.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.3.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.3.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.3.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.3.output_layer_norm.bias  |    768     |\n",
+      "|  distilbert.transformer.layer.4.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.4.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.4.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.4.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.4.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.4.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.4.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.4.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.4.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.4.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.4.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.4.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.4.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.4.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.4.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.4.output_layer_norm.bias  |    768     |\n",
+      "|  distilbert.transformer.layer.5.attention.q_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.5.attention.q_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.5.attention.k_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.5.attention.k_lin.bias   |    768     |\n",
+      "|  distilbert.transformer.layer.5.attention.v_lin.weight  |   589824   |\n",
+      "|   distilbert.transformer.layer.5.attention.v_lin.bias   |    768     |\n",
+      "| distilbert.transformer.layer.5.attention.out_lin.weight |   589824   |\n",
+      "|  distilbert.transformer.layer.5.attention.out_lin.bias  |    768     |\n",
+      "|   distilbert.transformer.layer.5.sa_layer_norm.weight   |    768     |\n",
+      "|    distilbert.transformer.layer.5.sa_layer_norm.bias    |    768     |\n",
+      "|      distilbert.transformer.layer.5.ffn.lin1.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.5.ffn.lin1.bias      |    3072    |\n",
+      "|      distilbert.transformer.layer.5.ffn.lin2.weight     |  2359296   |\n",
+      "|       distilbert.transformer.layer.5.ffn.lin2.bias      |    768     |\n",
+      "| distilbert.transformer.layer.5.output_layer_norm.weight |    768     |\n",
+      "|  distilbert.transformer.layer.5.output_layer_norm.bias  |    768     |\n",
+      "|                    classifier.weight                    |    1536    |\n",
+      "|                     classifier.bias                     |     2      |\n",
+      "+---------------------------------------------------------+------------+\n",
+      "Total Trainable Params: 66364418\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "66364418"
+      ]
+     },
+     "execution_count": 57,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "count_parameters(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "426a6311",
+   "metadata": {},
+   "source": [
+    "### Testing the model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "id": "6151c201",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get smaller dataset\n",
+    "batch_size = 8\n",
+    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
+    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
+    "optim = RMSprop(model.parameters(), lr=1e-4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "aeae0c56",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_model(model, optim, test_ds_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59928d34",
+   "metadata": {},
+   "source": [
+    "### Model Training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "id": "a8017b8c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "SimpleQuestionDistilBERT(\n",
+       "  (distilbert): DistilBertModel(\n",
+       "    (embeddings): Embeddings(\n",
+       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
+       "      (position_embeddings): Embedding(512, 768)\n",
+       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (dropout): Dropout(p=0.1, inplace=False)\n",
+       "    )\n",
+       "    (transformer): Transformer(\n",
+       "      (layer): ModuleList(\n",
+       "        (0): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (1): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (2): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (3): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (4): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (5): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (dropout): Dropout(p=0.5, inplace=False)\n",
+       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
+       ")"
+      ]
+     },
+     "execution_count": 60,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
+    "model = SimpleQuestionDistilBERT(mod)\n",
+    "model.to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "id": "f13c12dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.train()\n",
+    "optim = RMSprop(model.parameters(), lr=1e-4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4fa54d9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "epochs = 5\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    loop = tqdm(loader, leave=True)\n",
+    "    model.train()\n",
+    "    mean_training_error = []\n",
+    "    for batch in loop:\n",
+    "        optim.zero_grad()\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
+    "        loss = outputs['loss']\n",
+    "        loss.backward()\n",
+    "        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)\n",
+    "        optim.step()\n",
+    "        mean_training_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch}')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
+    "    \n",
+    "    \n",
+    "    loop = tqdm(test_loader, leave=True)\n",
+    "    model.eval()\n",
+    "    mean_test_error = []\n",
+    "    for batch in loop:\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
+    "        loss = outputs['loss']\n",
+    "        \n",
+    "        mean_test_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch} Testset')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Test Error\", np.mean(mean_test_error))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "6ff26fb4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.save(model.state_dict(), \"simple_distilbert_qa.model\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "a5e7abeb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<All keys matched successfully>"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model = SimpleQuestionDistilBERT(mod)\n",
+    "model.load_state_dict(torch.load(\"simple_distilbert_qa.model\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "f5ad7bee",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 2500/2500 [02:09<00:00, 19.37it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean EM:  0.5374\n",
+      "Mean F-1:  0.6826317532406944\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_test_set(model, tokenizer, test_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fa6017a8",
+   "metadata": {},
+   "source": [
+    "## Freeze baseline and train new head\n",
+    "This was my initial idea, to freeze the layers and add a completely new head, which we train from scratch. I tried a lot of different configurations, but nothing really worked, I usually stayed at a CrossEntropyLoss of about 3 the whole time. Below, you can see the different heads I have tried.\n",
+    "\n",
+    "Furthermore, I experimented with different data, because I though it might not be enough data all in all. I would conclude that this didn't work because (1) Transformers are very data-hungry and I probably still used too little data (one epoch took about 1h though, so it wasn't possible to use even more). (2) We train the layers completely new, which means they contain absolutely no structure about the problem and task beforehand. I do not think that this way of training leads to better results / less energy used all in all, because it would be too resource intense.\n",
+    "\n",
+    "The following setup is partly based on the HuggingFace implementation of the question answering model (https://github.com/huggingface/transformers/blob/v4.23.1/src/transformers/models/distilbert/modeling_distilbert.py#L805)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "92b21967",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 63,
+   "id": "1d7b3a8c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 64,
+   "id": "91444894",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# only take base model, we do not need the classification head\n",
+    "mod = model.distilbert"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 65,
+   "id": "74ca6c07",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "QuestionDistilBERT(\n",
+       "  (distilbert): DistilBertModel(\n",
+       "    (embeddings): Embeddings(\n",
+       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
+       "      (position_embeddings): Embedding(512, 768)\n",
+       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (dropout): Dropout(p=0.1, inplace=False)\n",
+       "    )\n",
+       "    (transformer): Transformer(\n",
+       "      (layer): ModuleList(\n",
+       "        (0): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (1): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (2): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (3): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (4): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (5): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (relu): ReLU()\n",
+       "  (dropout): Dropout(p=0.1, inplace=False)\n",
+       "  (te): TransformerEncoder(\n",
+       "    (layers): ModuleList(\n",
+       "      (0): TransformerEncoderLayer(\n",
+       "        (self_attn): MultiheadAttention(\n",
+       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
+       "        )\n",
+       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
+       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
+       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
+       "      )\n",
+       "      (1): TransformerEncoderLayer(\n",
+       "        (self_attn): MultiheadAttention(\n",
+       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
+       "        )\n",
+       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
+       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
+       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
+       "      )\n",
+       "      (2): TransformerEncoderLayer(\n",
+       "        (self_attn): MultiheadAttention(\n",
+       "          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)\n",
+       "        )\n",
+       "        (linear1): Linear(in_features=768, out_features=2048, bias=True)\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (linear2): Linear(in_features=2048, out_features=768, bias=True)\n",
+       "        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
+       "        (dropout1): Dropout(p=0.1, inplace=False)\n",
+       "        (dropout2): Dropout(p=0.1, inplace=False)\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (classifier): Sequential(\n",
+       "    (0): Dropout(p=0.1, inplace=False)\n",
+       "    (1): ReLU()\n",
+       "    (2): Linear(in_features=768, out_features=512, bias=True)\n",
+       "    (3): Dropout(p=0.1, inplace=False)\n",
+       "    (4): ReLU()\n",
+       "    (5): Linear(in_features=512, out_features=256, bias=True)\n",
+       "    (6): Dropout(p=0.1, inplace=False)\n",
+       "    (7): ReLU()\n",
+       "    (8): Linear(in_features=256, out_features=128, bias=True)\n",
+       "    (9): Dropout(p=0.1, inplace=False)\n",
+       "    (10): ReLU()\n",
+       "    (11): Linear(in_features=128, out_features=64, bias=True)\n",
+       "    (12): Dropout(p=0.1, inplace=False)\n",
+       "    (13): ReLU()\n",
+       "    (14): Linear(in_features=64, out_features=2, bias=True)\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 65,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
+    "model = QuestionDistilBERT(mod)\n",
+    "model.to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "id": "340857f9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------------------------------+------------+\n",
+      "|                Modules                | Parameters |\n",
+      "+---------------------------------------+------------+\n",
+      "|  te.layers.0.self_attn.in_proj_weight |  1769472   |\n",
+      "|   te.layers.0.self_attn.in_proj_bias  |    2304    |\n",
+      "| te.layers.0.self_attn.out_proj.weight |   589824   |\n",
+      "|  te.layers.0.self_attn.out_proj.bias  |    768     |\n",
+      "|       te.layers.0.linear1.weight      |  1572864   |\n",
+      "|        te.layers.0.linear1.bias       |    2048    |\n",
+      "|       te.layers.0.linear2.weight      |  1572864   |\n",
+      "|        te.layers.0.linear2.bias       |    768     |\n",
+      "|        te.layers.0.norm1.weight       |    768     |\n",
+      "|         te.layers.0.norm1.bias        |    768     |\n",
+      "|        te.layers.0.norm2.weight       |    768     |\n",
+      "|         te.layers.0.norm2.bias        |    768     |\n",
+      "|  te.layers.1.self_attn.in_proj_weight |  1769472   |\n",
+      "|   te.layers.1.self_attn.in_proj_bias  |    2304    |\n",
+      "| te.layers.1.self_attn.out_proj.weight |   589824   |\n",
+      "|  te.layers.1.self_attn.out_proj.bias  |    768     |\n",
+      "|       te.layers.1.linear1.weight      |  1572864   |\n",
+      "|        te.layers.1.linear1.bias       |    2048    |\n",
+      "|       te.layers.1.linear2.weight      |  1572864   |\n",
+      "|        te.layers.1.linear2.bias       |    768     |\n",
+      "|        te.layers.1.norm1.weight       |    768     |\n",
+      "|         te.layers.1.norm1.bias        |    768     |\n",
+      "|        te.layers.1.norm2.weight       |    768     |\n",
+      "|         te.layers.1.norm2.bias        |    768     |\n",
+      "|  te.layers.2.self_attn.in_proj_weight |  1769472   |\n",
+      "|   te.layers.2.self_attn.in_proj_bias  |    2304    |\n",
+      "| te.layers.2.self_attn.out_proj.weight |   589824   |\n",
+      "|  te.layers.2.self_attn.out_proj.bias  |    768     |\n",
+      "|       te.layers.2.linear1.weight      |  1572864   |\n",
+      "|        te.layers.2.linear1.bias       |    2048    |\n",
+      "|       te.layers.2.linear2.weight      |  1572864   |\n",
+      "|        te.layers.2.linear2.bias       |    768     |\n",
+      "|        te.layers.2.norm1.weight       |    768     |\n",
+      "|         te.layers.2.norm1.bias        |    768     |\n",
+      "|        te.layers.2.norm2.weight       |    768     |\n",
+      "|         te.layers.2.norm2.bias        |    768     |\n",
+      "|          classifier.2.weight          |   393216   |\n",
+      "|           classifier.2.bias           |    512     |\n",
+      "|          classifier.5.weight          |   131072   |\n",
+      "|           classifier.5.bias           |    256     |\n",
+      "|          classifier.8.weight          |   32768    |\n",
+      "|           classifier.8.bias           |    128     |\n",
+      "|          classifier.11.weight         |    8192    |\n",
+      "|           classifier.11.bias          |     64     |\n",
+      "|          classifier.14.weight         |    128     |\n",
+      "|           classifier.14.bias          |     2      |\n",
+      "+---------------------------------------+------------+\n",
+      "Total Trainable Params: 17108290\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "17108290"
+      ]
+     },
+     "execution_count": 66,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "count_parameters(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9babd013",
+   "metadata": {},
+   "source": [
+    "### Testing the model\n",
+    "This is the same procedure as in `distilbert.ipynb`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "id": "694c828b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get smaller dataset\n",
+    "batch_size = 8\n",
+    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
+    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
+    "optim=torch.optim.Adam(model.parameters())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "id": "a76587df",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_model(model, optim, test_ds_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c326e8e",
+   "metadata": {},
+   "source": [
+    "### Training the model\n",
+    "* Parameter Tuning:\n",
+    "  * Learning Rate: I experimented with several values, 1e-4 seemed to work best for me. 1e-3 was very unstable and 1e-5 was too small.\n",
+    "  * Gradient Clipping: I experimented with this, but the difference was only minimal\n",
+    "\n",
+    "Data:\n",
+    "* I first used only the SQuAD dataset, but generalisation is a problem\n",
+    "  * The dataset is realtively small and we often have entries with the same context but different questions\n",
+    "  * I believe, the diversity is not big enough to train a fully functional model\n",
+    "* Hence, I included the Natural Questions dataset too\n",
+    "  * It is however a lot more messy - I elaborated a bit more on this in `load_data.ipynb`\n",
+    "* Also the hotpotqa data was used\n",
+    "\n",
+    "Tested with: \n",
+    "* 3 Linear Layers\n",
+    "  * Training Error high - needed more layers\n",
+    "  * Already expected - this was mostly a Proof of Concept\n",
+    "* 1 TransformerEncoder with 4 attention heads + 1 Linear Layer:\n",
+    "  * Training Error was high, still too simple\n",
+    "* 1 TransformerEncoder with 8 heads + 1 Linear Layer:\n",
+    "  * Training Error gets lower, however stagnates at some point\n",
+    "  * Probably still too simple, it doesn't generalise either\n",
+    "* 2 TransformerEncoder with 8 and 4 heads + 1 Linear Layer:\n",
+    "  * Loss gets down but doesn't go further after some time\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e9f4bd3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=nat_paths, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
+    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
+    "\n",
+    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
+    "                       natural_question_paths=None, \n",
+    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
+    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "03a6de37",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = QuestionDistilBERT(mod)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "ed854b73",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.optim import AdamW, RMSprop\n",
+    "\n",
+    "model.train()\n",
+    "optim = RMSprop(model.parameters(), lr=1e-4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "79fdfcc9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.tensorboard import SummaryWriter\n",
+    "writer = SummaryWriter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7bddb43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "epochs = 20\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    loop = tqdm(loader, leave=True)\n",
+    "    model.train()\n",
+    "    mean_training_error = []\n",
+    "    for batch in loop:\n",
+    "        optim.zero_grad()\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        \n",
+    "        loss = outputs['loss']\n",
+    "        loss.backward()\n",
+    "        \n",
+    "        optim.step()\n",
+    "        mean_training_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch}')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
+    "    writer.add_scalar(\"Loss/train\", np.mean(mean_training_error), epoch)\n",
+    "    \n",
+    "    loop = tqdm(test_loader, leave=True)\n",
+    "    model.eval()\n",
+    "    mean_test_error = []\n",
+    "    for batch in loop:\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
+    "        loss = outputs['loss']\n",
+    "        \n",
+    "        mean_test_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch} Testset')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Test Error\", np.mean(mean_test_error))\n",
+    "    writer.add_scalar(\"Loss/test\", np.mean(mean_test_error), epoch)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 238,
+   "id": "a9d6af2e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "writer.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "ba43447e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.save(model.state_dict(), \"distilbert_qa.model\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "ffc49aca",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<All keys matched successfully>"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model = QuestionDistilBERT(mod)\n",
+    "model.load_state_dict(torch.load(\"distilbert_qa.model\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "730a86c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 2500/2500 [02:57<00:00, 14.09it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean EM:  0.0479\n",
+      "Mean F-1:  0.08989175857485086\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_test_set(model, tokenizer, test_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd1c7076",
+   "metadata": {},
+   "source": [
+    "## Reuse Layer\n",
+    "This was inspired by how well the original model with just one classification head worked. I felt like the main problem with the previous model was the lack of structure which was already in the layers, combined with the massive amount of resources needed for a Transformer.\n",
+    "\n",
+    "Hence, I tried cloning the last (and then last two) layers of the DistilBERT model, putting a classifier on top and using this as the head. The base DistilBERT model is completely frozen. This worked extremely well, while we only fine-tune about 21% of the parameters (14 Mio as opposed to 66 Mio!) we did before. Below you can see the results.\n",
+    "\n",
+    "### Last DistilBERT layer\n",
+    "\n",
+    "Dropout 0.1 and RMSprop 1e-4:\n",
+    "* Mean EM:  0.3888\n",
+    "* Mean F-1:  0.5122932744694068\n",
+    "\n",
+    "Dropout 0.25: very early stagnating\n",
+    "* Mean EM:  0.3552\n",
+    "* Mean F-1:  0.4711235721312687\n",
+    "\n",
+    "Dropout 0.15: seems to work well - training and test error stagnate around 1.7 and 1.8 but good generalisation (need to add more layers)\n",
+    "* Mean EM:  0.4119\n",
+    "* Mean F-1:  0.5296387232893214\n",
+    "\n",
+    "### Last DitilBERT layer + more Dense layers\n",
+    "Dropout 0.15 + 4 dense layers((786-512)-(512-256)-(256-128)-(128-2)) & ReLU: doesn't work too well - stagnates at around 2.4\n",
+    "\n",
+    "### Last two DistilBERT layers\n",
+    "Dropout 0.1 but last 2 DistilBERT layers: works very well, but early overfitting - maybe use more data\n",
+    "* Mean EM:  0.458\n",
+    "* Mean F-1:  0.6003368353673634\n",
+    "\n",
+    "Dropout 0.1 - last 2 distilbert layers: all data\n",
+    "* Mean EM:  0.484\n",
+    "* Mean F-1:  0.6344960035215299\n",
+    "\n",
+    "Dropout 0.15 - **BEST**\n",
+    "* Mean EM:  0.5178\n",
+    "* Mean F-1:  0.6671140689626448\n",
+    "\n",
+    "Dropout 0.2 - doesn't work too well\n",
+    "* Mean EM:  0.4353\n",
+    "* Mean F-1:  0.5776847879304647\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "id": "654e09e8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset(squad_paths = squad_paths, natural_question_paths=None, hotpotqa_paths=hotpotqa_paths, tokenizer=tokenizer)\n",
+    "loader = torch.utils.data.DataLoader(dataset, batch_size=8)\n",
+    "\n",
+    "test_dataset = Dataset(squad_paths = [str(x) for x in Path('data/test_squad/').glob('**/*.txt')], \n",
+    "                       natural_question_paths=None, \n",
+    "                       hotpotqa_paths = None, tokenizer=tokenizer)\n",
+    "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "id": "707c0cb5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ReuseQuestionDistilBERT(\n",
+       "  (te): ModuleList(\n",
+       "    (0): TransformerBlock(\n",
+       "      (attention): MultiHeadSelfAttention(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "      )\n",
+       "      (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (ffn): FFN(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "        (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "        (activation): GELUActivation()\n",
+       "      )\n",
+       "      (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "    )\n",
+       "    (1): TransformerBlock(\n",
+       "      (attention): MultiHeadSelfAttention(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "        (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "      )\n",
+       "      (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (ffn): FFN(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "        (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "        (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "        (activation): GELUActivation()\n",
+       "      )\n",
+       "      (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "    )\n",
+       "  )\n",
+       "  (distilbert): DistilBertModel(\n",
+       "    (embeddings): Embeddings(\n",
+       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
+       "      (position_embeddings): Embedding(512, 768)\n",
+       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "      (dropout): Dropout(p=0.1, inplace=False)\n",
+       "    )\n",
+       "    (transformer): Transformer(\n",
+       "      (layer): ModuleList(\n",
+       "        (0): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (1): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (2): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (3): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (4): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "        (5): TransformerBlock(\n",
+       "          (attention): MultiHeadSelfAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
+       "          )\n",
+       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "          (ffn): FFN(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
+       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
+       "            (activation): GELUActivation()\n",
+       "          )\n",
+       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       "  (relu): ReLU()\n",
+       "  (dropout): Dropout(p=0.15, inplace=False)\n",
+       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
+       ")"
+      ]
+     },
+     "execution_count": 70,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model = DistilBertForMaskedLM.from_pretrained(\"distilbert-base-uncased\")\n",
+    "config = DistilBertConfig.from_pretrained(\"distilbert-base-uncased\")\n",
+    "mod = model.distilbert\n",
+    "\n",
+    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
+    "model = ReuseQuestionDistilBERT(mod)\n",
+    "model.to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 71,
+   "id": "d2c6bff5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-------------------------------+------------+\n",
+      "|            Modules            | Parameters |\n",
+      "+-------------------------------+------------+\n",
+      "|  te.0.attention.q_lin.weight  |   589824   |\n",
+      "|   te.0.attention.q_lin.bias   |    768     |\n",
+      "|  te.0.attention.k_lin.weight  |   589824   |\n",
+      "|   te.0.attention.k_lin.bias   |    768     |\n",
+      "|  te.0.attention.v_lin.weight  |   589824   |\n",
+      "|   te.0.attention.v_lin.bias   |    768     |\n",
+      "| te.0.attention.out_lin.weight |   589824   |\n",
+      "|  te.0.attention.out_lin.bias  |    768     |\n",
+      "|   te.0.sa_layer_norm.weight   |    768     |\n",
+      "|    te.0.sa_layer_norm.bias    |    768     |\n",
+      "|      te.0.ffn.lin1.weight     |  2359296   |\n",
+      "|       te.0.ffn.lin1.bias      |    3072    |\n",
+      "|      te.0.ffn.lin2.weight     |  2359296   |\n",
+      "|       te.0.ffn.lin2.bias      |    768     |\n",
+      "| te.0.output_layer_norm.weight |    768     |\n",
+      "|  te.0.output_layer_norm.bias  |    768     |\n",
+      "|  te.1.attention.q_lin.weight  |   589824   |\n",
+      "|   te.1.attention.q_lin.bias   |    768     |\n",
+      "|  te.1.attention.k_lin.weight  |   589824   |\n",
+      "|   te.1.attention.k_lin.bias   |    768     |\n",
+      "|  te.1.attention.v_lin.weight  |   589824   |\n",
+      "|   te.1.attention.v_lin.bias   |    768     |\n",
+      "| te.1.attention.out_lin.weight |   589824   |\n",
+      "|  te.1.attention.out_lin.bias  |    768     |\n",
+      "|   te.1.sa_layer_norm.weight   |    768     |\n",
+      "|    te.1.sa_layer_norm.bias    |    768     |\n",
+      "|      te.1.ffn.lin1.weight     |  2359296   |\n",
+      "|       te.1.ffn.lin1.bias      |    3072    |\n",
+      "|      te.1.ffn.lin2.weight     |  2359296   |\n",
+      "|       te.1.ffn.lin2.bias      |    768     |\n",
+      "| te.1.output_layer_norm.weight |    768     |\n",
+      "|  te.1.output_layer_norm.bias  |    768     |\n",
+      "|       classifier.weight       |    1536    |\n",
+      "|        classifier.bias        |     2      |\n",
+      "+-------------------------------+------------+\n",
+      "Total Trainable Params: 14177282\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "14177282"
+      ]
+     },
+     "execution_count": 71,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "count_parameters(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c386c2eb",
+   "metadata": {},
+   "source": [
+    "### Testing the Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "id": "818deed3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get smaller dataset\n",
+    "batch_size = 8\n",
+    "test_ds = Dataset(squad_paths = squad_paths[:2], natural_question_paths=None, hotpotqa_paths=None, tokenizer=tokenizer)\n",
+    "test_ds_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\n",
+    "optim=torch.optim.Adam(model.parameters())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "id": "9da40760",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_model(model, optim, test_ds_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3f80248",
+   "metadata": {},
+   "source": [
+    "### Model Training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "e1adabe6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.optim import AdamW, RMSprop\n",
+    "\n",
+    "model.train()\n",
+    "optim = AdamW(model.parameters(), lr=1e-4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efe1cbd5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "epochs = 16\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    loop = tqdm(loader, leave=True)\n",
+    "    model.train()\n",
+    "    mean_training_error = []\n",
+    "    for batch in loop:\n",
+    "        optim.zero_grad()\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
+    "        loss = outputs['loss']\n",
+    "        loss.backward()\n",
+    "        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)\n",
+    "        optim.step()\n",
+    "        mean_training_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch}')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Training Error\", np.mean(mean_training_error))\n",
+    "    \n",
+    "    loop = tqdm(test_loader, leave=True)\n",
+    "    model.eval()\n",
+    "    mean_test_error = []\n",
+    "    for batch in loop:\n",
+    "        \n",
+    "        input_ids = batch['input_ids'].to(device)\n",
+    "        attention_mask = batch['attention_mask'].to(device)\n",
+    "        start = batch['start_positions'].to(device)\n",
+    "        end = batch['end_positions'].to(device)\n",
+    "        \n",
+    "        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start, end_positions=end)\n",
+    "        # print(torch.argmax(outputs['start_logits'],axis=1), torch.argmax(outputs['end_logits'], axis=1), start, end)\n",
+    "        loss = outputs['loss']\n",
+    "        \n",
+    "        mean_test_error.append(loss.item())\n",
+    "        loop.set_description(f'Epoch {epoch} Testset')\n",
+    "        loop.set_postfix(loss=loss.item())\n",
+    "    print(\"Mean Test Error\", np.mean(mean_test_error))\n",
+    "    torch.save(model.state_dict(), \"distilbert_reuse_{}\".format(epoch))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "fdf37d18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.save(model.state_dict(), \"distilbert_reuse.model\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "d1cfded4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = ReuseQuestionDistilBERT(mod)\n",
+    "m.load_state_dict(torch.load(\"distilbert_reuse.model\"))\n",
+    "model = m"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "233bdc18",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 2500/2500 [02:51<00:00, 14.59it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mean EM:  0.5178\n",
+      "Mean F-1:  0.6671140689626448\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_test_set(model, tokenizer, test_loader, device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fb1ce9e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.10.8 ('venv': venv)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "85bf9c14e9ba73b783ed1274d522bec79eb0b2b739090180d8ce17bb11aff4aa"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}