sanjudebnath
/

Numini

Question Answering

Model card Files Files and versions

xet

Community

sanjudebnath commited on Feb 16, 2025

Commit

22f3dba

verified ·

1 Parent(s): 5c41cbc

Delete load_data.ipynb

Browse files

Files changed (1) hide show

load_data.ipynb +0 -1209

load_data.ipynb DELETED Viewed

@@ -1,1209 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "12d87b30",
-   "metadata": {},
-   "source": [
-    "# Load Data\n",
-    "This notebook loads and preproceses all necessary data, namely the following.\n",
-    "* OpenWebTextCorpus: for base DistilBERT model\n",
-    "* SQuAD datasrt: for Q&A\n",
-    "* Natural Questions (needs to be downloaded externally but is preprocessed here): for Q&A\n",
-    "* HotPotQA: for Q&A"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "7c82d7fa",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from tqdm.auto import tqdm\n",
-    "from datasets import load_dataset\n",
-    "import os\n",
-    "import pandas as pd\n",
-    "import random"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1737f219",
-   "metadata": {},
-   "source": [
-    "## Distilbert Data\n",
-    "In the following, we download the english openwebtext dataset from huggingface (https://huggingface.co/datasets/openwebtext). The dataset is provided by Aaron Gokaslan and Vanya Cohen from Brown University (https://skylion007.github.io/OpenWebTextCorpus/).\n",
-    "\n",
-    "We first load the data, investigate the structure and write the dataset into files of each 10 000 texts."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cce7623c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "ds = load_dataset(\"openwebtext\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "678a5e86",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "DatasetDict({\n",
-       "    train: Dataset({\n",
-       "        features: ['text'],\n",
-       "        num_rows: 8013769\n",
-       "    })\n",
-       "})"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# we have a text-only training dataset with 8 million entries\n",
-    "ds"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "b141bce7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# create necessary folders\n",
-    "os.mkdir('data')\n",
-    "os.mkdir('data/original')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ca94f995",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# save text in chunks of 10000 samples\n",
-    "text = []\n",
-    "i = 0\n",
-    "\n",
-    "for sample in tqdm(ds['train']):\n",
-    "    # replace all newlines\n",
-    "    sample = sample['text'].replace('\\n','')\n",
-    "    \n",
-    "    # append cleaned sample to all texts\n",
-    "    text.append(sample)\n",
-    "    \n",
-    "    # if we processed 10000 samples, write them to a file and start over\n",
-    "    if len(text) == 10000:\n",
-    "        with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "            f.write('\\n'.join(text))\n",
-    "        text = []\n",
-    "        i += 1 \n",
-    "\n",
-    "# write remaining samples to a file\n",
-    "with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "    f.write('\\n'.join(text))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f131dcfc",
-   "metadata": {},
-   "source": [
-    "### Testing\n",
-    "If we load the first file, we should get a file that is 10000 lines long and has one column\n",
-    "\n",
-    "As we do not preprocess the data in any way, but just write the read text into the file, this is all testing necessary"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "id": "df50af74",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with open(\"data/original/text_0.txt\", 'r', encoding='utf-8') as f:\n",
-    "    lines = f.read().split('\\n')\n",
-    "lines = pd.DataFrame(lines)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "id": "8ddb0085",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Passed\n"
-     ]
-    }
-   ],
-   "source": [
-    "assert lines.shape==(10000,1)\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1a65b268",
-   "metadata": {},
-   "source": [
-    "## SQuAD Data\n",
-    "In the following, we download the SQuAD dataset from huggingface (https://huggingface.co/datasets/squad). It was initially provided by Rajpurkar et al. from Stanford University.\n",
-    "\n",
-    "We again load the dataset and store it in chunks of 1000 into files."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "6750ce6e",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "AssertionError",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
-      "Cell \u001b[0;32mIn [6], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m dataset \u001b[38;5;241m=\u001b[39m load_dataset(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124msquad\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
-      "File \u001b[0;32m~/anaconda3/envs/myenv/lib/python3.10/site-packages/datasets/load.py:1670\u001b[0m, in \u001b[0;36mload_dataset\u001b[0;34m(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)\u001b[0m\n\u001b[1;32m   1667\u001b[0m ignore_verifications \u001b[38;5;241m=\u001b[39m ignore_verifications \u001b[38;5;129;01mor\u001b[39;00m save_infos\n\u001b[1;32m   1669\u001b[0m \u001b[38;5;66;03m# Create a dataset builder\u001b[39;00m\n\u001b[0;32m-> 1670\u001b[0m builder_instance \u001b[38;5;241m=\u001b[39m \u001b[43mload_dataset_builder\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   1671\u001b[0m \u001b[43m    \u001b[49m\u001b[43mpath\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1672\u001b[0m \u001b[43m    \u001b[49m\u001b[43mname\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1673\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1674\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1675\u001b[0m \u001b[43m    \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1676\u001b[0m \u001b[43m    \u001b[49m\u001b[43mfeatures\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mfeatures\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1677\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1678\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_mode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1679\u001b[0m \u001b[43m    \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1680\u001b[0m \u001b[43m    \u001b[49m\u001b[43muse_auth_token\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_auth_token\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1681\u001b[0m \u001b[43m    \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mconfig_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1682\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1684\u001b[0m \u001b[38;5;66;03m# Return iterable dataset in case of streaming\u001b[39;00m\n\u001b[1;32m   1685\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m streaming:\n",
-      "File \u001b[0;32m~/anaconda3/envs/myenv/lib/python3.10/site-packages/datasets/load.py:1447\u001b[0m, in \u001b[0;36mload_dataset_builder\u001b[0;34m(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)\u001b[0m\n\u001b[1;32m   1445\u001b[0m     download_config \u001b[38;5;241m=\u001b[39m download_config\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m download_config \u001b[38;5;28;01melse\u001b[39;00m DownloadConfig()\n\u001b[1;32m   1446\u001b[0m     download_config\u001b[38;5;241m.\u001b[39muse_auth_token \u001b[38;5;241m=\u001b[39m use_auth_token\n\u001b[0;32m-> 1447\u001b[0m dataset_module \u001b[38;5;241m=\u001b[39m \u001b[43mdataset_module_factory\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   1448\u001b[0m \u001b[43m    \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1449\u001b[0m \u001b[43m    \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1450\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1451\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_mode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1452\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1453\u001b[0m \u001b[43m    \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1454\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1456\u001b[0m \u001b[38;5;66;03m# Get dataset builder class from the processing script\u001b[39;00m\n\u001b[1;32m   1457\u001b[0m builder_cls \u001b[38;5;241m=\u001b[39m import_main_class(dataset_module\u001b[38;5;241m.\u001b[39mmodule_path)\n",
-      "File \u001b[0;32m~/anaconda3/envs/myenv/lib/python3.10/site-packages/datasets/load.py:1172\u001b[0m, in \u001b[0;36mdataset_module_factory\u001b[0;34m(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)\u001b[0m\n\u001b[1;32m   1167\u001b[0m             \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(e1, \u001b[38;5;167;01mFileNotFoundError\u001b[39;00m):\n\u001b[1;32m   1168\u001b[0m                 \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mFileNotFoundError\u001b[39;00m(\n\u001b[1;32m   1169\u001b[0m                     \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCouldn\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt find a dataset script at \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mrelative_to_absolute_path(combined_path)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m or any data file in the same directory. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1170\u001b[0m                     \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCouldn\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt find \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpath\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m on the Hugging Face Hub either: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(e1)\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00me1\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1171\u001b[0m                 ) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28mNone\u001b[39m\n\u001b[0;32m-> 1172\u001b[0m             \u001b[38;5;28;01mraise\u001b[39;00m e1 \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28mNone\u001b[39m\n\u001b[1;32m   1173\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m   1174\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mFileNotFoundError\u001b[39;00m(\n\u001b[1;32m   1175\u001b[0m         \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCouldn\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt find a dataset script at \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mrelative_to_absolute_path(combined_path)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m or any data file in the same directory.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1176\u001b[0m     )\n",
-      "File \u001b[0;32m~/anaconda3/envs/myenv/lib/python3.10/site-packages/datasets/load.py:1151\u001b[0m, in \u001b[0;36mdataset_module_factory\u001b[0;34m(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)\u001b[0m\n\u001b[1;32m   1143\u001b[0m         \u001b[38;5;28;01mreturn\u001b[39;00m HubDatasetModuleFactoryWithScript(\n\u001b[1;32m   1144\u001b[0m             path,\n\u001b[1;32m   1145\u001b[0m             revision\u001b[38;5;241m=\u001b[39mrevision,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m   1148\u001b[0m             dynamic_modules_path\u001b[38;5;241m=\u001b[39mdynamic_modules_path,\n\u001b[1;32m   1149\u001b[0m         )\u001b[38;5;241m.\u001b[39mget_module()\n\u001b[1;32m   1150\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1151\u001b[0m         \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mHubDatasetModuleFactoryWithoutScript\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   1152\u001b[0m \u001b[43m            \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1153\u001b[0m \u001b[43m            \u001b[49m\u001b[43mrevision\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1154\u001b[0m \u001b[43m            \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1155\u001b[0m \u001b[43m            \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1156\u001b[0m \u001b[43m            \u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1157\u001b[0m \u001b[43m            \u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdownload_mode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1158\u001b[0m \u001b[43m        \u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mget_module()\n\u001b[1;32m   1159\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e1:  \u001b[38;5;66;03m# noqa: all the attempts failed, before raising the error we should check if the module is already cached.\u001b[39;00m\n\u001b[1;32m   1160\u001b[0m     \u001b[38;5;28;01mtry\u001b[39;00m:\n",
-      "File \u001b[0;32m~/anaconda3/envs/myenv/lib/python3.10/site-packages/datasets/load.py:744\u001b[0m, in \u001b[0;36mHubDatasetModuleFactoryWithoutScript.__init__\u001b[0;34m(self, name, revision, data_dir, data_files, download_config, download_mode)\u001b[0m\n\u001b[1;32m    742\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdownload_config \u001b[38;5;241m=\u001b[39m download_config \u001b[38;5;129;01mor\u001b[39;00m DownloadConfig()\n\u001b[1;32m    743\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdownload_mode \u001b[38;5;241m=\u001b[39m download_mode\n\u001b[0;32m--> 744\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname\u001b[38;5;241m.\u001b[39mcount(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m/\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m\n\u001b[1;32m    745\u001b[0m increase_load_count(name, resource_type\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdataset\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
-      "\u001b[0;31mAssertionError\u001b[0m: "
-     ]
-    }
-   ],
-   "source": [
-    "dataset = load_dataset(\"squad\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "65a7ee23",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "os.mkdir(\"data/training_squad\")\n",
-    "os.mkdir(\"data/test_squad\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f6ebf63e",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# we already have a training and test split. Each sample has an id, title, context, question and answers.\n",
-    "dataset"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f67ae448",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# answers are provided like that - we need to extract answer_end for the model\n",
-    "dataset['train']['answers'][0]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "101cd650",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# column contains the split (either train or validation), save_dir is the directory\n",
-    "def save_samples(column, save_dir):\n",
-    "    text = []\n",
-    "    i = 0\n",
-    "\n",
-    "    for sample in tqdm(dataset[column]):\n",
-    "        \n",
-    "        # preprocess the context and question by removing the newlines\n",
-    "        context = sample['context'].replace('\\n','')\n",
-    "        question = sample['question'].replace('\\n','')\n",
-    "\n",
-    "        # get the answer as text and start character index\n",
-    "        answer_text = sample['answers']['text'][0]\n",
-    "        answer_start = str(sample['answers']['answer_start'][0])\n",
-    "        \n",
-    "        text.append([context, question, answer_text, answer_start])\n",
-    "\n",
-    "        # we choose chunks of 1000\n",
-    "        if len(text) == 1000:\n",
-    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
-    "            text = []\n",
-    "            i += 1\n",
-    "\n",
-    "    # save remaining\n",
-    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
-    "\n",
-    "save_samples(\"train\", \"training_squad\")\n",
-    "save_samples(\"validation\", \"test_squad\")\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "67044d13",
-   "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
-   },
-   "source": [
-    "### Testing\n",
-    "If we load a file, we should get a file with 10000 lines and 4 columns\n",
-    "\n",
-    "Also, we want to assure the correct interval. Hence, the second test."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "446281cf",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "with open(\"data/training_squad/text_0.txt\", 'r', encoding='utf-8') as f:\n",
-    "    lines = f.read().split('\\n')\n",
-    "    \n",
-    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ccd5c650",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "assert lines.shape==(1000,4)\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2c9e4b70",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# we assert that we have the right interval\n",
-    "for ind, line in lines.iterrows():\n",
-    "    sample = line\n",
-    "    answer_start = int(sample['answer_start'])\n",
-    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "02265ace",
-   "metadata": {},
-   "source": [
-    "## Natural Questions Dataset\n",
-    "* Download from https://ai.google.com/research/NaturalQuestions via gsutil (the one from huggingface has 134.92GB, the one from google cloud is in archives)\n",
-    "* Use gunzip to get some samples - we then get `.jsonl`files\n",
-    "* The dataset is a lot more messy, as it is just wikipedia articles with all web artifacts\n",
-    "  * I cleaned the html tags\n",
-    "  * Also I chose a random interval (containing the answer) from the dataset\n",
-    "  * We can't send the whole text into the model anyways"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f3bce0c1",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "from pathlib import Path\n",
-    "paths = [str(x) for x in Path('data/natural_questions/v1.0/train/').glob('**/*.jsonl')]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e9c58c00",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "os.mkdir(\"data/natural_questions_train\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0ed7ba6c",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "import re\n",
-    "\n",
-    "# clean html tags\n",
-    "CLEANR = re.compile('<.+?>')\n",
-    "# clean multiple spaces\n",
-    "CLEANMULTSPACE = re.compile('(\\s)+')\n",
-    "\n",
-    "# the function takes an html documents and removes artifacts\n",
-    "def cleanhtml(raw_html):\n",
-    "    # tags\n",
-    "    cleantext = re.sub(CLEANR, '', raw_html)\n",
-    "    # newlines\n",
-    "    cleantext = cleantext.replace(\"\\n\", '')\n",
-    "    # tabs\n",
-    "    cleantext = cleantext.replace(\"\\t\", '')\n",
-    "    # character encodings\n",
-    "    cleantext = cleantext.replace(\"&#39;\", \"'\")\n",
-    "    cleantext = cleantext.replace(\"&amp;\", \"'\")\n",
-    "    cleantext = cleantext.replace(\"&quot;\", '\"')\n",
-    "    # multiple spaces\n",
-    "    cleantext = re.sub(CLEANMULTSPACE, ' ', cleantext)\n",
-    "    # documents end with this tags, if it is present in the string, cut it off\n",
-    "    idx = cleantext.find(\"<!-- NewPP limit\")\n",
-    "    if idx > -1:\n",
-    "        cleantext = cleantext[:idx]\n",
-    "    return cleantext.strip()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "66ca19ac",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "import json\n",
-    "\n",
-    "# file count\n",
-    "i = 0\n",
-    "data = []\n",
-    "\n",
-    "# iterate over all json files\n",
-    "for path in paths:\n",
-    "    print(path)\n",
-    "    # read file and store as list (this requires much memory, as the files are huge)\n",
-    "    with open(path, 'r') as json_file:\n",
-    "        json_list = list(json_file)\n",
-    "    \n",
-    "    # process every context, question, answer pair\n",
-    "    for json_str in json_list:\n",
-    "        result = json.loads(json_str)\n",
-    "\n",
-    "        # append a question mark - SQuAD questions end with a qm too\n",
-    "        question = result['question_text'] + \"?\"\n",
-    "        \n",
-    "        # some question do not contain an answer - we do not need them\n",
-    "        if(len(result['annotations'][0]['short_answers'])==0):\n",
-    "            continue\n",
-    "\n",
-    "        # get true start/end byte\n",
-    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
-    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
-    "\n",
-    "        # convert to bytes\n",
-    "        byte_encoding = bytes(result['document_html'], encoding='utf-8')\n",
-    "        \n",
-    "        # the document is the whole wikipedia article, we randomly choose an appropriate part (containing the\n",
-    "        # answer): we have 512 tokens as the input for the model - 4000 bytes lead to a good length\n",
-    "        max_back = 3500 if true_start >= 3500 else true_start\n",
-    "        first = random.randint(int(true_start)-max_back, int(true_start))\n",
-    "        end = first + 3500 + true_end - true_start\n",
-    "        \n",
-    "        # get chosen context\n",
-    "        cleanbytes = byte_encoding[first:end]\n",
-    "        # decode back to text - if our end byte is the middle of a word, we ignore it and cut it off\n",
-    "        cleantext = bytes.decode(cleanbytes, errors='ignore')\n",
-    "        # clean html tags\n",
-    "        cleantext = cleanhtml(cleantext)\n",
-    "\n",
-    "        # find the true answer\n",
-    "        answer_start = cleanbytes.find(byte_encoding[true_start:true_end])\n",
-    "        true_answer = bytes.decode(cleanbytes[answer_start:answer_start+(true_end-true_start)])\n",
-    "        \n",
-    "        # clean html tags\n",
-    "        true_answer = cleanhtml(true_answer)\n",
-    "        \n",
-    "        start_ind = cleantext.find(true_answer)\n",
-    "        \n",
-    "        # If cleaning the string makes the answer not findable skip it\n",
-    "        # this hardly ever happens, except if there is an emense amount of web artifacts\n",
-    "        if start_ind == -1:\n",
-    "            continue\n",
-    "            \n",
-    "        data.append([cleantext, question, true_answer, str(start_ind)])\n",
-    "\n",
-    "        if len(data) == 1000:\n",
-    "            with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))\n",
-    "            i += 1\n",
-    "            data = []\n",
-    "with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "    f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "30f26b4e",
-   "metadata": {},
-   "source": [
-    "### Testing\n",
-    "In the following, we first check if the shape of the file is correct.\n",
-    "\n",
-    "Then we iterate over the file and check if the answers according to the file are the same as in the original file."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "490ac0db",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "with open(\"data/natural_questions_train/text_0.txt\", 'r', encoding='utf-8') as f:\n",
-    "    lines = f.read().split('\\n')\n",
-    "    \n",
-    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0d7cc3ee",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "assert lines.shape == (1000, 4)\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0fd8a854",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "with open(\"data/natural_questions/v1.0/train/nq-train-00.jsonl\", 'r') as json_file:\n",
-    "    json_list = list(json_file)[:500]\n",
-    "del json_file"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "170bff30",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "lines_index = 0\n",
-    "for i in range(len(json_list)):\n",
-    "    result = json.loads(json_list[i])\n",
-    "     \n",
-    "    if(len(result['annotations'][0]['short_answers'])==0):\n",
-    "        pass\n",
-    "    else: \n",
-    "        # assert that the question text is the same\n",
-    "        assert result['question_text'] + \"?\" == lines.loc[lines_index, 'question']\n",
-    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
-    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
-    "        true_answer = bytes.decode(bytes(result['document_html'], encoding='utf-8')[true_start:true_end])\n",
-    "        \n",
-    "        processed_answer = lines.loc[lines_index, 'answer']\n",
-    "        # assert that the answer is the same\n",
-    "        assert cleanhtml(true_answer) == processed_answer\n",
-    "    \n",
-    "        start_ind = int(lines.loc[lines_index, 'answer_start'])\n",
-    "        # assert that the answer (according to the index) is the same\n",
-    "        assert cleanhtml(true_answer) == lines.loc[lines_index, 'context'][start_ind:start_ind+len(processed_answer)]\n",
-    "        \n",
-    "        lines_index += 1\n",
-    "    \n",
-    "    if lines_index == len(lines):\n",
-    "        break\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "78e6e737",
-   "metadata": {},
-   "source": [
-    "## Hotpot QA"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "27efcc8c",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "ds = load_dataset(\"hotpot_qa\", 'fullwiki')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1493f21f",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "ds"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2a047946",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "os.mkdir('data/hotpotqa_training')\n",
-    "os.mkdir('data/hotpotqa_test')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e65b6485",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# column contains the split (either train or validation), save_dir is the directory\n",
-    "def save_samples(column, save_dir):\n",
-    "    text = []\n",
-    "    i = 0\n",
-    "\n",
-    "    for sample in tqdm(ds[column]):\n",
-    "        \n",
-    "        # preprocess the context and question by removing the newlines\n",
-    "        context = sample['context']['sentences']\n",
-    "        context = \" \".join([\"\".join(sentence) for sentence in context])\n",
-    "        question = sample['question'].replace('\\n','')\n",
-    "        \n",
-    "        # get the answer as text and start character index\n",
-    "        answer_text = sample['answer']\n",
-    "        answer_start = context.find(answer_text)\n",
-    "        if answer_start == -1:\n",
-    "            continue\n",
-    "            \n",
-    "        \n",
-    "            \n",
-    "        if answer_start > 1500:\n",
-    "            first = random.randint(answer_start-1500, answer_start)\n",
-    "            end = first + 1500 + len(answer_text)\n",
-    "            \n",
-    "            context = context[first:end+1]\n",
-    "            answer_start = context.find(answer_text)\n",
-    "            \n",
-    "            if answer_start == -1:continue\n",
-    "            \n",
-    "        text.append([context, question, answer_text, str(answer_start)])\n",
-    "\n",
-    "        # we choose chunks of 1000\n",
-    "        if len(text) == 1000:\n",
-    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
-    "            text = []\n",
-    "            i += 1\n",
-    "\n",
-    "    # save remaining\n",
-    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
-    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
-    "\n",
-    "save_samples(\"train\", \"hotpotqa_training\")\n",
-    "save_samples(\"validation\", \"hotpotqa_test\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "97cc358f",
-   "metadata": {},
-   "source": [
-    "## Testing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f321483c",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "with open(\"data/hotpotqa_training/text_0.txt\", 'r', encoding='utf-8') as f:\n",
-    "    lines = f.read().split('\\n')\n",
-    "    \n",
-    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "72a96e78",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "assert lines.shape == (1000, 4)\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c32c2f16",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": [
-    "# we assert that we have the right interval\n",
-    "for ind, line in lines.iterrows():\n",
-    "    sample = line\n",
-    "    answer_start = int(sample['answer_start'])\n",
-    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
-    "print(\"Passed\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bc36fe7d",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    },
-    {
-     "ename": "",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31mnotebook controller is DISPOSED. \n",
-      "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
-     ]
-    }
-   ],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.16"
-  },
-  "toc": {
-   "base_numbering": 1,
-   "nav_menu": {},
-   "number_sections": true,
-   "sideBar": true,
-   "skip_h1_title": false,
-   "title_cell": "Table of Contents",
-   "title_sidebar": "Contents",
-   "toc_cell": false,
-   "toc_position": {},
-   "toc_section_display": true,
-   "toc_window_display": false
-  },
-  "varInspector": {
-   "cols": {
-    "lenName": 16,
-    "lenType": 16,
-    "lenVar": 40
-   },
-   "kernels_config": {
-    "python": {
-     "delete_cmd_postfix": "",
-     "delete_cmd_prefix": "del ",
-     "library": "var_list.py",
-     "varRefreshCmd": "print(var_dic_list())"
-    },
-    "r": {
-     "delete_cmd_postfix": ") ",
-     "delete_cmd_prefix": "rm(",
-     "library": "var_list.r",
-     "varRefreshCmd": "cat(var_dic_list()) "
-    }
-   },
-   "types_to_exclude": [
-    "module",
-    "function",
-    "builtin_function_or_method",
-    "instance",
-    "_Feature"
-   ],
-   "window_display": false
-  },
-  "vscode": {
-   "interpreter": {
-    "hash": "85bf9c14e9ba73b783ed1274d522bec79eb0b2b739090180d8ce17bb11aff4aa"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}