{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yxqqCH-FqfL4"
   },
   "source": [
    "### Eleuther AI Evaluation Harness\n",
    "\n",
    "It's easiest to let Eleuther AI explain what they were going for:\n",
    "\n",
    "\n",
    ">\"...the LM Evaluation Harness, [is] a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility.\"\n",
    "\n",
    "Let's get started with a simple task called `hellaswag`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "xfSF5WA3qfqF"
   },
   "outputs": [],
   "source": [
    "import locale\n",
    "locale.getpreferredencoding = lambda: \"UTF-8\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XMeOdr0lqmoO"
   },
   "source": [
    "First, we'll want to clone the Eleuther AI repository so we can use their evaluation scripts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "jZEkK1VoqnII",
    "outputId": "23fc393a-e2b7-4e3b-cac3-868dae464bdf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'lm-evaluation-harness'...\n",
      "remote: Enumerating objects: 19181, done.\u001b[K\n",
      "remote: Counting objects: 100% (5038/5038), done.\u001b[K\n",
      "remote: Compressing objects: 100% (1385/1385), done.\u001b[K\n",
      "remote: Total 19181 (delta 3934), reused 4486 (delta 3599), pack-reused 14143\u001b[K\n",
      "Receiving objects: 100% (19181/19181), 20.07 MiB | 25.81 MiB/s, done.\n",
      "Resolving deltas: 100% (12760/12760), done.\n"
     ]
    }
   ],
   "source": [
    "!git clone https://github.com/EleutherAI/lm-evaluation-harness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lDiK6G23qrHV"
   },
   "source": [
    "Next, let's install the required dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "29_UWp69qrcS",
    "outputId": "4888e8cb-d584-41a7-ba3d-6b4edea8615c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness\n"
     ]
    }
   ],
   "source": [
    "%cd lm-evaluation-harness/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "osdz9eocqtNQ",
    "outputId": "0ab53535-05fc-45a9-c689-e6c5e1ddc1ce"
   },
   "outputs": [],
   "source": [
    "!pip install -q -e ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B85uF28Bt07B"
   },
   "source": [
    "These tests can/will take a long time!\n",
    "\n",
    "While the script is provided to explain how you can run some tests - you shouldn't run this cell yourself unless you have a lot of time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tMerNGv1qvai",
    "outputId": "9453324a-867f-43a3-cdf6-e7202b9a73dd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected Tasks: ['hellaswag']\n",
      "Using device 'cuda:0'\n",
      "Downloading builder script: 100%|██████████| 4.36k/4.36k [00:00<00:00, 7.09MB/s]\n",
      "Downloading metadata: 100%|████████████████| 2.53k/2.53k [00:00<00:00, 4.32MB/s]\n",
      "Downloading readme: 100%|██████████████████| 6.85k/6.85k [00:00<00:00, 6.65MB/s]\n",
      "Downloading data files:   0%|                             | 0/3 [00:00<?, ?it/s]\n",
      "Downloading data:   0%|                             | 0.00/12.1M [00:00<?, ?B/s]\u001b[A\n",
      "Downloading data: 12.8MB [00:00, 128MB/s]                                       \u001b[A\n",
      "Downloading data: 27.1MB [00:00, 137MB/s]\u001b[A\n",
      "Downloading data: 47.5MB [00:00, 138MB/s]\u001b[A\n",
      "Downloading data files:  33%|███████              | 1/3 [00:03<00:07,  3.67s/it]\n",
      "Downloading data: 11.8MB [00:00, 138MB/s]                                       \u001b[A\n",
      "Downloading data files:  67%|██████████████       | 2/3 [00:04<00:02,  2.13s/it]\n",
      "Downloading data: 12.2MB [00:00, 135MB/s]                                       \u001b[A\n",
      "Downloading data files: 100%|█████████████████████| 3/3 [00:05<00:00,  1.92s/it]\n",
      "Extracting data files: 100%|████████████████████| 3/3 [00:00<00:00, 3337.64it/s]\n",
      "Generating train split: 100%|███| 39905/39905 [00:03<00:00, 12629.23 examples/s]\n",
      "Generating test split: 100%|████| 10003/10003 [00:00<00:00, 12595.93 examples/s]\n",
      "Generating validation split: 100%|█| 10042/10042 [00:00<00:00, 12512.37 examples\n",
      "Task: hellaswag; number of docs: 10042\n",
      "Task: hellaswag; document 0; context prompt (starting on next line):\n",
      "Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.\n",
      "(end of prompt on previous line)\n",
      "Requests: [Req_loglikelihood('Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.', ' You can visit a lingerie shop and have them measure you to help you fit a bra to your size, or measure yourself before you shop for a new bra to ensure that you get a good fit. Use a flexible tape measure, like one found in a sewing kit.')[0]\n",
      ", Req_loglikelihood('Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.', ' This is why it is important to keep your breasts under protection when in the shower and only wear bras that are larger than your breast size. If you are not wearing a bra, try wearing something that is a little bigger.')[0]\n",
      ", Req_loglikelihood('Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.', ' For a girl, a bra with a support strap will be easier for her, because most women are unable to pull through bra straps and bras that are too small will not be able to support breasts from side-to-side. Many bras have even been created that cover the breast side, and can be sent to other women in the world to make them look bigger.')[0]\n",
      ", Req_loglikelihood('Personal Care and Style: How to increase breast size with a bra. Check your bra size. Wearing a bra that is too big will not make your breasts look larger. That is why it is important to wear the right size bra for you.', ' Choose a color that is flattering to your breast type and specific event, in addition to those that make you uncomfortable. Look for sports bras made from natural material, such as spandex or lycra, as this is a more breathable bra.')[0]\n",
      "]\n",
      "Running loglikelihood requests\n",
      "100%|█████████████████████████████████████| 40145/40145 [33:25<00:00, 20.02it/s]\n",
      "{\n",
      "  \"results\": {\n",
      "    \"hellaswag\": {\n",
      "      \"acc\": 0.3444532961561442,\n",
      "      \"acc_stderr\": 0.0047421851692647675,\n",
      "      \"acc_norm\": 0.4296952798247361,\n",
      "      \"acc_norm_stderr\": 0.004940208641372079\n",
      "    }\n",
      "  },\n",
      "  \"versions\": {\n",
      "    \"hellaswag\": 0\n",
      "  },\n",
      "  \"config\": {\n",
      "    \"model\": \"hf-causal\",\n",
      "    \"model_args\": \"pretrained=bigscience/bloom-1b1\",\n",
      "    \"num_fewshot\": 0,\n",
      "    \"batch_size\": null,\n",
      "    \"batch_sizes\": [],\n",
      "    \"device\": \"cuda:0\",\n",
      "    \"no_cache\": false,\n",
      "    \"limit\": null,\n",
      "    \"bootstrap_iters\": 100000,\n",
      "    \"description_dict\": {}\n",
      "  }\n",
      "}\n",
      "hf-causal (pretrained=bigscience/bloom-1b1), limit: None, provide_description: False, num_fewshot: 0, batch_size: None\n",
      "|  Task   |Version| Metric |Value |   |Stderr|\n",
      "|---------|------:|--------|-----:|---|-----:|\n",
      "|hellaswag|      0|acc     |0.3445|±  |0.0047|\n",
      "|         |       |acc_norm|0.4297|±  |0.0049|\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!python main.py \\\n",
    "    --model hf-causal \\\n",
    "    --model_args pretrained=bigscience/bloom-1b1 \\\n",
    "    --tasks hellaswag \\\n",
    "    --device cuda:0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "r0bFAUTWtOzc"
   },
   "source": [
    "### Assignment Part 2: \n",
    "\n",
    "Test your model on another task! The task choice is up to you, but you'll need to explain it - and determine the models performance on that task.\n",
    "\n",
    "Again, this task will take a large amount of time - "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "jctZJ2DJtd6l"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected Tasks: ['babi']\n",
      "Using device 'cuda:0'\n",
      "Downloading readme: 100%|██████████████████| 2.20k/2.20k [00:00<00:00, 3.90MB/s]\n",
      "Repo card metadata block was not found. Setting CardData to empty.\n",
      "Downloading data files:   0%|                             | 0/3 [00:00<?, ?it/s]\n",
      "Downloading data:   0%|                             | 0.00/6.78M [00:00<?, ?B/s]\u001b[A\n",
      "Downloading data: 100%|████████████████████| 6.78M/6.78M [00:00<00:00, 43.8MB/s]\u001b[A\n",
      "Downloading data files:  33%|███████              | 1/3 [00:00<00:00,  6.41it/s]\n",
      "Downloading data: 100%|██████████████████████| 747k/747k [00:00<00:00, 28.8MB/s]\u001b[A\n",
      "\n",
      "Downloading data:   0%|                             | 0.00/7.56M [00:00<?, ?B/s]\u001b[A\n",
      "Downloading data: 100%|████████████████████| 7.56M/7.56M [00:00<00:00, 55.3MB/s]\u001b[A\n",
      "Downloading data files: 100%|█████████████████████| 3/3 [00:00<00:00,  9.34it/s]\n",
      "Extracting data files: 100%|████████████████████| 3/3 [00:00<00:00, 3468.28it/s]\n",
      "Generating train split: 17109 examples [00:00, 892042.35 examples/s]\n",
      "Generating validation split: 1891 examples [00:00, 623490.99 examples/s]\n",
      "Generating test split: 19000 examples [00:00, 1446889.43 examples/s]\n",
      "Task: babi; number of docs: 19000\n",
      "Task: babi; document 0; context prompt (starting on next line):\n",
      "Julius is a lion.\n",
      "Greg is a frog.\n",
      "Greg is white.\n",
      "Julius is white.\n",
      "Bernhard is a rhino.\n",
      "Brian is a rhino.\n",
      "Lily is a lion.\n",
      "Brian is green.\n",
      "Lily is gray.\n",
      "What color is Bernhard?\n",
      "(end of prompt on previous line)\n",
      "Requests: Req_greedy_until('Julius is a lion.\\nGreg is a frog.\\nGreg is white.\\nJulius is white.\\nBernhard is a rhino.\\nBrian is a rhino.\\nLily is a lion.\\nBrian is green.\\nLily is gray.\\nWhat color is Bernhard?', ['\\n'])[None]\n",
      "\n",
      "Running greedy_until requests\n",
      "  0%|                                                 | 0/17839 [00:00<?, ?it/s]\n",
      "Traceback (most recent call last):\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/main.py\", line 93, in <module>\n",
      "    main()\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/main.py\", line 59, in main\n",
      "    results = evaluator.simple_evaluate(\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/utils.py\", line 243, in _wrapper\n",
      "    return fn(*args, **kwargs)\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/evaluator.py\", line 105, in simple_evaluate\n",
      "    results = evaluate(\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/utils.py\", line 243, in _wrapper\n",
      "    return fn(*args, **kwargs)\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/evaluator.py\", line 305, in evaluate\n",
      "    resps = getattr(lm, reqtype)([req.args for req in reqs])\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/base.py\", line 922, in fn\n",
      "    rem_res = getattr(self.lm, attr)(remaining_reqs)\n",
      "  File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/base.py\", line 429, in greedy_until\n",
      "    until = request_args[\"until\"]\n",
      "TypeError: list indices must be integers or slices, not str\n"
     ]
    }
   ],
   "source": [
    "### YOUR CODE HERE\n",
    "!python main.py \\\n",
    "    --model hf-causal \\\n",
    "    --model_args pretrained=bigscience/bloom-1b1 \\\n",
    "    --tasks babi \\\n",
    "    --device cuda:0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Selected Tasks: ['wsc']\n",
      "Using device 'cuda:0'\n",
      "Downloading builder script: 100%|██████████| 30.7k/30.7k [00:00<00:00, 39.8MB/s]\n",
      "Downloading metadata: 100%|████████████████| 38.7k/38.7k [00:00<00:00, 39.8MB/s]\n",
      "Downloading readme: 100%|██████████████████| 14.8k/14.8k [00:00<00:00, 24.6MB/s]\n",
      "Downloading data: 100%|████████████████████| 32.8k/32.8k [00:00<00:00, 40.8MB/s]\n",
      "Generating train split: 100%|███████| 554/554 [00:00<00:00, 14258.02 examples/s]\n",
      "Generating validation split: 100%|██| 104/104 [00:00<00:00, 14349.88 examples/s]\n",
      "Generating test split: 100%|████████| 146/146 [00:00<00:00, 16169.00 examples/s]\n",
      "Task: wsc; number of docs: 104\n",
      "Task: wsc; document 0; context prompt (starting on next line):\n",
      "Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\n",
      "Question: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\n",
      "Answer:\n",
      "(end of prompt on previous line)\n",
      "Requests: (Req_loglikelihood('Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\\nQuestion: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\\nAnswer:', ' yes')[0]\n",
      ", Req_loglikelihood('Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\\nQuestion: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\\nAnswer:', ' no')[0]\n",
      ")\n",
      "Running loglikelihood requests\n",
      "100%|█████████████████████████████████████████| 202/202 [00:05<00:00, 35.05it/s]\n",
      "{\n",
      "  \"results\": {\n",
      "    \"wsc\": {\n",
      "      \"acc\": 0.36538461538461536,\n",
      "      \"acc_stderr\": 0.0474473339327792\n",
      "    }\n",
      "  },\n",
      "  \"versions\": {\n",
      "    \"wsc\": 0\n",
      "  },\n",
      "  \"config\": {\n",
      "    \"model\": \"hf-causal\",\n",
      "    \"model_args\": \"pretrained=bigscience/bloom-1b1\",\n",
      "    \"num_fewshot\": 0,\n",
      "    \"batch_size\": null,\n",
      "    \"batch_sizes\": [],\n",
      "    \"device\": \"cuda:0\",\n",
      "    \"no_cache\": false,\n",
      "    \"limit\": null,\n",
      "    \"bootstrap_iters\": 100000,\n",
      "    \"description_dict\": {}\n",
      "  }\n",
      "}\n",
      "hf-causal (pretrained=bigscience/bloom-1b1), limit: None, provide_description: False, num_fewshot: 0, batch_size: None\n",
      "|Task|Version|Metric|Value |   |Stderr|\n",
      "|----|------:|------|-----:|---|-----:|\n",
      "|wsc |      0|acc   |0.3654|±  |0.0474|\n",
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "\n",
    "### YOUR CODE HERE\n",
    "!python main.py \\\n",
    "    --model hf-causal \\\n",
    "    --model_args pretrained=bigscience/bloom-1b1 \\\n",
    "    --tasks wsc \\\n",
    "    --device cuda:0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "A100",
   "machine_shape": "hm",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "conda_pytorch_p310",
   "language": "python",
   "name": "conda_pytorch_p310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}