{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "yxqqCH-FqfL4" }, "source": [ "### Eleuther AI Evaluation Harness\n", "\n", "It's easiest to let Eleuther AI explain what they were going for:\n", "\n", "\n", ">\"...the LM Evaluation Harness, [is] a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility.\"\n", "\n", "Let's get started with a simple task called `hellaswag`!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "xfSF5WA3qfqF" }, "outputs": [], "source": [ "import locale\n", "locale.getpreferredencoding = lambda: \"UTF-8\"" ] }, { "cell_type": "markdown", "metadata": { "id": "XMeOdr0lqmoO" }, "source": [ "First, we'll want to clone the Eleuther AI repository so we can use their evaluation scripts." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jZEkK1VoqnII", "outputId": "23fc393a-e2b7-4e3b-cac3-868dae464bdf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into 'lm-evaluation-harness'...\n", "remote: Enumerating objects: 19181, done.\u001b[K\n", "remote: Counting objects: 100% (5038/5038), done.\u001b[K\n", "remote: Compressing objects: 100% (1385/1385), done.\u001b[K\n", "remote: Total 19181 (delta 3934), reused 4486 (delta 3599), pack-reused 14143\u001b[K\n", "Receiving objects: 100% (19181/19181), 20.07 MiB | 25.81 MiB/s, done.\n", "Resolving deltas: 100% (12760/12760), done.\n" ] } ], "source": [ "!git clone https://github.com/EleutherAI/lm-evaluation-harness" ] }, { "cell_type": "markdown", "metadata": { "id": "lDiK6G23qrHV" }, "source": [ "Next, let's install the required dependencies." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "29_UWp69qrcS", "outputId": "4888e8cb-d584-41a7-ba3d-6b4edea8615c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness\n" ] } ], "source": [ "%cd lm-evaluation-harness/" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "osdz9eocqtNQ", "outputId": "0ab53535-05fc-45a9-c689-e6c5e1ddc1ce" }, "outputs": [], "source": [ "!pip install -q -e ." ] }, { "cell_type": "markdown", "metadata": { "id": "B85uF28Bt07B" }, "source": [ "These tests can/will take a long time!\n", "\n", "While the script is provided to explain how you can run some tests - you shouldn't run this cell yourself unless you have a lot of time!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tMerNGv1qvai", "outputId": "9453324a-867f-43a3-cdf6-e7202b9a73dd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected Tasks: ['hellaswag']\n", "Using device 'cuda:0'\n", "Downloading builder script: 100%|██████████| 4.36k/4.36k [00:00<00:00, 7.09MB/s]\n", "Downloading metadata: 100%|████████████████| 2.53k/2.53k [00:00<00:00, 4.32MB/s]\n", "Downloading readme: 100%|██████████████████| 6.85k/6.85k [00:00<00:00, 6.65MB/s]\n", "Downloading data files: 0%| | 0/3 [00:00\n", " main()\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/main.py\", line 59, in main\n", " results = evaluator.simple_evaluate(\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/utils.py\", line 243, in _wrapper\n", " return fn(*args, **kwargs)\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/evaluator.py\", line 105, in simple_evaluate\n", " results = evaluate(\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/utils.py\", line 243, in _wrapper\n", " return fn(*args, **kwargs)\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/evaluator.py\", line 305, in evaluate\n", " resps = getattr(lm, reqtype)([req.args for req in reqs])\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/base.py\", line 922, in fn\n", " rem_res = getattr(self.lm, attr)(remaining_reqs)\n", " File \"/home/ec2-user/SageMaker/FourthBrain/Building-With-LLMs-EXL-main/Week 3/lm-evaluation-harness/lm_eval/base.py\", line 429, in greedy_until\n", " until = request_args[\"until\"]\n", "TypeError: list indices must be integers or slices, not str\n" ] } ], "source": [ "### YOUR CODE HERE\n", "!python main.py \\\n", " --model hf-causal \\\n", " --model_args pretrained=bigscience/bloom-1b1 \\\n", " --tasks babi \\\n", " --device cuda:0" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected Tasks: ['wsc']\n", "Using device 'cuda:0'\n", "Downloading builder script: 100%|██████████| 30.7k/30.7k [00:00<00:00, 39.8MB/s]\n", "Downloading metadata: 100%|████████████████| 38.7k/38.7k [00:00<00:00, 39.8MB/s]\n", "Downloading readme: 100%|██████████████████| 14.8k/14.8k [00:00<00:00, 24.6MB/s]\n", "Downloading data: 100%|████████████████████| 32.8k/32.8k [00:00<00:00, 40.8MB/s]\n", "Generating train split: 100%|███████| 554/554 [00:00<00:00, 14258.02 examples/s]\n", "Generating validation split: 100%|██| 104/104 [00:00<00:00, 14349.88 examples/s]\n", "Generating test split: 100%|████████| 146/146 [00:00<00:00, 16169.00 examples/s]\n", "Task: wsc; number of docs: 104\n", "Task: wsc; document 0; context prompt (starting on next line):\n", "Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\n", "Question: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\n", "Answer:\n", "(end of prompt on previous line)\n", "Requests: (Req_loglikelihood('Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\\nQuestion: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\\nAnswer:', ' yes')[0]\n", ", Req_loglikelihood('Passage: Meanwhile, in the forest, the elephants are calling and hunting high and low for Arthur and Celeste, and their mothers are very worried. Fortunately, in flying over the town, an old marabou bird has seen *them* and come back quickly to tell the news.\\nQuestion: In the passage above, does the pronoun \"*them*\" refer to \"*the elephants*\"?\\nAnswer:', ' no')[0]\n", ")\n", "Running loglikelihood requests\n", "100%|█████████████████████████████████████████| 202/202 [00:05<00:00, 35.05it/s]\n", "{\n", " \"results\": {\n", " \"wsc\": {\n", " \"acc\": 0.36538461538461536,\n", " \"acc_stderr\": 0.0474473339327792\n", " }\n", " },\n", " \"versions\": {\n", " \"wsc\": 0\n", " },\n", " \"config\": {\n", " \"model\": \"hf-causal\",\n", " \"model_args\": \"pretrained=bigscience/bloom-1b1\",\n", " \"num_fewshot\": 0,\n", " \"batch_size\": null,\n", " \"batch_sizes\": [],\n", " \"device\": \"cuda:0\",\n", " \"no_cache\": false,\n", " \"limit\": null,\n", " \"bootstrap_iters\": 100000,\n", " \"description_dict\": {}\n", " }\n", "}\n", "hf-causal (pretrained=bigscience/bloom-1b1), limit: None, provide_description: False, num_fewshot: 0, batch_size: None\n", "|Task|Version|Metric|Value | |Stderr|\n", "|----|------:|------|-----:|---|-----:|\n", "|wsc | 0|acc |0.3654|± |0.0474|\n", "\n" ] } ], "source": [ "\n", "\n", "### YOUR CODE HERE\n", "!python main.py \\\n", " --model hf-causal \\\n", " --model_args pretrained=bigscience/bloom-1b1 \\\n", " --tasks wsc \\\n", " --device cuda:0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "A100", "machine_shape": "hm", "provenance": [] }, "kernelspec": { "display_name": "conda_pytorch_p310", "language": "python", "name": "conda_pytorch_p310" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }