{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "uG0PEEBEV1P1" }, "source": [ "# Evaluation of LLMs\n", "\n", "![](runtime.png)\n", "\n", "Okay, so we've made our sweet new LLM - but how can we confirm that it's working as intended?\n", "\n", "In this notebook, we'll walk through a few popular methods of evaluating LLMs on various tasks:\n", "\n", "- Metric evaluation, like [Perplexity](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/)\n", "- Human or AI Evaluation\n", "- Eleuther AI's [Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - [Notebook Here](https://colab.research.google.com/drive/1CsaPpqsB21QgQxhJpV22SgwryFFapDBP?usp=sharing)\n", "- Stanford's [HELM](https://github.com/stanford-crfm/helm) - [Notebook here]()\n", "\n", "There's nothing left to do but get started - and we'll start with the most familiar method: Metrics!" ] }, { "cell_type": "markdown", "metadata": { "id": "wfyIOQr9mL-E" }, "source": [ "If you run into CUDA memory issues - please restart the notebook at start from the next session." ] }, { "cell_type": "markdown", "metadata": { "id": "3E997GhwY80E" }, "source": [ "### Base Model\n", "\n", "For this exercise, we'll be using bigscience's `bloom-1b7` as our base model.\n", "\n", "This is to ensure we stay consistent across all tasks." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "62YhJ9tlZkkX" }, "outputs": [], "source": [ "model_id = \"bigscience/bloom-1b1\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "BjYz9DvaXgi9" }, "source": [ "### Perplexity\n", "\n", "First things first, perplexity is limited to autoregressive (CausalLM) models. That does restrict its usefulness, but not tremendously!\n", "\n", "Secondly, Perplexity has a number of pros and cons associated with it:\n", "\n", "Pros:\n", "- Time-efficient, since perplexity can be calculated in a single-pass - it's fairly quick to obtain\n", "- Can be used as signal for over/under-fitting, if perplexity scales proportionally with training data size - it could indicate your model is overfitting\n", "\n", "Cons:\n", "- Doesn't indicate model's performance on the final task\n", "- Because the perplexity score depends heavily on what text was used to train the model - the scores are not comparable between models or datasets\n", "\n", "That con is a big one, and is one of the reasons that - while perplexity is useful to calculate - it isn't great signal on how well your model will perform on its desired task.\n", "\n", "Let's get started by getting the `evaluate` library and some other dependencies we'll use." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WzLeJ_1BVyE5", "outputId": "701d1847-d5fe-4df5-9d04-055e781227ba" }, "outputs": [], "source": [ "!pip install -q evaluate datasets #transformers torch" ] }, { "cell_type": "markdown", "metadata": { "id": "Ry8sXkzdZ8Qj" }, "source": [ "Now, let's get a small test set of strings we wish to use!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 241, "referenced_widgets": [ "9cb1b9c7e84940019a47a73fb00f7579", "9080a2723e3641c987236763c60a96c1", "3e7c87d89fcc452aa280844d8f999436", "f7854fe920e240f6937e671dd969498b", "38902b67b7014c2dae5d44ba91a77a63", "f62e6a89a5dc4461939f2dd39f9719fe", "c4ee26f09058480799cea3dd7531771f", "fa6b178edb234894b87e747796c624e6", "bde5ca0b3bff4bc288362e3de2ab1169", "8c30614fc8ca4962982696f203276476", "4e67b4821a984d16ae54a3cdee34570a", "a12cf70dba3246e5b24d901fc3c0ba2c", "0def0fd83249419eae158ad4b3c57160", "209bac99159f409dba293b04379e279e", "8c88f2f188ce443696751970c7f06b33", "07399d62b84445cc8b6c53ed435c1b82", "a33abf1436744d22a2204f60ab07f5f1", "eb67de2bc0a644deb2dcefb0b32f08f4", "8de822d7d35c4f6ca57c1043fdf49532", "d0cdb2cf824445688569d84c82a29811", "c5eec71552a746fc810b11b49ed74c07", "5460a720aaaa411cb34525f0e094896f", "c683c194ded24e1cade651fc85089fef", "e2a708c734b14881b7ff4de275d5ec28", "035bd136e66d4a27ac69d2ec41c674b4", "a2c25933514c40f9a35013cb0dd98672", "67e34c4361d44c0bbdbff41c50820442", "2758bbc74ac5416bb026a63f332e6fda", "3c82295675384263b5515e8461b823ef", "b59369dd7c674a809aaa25465c4a9435", "4c3fd01693f9450ca1eec52552bc9110", "0e90fb375c1645859b4287a6d427a528", "73e968cf0a354870a99ed065bb3b7022", "1cb5bb403d674024921289f2a4b6cac6", "829581ca6a8141ea92e0c15bdf79a6dc", "4505b62ac8504c31a2febea381cc1dc1", "68cc522e81cc4a93ae5f0bf747c601a3", "56fe0bba730d4b46b9c41ca7b6932911", "bbc5c7a4f69f435d9a0b6deeb6922f89", "0f6ea6523dd24dd2b248e50507df3fea", "c567f0588fee4a86b9014e945ecfad9d", "103512adf43a47b7a4f521853586fc7e", "c8e94a3a8bc34dd184ef78351eb9e6a0", "96316d955e2544d8a52dde4804b4915f", "9a91ccc5d860461caeadac37f5934d13", "2e635ef6fd6e4ee99a36810c0a5d0422", "fa9aef00029d4560ad34c7814a10fcec", "dc5d96ea11c64d6da0861919acf6039e", "4a22c3cfa2eb44a9a891e2c5fcbbd9c1", "75517c0d8ee442f4a86b5a255f0800da", "d70f38b1458245f8b2cbb339b8351c24", "716807a2ccf74fa5b153a2def08c91aa", "30b7b0cc11c54bcebf1b2224f7cbb18b", "6330c9f09a3c4f00b45abd2098596263", "9822117d76374d4caed64d69351e1e94", "da7198ba2cb84e4895197ae6c70cd4ac", "21fa973d67604b6d9dee650378732392", "8388b373dac2413e8afc89e9b7dc31a6", "a909c1d6afb147b3900558b595a9d17e", "69ac48d30e32418abd148e6cb1d8c7e3", "15898c37e4864a1ea772eacc3b19652d", "ca6e4d6760cd4ebda0bef530bef16f5f", "95fed3a6fc474ad5a9890f8a42ec6dcc", "bf3517ed126e482097fc4a753ed33680", "61dcba7260e24dcda74de91845963f80", "410f7c41d80b4743aefb3814bdbcbf2d", "628fccaa50f9444fb6cdb558b5a59513", "86db93f9b9404c1e8d325f856327bb4e", "13a780eba686402690bacecd6e21f591", "c461434b04d44bf48a77aa79762689ba", "40408c90feb746a1a89724cebd1cf298", "710c48e5383f4541be8027bd8086bc78", "98431b0c6b2940fea4fedc51cb2ccb90", "33f0a1285f584be5ada2bc883bd3f523", "a97a639ee51f4adb8ae41aaabe616953", "6ea82abe934f4b87ae15266eae3f85cb", "c7750bae0e3c46b899ddaeee60cf4e73" ] }, "id": "stgWCOqWZ_1a", "outputId": "859770bf-b346-4403-ee83-86bbc2c768fa" }, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "input_data = load_dataset(\"wikitext\", \"wikitext-2-raw-v1\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JYzzAv2AaZk4", "outputId": "9e65fb75-40b7-4c4b-8e3d-8034bf0d4826" }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " test: Dataset({\n", " features: ['text'],\n", " num_rows: 4358\n", " })\n", " train: Dataset({\n", " features: ['text'],\n", " num_rows: 36718\n", " })\n", " validation: Dataset({\n", " features: ['text'],\n", " num_rows: 3760\n", " })\n", "})" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_data" ] }, { "cell_type": "markdown", "metadata": { "id": "MneQ7sHrabA1" }, "source": [ "We'll use some of the data present in the `test` split to ensure we're not usings something the model was trained on." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "9WYwMCqmaWfP" }, "outputs": [], "source": [ "test_text = input_data[\"test\"][:50][\"text\"]\n", "\n", "test_text = [text for text in test_text if text != \"\"]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 241, "referenced_widgets": [ "e12d954a509a45089bf5faf3bb8da2ee", "3e2a8ef1bd2643dc81dad126ef002d60", "111f658dbe1b449b89196c11ac1dd5b1", "923e87752173429988f73ac3e451de19", "46ab00a849fd4e11ac85d532a6421a24", "348a0d4656204ff48825b14d001a0da5", "37f15f77c3294d9e9c602046991d1b29", "72f89dcf75064e8f925846592bd8971f", "c9d9b4c290bf4c7db7b4a39ce3cdf9f3", "1596cdb833c543abba6d87d57cc43e7f", "42e3682fc3e5451c8afc212bc9b510d1", "16e6d24bc3324cfcb7dfd9dbca740006", "11d453ee41c4434f9be85a792d404c7b", "ebc429b969174bc9a049107184ef4c99", "761f0296547d41de85bd7d6570cbbb88", "09b24a3a3b3b4627ba44fac21245cd72", "89afb6eb37974d66b1fb1deb5a9b95ab", "cdce2603ac7a489aaa60fca56af7881b", "3a13e4fdd12048ad9d4712e0fb61d83a", "0693065125ed4842910463a05389d2f5", "73a84382855647e3a4a2a4e616fa79d0", "7f2700ab2dc24917bf1f7044015fac75", "3719af27b7a847938aa72a49590d8277", "ded23310147c44159152641043fd56b3", "83a475ba386245bca158bc4980fde991", "5677ff81bbb64cd19e470cb7009f21bb", "647ee609492a477cb31cbdc08d66ea2f", "b435af6c6e024bafa290128765be2598", "3437daaeec0140258d36906a8427c1e8", "597d700e599746bfa47d894eff35198a", "ad9c9f072daa4766b50b4d284c73bb8f", "ec0f708e0b974a9fbfd57d0a25e68ee6", "178cc91e59ab4779b5ef160fbb938b5a", "5d121851051b453d864cfb0921c0ec91", "e4ac43fc178f43429f97468be92e0eb6", "c71806a7fa954a36976f26c562638283", "0c014df93ede496b88776dcb987358a3", "84d46a6ee0f745abb53b2b7a5d5d8f3a", "6e54067c57904a20a21cced0d0981dc2", "2ef413dc710841d98f8d8c345de25082", "68df4e2705b14f26a35adca6e7f9d4f5", "3c601217e33c4fadb21868bbdb0c8043", "35ace5a1b1a847b9a2336a0c4d371f5b", "f5023957abb440e3be2c790b7334acf0", "018b214c93a645e0983d6134003a5bfd", "381e425571684d1585c5d84842674327", "b451e3c7ff5642f6911a7b067ac9a7e7", "f60984b08ff4427da640e8480459dd99", "a7ce9f42bad141aeaca47b7c1f3b02ec", "1facb98e56034e7ab0c9a94a33e7298a", "68261e7b3e614c8ea54f4c6fe4ed7588", "760ca72224ff44fbae51b029facb09c2", "5cef23d0adb245d58ec554065f6fd844", "ae0adb20355c4f83a3cf16f3bf3a75db", "4b9d1b09ff0048d681c2bc4ac1499c59", "3ca92e8c0c3f4856ab3aca5022002076", "25422666869b43a8bf0cb1735e38e99b", "df3ac7c6bacb4645830def7afe4a9d2d", "b6ce4ab91d5f48c1bfa4955dcb2ccd30", "d6f417b3a42047ca8858d00f119e96e2", "a9bdebe6759b43e380b2f82e51462085", "1b99a3458572433e8438ca40309f733e", "66a4c68237a14dc4903946126ad0e564", "1d70ed6f29bf4e3bae19c18786ae5ed0", "cbec675c609f4ea98e669af3bc916369", "ee5da907c6af46c99527a127432d77db", "0e3f0031c4144a4cae973daa55bea592", "d3f7af95bea7467cab53dbe89576f7a2", "b12b0144f038476fb625bee495f07124", "8a735a15d7024105a7dc0d33dc923178", "1bb1ca8ab9604b83b23dda8d9c6e7558", "eddf9ea2613b4e679177e55285ab44d3", "ed2a6946240f4a559928bcc96ef64ddc", "17b169ba257541efaaf4f62be99cd023", "4de66a63e35a405b88e7641acc689e73", "c6e3eb87fabe4bfb83dbc88f080971f4", "dc1d1a57873141bc8bc22543cd62b1d3" ] }, "id": "08Jd0w6RZqKT", "outputId": "5f745b5a-e1ee-413a-d807-69a5b3377c5f" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "670087ec112343fa837317ab1547e812", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (…)lve/main/config.json: 0%| | 0.00/693 [00:00