{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "qAkXdLL2D25p" }, "source": [ "## Peft model evaluation using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness)\n", "\n", "In this notebook, we are going to learn how to evaluate the finetuned lora model on the hellaswag task using lm-eval-harness toolkit." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "o52TJHcYD25q", "outputId": "c5482c79-ff56-4ffa-d20c-46c3d30d2cd5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33m DEPRECATION: Building 'rouge-score' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'rouge-score'. Discussion can be found at https://github.com/pypa/pip/issues/6334\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33m DEPRECATION: Building 'sqlitedict' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sqlitedict'. Discussion can be found at https://github.com/pypa/pip/issues/6334\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33m DEPRECATION: Building 'word2number' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'word2number'. Discussion can be found at https://github.com/pypa/pip/issues/6334\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.\u001b[0m\u001b[33m\n", "\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.2\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython3 -m pip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "# Install LM-Eval\n", "!pip install -q datasets evaluate lm_eval" ] }, { "cell_type": "markdown", "metadata": { "id": "uhUflrJXD25q" }, "source": [ "### First we will check the accuracy score on the hellaswag task for the base bert without finetuning" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hwJIYD5KD25q", "outputId": "51e69f81-d048-46b2-9699-658d3ffc5f08" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7b1ea8948a0747bc98795d6459270044", "version_major": 2, "version_minor": 0 }, "text/plain": [ "README.md: 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4ec51e06812446899b66826c41697f8d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "data/train-00000-of-00001.parquet: 0%| | 0.00/24.4M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fbd73be60e5a4bc68d1347504a6b7070", "version_major": 2, "version_minor": 0 }, "text/plain": [ "data/test-00000-of-00001.parquet: 0%| | 0.00/6.11M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0c887d06a56a410eae563e11a3080a52", "version_major": 2, "version_minor": 0 }, "text/plain": [ "data/validation-00000-of-00001.parquet: 0%| | 0.00/6.32M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "adb9c60c23f74f7d9af72ea2e34bc22c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0%| | 0/39905 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b3418259aeaf4d459d5ed4fe9b8434fc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating test split: 0%| | 0/10003 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "da448b2a7a534fec9045c086fbda5d0d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating validation split: 0%| | 0/10042 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "83f04bb57ab94be58080e8c67675a4e5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/39905 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f08ebbc81eaa45818c85e51169ac48cf", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/10042 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "100%|█████████████████████████████████████████████████████████████████████████████████████████████| 10042/10042 [00:02<00:00, 4111.19it/s]\n", "Running loglikelihood requests: 0%| | 0/40168 [00:00, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.\n", "Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████| 40168/40168 [02:40<00:00, 250.28it/s]\n" ] }, { "data": { "text/plain": [ "{'hellaswag': {'alias': 'hellaswag',\n", " 'acc,none': 0.24915355506871142,\n", " 'acc_stderr,none': 0.004316389476434537,\n", " 'acc_norm,none': 0.244672376020713,\n", " 'acc_norm_stderr,none': 0.004290142029921662}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "import lm_eval\n", "\n", "\n", "device = torch.accelerator.current_accelerator().type if hasattr(torch, \"accelerator\") else \"cuda\"\n", "output = lm_eval.simple_evaluate(model = 'hf',\n", " model_args = {\n", " 'pretrained' : 'bert-base-cased',\n", " 'dtype' : 'bfloat16'},\n", " tasks = 'hellaswag',\n", " device = device,\n", " batch_size = 128,\n", " log_samples = False)\n", "output[\"results\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "EZk-1JT7D25r" }, "source": [ "### Now lets try to finetune the bert on the imdb dataset (this is for demonstration and finetuning on imdb may not increase the scores on hellaswag task)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "FmtVeh7QD25r" }, "outputs": [], "source": [ "# Import necessary libraries\n", "import evaluate\n", "import numpy as np\n", "from datasets import load_dataset\n", "from transformers import AutoTokenizer, BertForSequenceClassification, Trainer, TrainingArguments\n", "\n", "from peft import LoraConfig, TaskType, get_peft_model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rHF7tzN9D25r", "outputId": "352ad9ab-2efc-41f8-c3d5-a7da05d9529b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "The 8-bit optimizer is not available on your device, only available on CUDA for now.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "trainable params: 296,450 || all params: 108,608,260 || trainable%: 0.2730\n" ] } ], "source": [ "# Configure LoRA for Sequence Classification\n", "lora_config = LoraConfig(\n", " task_type=TaskType.SEQ_CLS, # Set task type to sequence classification\n", " target_modules=[\"query\", \"key\"] # Specify target modules for LoRA tuning\n", ")\n", "\n", "# Initialize the BERT model for sequence classification\n", "model = BertForSequenceClassification.from_pretrained(\n", " 'bert-base-cased',\n", " num_labels = 2\n", ")\n", "\n", "# Wrap the model with LoRA configuration\n", "model = get_peft_model(model, lora_config)\n", "\n", "model.print_trainable_parameters()\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 337, "referenced_widgets": [ "ebf724c3ad1443e98763dd279e6fc996", "9b4b309603db4847a0d94da76db15116", "93dd1cc84f26479caaa1ded80bbea5ff", "0dc109378f1c43a7b18b1be03cce32ce", "2d2adb2b7a3b41d28736a8b7aba258b1", "568c0efe6ced432d81f45af4acaa921e", "f402e8e510d64d0ca4d4ae2c09a7ddfc", "b3cde05a07b2437b935083d6aac25913", "31bad1f4c7c047a280d490b854a6e911", "db789da312a24fc9944ecb8b617109e7", "0bee940b8667495f9e685f7ba6c3706f", "3142f29a154c4a33a161237d4c605c50", "9b39cdb9f7a14ab4865f360f3c1537dc", "798c00bf640e483cbc4fea744b268461", "1882e91f0b264cbeb90b99a69c7de7f5", "17c7d8bb89184c26966a33dc27ef5517", "55cab288802d49efb930c7641b036f44", "0649702ad9764ad8bf3dfbaa6739686e", "d406b18b855d488cac92b9b96073ba43", "eee89779493748d38c01bb0a74b29e38", "8e862faf12804eaebfd7692db348842b", "7812641b79ae46d48bc88b4c773344c0", "806a2b3f4c4c4ba59370a46c0f8faa85", "38fd040b3a0d44d2adf13c2476f4505a", "3a99edeb7d5e43048fdce29b880c19a5", "1dfc470241c44ce1a0f9ae71fdfdbdf6", "feb83525a43a4c2d818f3ef1ae69d581", "a9548c2f9fd54f73abc3e9c3c0bc9fda", "faa9b111dd9745a29bf7494b95619a1b", "81eef7f1d0c7461cb443b996a5d5163f", "b832fdc7655b4f88a15e19fc8381db47", "da7f0a799616427ba7e93b0080d26d37", "9931cc064e2c400e9830e448c8ef4655", "4fe02a8771814e22b8d954cfdd8b9f86", "d52e5c49b80348fbb55ab39ce0a13f7e", "51f4dabc59d04f0c89128d41d4c184a1", "803fcea81b7b47fb91fb108e2170fa75", "ee5722120e1045e985e5e4ca29a2e192", "8fcd7e9ed5f54287b5cda0ad52a277f4", "10335d3ada7f428588c4faa3f57bbd51", "3d6b95a9f8774341be1976f10fb74679", "3ec913f1b93d4097ad9729156295f9e9", "b29cde97f90f4c489c5cdfd007c96d4f", "d8c8ee9f63b14182a9ded152435c510f", "15dc2c4e42ab48c9ad09aafff29f9278", "1fea1738e9fe48629b09e2ec9351fcd2", "2e2fc856557e40df8400e4b69f7143fc", "bce88474ca6745da99d79bc07216333c", "99bab394a68140f79def33bc6f6499b2", "60695bb251124317a897a6fc56b754ac", "d0582455fe3c449dbde19a47561770b4", "92b9e59f9038485e839598237ec3fd8c", "d69df0074246435f8481fd803863cdb1", "3cc5521074d3411cbc24d0348d3fc314", "4c55b2c1daa4497abf9c9f53e23f83b8", "f4d35fb98b0048ca8bdbe856d182d561", "e85940da7bc24b8bb29ce609ba6e5613", "34f8bcc1d9954fb8ba8ecca7a6bd04cd", "7f78e58f4e9c457a9c2d121759efdb09", "2d46765961fe453597645a0b56a9cbc7", "4f7b2a1359bf41cab2ed5663643509a6", "db1b2639fc4944bcbbdfbdaf9150409f", "336ffca0a89e4255a62564ec2600318c", "2bbd92cebbf445d087bcadf82625c6d5", "f1574df0debf4e1b8617e24ffcc39e16", "a9cbf2bbb1f14894886617ce8e60de12", "fdd4b9937b7744d98ba1163efbe1310b", "7823f9d79e1d4350b385e6dfef84b021", "20cd23dc1cd840c89741957f3fcbfdb8", "1c07d8f701604da0989d5f8d88d4bbcd", "00a9858d90d6430eaab54f9e013f077b", "c006e4b791204d10a9d8e7fbc4bceb81", "31664fd452bb43e3aefb87542c747b74", "16afa8cf9ca64a59afd7a4c4f293b479", "4a025adc548b4fcb8c5637f8f6dabc81", "f8d90231390a4211b698e700a66fcb0f", "aa7697dbb00641f19491f13b1a643197", "2d76c76bc4a6433b8fe2b28a1c887ada", "31aa04d4f32a411293f2f729889984b8", "87de47c8821f423d9efc5c7e85297e32", "5f0dfd26cb484695b85d021a5d687503", "d1414b66bc0f4a088c5e4551e8f4ee72", "da58efd0f2b5442c8a284a948f2614f8", "2cd718fb166641d59c8df64cbc637d9c", "2be4c676b5834100b31d7f42ab8bab85", "6eea630bd65745739ff646fbb172f426", "b1c37874948c459583a9f33dc77a6f55", "24cbed7481774ef793a8f204ba5b604b", "716d24b9c1b340a2bf9045b0bf4e7e34", "47789088bcfa4c96a5fd898812c23d17", "580a3b71b23f4a72be9e8633c04e9276", "03f7cfca9e634cf69e3cc70f24832ba3", "987a633e24594228b82e162397a63141", "8ac5b5baeefb4078a74d8c8b2fed6d93", "4bbb4bf49e50489abc875881958c00aa", "5b346ce1eaf649d195fa7dd6058dd196", "d7b4684e53d445e58de1fe155315e093", "886443acd2f14cff93059c093f98cc1b", "24abe5089abc4ecfab75f7601bc98e68", "122d7dd4d02d4df0b6573e100b5e46e3", "db0b4bb4c7a642fb9e8d5c7738e90afc", "f8b71d5cb37549fba6559e2d83531319", "30d18e09421b42c985904750e68740d1", "f67b5bbf7ccb475abe41e08316dc5b37", "e0e6445a2a774ae89729b7e2fb4a14b3", "fdd99c599f9c4ac88c09939ec397ca46", "5ce04c799bb0430386e39af4734e80e6", "404088a4057546968f4e8cfc9e7461e1", "71a7cf52600f4288bd725b7bb93e7299", "e3082a5e8a4144f5982ad478d9a54a2c" ] }, "id": "8cZUKSQLD25r", "outputId": "0c0120e6-28f0-4496-a395-6d48d1b159e5" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ca48589e7f2f46b49b0c6f0f643cbcc8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "README.md: 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0ea02d82b14141c0a0237e8036404c84", "version_major": 2, "version_minor": 0 }, "text/plain": [ "train-00000-of-00001.parquet: 0%| | 0.00/21.0M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c26d8db5c6ac402aa0d1042df7c10858", "version_major": 2, "version_minor": 0 }, "text/plain": [ "test-00000-of-00001.parquet: 0%| | 0.00/20.5M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a6b57f45ca0e46edb407d7763e7cc141", "version_major": 2, "version_minor": 0 }, "text/plain": [ "unsupervised-00000-of-00001.parquet: 0%| | 0.00/42.0M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2cb938dfb23240a9b5e2a85c3e6796a6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0%| | 0/25000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7e1d3d3c1e414338b44e886a0ed29b8b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating test split: 0%| | 0/25000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c6147cbc8124478ea9eb921a2789d20e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating unsupervised split: 0%| | 0/50000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ed1329e2506a48b7b30f05a3ded2c230", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/25000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1c6d66613cfc44d391d9a8bb3c58e732", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/25000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "26681307716846d68eca88409b126248", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/50000 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the dataset\n", "dataset = load_dataset(\"imdb\")\n", "\n", "def tokenize_function(row):\n", " return tokenizer(row[\"text\"], padding=\"max_length\", truncation = True)\n", "\n", "tokenized_datasets = dataset.map(tokenize_function, batched = True)\n", "\n", "train_dataset = tokenized_datasets[\"train\"]\n", "eval_dataset = tokenized_datasets[\"test\"]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "bA3k0iVED25r" }, "outputs": [], "source": [ "# Define a function to compute evaluation metrics\n", "\n", "def compute_metrics(eval_pred):\n", " logits, labels = eval_pred\n", " predictions = np.argmax(logits, axis=-1)\n", " metric = evaluate.load(\"accuracy\")\n", " return metric.compute(predictions = predictions, references = labels)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 380 }, "id": "DFG74c3kD25s", "outputId": "5dd9f988-95db-4efb-e632-5f741801910a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.\n" ] }, { "data": { "text/html": [ "\n", "
| Epoch | \n", "Training Loss | \n", "Validation Loss | \n", "Accuracy | \n", "
|---|---|---|---|
| 1 | \n", "0.353800 | \n", "0.261258 | \n", "0.901160 | \n", "
| 2 | \n", "0.277400 | \n", "0.221651 | \n", "0.912480 | \n", "
| 3 | \n", "0.244500 | \n", "0.216107 | \n", "0.918200 | \n", "
| 4 | \n", "0.197000 | \n", "0.215257 | \n", "0.920040 | \n", "
| 5 | \n", "0.157700 | \n", "0.215050 | \n", "0.923240 | \n", "
"
],
"text/plain": [
"