# Evaluation of LLMs

![](runtime.png)

Okay, so we've made our sweet new LLM - but how can we confirm that it's working as intended?

In this notebook, we'll walk through a few popular methods of evaluating LLMs on various tasks:

- Metric evaluation, like [Perplexity](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/)
- Human or AI Evaluation
- Eleuther AI's [Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - [Notebook Here](https://colab.research.google.com/drive/1CsaPpqsB21QgQxhJpV22SgwryFFapDBP?usp=sharing)
- Stanford's [HELM](https://github.com/stanford-crfm/helm) - [Notebook here]()

There's nothing left to do but get started - and we'll start with the most familiar method: Metrics!

If you run into CUDA memory issues - please restart the notebook at start from the next session.

### Base Model

For this exercise, we'll be using bigscience's `bloom-1b7` as our base model.

This is to ensure we stay consistent across all tasks.

In [1]:
model_id = "bigscience/bloom-1b1"


### Perplexity

First things first, perplexity is limited to autoregressive (CausalLM) models. That does restrict its usefulness, but not tremendously!

Secondly, Perplexity has a number of pros and cons associated with it:

Pros:
- Time-efficient, since perplexity can be calculated in a single-pass - it's fairly quick to obtain
- Can be used as signal for over/under-fitting, if perplexity scales proportionally with training data size - it could indicate your model is overfitting

Cons:
- Doesn't indicate model's performance on the final task
- Because the perplexity score depends heavily on what text was used to train the model - the scores are not comparable between models or datasets

That con is a big one, and is one of the reasons that - while perplexity is useful to calculate - it isn't great signal on how well your model will perform on its desired task.

Let's get started by getting the `evaluate` library and some other dependencies we'll use.

In [2]:
!pip install -q evaluate datasets #transformers torch

Now, let's get a small test set of strings we wish to use!

In [2]:
from datasets import load_dataset

input_data = load_dataset("wikitext", "wikitext-2-raw-v1")

In [3]:
input_data

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

We'll use some of the data present in the `test` split to ensure we're not usings something the model was trained on.

In [4]:
test_text = input_data["test"][:50]["text"]

test_text = [text for text in test_text if text != ""]

In [5]:
from evaluate import load

perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=test_text, model_id=model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
results["mean_perplexity"]

264.3619388297752

Perplexity is measured as a score between 0 and `inf`, so a lower score is better.

In this case, the results are absolutely fine - though unimpressive.

bloom-1b7 was not trained on Wikitext.

### Human or AI Evaluation

Now, let's get into how we could compare the actual final production of the model - with human or AI supervision!

The idea here is that we ask the model to perform a task - and then get some kind of results from a human being.

This method similarly comes with some pros and cons:

Pros:
- Should provide excellent feedback on wether or not your model is performing as expected

Cons:
- Extremely expensive

Since we're going to be leveraging AI in this example, you will need an OpenAI API key!

Also, we're going to use an instruct-tuned version of the Bloom (`bigscience/bloomz-1b7`) base-model to guage how well it's doing on following instructions!

In [7]:
!pip install -q openai accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "bigscience/bloom-1b1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype= torch.float16)

prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

In [9]:
import os

os.environ["OPENAI_API_KEY"] = "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"

We're going to be using a short list of instructions to score the models - the idea can be extended as far as you'd like!

In [10]:
list_of_instructions = [
    "Give three tips for staying healthy.",
    "What are the three primary colors?",
    "Describe a time when you had to make a difficult decision.",
]

In [11]:
def get_model_response(text):
  input = prompt_template.format(instruction= text)
  input_ids = tokenizer(input, return_tensors="pt").input_ids.to("cpu")

  model.to("cpu")  # Moving model to CPU
  model.float()  # Ensuring model is in full precision mode

  output1 = model.generate(input_ids, max_length=512)
  input_length = input_ids.shape[1]
  output1 = output1[:, input_length:]
  output = tokenizer.decode(output1[0])

  return output



In [12]:
import openai

for prompt in list_of_instructions:
  gpt_35_turbo_prompt = [
      {"role" : "system",
      "content" : f"Is the following a good response to this instruction: {prompt}"}
  ]
  output_to_test = get_model_response(prompt)
  gpt_35_turbo_prompt.append(
      {"role" : "user",
       "content" : f"{output_to_test}"}
  )

  print(prompt)

  print(openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=gpt_35_turbo_prompt
  )["choices"][0]["message"]["content"])

  print("-----------------")

Give three tips for staying healthy.
Yes, this is a good response to the instruction. It provides three clear and concise tips for staying healthy: exercise regularly, eat a balanced diet, and avoid smoking.
-----------------
What are the three primary colors?
primary colors?

### Response:
The three primary colors are red, blue, and yellow.
-----------------
Describe a time when you had to make a difficult decision.
No, this response is not appropriate. It does not provide any information or insight about a difficult decision.
-----------------


Try it out yourself!