Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / lighteval /pr_1221 /en /adding-a-new-metric.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

5.24 kB

	# Adding a New Metric

	## Before You Start

	### Two different types of metrics

	There are two types of metrics in Lighteval:

	#### Sample-Level Metrics
	- Purpose: Evaluate individual samples/predictions
	- Input: Takes a `Doc` and `ModelResponse` (model's prediction)
	- Output: Returns a float or boolean value for that specific sample
	- Example: Checking if a model's answer matches the correct answer for one sample

	#### Corpus-Level Metrics
	- Purpose: Compute final scores across the entire dataset/corpus
	- Input: Takes the results from all sample-level evaluations
	- Output: Returns a single score representing overall performance
	- Examples:
	- Simple aggregation: Calculating average accuracy across all test samples
	- Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score

	### Check Existing Metrics

	First, check if you can use one of the parameterized functions in
	[Corpus Metrics](package_reference/metrics#corpus-metrics) or
	[Sample Metrics](package_reference/metrics#sample-metrics).

	If not, you can use the `custom_task` system to register your new metric.

	> [!TIP]
	> To see an example of a custom metric added along with a custom task, look at the [IFEval custom task](https://github.com/huggingface/lighteval/tree/main/examples/custom_tasks/ifeval).

	> [!WARNING]
	> To contribute your custom metric to the Lighteval repository, you would first need
	> to install the required dev dependencies by running `pip install -e .[dev]`
	> and then run `pre-commit install` to install the pre-commit hooks.

	## Creating a Custom Metric

	### Step 1: Create the Metric File

	Create a new Python file which should contain the full logic of your metric.
	The file also needs to start with these imports:

	```python
	from aenum import extend_enum
	from lighteval.metrics import Metrics
	```

	### Step 2: Define the Sample-Level Metric

	You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a
	`~lighteval.types.Doc` and a `~lighteval.types.ModelResponse`. The metric should return a float or a
	boolean.

	#### Single Metric Example

	```python
	def custom_metric(doc: Doc, model_response: ModelResponse) -> bool:
	response = model_response.final_text[0]
	return response == doc.choices[doc.gold_index]
	```

	#### Multiple Metrics Example

	If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values:

	```python
	def custom_metric(doc: Doc, model_response: ModelResponse) -> dict:
	response = model_response.final_text[0]
	return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5}
	```

	### Step 3: Define Aggregation Function (Optional)

	You can define an aggregation function if needed. A common aggregation function is `np.mean`:

	```python
	def agg_function(items):
	flat_items = [item for sublist in items for item in sublist]
	score = sum(flat_items) / len(flat_items)
	return score
	```

	### Step 4: Create the Metric Object

	#### Single Metric

	If it's a sample-level metric, you can use the following code
	with [SampleLevelMetric](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetric):

	```python
	my_custom_metric = SampleLevelMetric(
	metric_name="custom_accuracy",
	higher_is_better=True,
	category=SamplingMethod.GENERATIVE,
	sample_level_fn=custom_metric,
	corpus_level_fn=agg_function,
	)
	```

	#### Multiple Metrics

	If your metric defines multiple metrics per sample, you can use the following code
	with [SampleLevelMetricGrouping](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping):

	```python
	custom_metric = SampleLevelMetricGrouping(
	metric_name=["accuracy", "response_length", "confidence"],
	higher_is_better={
	"accuracy": True,
	"response_length": False, # Shorter responses might be better
	"confidence": True
	},
	category=SamplingMethod.GENERATIVE,
	sample_level_fn=custom_metric,
	corpus_level_fn={
	"accuracy": np.mean,
	"response_length": np.mean,
	"confidence": np.mean,
	},
	)
	```

	### Step 5: Register the Metric

	To finish, add the following code so that it adds your metric to our metrics list
	when loaded as a module:

	```python
	# Adds the metric to the metric list!
	extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric)

	if __name__ == "__main__":
	print("Imported metric")
	```

	## Using Your Custom Metric

	### With Custom Tasks

	You can then give your custom metric to Lighteval by using `--custom-tasks
	path_to_your_file` when launching it after adding it to the task config.

	```bash
	lighteval accelerate \
	"model_name=openai-community/gpt2" \
	"truthfulqa:mc" \
	--custom-tasks path_to_your_metric_file.py
	```

	```python
	from lighteval.tasks.lighteval_task import LightevalTaskConfig

	task = LightevalTaskConfig(
	name="my_custom_task",
	metric=[my_custom_metric], # Use your custom metric here
	prompt_function=my_prompt_function,
	hf_repo="my_dataset",
	evaluation_splits=["test"]
	)
	```

Xet Storage Details

Size:: 5.24 kB
Xet hash:: 2d738732b328b75f8cbcae56d8e32a2c40f8040735c0db515cbcb34f02d50002

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.