Buckets:

hf-doc-build/doc-dev / lighteval /pr_1221 /en /adding-a-new-metric.md
HuggingFaceDocBuilder's picture
|
download
raw
5.24 kB
# Adding a New Metric
## Before You Start
### Two different types of metrics
There are two types of metrics in Lighteval:
#### Sample-Level Metrics
- **Purpose**: Evaluate individual samples/predictions
- **Input**: Takes a `Doc` and `ModelResponse` (model's prediction)
- **Output**: Returns a float or boolean value for that specific sample
- **Example**: Checking if a model's answer matches the correct answer for one sample
#### Corpus-Level Metrics
- **Purpose**: Compute final scores across the entire dataset/corpus
- **Input**: Takes the results from all sample-level evaluations
- **Output**: Returns a single score representing overall performance
- **Examples**:
- Simple aggregation: Calculating average accuracy across all test samples
- Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score
### Check Existing Metrics
First, check if you can use one of the parameterized functions in
[Corpus Metrics](package_reference/metrics#corpus-metrics) or
[Sample Metrics](package_reference/metrics#sample-metrics).
If not, you can use the `custom_task` system to register your new metric.
> [!TIP]
> To see an example of a custom metric added along with a custom task, look at the [IFEval custom task](https://github.com/huggingface/lighteval/tree/main/examples/custom_tasks/ifeval).
> [!WARNING]
> To contribute your custom metric to the Lighteval repository, you would first need
> to install the required dev dependencies by running `pip install -e .[dev]`
> and then run `pre-commit install` to install the pre-commit hooks.
## Creating a Custom Metric
### Step 1: Create the Metric File
Create a new Python file which should contain the full logic of your metric.
The file also needs to start with these imports:
```python
from aenum import extend_enum
from lighteval.metrics import Metrics
```
### Step 2: Define the Sample-Level Metric
You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a
`~lighteval.types.Doc` and a `~lighteval.types.ModelResponse`. The metric should return a float or a
boolean.
#### Single Metric Example
```python
def custom_metric(doc: Doc, model_response: ModelResponse) -> bool:
response = model_response.final_text[0]
return response == doc.choices[doc.gold_index]
```
#### Multiple Metrics Example
If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values:
```python
def custom_metric(doc: Doc, model_response: ModelResponse) -> dict:
response = model_response.final_text[0]
return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5}
```
### Step 3: Define Aggregation Function (Optional)
You can define an aggregation function if needed. A common aggregation function is `np.mean`:
```python
def agg_function(items):
flat_items = [item for sublist in items for item in sublist]
score = sum(flat_items) / len(flat_items)
return score
```
### Step 4: Create the Metric Object
#### Single Metric
If it's a sample-level metric, you can use the following code
with [SampleLevelMetric](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetric):
```python
my_custom_metric = SampleLevelMetric(
metric_name="custom_accuracy",
higher_is_better=True,
category=SamplingMethod.GENERATIVE,
sample_level_fn=custom_metric,
corpus_level_fn=agg_function,
)
```
#### Multiple Metrics
If your metric defines multiple metrics per sample, you can use the following code
with [SampleLevelMetricGrouping](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping):
```python
custom_metric = SampleLevelMetricGrouping(
metric_name=["accuracy", "response_length", "confidence"],
higher_is_better={
"accuracy": True,
"response_length": False, # Shorter responses might be better
"confidence": True
},
category=SamplingMethod.GENERATIVE,
sample_level_fn=custom_metric,
corpus_level_fn={
"accuracy": np.mean,
"response_length": np.mean,
"confidence": np.mean,
},
)
```
### Step 5: Register the Metric
To finish, add the following code so that it adds your metric to our metrics list
when loaded as a module:
```python
# Adds the metric to the metric list!
extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric)
if __name__ == "__main__":
print("Imported metric")
```
## Using Your Custom Metric
### With Custom Tasks
You can then give your custom metric to Lighteval by using `--custom-tasks
path_to_your_file` when launching it after adding it to the task config.
```bash
lighteval accelerate \
"model_name=openai-community/gpt2" \
"truthfulqa:mc" \
--custom-tasks path_to_your_metric_file.py
```
```python
from lighteval.tasks.lighteval_task import LightevalTaskConfig
task = LightevalTaskConfig(
name="my_custom_task",
metric=[my_custom_metric], # Use your custom metric here
prompt_function=my_prompt_function,
hf_repo="my_dataset",
evaluation_splits=["test"]
)
```

Xet Storage Details

Size:
5.24 kB
·
Xet hash:
2d738732b328b75f8cbcae56d8e32a2c40f8040735c0db515cbcb34f02d50002

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.