Buckets:
| # Adding a New Metric | |
| ## Before You Start | |
| ### Two different types of metrics | |
| There are two types of metrics in Lighteval: | |
| #### Sample-Level Metrics | |
| - **Purpose**: Evaluate individual samples/predictions | |
| - **Input**: Takes a `Doc` and `ModelResponse` (model's prediction) | |
| - **Output**: Returns a float or boolean value for that specific sample | |
| - **Example**: Checking if a model's answer matches the correct answer for one sample | |
| #### Corpus-Level Metrics | |
| - **Purpose**: Compute final scores across the entire dataset/corpus | |
| - **Input**: Takes the results from all sample-level evaluations | |
| - **Output**: Returns a single score representing overall performance | |
| - **Examples**: | |
| - Simple aggregation: Calculating average accuracy across all test samples | |
| - Complex metrics: BLEU score where sample-level metric prepares data (tokenization, etc.) and corpus-level metric computes the actual BLEU score | |
| ### Check Existing Metrics | |
| First, check if you can use one of the parameterized functions in | |
| [Corpus Metrics](package_reference/metrics#corpus-metrics) or | |
| [Sample Metrics](package_reference/metrics#sample-metrics). | |
| If not, you can use the `custom_task` system to register your new metric. | |
| > [!TIP] | |
| > To see an example of a custom metric added along with a custom task, look at the [IFEval custom task](https://github.com/huggingface/lighteval/tree/main/examples/custom_tasks/ifeval). | |
| > [!WARNING] | |
| > To contribute your custom metric to the Lighteval repository, you would first need | |
| > to install the required dev dependencies by running `pip install -e .[dev]` | |
| > and then run `pre-commit install` to install the pre-commit hooks. | |
| ## Creating a Custom Metric | |
| ### Step 1: Create the Metric File | |
| Create a new Python file which should contain the full logic of your metric. | |
| The file also needs to start with these imports: | |
| ```python | |
| from aenum import extend_enum | |
| from lighteval.metrics import Metrics | |
| ``` | |
| ### Step 2: Define the Sample-Level Metric | |
| You need to define a sample-level metric. All sample-level metrics will have the same signature, taking a | |
| `~lighteval.types.Doc` and a `~lighteval.types.ModelResponse`. The metric should return a float or a | |
| boolean. | |
| #### Single Metric Example | |
| ```python | |
| def custom_metric(doc: Doc, model_response: ModelResponse) -> bool: | |
| response = model_response.final_text[0] | |
| return response == doc.choices[doc.gold_index] | |
| ``` | |
| #### Multiple Metrics Example | |
| If you want to return multiple metrics per sample, you need to return a dictionary with the metrics as keys and the values as values: | |
| ```python | |
| def custom_metric(doc: Doc, model_response: ModelResponse) -> dict: | |
| response = model_response.final_text[0] | |
| return {"accuracy": response == doc.choices[doc.gold_index], "other_metric": 0.5} | |
| ``` | |
| ### Step 3: Define Aggregation Function (Optional) | |
| You can define an aggregation function if needed. A common aggregation function is `np.mean`: | |
| ```python | |
| def agg_function(items): | |
| flat_items = [item for sublist in items for item in sublist] | |
| score = sum(flat_items) / len(flat_items) | |
| return score | |
| ``` | |
| ### Step 4: Create the Metric Object | |
| #### Single Metric | |
| If it's a sample-level metric, you can use the following code | |
| with [SampleLevelMetric](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetric): | |
| ```python | |
| my_custom_metric = SampleLevelMetric( | |
| metric_name="custom_accuracy", | |
| higher_is_better=True, | |
| category=SamplingMethod.GENERATIVE, | |
| sample_level_fn=custom_metric, | |
| corpus_level_fn=agg_function, | |
| ) | |
| ``` | |
| #### Multiple Metrics | |
| If your metric defines multiple metrics per sample, you can use the following code | |
| with [SampleLevelMetricGrouping](/docs/lighteval/pr_1221/en/package_reference/metrics#lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping): | |
| ```python | |
| custom_metric = SampleLevelMetricGrouping( | |
| metric_name=["accuracy", "response_length", "confidence"], | |
| higher_is_better={ | |
| "accuracy": True, | |
| "response_length": False, # Shorter responses might be better | |
| "confidence": True | |
| }, | |
| category=SamplingMethod.GENERATIVE, | |
| sample_level_fn=custom_metric, | |
| corpus_level_fn={ | |
| "accuracy": np.mean, | |
| "response_length": np.mean, | |
| "confidence": np.mean, | |
| }, | |
| ) | |
| ``` | |
| ### Step 5: Register the Metric | |
| To finish, add the following code so that it adds your metric to our metrics list | |
| when loaded as a module: | |
| ```python | |
| # Adds the metric to the metric list! | |
| extend_enum(Metrics, "CUSTOM_ACCURACY", my_custom_metric) | |
| if __name__ == "__main__": | |
| print("Imported metric") | |
| ``` | |
| ## Using Your Custom Metric | |
| ### With Custom Tasks | |
| You can then give your custom metric to Lighteval by using `--custom-tasks | |
| path_to_your_file` when launching it after adding it to the task config. | |
| ```bash | |
| lighteval accelerate \ | |
| "model_name=openai-community/gpt2" \ | |
| "truthfulqa:mc" \ | |
| --custom-tasks path_to_your_metric_file.py | |
| ``` | |
| ```python | |
| from lighteval.tasks.lighteval_task import LightevalTaskConfig | |
| task = LightevalTaskConfig( | |
| name="my_custom_task", | |
| metric=[my_custom_metric], # Use your custom metric here | |
| prompt_function=my_prompt_function, | |
| hf_repo="my_dataset", | |
| evaluation_splits=["test"] | |
| ) | |
| ``` | |
Xet Storage Details
- Size:
- 5.24 kB
- Xet hash:
- 2d738732b328b75f8cbcae56d8e32a2c40f8040735c0db515cbcb34f02d50002
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.