Buckets:

|
download
raw
5.59 kB
# Quick Tour
> [!TIP]
> We recommend using the `--help` flag to get more information about the
> available options for each command.
> `lighteval --help`
Lighteval can be used with several different commands, each optimized for different evaluation scenarios.
## Find your benchmark
<iframe
src="https://openevals-open-benchmark-index.hf.space"
frameborder="0"
width="850"
height="450"
>
## Available Commands
### Evaluation Backends
- `lighteval eval`: Use [inspect-ai](https://inspect.aisi.org.uk/) as backend to evaluate and inspect your models ! (prefered way)
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [๐Ÿค—
Accelerate](https://github.com/huggingface/accelerate)
- `lighteval nanotron`: Evaluate models in distributed settings using [โšก๏ธ
Nanotron](https://github.com/huggingface/nanotron)
- `lighteval vllm`: Evaluate models on one or more GPUs using [๐Ÿš€
VLLM](https://github.com/vllm-project/vllm)
- `lighteval custom`: Evaluate custom models (can be anything)
- `lighteval sglang`: Evaluate models using [SGLang](https://github.com/sgl-project/sglang) as backend
- `lighteval endpoint`: Evaluate models using various endpoints as backend
- `lighteval endpoint inference-endpoint`: Evaluate models using Hugging Face's [Inference Endpoints API](https://huggingface.co/inference-endpoints/dedicated)
- `lighteval endpoint tgi`: Evaluate models using [๐Ÿ”— Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) running locally
- `lighteval endpoint litellm`: Evaluate models on any compatible API using [LiteLLM](https://www.litellm.ai/)
- `lighteval endpoint inference-providers`: Evaluate models using [HuggingFace's inference providers](https://huggingface.co/docs/inference-providers/en/index) as backend
### Evaluation Utils
- `lighteval baseline`: Compute baselines for given tasks
### Utils
- `lighteval tasks`: List or inspect tasks
- `lighteval tasks list`: List all available tasks
- `lighteval tasks inspect`: Inspect a specific task to see its configuration and samples
- `lighteval tasks create`: Create a new task from a template
## Basic Usage
To evaluate `GPT-2` on the Truthful QA benchmark with [๐Ÿค—
Accelerate](https://github.com/huggingface/accelerate), run:
```bash
lighteval accelerate \
"model_name=openai-community/gpt2" \
truthfulqa:mc
```
### Task Specification
Tasks have a function applied at the sample level and one at the corpus level. For example,
- an exact match can be applied per sample, then averaged over the corpus to give the final score
- samples can be left untouched before applying Corpus BLEU at the corpus level
etc.
If the task you are looking at has a sample level function (`sample_level_fn`) which can be parametrized, you can pass parameters in the CLI.
For example
```txt
{task}@{parameter_name1}={value1}@{parameter_name2}={value2},...|0
```
All officially supported tasks can be found at the [tasks_list](available-tasks) and in the
[extended folder](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks/extended).
Moreover, community-provided tasks can be found in the
[community](https://github.com/huggingface/lighteval/tree/main/community_tasks) folder.
For more details on the implementation of the tasks, such as how prompts are constructed or which metrics are used, you can examine the
[implementation file](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py).
### Running Multiple Tasks
Running multiple tasks is supported, either with a comma-separated list or by specifying a file path.
The file should be structured like [examples/tasks/recommended_set.txt](https://github.com/huggingface/lighteval/blob/main/examples/tasks/recommended_set.txt).
When specifying a path to a file, it should start with `./`.
```bash
lighteval accelerate \
"model_name=openai-community/gpt2" \
./path/to/lighteval/examples/tasks/recommended_set.txt
# or, e.g., "truthfulqa:mc|0,gsm8k|3"
```
## Backend Configuration
### General Information
The `model-args` argument takes a string representing a list of model
arguments. The arguments allowed vary depending on the backend you use and
correspond to the fields of the model configurations.
The model configurations can be found [here](./package_reference/models).
All models allow you to post-process your reasoning model predictions
to remove the thinking tokens from the trace used to compute the metrics,
using `--remove-reasoning-tags` and `--reasoning-tags` to specify which
reasoning tags to remove (defaults to `<think>` and `</think>`).
Here's an example with `mistralai/Magistral-Small-2507` which outputs custom
thinking tokens:
```bash
lighteval vllm \
"model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
aime24 \
--remove-reasoning-tags \
--reasoning-tags="[('[THINK]','[/THINK]')]"
```
### Nanotron
To evaluate a model trained with Nanotron on a single GPU:
> [!WARNING]
> Nanotron models cannot be evaluated without torchrun.
```bash
torchrun --standalone --nnodes=1 --nproc-per-node=1 \
src/lighteval/__main__.py nanotron \
--checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
--lighteval-config-path examples/nanotron/lighteval_config_override_template.yaml
```
The `nproc-per-node` argument should match the data, tensor, and pipeline
parallelism configured in the `lighteval_config_template.yaml` file.
That is: `nproc-per-node = data_parallelism * tensor_parallelism *
pipeline_parallelism`.

Xet Storage Details

Size:
5.59 kB
ยท
Xet hash:
85dba28c276863c832bd5fefc1b9c8eb0f9b70d497b7aded40ab883130c9d994

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.