Buckets:
Tasks
LightevalTask
LightevalTaskConfig[[lighteval.tasks.lighteval_task.LightevalTaskConfig]]
class lighteval.tasks.lighteval_task.LightevalTaskConfiglighteval.tasks.lighteval_task.LightevalTaskConfig
- prompt_function (Callable[[dict, str], Doc]) -- Function that converts dataset row to Doc objects for evaluation. Takes a dataset row dict and task name as input.
- hf_repo (str) -- HuggingFace Hub repository path containing the evaluation dataset.
- hf_subset (str) -- Dataset subset/configuration name to use for this task.
- metrics (ListLike[Metric]) -- List of metrics to compute for this task.0 Configuration dataclass for a LightevalTask.
This class stores all the configuration parameters needed to define and run an evaluation task, including dataset information, prompt formatting, evaluation metrics, and generation parameters.
Dataset Configuration: hf_revision (str | None, optional): Specific dataset revision to use. Defaults to None (latest). hf_filter (Callable[[dict], bool] | None, optional): Filter function to apply to dataset items. Defaults to None. hf_avail_splits (ListLike[str], optional): Available dataset splits. Defaults to ["train", "validation", "test"].
Evaluation Splits: evaluation_splits (ListLike[str], optional): Dataset splits to use for evaluation. Defaults to ["validation"]. few_shots_split (str | None, optional): Split to sample few-shot examples from. Defaults to None. few_shots_select (str | None, optional): Method for selecting few-shot examples. Defaults to None.
Generation Parameters: generation_size (int | None, optional): Maximum token length for generated text. Defaults to None. generation_grammar (TextGenerationInputGrammarType | None, optional): Grammar for structured text generation. Only available for TGI and Inference Endpoint models. Defaults to None. stop_sequence (ListLike[str] | None, optional): Sequences that stop text generation. Defaults to None. num_samples (list[int] | None, optional): Number of samples to generate per input. Defaults to None.
Task Configuration: suite (ListLike[str], optional): Evaluation suites this task belongs to. Defaults to ["custom"]. version (int, optional): Task version number. Increment when dataset or prompt changes. Defaults to 0. num_fewshots (int, optional): Number of few-shot examples to include. Defaults to 0. truncate_fewshots (bool, optional): Whether to truncate few-shot examples. Defaults to False. must_remove_duplicate_docs (bool, optional): Whether to remove duplicate documents. Defaults to False.
Document Tracking: original_num_docs (int, optional): Total number of documents in the task. Defaults to -1. effective_num_docs (int, optional): Number of documents actually used in evaluation. Defaults to -1.
LightevalTask[[lighteval.tasks.lighteval_task.LightevalTask]]
class lighteval.tasks.lighteval_task.LightevalTasklighteval.tasks.lighteval_task.LightevalTask
aggregationlighteval.tasks.lighteval_task.LightevalTask.aggregation
download_dataset_workerlighteval.tasks.lighteval_task.LightevalTask.download_dataset_worker
Downloads the dataset specified in the task configuration, optionally applies a filter if configured, and returns the dataset dictionary. This method is designed to be used for parallel dataset loading.
eval_docslighteval.tasks.lighteval_task.LightevalTask.eval_docs
fewshot_docslighteval.tasks.lighteval_task.LightevalTask.fewshot_docs
get_docslighteval.tasks.lighteval_task.LightevalTask.get_docsValueError -- If no documents are available for evaluation.ValueError
Get evaluation documents with few-shot examples and generation parameters configured.
Retrieves evaluation documents, optionally limits the number of samples, shuffles them for reproducibility, and configures each document with few-shot examples and generation parameters for evaluation.
get_first_possible_fewshot_splitslighteval.tasks.lighteval_task.LightevalTask.get_first_possible_fewshot_splits
load_datasetslighteval.tasks.lighteval_task.LightevalTask.load_datasets
- dataset_loading_processes (int, optional) -- Number of processes to use for parallel dataset loading. Defaults to 1 (sequential loading).0 Load datasets from the HuggingFace Hub for the given tasks.
PromptManager[[lighteval.tasks.prompt_manager.PromptManager]]
class lighteval.tasks.prompt_manager.PromptManagerlighteval.tasks.prompt_manager.PromptManager
prepare_promptlighteval.tasks.prompt_manager.PromptManager.prepare_prompt
prepare_prompt_apilighteval.tasks.prompt_manager.PromptManager.prepare_prompt_api
Registry[[lighteval.tasks.registry.Registry]]
class lighteval.tasks.registry.Registrylighteval.tasks.registry.Registry
create_custom_tasks_modulelighteval.tasks.registry.Registry.create_custom_tasks_module
create_task_config_dictlighteval.tasks.registry.Registry.create_task_config_dict
print_all_taskslighteval.tasks.registry.Registry.print_all_tasks
Doc[[lighteval.tasks.requests.Doc]]
class lighteval.tasks.requests.Doclighteval.tasks.requests.Doc
choices (list[str]) -- List of possible answer choices for the query. For multiple choice tasks, this contains all options (A, B, C, D, etc.). For generative tasks, this may be empty or contain reference answers.
gold_index (Union[int, list[int]]) -- Index or indices of the correct answer(s) in the choices list. For single correct answers,(e.g., 0 for first choice). For multiple correct answers, use a list (e.g., [0, 2] for first and third).
instruction (str | None) -- System prompt or task-specific instructions to guide the model. This is typically prepended to the query to set context or behavior.
images (list["Image"] | None) -- List of PIL Image objects for multimodal tasks.
specific (dict | None) -- Task-specific information or metadata. Can contain any additional data needed for evaluation.
unconditioned_query (Optional[str]) -- Query without task-specific context for PMI normalization. Used to calculate: log P(choice | Query) - log P(choice | Unconditioned Query).
original_query (str | None) -- The query before any preprocessing or modification.
# Set by task parameters --
id (str) -- Unique identifier for this evaluation instance. Set by the task and not the user.
task_name (str) -- Name of the task or benchmark this Doc belongs to.
## Few-shot Learning Parameters --
fewshot_samples (list) -- List of Doc objects representing few-shot examples. These examples are prepended to the main query to provide context.
sampling_methods (list[SamplingMethod]) -- List of sampling methods to use for this instance. Options: GENERATIVE, LOGPROBS, PERPLEXITY.
fewshot_sorting_class (Optional[str]) -- Class label for balanced few-shot example selection. Used to ensure diverse representation in few-shot examples.
## Generation Control Parameters --
generation_size (int | None) -- Maximum number of tokens to generate for this instance.
stop_sequences (list[str] | None) -- List of strings that should stop generation when encountered. Used for: Controlled generation, preventing unwanted continuations.
use_logits (bool) -- Whether to return logits (raw model outputs) in addition to text. Used for: Probability analysis, confidence scoring, detailed evaluation.
num_samples (int) -- Number of different samples to generate for this instance. Used for: Diversity analysis, uncertainty estimation, ensemble methods.
generation_grammar (None) -- Grammar constraints for generation (currently not implemented). Reserved for: Future structured generation features.0 Dataclass representing a single evaluation sample for a benchmark.
This class encapsulates all the information needed to evaluate a model on a single task instance. It contains the input query, expected outputs, metadata, and configuration parameters for different types of evaluation tasks.
Required Fields:
query: The input prompt or questionchoices: Available answer choices (for multiple choice tasks)gold_index: Index(es) of the correct answer(s)
Optional Fields:
instruction: System prompt, task specific. Will be appended to model specific system prompt.images: Visual inputs for multimodal tasks.
Methods: get_golds(): Returns the correct answer(s) as strings based on gold_index. Handles both single and multiple correct answers.
Usage Examples:
Multiple Choice Question:
doc = Doc(
query="What is the capital of France?",
choices=["London", "Paris", "Berlin", "Madrid"],
gold_index=1, # Paris is the correct answer
instruction="Answer the following geography question:",
)
Generative Task:
doc = Doc(
query="Write a short story about a robot.",
choices=[], # No predefined choices for generative tasks
gold_index=0, # Not used for generative tasks
generation_size=100,
stop_sequences=["
End"],
)
Few-shot Learning:
doc = Doc(
query="Translate 'Hello world' to Spanish.",
choices=["Hola mundo", "Bonjour monde", "Ciao mondo"],
gold_index=0,
fewshot_samples=[
Doc(query="Translate 'Good morning' to Spanish.",
choices=["Buenos días", "Bonjour", "Buongiorno"],
gold_index=0),
Doc(query="Translate 'Thank you' to Spanish.",
choices=["Gracias", "Merci", "Grazie"],
gold_index=0)
],
)
Multimodal Task:
doc = Doc(
query="What is shown in this image?",
choices=["A cat"],
gold_index=0,
images=[pil_image], # PIL Image object
)
get_goldslighteval.tasks.requests.Doc.get_golds
Datasets[[lighteval.data.DynamicBatchDataset]]
class lighteval.data.DynamicBatchDatasetlighteval.data.DynamicBatchDataset
get_original_orderlighteval.data.DynamicBatchDataset.get_original_order
splits_iteratorlighteval.data.DynamicBatchDataset.splits_iterator
class lighteval.data.LoglikelihoodDatasetlighteval.data.LoglikelihoodDataset
class lighteval.data.GenerativeTaskDatasetlighteval.data.GenerativeTaskDataset
init_split_limitslighteval.data.GenerativeTaskDataset.init_split_limits
For generative tasks, self._sorting_criteria outputs:
- a boolean (whether the generation task uses logits)
- a list (the stop sequences)
- the item length (the actual size sorting factor).
In the current function, we create evaluation groups by generation parameters (logits and eos), so that samples with similar properties get batched together afterwards. The samples will then be further organised by length in each split.
class lighteval.data.GenerativeTaskDatasetNanotronlighteval.data.GenerativeTaskDatasetNanotron
class lighteval.data.GenDistributedSamplerlighteval.data.GenDistributedSampler
Xet Storage Details
- Size:
- 27.5 kB
- Xet hash:
- 96402c21e032b7b3e7d2ae744bbecfb7ed091536b61c6f4edf711214ba891d3a
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.