Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / lighteval /pr_980 /en /package_reference /logging.md

rtrm

29 days ago

preview code

download

raw

15.7 kB

Logging

EvaluationTracker[[lighteval.logging.evaluation_tracker.EvaluationTracker]]

class lighteval.logging.evaluation_tracker.EvaluationTrackerlighteval.logging.evaluation_tracker.EvaluationTrackerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/evaluation_tracker.py#L95[{"name": "output_dir", "val": ": str"}, {"name": "results_path_template", "val": ": str | None = None"}, {"name": "save_details", "val": ": bool = True"}, {"name": "push_to_hub", "val": ": bool = False"}, {"name": "push_to_tensorboard", "val": ": bool = False"}, {"name": "hub_results_org", "val": ": str | None = ''"}, {"name": "tensorboard_metric_prefix", "val": ": str = 'eval'"}, {"name": "public", "val": ": bool = False"}, {"name": "nanotron_run_info", "val": ": GeneralArgs = None"}, {"name": "use_wandb", "val": ": bool = False"}]- output_dir (str) -- Local directory to save evaluation results and logs

results_path_template (str, optional) -- Template for results directory structure. Example: "{output_dir}/results/{org}_{model}"
save_details (bool, defaults to True) -- Whether to save detailed evaluation records
push_to_hub (bool, defaults to False) -- Whether to push results to HF Hub
push_to_tensorboard (bool, defaults to False) -- Whether to push metrics to TensorBoard
hub_results_org (str, optional) -- HF Hub organization to push results to
tensorboard_metric_prefix (str, defaults to "eval") -- Prefix for TensorBoard metrics
public (bool, defaults to False) -- Whether to make Hub datasets public
nanotron_run_info (GeneralArgs, optional) -- Nanotron model run information
use_wandb (bool, defaults to False) -- Whether to log to Weights & Biases or Trackio if available0 Tracks and manages evaluation results, metrics, and logging for model evaluations.

The EvaluationTracker coordinates multiple specialized loggers to track different aspects of model evaluation:

Details Logger (DetailsLogger): Records per-sample evaluation details and predictions
Metrics Logger (MetricsLogger): Tracks aggregate evaluation metrics and scores
Versions Logger (VersionsLogger): Records task and dataset versions
General Config Logger (GeneralConfigLogger): Stores overall evaluation configuration
Task Config Logger (TaskConfigLogger): Maintains per-task configuration details

The tracker can save results locally and optionally push them to:

Hugging Face Hub as datasets
TensorBoard for visualization
Trackio or Weights & Biases for experiment tracking

Example:

tracker = EvaluationTracker(
    output_dir="./eval_results",
    push_to_hub=True,
    hub_results_org="my-org",
    save_details=True
)

# Log evaluation results
tracker.metrics_logger.add_metric("accuracy", 0.85)
tracker.details_logger.add_detail(task_name="qa", prediction="Paris")

# Save all results
tracker.save()

generate_final_dictlighteval.logging.evaluation_tracker.EvaluationTracker.generate_final_dicthttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/evaluation_tracker.py#L363[]dictDictionary containing all experiment information including config, results, versions, and summaries Aggregates and returns all the logger's experiment information in a dictionary.

This function should be used to gather and display said information at the end of an evaluation run.

push_to_hublighteval.logging.evaluation_tracker.EvaluationTracker.push_to_hubhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/evaluation_tracker.py#L387[{"name": "date_id", "val": ": str"}, {"name": "details", "val": ": dict"}, {"name": "results_dict", "val": ": dict"}] Pushes the experiment details (all the model predictions for every step) to the hub.

recreate_metadata_cardlighteval.logging.evaluation_tracker.EvaluationTracker.recreate_metadata_cardhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/evaluation_tracker.py#L454[{"name": "repo_id", "val": ": str"}]- repo_id (str) -- Details dataset repository path on the hub (org/dataset)0 Fully updates the details repository metadata card for the currently evaluated model

savelighteval.logging.evaluation_tracker.EvaluationTracker.savehttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/evaluation_tracker.py#L247[] Saves the experiment information and results to files, and to the hub if requested.

GeneralConfigLogger[[lighteval.logging.info_loggers.GeneralConfigLogger]]

class lighteval.logging.info_loggers.GeneralConfigLoggerlighteval.logging.info_loggers.GeneralConfigLoggerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L48[]- lighteval_sha (str) -- Git commit SHA of lighteval used for evaluation, enabling exact version reproducibility. Set to "?" if not in a git repository.

num_fewshot_seeds (int) -- Number of random seeds used for few-shot example sampling.
- If <= 1: Single evaluation with seed=0
- If > 1: Multiple evaluations with different few-shot samplings (HELM-style)
max_samples (int, optional) -- Maximum number of samples to evaluate per task. Only used for debugging - truncates each task's dataset.
job_id (int, optional) -- Slurm job ID if running on a cluster. Used to cross-reference with scheduler logs.
start_time (float) -- Unix timestamp when evaluation started. Automatically set during logger initialization.
end_time (float) -- Unix timestamp when evaluation completed. Set by calling log_end_time().
total_evaluation_time_secondes (str) -- Total runtime in seconds. Calculated as end_time - start_time.
model_config (ModelConfig) -- Complete model configuration settings. Contains model architecture, tokenizer, and generation parameters.
model_name (str) -- Name identifier for the evaluated model. Extracted from model_config.0 Tracks general configuration and runtime information for model evaluations.

This logger captures key configuration parameters, model details, and timing information to ensure reproducibility and provide insights into the evaluation process.

log_args_infolighteval.logging.info_loggers.GeneralConfigLogger.log_args_infohttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L106[{"name": "num_fewshot_seeds", "val": ": int"}, {"name": "max_samples", "val": ": int | None"}, {"name": "job_id", "val": ": str"}]- num_fewshot_seeds (int) -- number of few-shot seeds.

max_samples (int | None) -- maximum number of samples, if None, use all the samples available.
job_id (str) -- job ID, used to retrieve logs.0 Logs the information about the arguments passed to the method.

log_model_infolighteval.logging.info_loggers.GeneralConfigLogger.log_model_infohttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L123[{"name": "model_config", "val": ": ModelConfig"}]- model_config -- the model config used to initialize the model.0 Logs the model information.

DetailsLogger[[lighteval.logging.info_loggers.DetailsLogger]]

class lighteval.logging.info_loggers.DetailsLoggerlighteval.logging.info_loggers.DetailsLoggerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L138[{"name": "hashes", "val": ": dict = "}, {"name": "compiled_hashes", "val": ": dict = "}, {"name": "details", "val": ": dict = "}, {"name": "compiled_details", "val": ": dict = "}, {"name": "compiled_details_over_all_tasks", "val": ": DetailsLogger.CompiledDetailOverAllTasks = "}]- hashes (dict[str, listHash) -- Maps each task name to the list of all its samples' Hash.

compiled_hashes (dict[str, CompiledHash) -- Maps each task name to its CompiledHas, an aggregation of all the individual sample hashes.
details (dict[str, listDetail]) -- Maps each task name to the list of its samples' details. Example: winogrande: [sample1_details, sample2_details, ...]
compiled_details (dict[str, CompiledDetail]) -- : Maps each task name to the list of its samples' compiled details.
compiled_details_over_all_tasks (CompiledDetailOverAllTasks) -- Aggregated details over all the tasks.0 Logger for the experiment details.

Stores and logs experiment information both at the task and at the sample level.

aggregatelighteval.logging.info_loggers.DetailsLogger.aggregatehttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L277[] Hashes the details for each task and then for all tasks.

loglighteval.logging.info_loggers.DetailsLogger.loghttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L253[{"name": "task_name", "val": ": str"}, {"name": "doc", "val": ": Doc"}, {"name": "model_response", "val": ": ModelResponse"}, {"name": "metrics", "val": ": dict"}]- task_name (str) -- Name of the current task of interest.

doc (Doc) -- Current sample that we want to store.
model_response (ModelResponse) -- Model outputs for the current sample
metrics (dict) -- Model scores for said sample on the current task's metrics.0 Stores the relevant information for one sample of one task to the total list of samples stored in the DetailsLogger.

MetricsLogger[[lighteval.logging.info_loggers.MetricsLogger]]

class lighteval.logging.info_loggers.MetricsLoggerlighteval.logging.info_loggers.MetricsLoggerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L309[{"name": "metrics_values", "val": ": dict = "}, {"name": "metric_aggregated", "val": ": dict = "}]- metrics_value (dict[str, dict[str, list[float]]]) -- Maps each task to its dictionary of metrics to scores for all the example of the task. Example: {"winogrande|winogrande_xl": {"accuracy": [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}}

metric_aggregated (dict[str, dict[str, float]]) -- Maps each task to its dictionary of metrics to aggregated scores over all the example of the task. Example: {"winogrande|winogrande_xl": {"accuracy": 0.5}}0 Logs the actual scores for each metric of each task.

aggregatelighteval.logging.info_loggers.MetricsLogger.aggregatehttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L330[{"name": "task_dict", "val": ": dict"}, {"name": "bootstrap_iters", "val": ": int = 1000"}]- task_dict (dict[str, LightevalTask]) -- used to determine what aggregation function to use for each metric

bootstrap_iters (int, optional) -- Number of runs used to run the statistical bootstrap. Defaults to 1000.0 Aggregate the metrics for each task and then for all tasks.

VersionsLogger[[lighteval.logging.info_loggers.VersionsLogger]]

class lighteval.logging.info_loggers.VersionsLoggerlighteval.logging.info_loggers.VersionsLoggerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L406[{"name": "versions", "val": ": dict = "}]- version (dict[str, int]) -- Maps the task names with the task versions.0 Logger of the tasks versions.

Tasks can have a version number/date, which indicates what is the precise metric definition and dataset version used for an evaluation.

TaskConfigLogger[[lighteval.logging.info_loggers.TaskConfigLogger]]

class lighteval.logging.info_loggers.TaskConfigLoggerlighteval.logging.info_loggers.TaskConfigLoggerhttps://github.com/huggingface/lighteval/blob/vr_980/src/lighteval/logging/info_loggers.py#L425[{"name": "tasks_configs", "val": ": dict = "}]- tasks_config (dict[str, LightevalTaskConfig]) -- Maps each task to its associated LightevalTaskConfig0 Logs the different parameters of the current LightevalTask of interest.

Xet Storage Details

Size:: 15.7 kB
Xet hash:: 0f43cdaf24f913646e3163038f6e8862d6f69c6c3c3fb19d3c4b025fe6ac3f2e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.