Buckets:
Metrics
Metrics
Metric[[lighteval.metrics.Metric]]
class lighteval.metrics.Metriclighteval.metrics.Metric
CorpusLevelMetric[[lighteval.metrics.utils.metric_utils.CorpusLevelMetric]]
class lighteval.metrics.utils.metric_utils.CorpusLevelMetriclighteval.metrics.utils.metric_utils.CorpusLevelMetric
SampleLevelMetric[[lighteval.metrics.utils.metric_utils.SampleLevelMetric]]
class lighteval.metrics.utils.metric_utils.SampleLevelMetriclighteval.metrics.utils.metric_utils.SampleLevelMetric
MetricGrouping[[lighteval.metrics.utils.metric_utils.MetricGrouping]]
class lighteval.metrics.utils.metric_utils.MetricGroupinglighteval.metrics.utils.metric_utils.MetricGrouping
CorpusLevelMetricGrouping[[lighteval.metrics.utils.metric_utils.CorpusLevelMetricGrouping]]
class lighteval.metrics.utils.metric_utils.CorpusLevelMetricGroupinglighteval.metrics.utils.metric_utils.CorpusLevelMetricGrouping
SampleLevelMetricGrouping[[lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping]]
class lighteval.metrics.utils.metric_utils.SampleLevelMetricGroupinglighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping
Corpus Metrics
CorpusLevelF1Score[[lighteval.metrics.metrics_corpus.CorpusLevelF1Score]]
class lighteval.metrics.metrics_corpus.CorpusLevelF1Scorelighteval.metrics.metrics_corpus.CorpusLevelF1Score
compute_corpuslighteval.metrics.metrics_corpus.CorpusLevelF1Score.compute_corpus
CorpusLevelPerplexityMetric[[lighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetric]]
class lighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetriclighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetric
compute_corpuslighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetric.compute_corpus
CorpusLevelTranslationMetric[[lighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric]]
class lighteval.metrics.metrics_corpus.CorpusLevelTranslationMetriclighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric
compute_corpuslighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric.compute_corpus
MatthewsCorrCoef[[lighteval.metrics.metrics_corpus.MatthewsCorrCoef]]
class lighteval.metrics.metrics_corpus.MatthewsCorrCoeflighteval.metrics.metrics_corpus.MatthewsCorrCoef
compute_corpuslighteval.metrics.metrics_corpus.MatthewsCorrCoef.compute_corpus
Sample Metrics
ExactMatches[[lighteval.metrics.metrics_sample.ExactMatches]]
class lighteval.metrics.metrics_sample.ExactMatcheslighteval.metrics.metrics_sample.ExactMatches
computelighteval.metrics.metrics_sample.ExactMatches.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0floatAggregated score over the current sample's items. Computes the metric over a list of golds and predictions for one single sample.
compute_one_itemlighteval.metrics.metrics_sample.ExactMatches.compute_one_item
- pred (str) -- One of the possible predictions0floatThe exact match score. Will be 1 for a match, 0 otherwise. Compares two strings only.
F1_score[[lighteval.metrics.metrics_sample.F1_score]]
class lighteval.metrics.metrics_sample.F1_scorelighteval.metrics.metrics_sample.F1_score
computelighteval.metrics.metrics_sample.F1_score.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0floatAggregated score over the current sample's items. Computes the metric over a list of golds and predictions for one single sample.
compute_one_itemlighteval.metrics.metrics_sample.F1_score.compute_one_item
- pred (str) -- One of the possible predictions0floatThe f1 score over the bag of words, computed using nltk. Compares two strings only.
LoglikelihoodAcc[[lighteval.metrics.metrics_sample.LoglikelihoodAcc]]
class lighteval.metrics.metrics_sample.LoglikelihoodAcclighteval.metrics.metrics_sample.LoglikelihoodAcc
computelighteval.metrics.metrics_sample.LoglikelihoodAcc.compute
- model_response (ModelResponse) -- The model's response containing logprobs.
- **kwargs -- Additional keyword arguments.0intThe eval score: 1 if the best log-prob choice is in gold, 0 otherwise.
Computes the log likelihood accuracy: is the choice with the highest logprob in
choices_logprobpresent in thegold_ixs?
NormalizedMultiChoiceProbability[[lighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability]]
class lighteval.metrics.metrics_sample.NormalizedMultiChoiceProbabilitylighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability
computelighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability.compute
- model_response (ModelResponse) -- The model's response containing logprobs.
- **kwargs -- Additional keyword arguments.0floatThe probability of the best log-prob choice being a gold choice. Computes the log likelihood probability: chance of choosing the best choice.
Probability[[lighteval.metrics.metrics_sample.Probability]]
class lighteval.metrics.metrics_sample.Probabilitylighteval.metrics.metrics_sample.Probability
computelighteval.metrics.metrics_sample.Probability.compute
- model_response (ModelResponse) -- The model's response containing logprobs.
- **kwargs -- Additional keyword arguments.0floatThe probability of the best log-prob choice being a gold choice. Computes the log likelihood probability: chance of choosing the best choice.
Recall[[lighteval.metrics.metrics_sample.Recall]]
class lighteval.metrics.metrics_sample.Recalllighteval.metrics.metrics_sample.Recall
computelighteval.metrics.metrics_sample.Recall.compute
- model_response (ModelResponse) -- The model's response containing logprobs.
- **kwargs -- Additional keyword arguments.0intScore: 1 if one of the top level predicted choices was correct, 0 otherwise.
Computes the recall at the requested depth level: looks at the
nbest predicted choices (with the highest log probabilities) and see if there is an actual gold among them.
MRR[[lighteval.metrics.metrics_sample.MRR]]
class lighteval.metrics.metrics_sample.MRRlighteval.metrics.metrics_sample.MRR
computelighteval.metrics.metrics_sample.MRR.compute
- doc (Doc) -- The document containing choices and gold indices.
- **kwargs -- Additional keyword arguments.0floatMRR score. Mean reciprocal rank. Measures the quality of a ranking of choices (ordered by correctness).
ROUGE[[lighteval.metrics.metrics_sample.ROUGE]]
class lighteval.metrics.metrics_sample.ROUGElighteval.metrics.metrics_sample.ROUGE
computelighteval.metrics.metrics_sample.ROUGE.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0float or dictAggregated score over the current sample's items. If several rouge functions have been selected, returns a dict which maps name and scores. Computes the metric(s) over a list of golds and predictions for one single sample.
BertScore[[lighteval.metrics.metrics_sample.BertScore]]
class lighteval.metrics.metrics_sample.BertScorelighteval.metrics.metrics_sample.BertScore
computelighteval.metrics.metrics_sample.BertScore.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0dictScores over the current sample's items. Computes the prediction, recall and f1 score using the bert scorer.
Extractiveness[[lighteval.metrics.metrics_sample.Extractiveness]]
class lighteval.metrics.metrics_sample.Extractivenesslighteval.metrics.metrics_sample.Extractiveness
computelighteval.metrics.metrics_sample.Extractiveness.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0dict[str, float]The extractiveness scores. Compute the extractiveness of the predictions.
This method calculates coverage, density, and compression scores for a single prediction against the input text.
Faithfulness[[lighteval.metrics.metrics_sample.Faithfulness]]
class lighteval.metrics.metrics_sample.Faithfulnesslighteval.metrics.metrics_sample.Faithfulness
computelighteval.metrics.metrics_sample.Faithfulness.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0dict[str, float]The faithfulness scores. Compute the faithfulness of the predictions.
The SummaCZS (Summary Content Zero-Shot) model is used with configurable granularity and model variation.
BLEURT[[lighteval.metrics.metrics_sample.BLEURT]]
class lighteval.metrics.metrics_sample.BLEURTlighteval.metrics.metrics_sample.BLEURT
computelighteval.metrics.metrics_sample.BLEURT.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0floatScore over the current sample's items. Uses the stored BLEURT scorer to compute the score on the current sample.
BLEU[[lighteval.metrics.metrics_sample.BLEU]]
class lighteval.metrics.metrics_sample.BLEUlighteval.metrics.metrics_sample.BLEU
computelighteval.metrics.metrics_sample.BLEU.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0floatScore over the current sample's items. Computes the sentence level BLEU between the golds and each prediction, then takes the average.
StringDistance[[lighteval.metrics.metrics_sample.StringDistance]]
class lighteval.metrics.metrics_sample.StringDistancelighteval.metrics.metrics_sample.StringDistance
computelighteval.metrics.metrics_sample.StringDistance.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0dictThe different scores computed Computes all the requested metrics on the golds and prediction.
edit_similaritylighteval.metrics.metrics_sample.StringDistance.edit_similarity
Edit similarity is also used in the paper Lee, Katherine, et al. "Deduplicating training data makes language models better." arXiv preprint arXiv:2107.06499 (2021).
longest_common_prefix_lengthlighteval.metrics.metrics_sample.StringDistance.longest_common_prefix_length
JudgeLLM[[lighteval.metrics.metrics_sample.JudgeLLM]]
class lighteval.metrics.metrics_sample.JudgeLLMlighteval.metrics.metrics_sample.JudgeLLM
JudgeLLMMTBench[[lighteval.metrics.metrics_sample.JudgeLLMMTBench]]
class lighteval.metrics.metrics_sample.JudgeLLMMTBenchlighteval.metrics.metrics_sample.JudgeLLMMTBench
computelighteval.metrics.metrics_sample.JudgeLLMMTBench.compute
JudgeLLMMixEval[[lighteval.metrics.metrics_sample.JudgeLLMMixEval]]
class lighteval.metrics.metrics_sample.JudgeLLMMixEvallighteval.metrics.metrics_sample.JudgeLLMMixEval
computelighteval.metrics.metrics_sample.JudgeLLMMixEval.compute
MajAtK[[lighteval.metrics.metrics_sample.MajAtK]]
class lighteval.metrics.metrics_sample.MajAtKlighteval.metrics.metrics_sample.MajAtK
computelighteval.metrics.metrics_sample.MajAtK.compute
- model_response (ModelResponse) -- The model's response containing predictions.
- **kwargs -- Additional keyword arguments.0floatAggregated score over the current sample's items. Computes the metric over a list of golds and predictions for one single sample. It applies normalisation (if needed) to model prediction and gold, and takes the most frequent answer of all the available ones, then compares it to the gold.
LLM-as-a-Judge
JudgeLM[[lighteval.metrics.utils.llm_as_judge.JudgeLM]]
class lighteval.metrics.utils.llm_as_judge.JudgeLMlighteval.metrics.utils.llm_as_judge.JudgeLM
- templates (Callable) -- A function taking into account the question, options, answer, and gold and returning the judge prompt.
- process_judge_response (Callable) -- A function for processing the judge's response.
- judge_backend (Literal["litellm", "openai", "transformers", "tgi", "vllm", "inference-providers"]) -- The backend for the judge.
- url (str | None) -- The URL for the OpenAI API.
- api_key (str | None) -- The API key for the OpenAI API (either OpenAI or HF key).
- max_tokens (int) -- The maximum number of tokens to generate. Defaults to 512.
- response_format (BaseModel | None) -- The format of the response from the API, used for the OpenAI and TGI backend.
- hf_provider (Literal["black-forest-labs", "cerebras", "cohere", "fal-ai", "fireworks-ai", -- "inference-providers", "hyperbolic", "nebius", "novita", "openai", "replicate", "sambanova", "together"] | None): The HuggingFace provider when using the inference-providers backend.
- backend_options (dict | None) -- Options for the backend. Currently only supported for litellm.0 A class representing a judge for evaluating answers using either the chosen backend.
Methods: evaluate_answer: Evaluates an answer using the OpenAI API or Transformers library. __lazy_load_client: Lazy loads the OpenAI client or Transformers pipeline. __call_api: Calls the API to get the judge's response. __call_transformers: Calls the Transformers pipeline to get the judge's response. __call_vllm: Calls the VLLM pipeline to get the judge's response.
dict_of_lists_to_list_of_dictslighteval.metrics.utils.llm_as_judge.JudgeLM.dict_of_lists_to_list_of_dicts
Each dictionary in the output list will contain one element from each list in the input dictionary, with the same keys as the input dictionary.
Example:
dict_of_lists_to_list_of_dicts({'k': [1, 2, 3], 'k2': ['a', 'b', 'c']}) [{'k': 1, 'k2': 'a'}, {'k': 2, 'k2': 'b'}, {'k': 3, 'k2': 'c'}]
evaluate_answerlighteval.metrics.utils.llm_as_judge.JudgeLM.evaluate_answer
- answer (str) -- Answer given by the evaluated model.
- options (list[str] | None) -- Optional list of answer options.
- gold (str | None) -- Optional reference answer.0A tuple containing the score, prompts, and judgment. Evaluates an answer using either Transformers or OpenAI API.
Xet Storage Details
- Size:
- 47.4 kB
- Xet hash:
- 7ea265ad7d26f4b5bb18516f2d33008c3bc2472f9b4f6892d505c68331c4051f
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.