| | --- |
| | title: L3Score |
| | datasets: |
| | - google/spiqa |
| | tags: |
| | - evaluate |
| | - metric |
| | - semantic-similarity |
| | - qa |
| | - llm-eval |
| | description: > |
| | L3Score is a metric for evaluating the semantic similarity of free-form |
| | answers in question answering tasks. It uses log-probabilities of "Yes"/"No" |
| | tokens from a language model acting as a judge. Based on the SPIQA benchmark: |
| | https://arxiv.org/pdf/2407.09413 |
| | sdk: gradio |
| | sdk_version: 5.25.1 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # Metric Card: L3Score |
| |
|
| | ## ๐ Description |
| |
|
| | **L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using the following format: |
| |
|
| | ```text |
| | You are given a question, ground-truth answer, and a candidate answer. |
| | |
| | Question: {question} |
| | Ground-truth answer: {gt} |
| | Candidate answer: {answer} |
| | |
| | Is the semantic meaning of the ground-truth and candidate answers similar? |
| | Answer in one word - Yes or No. |
| | ``` |
| |
|
| | The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score. |
| |
|
| | ### ๐งฎ Scoring Logic |
| |
|
| | Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively. |
| |
|
| | - If neither token is in the top-5: |
| |
|
| | $$ |
| | \text{L3Score} = 0 |
| | $$ |
| |
|
| | - If both are present: |
| |
|
| | $$ |
| | \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})} |
| | $$ |
| | |
| | - If only one is present, the missing tokenโs probability is estimated using the minimum of: |
| | - remaining probability mass apart from the top-5 tokens |
| | - the least likely top-5 token |
| | |
| | The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent. |
| | |
| | See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details. |
| | |
| | ## ๐ How to Use |
| | |
| | ```python |
| | import evaluate |
| | |
| | l3score = evaluate.load("nhop/L3Score") |
| | |
| | questions = ["What is the capital of France?", "What is the capital of Germany?"] |
| | predictions = ["Paris", "Moscow"] |
| | references = ["Paris", "Berlin"] |
| | |
| | score = l3score.compute( |
| | questions=questions, |
| | predictions=predictions, |
| | references=references, |
| | api_key="your-openai-api-key", |
| | provider="openai", |
| | model="gpt-4o-mini" |
| | ) |
| | |
| | print(score) |
| | # {'L3Score': 0.49..., 'Cost':...} |
| | ``` |
| | |
| | --- |
| | |
| | ### ๐ Inputs |
| | |
| | | Name | Type | Description | |
| | |--------------|--------------|-----------------------------------------------------------------------------| |
| | | `questions` | `list[str]` | The list of input questions. | |
| | | `predictions`| `list[str]` | Generated answers by the model being evaluated. | |
| | | `references` | `list[str]` | Ground-truth or reference answers. | |
| | | `api_key` | `str` | API key for the selected LLM provider. | |
| | | `provider` | `str` | Must support top-n token log-probabilities (currently available: `"openai"`, `"deepseek","xai"`). | |
| | | `model` | `str` | Name of the evaluation LLM (e.g., `"gpt-4o-mini"`). | |
| | |
| | --- |
| | |
| | ### ๐ Output |
| | |
| | A dictionary with a the score and the cost to query the LLM-provider API: |
| | |
| | ```python |
| | {"L3Score": float, "Cost": float} |
| | ``` |
| | |
| | The value is the **average score** over all (question, prediction, reference) triplets and the total cost of all API calls. |
| | |
| | --- |
| | |
| | ## ๐ก Examples |
| | |
| | ```python |
| | l3score = evaluate.load("nhop/L3Score") |
| |
|
| | score = l3score.compute( |
| | questions=["What is the capital of France?"], |
| | predictions=["Paris"], |
| | references=["Paris"], |
| | api_key="your-openai-api-key", |
| | provider="openai", |
| | model="gpt-4o-mini" |
| | ) |
| | # {'L3Score': 0.99...,'Cost':...} |
| | |
| | score = l3score.compute( |
| | questions=["What is the capital of Germany?"], |
| | predictions=["Moscow"], |
| | references=["Berlin"], |
| | api_key="your-openai-api-key", |
| | provider="openai", |
| | model="gpt-4o-mini" |
| | ) |
| | # {'L3Score': 0.00...,'Cost':...} |
| | ``` |
| | |
| | --- |
| |
|
| | ## โ ๏ธ Limitations and Bias |
| |
|
| | - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq). |
| | - Scores are **only comparable when using the same judge model**. |
| |
|
| | --- |
| |
|
| | ## ๐ Citation |
| |
|
| | ```bibtex |
| | @article{pramanick2024spiqa, |
| | title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, |
| | author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, |
| | journal={arXiv preprint arXiv:2407.09413}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## ๐ Further References |
| |
|
| | - ๐ค [Dataset on Hugging Face](https://huggingface.co/datasets/google/spiqa) |
| | - ๐ [GitHub Repository](https://github.com/google/spiqa) |
| | - ๐ [SPIQA Paper (arXiv:2407.09413)](https://arxiv.org/pdf/2407.09413) |