Spaces:
Sleeping
Sleeping
| # Evaluation Guidelines | |
| ## 1. Selected Metrics | |
| ### 1.1 Correctness | |
| Combines elements of: | |
| - **coverage**: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1]. | |
| - **relevance**: portion of the generated response which is directly addressing the question, regardless its factual correctness. | |
| Graded on a continuous scale with the following representative points: | |
| - **2:** Correct and relevant (no irrelevant information) | |
| - **1:** Correct but contains irrelevant information | |
| - **0:** No answer provided (abstention) | |
| - **-1:** Incorrect answer | |
| ### 1.2 Faithfulness | |
| Assesses whether the response is **grounded in the retrieved passages**. This metric reimplements the work discussed in [2]. | |
| Graded on a continuous scale with the following representative points: | |
| - **1:** Full support. All answer parts are grounded | |
| - **0:** Partial support. Not all answer parts are grounded | |
| - **-1:** No support. All answer parts are not grounded | |
| ### 1.3 Aggregation of Metrics | |
| Both **correctness** and **faithfulness** will contribute to the final evaluation score. | |
| ## 2. Manual and Automated Evaluation | |
| ### **2.1 First Stage:** | |
| - Automated evaluation by a state-of-the-art LLM, using **correctness** and **faithfulness** metrics to rank the participant teams. | |
| ### **2.2 Final Stage:** | |
| - **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners. | |
| ## 3. Other Notable Points | |
| - Answer length is **unlimited** but only the first **300 words** will be evaluated. | |
| - Participants will submit (see Answer file [json schema](Answer_File.json.schema) and [example](Answer_File_Example.json)): | |
| - **The Question ID**. | |
| - **The Question**. | |
| - **The answer**. | |
| - **Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs**. | |
| - **The full prompt used for generation**. | |
| - Remarks: | |
| - Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric. | |
| - We accept partial submissions where not all questions are answered | |
| These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**. | |
| ## References | |
| [1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track | |
| [2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024 |