Spaces:
Sleeping
Sleeping
Upload instructions_suggested.md
Browse files
Operational_Instructions/instructions_suggested.md
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation Guidelines
|
| 2 |
+
|
| 3 |
+
## 1. Selected Metrics
|
| 4 |
+
|
| 5 |
+
### 1.1 Correctness
|
| 6 |
+
Combines elements of:
|
| 7 |
+
- **coverage**: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
|
| 8 |
+
- **relevance**: portion of the generated response which is directly addressing the question, regardless its factual correctness.
|
| 9 |
+
|
| 10 |
+
Graded on a continuous scale with the following representative points:
|
| 11 |
+
- **2:** Correct and relevant (no irrelevant information)
|
| 12 |
+
- **1:** Correct but contains irrelevant information
|
| 13 |
+
- **0:** No answer provided (abstention)
|
| 14 |
+
- **-1:** Incorrect answer
|
| 15 |
+
|
| 16 |
+
### 1.2 Faithfulness
|
| 17 |
+
Assesses whether the response is **grounded in the retrieved passages**. This metric reimplements the work discussed in [2].
|
| 18 |
+
|
| 19 |
+
Graded on a continuous scale with the following representative points:
|
| 20 |
+
- **1:** Full support. All answer parts are grounded
|
| 21 |
+
- **0:** Partial support. Not all answer parts are grounded
|
| 22 |
+
- **-1:** No support. All answer parts are not grounded
|
| 23 |
+
|
| 24 |
+
### 1.3 Aggregation of Metrics
|
| 25 |
+
Both **correctness** and **faithfulness** will contribute to the final evaluation score.
|
| 26 |
+
|
| 27 |
+
## 2. Manual and Automated Evaluation
|
| 28 |
+
|
| 29 |
+
### **2.1 First Stage:**
|
| 30 |
+
- Automated evaluation by a state-of-the-art LLM, using **correctness** and **faithfulness** metrics to rank the participant teams.
|
| 31 |
+
|
| 32 |
+
### **2.2 Final Stage:**
|
| 33 |
+
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
|
| 34 |
+
|
| 35 |
+
## 3. Other Notable Points
|
| 36 |
+
- Answer length is **unlimited** but only the first **300 words** will be evaluated.
|
| 37 |
+
- Participants will submit:
|
| 38 |
+
- **The Question ID**.
|
| 39 |
+
- **The Question**.
|
| 40 |
+
- **The answer**.
|
| 41 |
+
- **Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs**.
|
| 42 |
+
- **The full prompt used for generation**.
|
| 43 |
+
- Remarks:
|
| 44 |
+
- Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
|
| 45 |
+
- We accept partial submissions where not all questions are answered
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.
|
| 49 |
+
|
| 50 |
+
## References
|
| 51 |
+
|
| 52 |
+
[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track
|
| 53 |
+
|
| 54 |
+
[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024
|