File size: 7,079 Bytes
2575a48
71781d4
 
2575a48
71781d4
2575a48
71781d4
2575a48
 
71781d4
 
 
 
 
 
 
 
 
2575a48
 
71781d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2895f83
71781d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: FEVER
emoji: πŸ”₯
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  The FEVER (Fact Extraction and VERification) metric evaluates the performance of systems that verify factual claims against evidence retrieved from Wikipedia.

  It consists of three main components: Label accuracy (measures how often the predicted claim label matches the gold label), FEVER score (considers a prediction correct only if the label is correct and at least one complete gold evidence set is retrieved), and Evidence F1 (computes the micro-averaged precision, recall, and F1 between predicted and gold evidence sentences).

  The FEVER score is the official leaderboard metric used in the FEVER shared tasks. All metrics range from 0 to 1, with higher values indicating better performance.
---

# Metric Card for FEVER

## Metric description

The FEVER (Fact Extraction and VERification) metric evaluates the performance of systems that verify factual claims against evidence retrieved from Wikipedia. It was introduced in the FEVER shared task and has become a standard benchmark for fact verification systems.

FEVER consists of three main evaluation components:

1. **Label accuracy**: measures how often the predicted claim label (SUPPORTED, REFUTED, or NOT ENOUGH INFO) matches the gold label
2. **FEVER score**: considers a prediction correct only if the label is correct _and_ at least one complete gold evidence set is retrieved
3. **Evidence F1**: computes the micro-averaged precision, recall, and F1 between predicted and gold evidence sentences

## How to use

The metric takes two inputs: predictions (a list of dictionaries containing predicted labels and evidence) and references (a list of dictionaries containing gold labels and evidence sets).

```python
from evaluate import load
fever = load("fever")
predictions = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
results = fever.compute(predictions=predictions, references=references)
```

## Output values

This metric outputs a dictionary containing five float values:

```python
print(results)
{
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
}
```

- **label_accuracy**: Proportion of claims with correctly predicted labels (0-1, higher is better)
- **fever_score**: Proportion of claims where both the label and at least one full gold evidence set are correct (0-1, higher is better). This is the **official FEVER leaderboard metric**
- **evidence_precision**: Micro-averaged precision of evidence retrieval (0-1, higher is better)
- **evidence_recall**: Micro-averaged recall of evidence retrieval (0-1, higher is better)
- **evidence_f1**: Micro-averaged F1 of evidence retrieval (0-1, higher is better)

All values range from 0 to 1, with **1.0 representing perfect performance**.

### Values from popular papers

The FEVER shared task has established performance benchmarks on the FEVER dataset:

- Human performance: FEVER score of ~0.92
- Top systems (2018-2019): FEVER scores ranging from 0.64 to 0.70
- State-of-the-art models (2020+): FEVER scores above 0.75

Performance varies significantly based on:

- Model architecture (retrieval + verification pipeline vs. end-to-end)
- Pre-training (BERT, RoBERTa, etc.)
- Evidence retrieval quality

## Examples

Perfect prediction (label and evidence both correct):

```python
from evaluate import load
fever = load("fever")
predictions = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
results = fever.compute(predictions=predictions, references=references)
print(results)
{
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
}
```

Correct label but incomplete evidence:

```python
from evaluate import load
fever = load("fever")
predictions = [{"label": "SUPPORTED", "evidence": ["E1"]}]
references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
results = fever.compute(predictions=predictions, references=references)
print(results)
{
    'label_accuracy': 1.0,
    'fever_score': 0.0,
    'evidence_precision': 1.0,
    'evidence_recall': 0.5,
    'evidence_f1': 0.6666666666666666
}
```

Incorrect label (FEVER score is 0):

```python
from evaluate import load
fever = load("fever")
predictions = [{"label": "REFUTED", "evidence": ["E1", "E2"]}]
references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
results = fever.compute(predictions=predictions, references=references)
print(results)
{
    'label_accuracy': 0.0,
    'fever_score': 0.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
}
```

Multiple valid evidence sets:

```python
from evaluate import load
fever = load("fever")
predictions = [{"label": "SUPPORTED", "evidence": ["E3", "E4"]}]
references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"], ["E3", "E4"]]}]
results = fever.compute(predictions=predictions, references=references)
print(results)
{
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 1.0,
    'evidence_recall': 0.5,
    'evidence_f1': 0.5
}
```

## Limitations and bias

The FEVER metric has several important considerations:

1. **Evidence set completeness**: The FEVER score requires retrieving _all_ sentences in at least one gold evidence set. Partial evidence retrieval (even if sufficient for verification) results in a score of 0.
2. **Multiple valid evidence sets**: Some claims can be verified using different sets of evidence. The metric gives credit if any one complete set is retrieved.
3. **Micro-averaging**: Evidence precision, recall, and F1 are micro-averaged across all examples, which means performance on longer evidence sets has more influence on the final metrics.
4. **Label dependency**: The FEVER score requires both correct labeling _and_ correct evidence retrieval, making it a strict metric that penalizes systems for either type of error.
5. **Wikipedia-specific**: The metric was designed for Wikipedia-based fact verification and may not generalize directly to other knowledge sources or domains.

## Citation

```bibtex
@inproceedings{thorne2018fever,
  title={FEVER: a Large-scale Dataset for Fact Extraction and VERification},
  author={Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
  booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
  pages={809--819},
  year={2018}
}
```

## Further References

- [FEVER Dataset Website](https://fever.ai/dataset/)
- [FEVER Paper on arXiv](https://arxiv.org/abs/1803.05355)
- [Hugging Face Tasks -- Fact Checking](https://huggingface.co/tasks/text-classification)
- [FEVER Shared Task Overview](https://fever.ai/task.html)