File size: 3,676 Bytes
2cf0cee
766a079
944744e
766a079
 
 
b6b235b
 
 
 
2cf0cee
766a079
2cf0cee
 
 
 
4d87eea
766a079
4d87eea
766a079
 
4d87eea
 
 
766a079
 
4d87eea
766a079
4d87eea
 
 
 
 
 
766a079
 
4d87eea
 
 
 
766a079
 
4d87eea
 
 
 
 
 
766a079
4d87eea
 
766a079
 
4d87eea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
766a079
 
4d87eea
766a079
 
7c62a63
4d87eea
 
 
 
 
 
 
 
 
 
35a61d4
4d87eea
7c62a63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: ParaPLUIE
emoji: ☂️
tags:
- evaluate
- metric
description: >-
    ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. 
    ParaPLUIE use the perplexity of an LLM to compute a confidence score.
    It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
---

# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)

W.I.P

## Metric Description
ParaPLUIE is a metric for evaluating the semantic proximity of two sentences. 
ParaPLUIE use the perplexity of an LLM to compute a confidence score.
It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.

## How to Use
This metric requires a source sentence and it's hypothetical paraphrase.

```python
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
>>> print(results)
{'accuracy': 1.0}
```

### Inputs
- **predictions** (`list` of `int`): Predicted labels.
- **references** (`list` of `int`): Ground truth labels.
- **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
- **sample_weight** (`list` of `float`): Sample weights Defaults to None.

### Output Values
- **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
Output Example(s):
```python
{'accuracy': 1.0}
```
This metric outputs a dictionary, containing the accuracy score.

#### Values from Papers
ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.

### Examples
Example 1-A simple example
```python
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
>>> print(results)
{'accuracy': 0.5}
```
Example 2-The same as Example 1, except with `normalize` set to `False`.
```python
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
>>> print(results)
{'accuracy': 3.0}
```
Example 3-The same as Example 1, except with `sample_weight` set.
```python
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
>>> print(results)
{'accuracy': 0.8778625954198473}
```

## Limitations and Bias
This metric is based on an LLM and therefore is limited by the LLM used.

## Citation
```bibtex
@inproceedings{lemesle-etal-2025-paraphrase,
    title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
    author = "Lemesle, Quentin  and
      Chevelu, Jonathan  and
      Martin, Philippe  and
      Lolive, Damien  and
      Delhay, Arnaud  and
      Barbot, Nelly",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    year = "2025",
    url = "https://aclanthology.org/2025.coling-main.538/"
}
```