File size: 4,359 Bytes
2cf0cee
766a079
944744e
766a079
 
 
b6b235b
3567093
 
 
 
 
2cf0cee
766a079
2cf0cee
 
37b5821
2cf0cee
 
4d87eea
766a079
 
3567093
 
 
766a079
 
22a146d
3567093
766a079
4d87eea
22a146d
 
 
 
 
 
 
 
4d87eea
766a079
 
22a146d
 
 
766a079
 
22a146d
3567093
22a146d
3567093
22a146d
 
 
 
4d87eea
22a146d
 
 
 
 
 
 
 
4d87eea
766a079
22a146d
 
 
c4d9b8e
22a146d
 
 
 
 
 
 
766a079
3567093
4d87eea
22a146d
9b4c6a8
 
 
 
 
 
 
 
 
4d87eea
22a146d
3567093
22a146d
 
9b4c6a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22a146d
 
3567093
4d87eea
22a146d
4d87eea
766a079
3567093
db5fc4c
 
 
 
766a079
3567093
766a079
7929943
 
 
 
766a079
7c62a63
4d87eea
 
 
 
 
 
 
 
 
 
35a61d4
4d87eea
37b5821
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: ParaPLUIE
emoji: ☂️
tags:
- evaluate
- metric
description: >-
  ParaPLUIE is a metric for evaluating the semantic proximity between two sentences. 
  ParaPLUIE uses the perplexity of an LLM to compute a confidence score. It has
  shown the highest correlation with human judgment on paraphrase
  classification while maintaining a low computational cost, as it roughly equivalent
  to the cost of generating a single token.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
short_description: ParaPLUIE is a metric for evaluating the semantic proximity
---

# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)

## Metric Description
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences. 
ParaPLUIE uses the perplexity of an LLM to compute a confidence score.
It has shown the highest correlation with human judgment on paraphrase classification while maintaining a low computational cost, as it roughly equivalent to the cost of generating a single token.

## How to Use

This metric requires a source sentence and its hypothetical paraphrase.

```python
import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?" 
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}
```

### Inputs

- **sources** (`list` of `string`): Source sentences.
- **hypotheses** (`list` of `string`): Hypothetical paraphrases.

### Output Values

- **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 means that sentences are paraphrases. A score lower than 0 indicates the opposite.

This metric outputs a dictionary containing the score.

### Examples

Simple example
```python
import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?" 
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}
```

Configure metric
```python
ppluie.init(
  model = "mistralai/Mistral-7B-Instruct-v0.2",
  device = "cuda:0",
  template = "FS-DIRECT",
  use_chat_template = True,
  half_mode = True,
  n_right_specials_tokens = 1
)
```

Show the available prompting templates
```python
ppluie.show_templates()
>>> DIRECT
>>> MEANING
>>> INDIRECT
>>> FS-DIRECT
>>> FS-DIRECT_MAJ
>>> FS-DIRECT_FR
>>> FS-DIRECT_MAJ_FR
>>> FS-DIRECT_FR_MIN
>>> NETWORK
```

Show the LLMs that have already been tested with ParaPLUIE
```python
ppluie.show_available_models()
>>> HuggingFaceTB/SmolLM2-135M-Instruct
>>> HuggingFaceTB/SmolLM2-360M-Instruct
>>> HuggingFaceTB/SmolLM2-1.7B-Instruct
>>> google/gemma-2-2b-it
>>> state-spaces/mamba-2.8b-hf
>>> internlm/internlm2-chat-1_8b
>>> microsoft/Phi-4-mini-instruct
>>> mistralai/Mistral-7B-Instruct-v0.2
>>> tiiuae/falcon-mamba-7b-instruct
>>> Qwen/Qwen2.5-7B-Instruct
>>> CohereForAI/aya-expanse-8b
>>> google/gemma-2-9b-it
>>> meta-llama/Meta-Llama-3-8B-Instruct
>>> microsoft/phi-4
>>> CohereForAI/aya-expanse-32b
>>> Qwen/QwQ-32B
>>> CohereForAI/c4ai-command-r-08-2024
```

Change the prompting template
```python
ppluie.setTemplate("DIRECT")
```

Show how the prompt is encoded to ensure that the correct numbers of special tokens are removed and that the words "Yes" and "No" each fit into a single token
```python
ppluie.check_end_tokens_tmpl()
```

## Limitations and Bias
This metric is based on an LLM and is therefore limited by the LLM that is used.

## Source code
[GitLab](https://gitlab.inria.fr/expression/paraphrase-generation-evaluation-powered-by-an-llm-a-semantic-metric-not-a-lexical-one-coling-2025)


## Citation
```bibtex
@inproceedings{lemesle-etal-2025-paraphrase,
    title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
    author = "Lemesle, Quentin  and
      Chevelu, Jonathan  and
      Martin, Philippe  and
      Lolive, Damien  and
      Delhay, Arnaud  and
      Barbot, Nelly",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    year = "2025",
    url = "https://aclanthology.org/2025.coling-main.538/"
}
```