File size: 4,359 Bytes
2cf0cee 766a079 944744e 766a079 b6b235b 3567093 2cf0cee 766a079 2cf0cee 37b5821 2cf0cee 4d87eea 766a079 3567093 766a079 22a146d 3567093 766a079 4d87eea 22a146d 4d87eea 766a079 22a146d 766a079 22a146d 3567093 22a146d 3567093 22a146d 4d87eea 22a146d 4d87eea 766a079 22a146d c4d9b8e 22a146d 766a079 3567093 4d87eea 22a146d 9b4c6a8 4d87eea 22a146d 3567093 22a146d 9b4c6a8 22a146d 3567093 4d87eea 22a146d 4d87eea 766a079 3567093 db5fc4c 766a079 3567093 766a079 7929943 766a079 7c62a63 4d87eea 35a61d4 4d87eea 37b5821 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
title: ParaPLUIE
emoji: ☂️
tags:
- evaluate
- metric
description: >-
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences.
ParaPLUIE uses the perplexity of an LLM to compute a confidence score. It has
shown the highest correlation with human judgment on paraphrase
classification while maintaining a low computational cost, as it roughly equivalent
to the cost of generating a single token.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
short_description: ParaPLUIE is a metric for evaluating the semantic proximity
---
# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
## Metric Description
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences.
ParaPLUIE uses the perplexity of an LLM to compute a confidence score.
It has shown the highest correlation with human judgment on paraphrase classification while maintaining a low computational cost, as it roughly equivalent to the cost of generating a single token.
## How to Use
This metric requires a source sentence and its hypothetical paraphrase.
```python
import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?"
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}
```
### Inputs
- **sources** (`list` of `string`): Source sentences.
- **hypotheses** (`list` of `string`): Hypothetical paraphrases.
### Output Values
- **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 means that sentences are paraphrases. A score lower than 0 indicates the opposite.
This metric outputs a dictionary containing the score.
### Examples
Simple example
```python
import evaluate
ppluie = evaluate.load("qlemesle/parapluie")
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
S = "Have you ever seen a tsunami ?"
H = "Have you ever seen a tiramisu ?"
results = ppluie.compute(sources=[S], hypotheses=[H])
print(results)
>>> {'scores': [-16.97607421875]}
```
Configure metric
```python
ppluie.init(
model = "mistralai/Mistral-7B-Instruct-v0.2",
device = "cuda:0",
template = "FS-DIRECT",
use_chat_template = True,
half_mode = True,
n_right_specials_tokens = 1
)
```
Show the available prompting templates
```python
ppluie.show_templates()
>>> DIRECT
>>> MEANING
>>> INDIRECT
>>> FS-DIRECT
>>> FS-DIRECT_MAJ
>>> FS-DIRECT_FR
>>> FS-DIRECT_MAJ_FR
>>> FS-DIRECT_FR_MIN
>>> NETWORK
```
Show the LLMs that have already been tested with ParaPLUIE
```python
ppluie.show_available_models()
>>> HuggingFaceTB/SmolLM2-135M-Instruct
>>> HuggingFaceTB/SmolLM2-360M-Instruct
>>> HuggingFaceTB/SmolLM2-1.7B-Instruct
>>> google/gemma-2-2b-it
>>> state-spaces/mamba-2.8b-hf
>>> internlm/internlm2-chat-1_8b
>>> microsoft/Phi-4-mini-instruct
>>> mistralai/Mistral-7B-Instruct-v0.2
>>> tiiuae/falcon-mamba-7b-instruct
>>> Qwen/Qwen2.5-7B-Instruct
>>> CohereForAI/aya-expanse-8b
>>> google/gemma-2-9b-it
>>> meta-llama/Meta-Llama-3-8B-Instruct
>>> microsoft/phi-4
>>> CohereForAI/aya-expanse-32b
>>> Qwen/QwQ-32B
>>> CohereForAI/c4ai-command-r-08-2024
```
Change the prompting template
```python
ppluie.setTemplate("DIRECT")
```
Show how the prompt is encoded to ensure that the correct numbers of special tokens are removed and that the words "Yes" and "No" each fit into a single token
```python
ppluie.check_end_tokens_tmpl()
```
## Limitations and Bias
This metric is based on an LLM and is therefore limited by the LLM that is used.
## Source code
[GitLab](https://gitlab.inria.fr/expression/paraphrase-generation-evaluation-powered-by-an-llm-a-semantic-metric-not-a-lexical-one-coling-2025)
## Citation
```bibtex
@inproceedings{lemesle-etal-2025-paraphrase,
title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
author = "Lemesle, Quentin and
Chevelu, Jonathan and
Martin, Philippe and
Lolive, Damien and
Delhay, Arnaud and
Barbot, Nelly",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
year = "2025",
url = "https://aclanthology.org/2025.coling-main.538/"
}
``` |