update doc
Browse files- README.md +56 -37
- parapluie.py +10 -5
README.md
CHANGED
|
@@ -16,66 +16,85 @@ pinned: false
|
|
| 16 |
|
| 17 |
# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
|
| 18 |
|
| 19 |
-
W.I.P
|
| 20 |
-
|
| 21 |
## Metric Description
|
| 22 |
ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
|
| 23 |
ParaPLUIE use the perplexity of an LLM to compute a confidence score.
|
| 24 |
It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
|
| 25 |
|
| 26 |
## How to Use
|
| 27 |
-
|
| 28 |
This metric requires a source sentence and it's hypothetical paraphrase.
|
| 29 |
|
| 30 |
```python
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
```
|
| 36 |
|
| 37 |
### Inputs
|
| 38 |
-
|
| 39 |
-
- **
|
| 40 |
-
- **
|
| 41 |
-
- **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
|
| 42 |
-
- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
|
| 43 |
|
| 44 |
### Output Values
|
| 45 |
-
|
| 46 |
-
- **
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
```python
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
-
This metric outputs a dictionary, containing the accuracy score.
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
-
# TODO
|
| 59 |
-
Example 1-A simple example
|
| 60 |
```python
|
| 61 |
-
|
| 62 |
-
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
|
| 63 |
-
>>> print(results)
|
| 64 |
-
{'accuracy': 0.5}
|
| 65 |
```
|
| 66 |
-
|
|
|
|
| 67 |
```python
|
| 68 |
-
|
| 69 |
-
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
|
| 70 |
-
>>> print(results)
|
| 71 |
-
{'accuracy': 3.0}
|
| 72 |
```
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
```python
|
| 75 |
-
|
| 76 |
-
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
|
| 77 |
-
>>> print(results)
|
| 78 |
-
{'accuracy': 0.8778625954198473}
|
| 79 |
```
|
| 80 |
|
| 81 |
## Limitations and Bias
|
|
|
|
| 16 |
|
| 17 |
# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
|
| 18 |
|
|
|
|
|
|
|
| 19 |
## Metric Description
|
| 20 |
ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
|
| 21 |
ParaPLUIE use the perplexity of an LLM to compute a confidence score.
|
| 22 |
It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
|
| 23 |
|
| 24 |
## How to Use
|
| 25 |
+
|
| 26 |
This metric requires a source sentence and it's hypothetical paraphrase.
|
| 27 |
|
| 28 |
```python
|
| 29 |
+
import evaluate
|
| 30 |
+
ppluie = evaluate.load("qlemesle/parapluie")
|
| 31 |
+
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
|
| 32 |
+
|
| 33 |
+
S = "Have you ever seen a tsunami ?"
|
| 34 |
+
H = "Have you ever seen a tiramisu ?"
|
| 35 |
+
|
| 36 |
+
results = ppluie.compute(sources=[S], hypotheses=[H])
|
| 37 |
+
print(results)
|
| 38 |
+
>>> {'scores': [-16.97607421875]}
|
| 39 |
```
|
| 40 |
|
| 41 |
### Inputs
|
| 42 |
+
|
| 43 |
+
- **sources** (`list` of `string`): Source sentences.
|
| 44 |
+
- **hypotheses** (`list` of `string`): Hypothetical paraphrases.
|
|
|
|
|
|
|
| 45 |
|
| 46 |
### Output Values
|
| 47 |
+
|
| 48 |
+
- **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
|
| 49 |
+
|
| 50 |
+
This metric outputs a dictionary, containing the score.
|
| 51 |
+
|
| 52 |
+
### Examples
|
| 53 |
+
|
| 54 |
+
Simple example
|
| 55 |
```python
|
| 56 |
+
import evaluate
|
| 57 |
+
ppluie = evaluate.load("qlemesle/parapluie")
|
| 58 |
+
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
|
| 59 |
+
|
| 60 |
+
S = "Have you ever seen a tsunami ?"
|
| 61 |
+
H = "Have you ever seen a tiramisu ?"
|
| 62 |
+
|
| 63 |
+
results = ppluie.compute(sources=[S], hypotheses=[H])
|
| 64 |
+
print(results)
|
| 65 |
+
>>> {'scores': [-16.97607421875]}
|
| 66 |
```
|
|
|
|
| 67 |
|
| 68 |
+
Configure metric
|
| 69 |
+
```python
|
| 70 |
+
ppluie.init(
|
| 71 |
+
model="mistralai/Mistral-7B-Instruct-v0.2",
|
| 72 |
+
device = "cuda:0",
|
| 73 |
+
template = "FS-DIRECT",
|
| 74 |
+
use_chat_template = True,
|
| 75 |
+
half_mode = True,
|
| 76 |
+
n_right_specials_tokens = 1
|
| 77 |
+
)
|
| 78 |
+
```
|
| 79 |
|
| 80 |
+
show available prompting templates
|
|
|
|
|
|
|
| 81 |
```python
|
| 82 |
+
ppluie.show_templates()
|
|
|
|
|
|
|
|
|
|
| 83 |
```
|
| 84 |
+
|
| 85 |
+
show how is the prompt encoded, to ensure that the correct numbers of special tokens are removed and Yes / No words fit on one token
|
| 86 |
```python
|
| 87 |
+
ppluie.check_end_tokens_tmpl()
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
+
|
| 90 |
+
show LLM already tested with ParaPLUIE
|
| 91 |
+
```python
|
| 92 |
+
ppluie.show_available_models()
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
change prompting template
|
| 96 |
```python
|
| 97 |
+
ppluie.setTemplate("DIRECT")
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
| 100 |
## Limitations and Bias
|
parapluie.py
CHANGED
|
@@ -59,11 +59,16 @@ Args:
|
|
| 59 |
Returns:
|
| 60 |
score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
|
| 61 |
Examples:
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
"""
|
| 68 |
|
| 69 |
|
|
|
|
| 59 |
Returns:
|
| 60 |
score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
|
| 61 |
Examples:
|
| 62 |
+
import evaluate
|
| 63 |
+
ppluie = evaluate.load("qlemesle/parapluie")
|
| 64 |
+
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
|
| 65 |
+
|
| 66 |
+
S = "Have you ever seen a tsunami ?"
|
| 67 |
+
H = "Have you ever seen a tiramisu ?"
|
| 68 |
+
|
| 69 |
+
results = ppluie.compute(sources=[S], hypotheses=[H])
|
| 70 |
+
print(results)
|
| 71 |
+
>>> {'scores': [-16.97607421875]}
|
| 72 |
"""
|
| 73 |
|
| 74 |
|