qlemesle commited on
Commit
22a146d
·
1 Parent(s): a0da963

update doc

Browse files
Files changed (2) hide show
  1. README.md +56 -37
  2. parapluie.py +10 -5
README.md CHANGED
@@ -16,66 +16,85 @@ pinned: false
16
 
17
  # Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
18
 
19
- W.I.P
20
-
21
  ## Metric Description
22
  ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
23
  ParaPLUIE use the perplexity of an LLM to compute a confidence score.
24
  It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
25
 
26
  ## How to Use
27
- # TODO
28
  This metric requires a source sentence and it's hypothetical paraphrase.
29
 
30
  ```python
31
- >>> accuracy_metric = evaluate.load("accuracy")
32
- >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
33
- >>> print(results)
34
- {'accuracy': 1.0}
 
 
 
 
 
 
35
  ```
36
 
37
  ### Inputs
38
- # TODO
39
- - **predictions** (`list` of `int`): Predicted labels.
40
- - **references** (`list` of `int`): Ground truth labels.
41
- - **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
42
- - **sample_weight** (`list` of `float`): Sample weights Defaults to None.
43
 
44
  ### Output Values
45
- # TODO
46
- - **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
47
- Output Example(s):
 
 
 
 
 
48
  ```python
49
- {'accuracy': 1.0}
 
 
 
 
 
 
 
 
 
50
  ```
51
- This metric outputs a dictionary, containing the accuracy score.
52
 
53
- #### Values from Papers
54
- # TODO
55
- ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.
 
 
 
 
 
 
 
 
56
 
57
- ### Examples
58
- # TODO
59
- Example 1-A simple example
60
  ```python
61
- >>> accuracy_metric = evaluate.load("accuracy")
62
- >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
63
- >>> print(results)
64
- {'accuracy': 0.5}
65
  ```
66
- Example 2-The same as Example 1, except with `normalize` set to `False`.
 
67
  ```python
68
- >>> accuracy_metric = evaluate.load("accuracy")
69
- >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
70
- >>> print(results)
71
- {'accuracy': 3.0}
72
  ```
73
- Example 3-The same as Example 1, except with `sample_weight` set.
 
 
 
 
 
 
74
  ```python
75
- >>> accuracy_metric = evaluate.load("accuracy")
76
- >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
77
- >>> print(results)
78
- {'accuracy': 0.8778625954198473}
79
  ```
80
 
81
  ## Limitations and Bias
 
16
 
17
  # Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
18
 
 
 
19
  ## Metric Description
20
  ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
21
  ParaPLUIE use the perplexity of an LLM to compute a confidence score.
22
  It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
23
 
24
  ## How to Use
25
+
26
  This metric requires a source sentence and it's hypothetical paraphrase.
27
 
28
  ```python
29
+ import evaluate
30
+ ppluie = evaluate.load("qlemesle/parapluie")
31
+ ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
32
+
33
+ S = "Have you ever seen a tsunami ?"
34
+ H = "Have you ever seen a tiramisu ?"
35
+
36
+ results = ppluie.compute(sources=[S], hypotheses=[H])
37
+ print(results)
38
+ >>> {'scores': [-16.97607421875]}
39
  ```
40
 
41
  ### Inputs
42
+
43
+ - **sources** (`list` of `string`): Source sentences.
44
+ - **hypotheses** (`list` of `string`): Hypothetical paraphrases.
 
 
45
 
46
  ### Output Values
47
+
48
+ - **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
49
+
50
+ This metric outputs a dictionary, containing the score.
51
+
52
+ ### Examples
53
+
54
+ Simple example
55
  ```python
56
+ import evaluate
57
+ ppluie = evaluate.load("qlemesle/parapluie")
58
+ ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
59
+
60
+ S = "Have you ever seen a tsunami ?"
61
+ H = "Have you ever seen a tiramisu ?"
62
+
63
+ results = ppluie.compute(sources=[S], hypotheses=[H])
64
+ print(results)
65
+ >>> {'scores': [-16.97607421875]}
66
  ```
 
67
 
68
+ Configure metric
69
+ ```python
70
+ ppluie.init(
71
+ model="mistralai/Mistral-7B-Instruct-v0.2",
72
+ device = "cuda:0",
73
+ template = "FS-DIRECT",
74
+ use_chat_template = True,
75
+ half_mode = True,
76
+ n_right_specials_tokens = 1
77
+ )
78
+ ```
79
 
80
+ show available prompting templates
 
 
81
  ```python
82
+ ppluie.show_templates()
 
 
 
83
  ```
84
+
85
+ show how is the prompt encoded, to ensure that the correct numbers of special tokens are removed and Yes / No words fit on one token
86
  ```python
87
+ ppluie.check_end_tokens_tmpl()
 
 
 
88
  ```
89
+
90
+ show LLM already tested with ParaPLUIE
91
+ ```python
92
+ ppluie.show_available_models()
93
+ ```
94
+
95
+ change prompting template
96
  ```python
97
+ ppluie.setTemplate("DIRECT")
 
 
 
98
  ```
99
 
100
  ## Limitations and Bias
parapluie.py CHANGED
@@ -59,11 +59,16 @@ Args:
59
  Returns:
60
  score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
61
  Examples:
62
- Example 1-A simple example
63
- >>> accuracy_metric = evaluate.load("accuracy")
64
- >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
65
- >>> print(results)
66
- {'accuracy': 0.5}
 
 
 
 
 
67
  """
68
 
69
 
 
59
  Returns:
60
  score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
61
  Examples:
62
+ import evaluate
63
+ ppluie = evaluate.load("qlemesle/parapluie")
64
+ ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
65
+
66
+ S = "Have you ever seen a tsunami ?"
67
+ H = "Have you ever seen a tiramisu ?"
68
+
69
+ results = ppluie.compute(sources=[S], hypotheses=[H])
70
+ print(results)
71
+ >>> {'scores': [-16.97607421875]}
72
  """
73
 
74