Spaces:

qlemesle
/

parapluie

Sleeping

App Files Files Community

qlemesle commited on Nov 6

Commit

22a146d

1 Parent(s): a0da963

update doc

Browse files

Files changed (2) hide show

README.md +56 -37
parapluie.py +10 -5

README.md CHANGED Viewed

@@ -16,66 +16,85 @@ pinned: false
 # Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
-W.I.P
 ## Metric Description
 ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
 ParaPLUIE use the perplexity of an LLM to compute a confidence score.
 It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
 ## How to Use
-# TODO
 This metric requires a source sentence and it's hypothetical paraphrase.
 ```python
->>> accuracy_metric = evaluate.load("accuracy")
->>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
->>> print(results)
-{'accuracy': 1.0}
 ```
 ### Inputs
-# TODO
-- **predictions** (`list` of `int`): Predicted labels.
-- **references** (`list` of `int`): Ground truth labels.
-- **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
-- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
 ### Output Values
-# TODO
-- **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
-Output Example(s):
 ```python
-{'accuracy': 1.0}
 ```
-This metric outputs a dictionary, containing the accuracy score.
-#### Values from Papers
-# TODO
-ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.
-### Examples
-# TODO
-Example 1-A simple example
 ```python
->>> accuracy_metric = evaluate.load("accuracy")
->>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
->>> print(results)
-{'accuracy': 0.5}
 ```
-Example 2-The same as Example 1, except with `normalize` set to `False`.
 ```python
->>> accuracy_metric = evaluate.load("accuracy")
->>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
->>> print(results)
-{'accuracy': 3.0}
 ```
-Example 3-The same as Example 1, except with `sample_weight` set.
 ```python
->>> accuracy_metric = evaluate.load("accuracy")
->>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
->>> print(results)
-{'accuracy': 0.8778625954198473}
 ```
 ## Limitations and Bias

 # Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
 ## Metric Description
 ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
 ParaPLUIE use the perplexity of an LLM to compute a confidence score.
 It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
 ## How to Use
 This metric requires a source sentence and it's hypothetical paraphrase.
 ```python
+import evaluate
+ppluie = evaluate.load("qlemesle/parapluie")
+ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
+S = "Have you ever seen a tsunami ?"
+H = "Have you ever seen a tiramisu ?"
+results = ppluie.compute(sources=[S], hypotheses=[H])
+print(results)
+>>> {'scores': [-16.97607421875]}
 ```
 ### Inputs
+- **sources** (`list` of `string`): Source sentences.
+- **hypotheses** (`list` of `string`): Hypothetical paraphrases.
 ### Output Values
+- **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
+This metric outputs a dictionary, containing the score.
+### Examples
+Simple example
 ```python
+import evaluate
+ppluie = evaluate.load("qlemesle/parapluie")
+ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
+S = "Have you ever seen a tsunami ?"
+H = "Have you ever seen a tiramisu ?"
+results = ppluie.compute(sources=[S], hypotheses=[H])
+print(results)
+>>> {'scores': [-16.97607421875]}
 ```
+Configure metric
+```python
+ppluie.init(
+  model="mistralai/Mistral-7B-Instruct-v0.2",
+  device = "cuda:0",
+  template = "FS-DIRECT",
+  use_chat_template = True,
+  half_mode = True,
+  n_right_specials_tokens = 1
+)
+```
+show available prompting templates
 ```python
+ppluie.show_templates()
 ```
+show how is the prompt encoded, to ensure that the correct numbers of special tokens are removed and Yes / No words fit on one token
 ```python
+ppluie.check_end_tokens_tmpl()
 ```
+show LLM already tested with ParaPLUIE
+```python
+ppluie.show_available_models()
+```
+change prompting template
 ```python
+ppluie.setTemplate("DIRECT")
 ```
 ## Limitations and Bias

parapluie.py CHANGED Viewed

@@ -59,11 +59,16 @@ Args:
 Returns:
     score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
 Examples:
-    Example 1-A simple example
-        >>> accuracy_metric = evaluate.load("accuracy")
-        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
-        >>> print(results)
-        {'accuracy': 0.5}
 """

 Returns:
     score (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 mean that sentences are paraphrases. A score lower than 0 mean the opposite.
 Examples:
+    import evaluate
+    ppluie = evaluate.load("qlemesle/parapluie")
+    ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2")
+    S = "Have you ever seen a tsunami ?"
+    H = "Have you ever seen a tiramisu ?"
+    results = ppluie.compute(sources=[S], hypotheses=[H])
+    print(results)
+    >>> {'scores': [-16.97607421875]}
 """