Spaces:

rfr2003
/

place_gen_evaluate

Sleeping

App Files Files Community

Rodrigo Ferreira Rodrigues commited on Feb 23

Commit

cdc442b

1 Parent(s): ca19a65

Updating doc

Browse files

Files changed (3) hide show

README.md +38 -16
place_gen_evaluate.py +26 -22
tests.py +3 -13

README.md CHANGED Viewed

@@ -14,37 +14,59 @@ pinned: false
 # Metric Card for Place_gen_evaluate
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 ### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 # Metric Card for Place_gen_evaluate
 ## Metric Description
+This metric aims to evaluate geographic place prediction tasks done by LMs. For each question, we expect the model to generate a list of places and the gold answers must also be a list of places names.
 ## How to Use
+This metric takes 2 mandatory arguments : `generations` (a list of string), `golds` (a list of list of string containing places names).
+```python
+import evaluate
+place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
+results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
+print(results)
+{'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
+```
+This metric accepts one optional argument:
+`d`: function used to compute the distance between a generated value and a gold one. The default value is the __distance__ function from the [Levansthein library](https://github.com/rapidfuzz/python-Levenshtein).
 ### Output Values
+This metric outputs a dictionary with the following values:
+`bert_score_precision`: Average of BERTScores Precision values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
+`bert_score_recall`: Average of BERTScores Recall values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
+`bert_score_f1`: Average of BERTScores f1 values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
+`bleu-1`: Bleu-1 score computed by [bleu module](https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md).
+`precision`: Sum of the minimum distances between each predicted value and the set of gold values, computed for each question.
+`recall`: Sum of the minimum distances between each gold value and the set of generated values, computed for each question.
+`macro-mean`: Average between precision and recall, computed for each question.
+`median macro-mean`: Median accross macro-mean values.
 #### Values from Popular Papers
 ### Examples
+```python
+import evaluate
+place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
+results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
+print(results)
+{'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
+```
 ## Limitations and Bias
 ## Citation
+## Further References

place_gen_evaluate.py CHANGED Viewed

@@ -32,29 +32,35 @@ year={2020}
 # TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
 # TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
     >>> print(results)
-    {'accuracy': 1.0}
 """
@@ -95,15 +101,13 @@ class Place_gen_evaluate(evaluate.Metric):
             dists.append(g_dist)
         dists = np.array(dists)
-        precision = np.min(dists, axis=0).sum()
-        recall = np.min(dists, axis=1).sum()
         return precision, recall
-    def _compute(self, generations, golds):
-        '''Calculate Accuracy and BLEU-1 scores between model generations and golden answers.
-           We expect a set of generated answers and want to find it among a set of gold answers.'''
         assert len(generations) == len(golds)
         assert isinstance(golds, list)
@@ -128,7 +132,7 @@ class Place_gen_evaluate(evaluate.Metric):
             f_ans = list(set([str(a).lower().strip() for a in f_ans])) #get rid of duples values
             f_gold = list(set(gold))
-            precision, recall = self._calculate_pre_rec(f_ans, f_gold, Levenshtein.distance)
             precisions.append(precision)
             recalls.append(recall)
@@ -140,9 +144,9 @@ class Place_gen_evaluate(evaluate.Metric):
         metrics = {f"bert_score_{k}":np.mean(v).item() for k,v in self.bert_score.compute(predictions=predictions, references=references, lang="en").items() if k in ['recall', 'precision', 'f1']}
         metrics.update({
             'bleu-1': self.bleu.compute(predictions=predictions, references=references, max_order=1)['bleu'],
-            'precision': np.mean(precisions).item(),
-            'rappel': np.mean(recalls).item(),
-            'macro-mean': np.mean(means_pre_rec).item(),
             'median macro-mean': median(means_pre_rec)
         })

 # TODO: Add description of the module here
 _DESCRIPTION = """\
+This metric aims to evaluate geographic place prediction tasks done by LMs.
 """
 # TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
+Calculates Precision, recall and macro-mean between generations and gold answers in a place prediction context.
 Args:
+    generations: list of predictions to score. Each predictions
+        should be a string generated by a LM model.
+    golds: list of reference for each prediction. Each
+        reference should be a string corresponding to the name of a geographic place.
+    d: function used to the compute distance between a generated value and a gold one.
 Returns:
+    bert_score_precision: Average of BERTScores Precision values computed.
+    bert_score_recall: Average of BERTScores Recall values computed.
+    bert_score_f1: Average of BERTScores f1 values computed.
+    bleu-1: Bleu-1 score across all questions.
+    precision: Sum of the minimum distances between each predicted value and the set of gold values, computed for each question.
+    recall: Sum of the minimum distances between each gold value and the set of generated values, computed for each question.
+    macro-mean: Average between precision and recall, computed for each question.
+    median macro-mean: Median accross macro-mean values.
 Examples:
+    Here is an exemple on how to use the metric:
+    >>> metric = evaluate.load("rfr2003/place_gen_evaluate")
+    >>> results = metric.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
     >>> print(results)
+    {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
 """
             dists.append(g_dist)
         dists = np.array(dists)
+        precision = np.min(dists, axis=0).sum().item()
+        recall = np.min(dists, axis=1).sum().item()
         return precision, recall
+    def _compute(self, generations, golds, d=Levenshtein.distance):
         assert len(generations) == len(golds)
         assert isinstance(golds, list)
             f_ans = list(set([str(a).lower().strip() for a in f_ans])) #get rid of duples values
             f_gold = list(set(gold))
+            precision, recall = self._calculate_pre_rec(f_ans, f_gold, d)
             precisions.append(precision)
             recalls.append(recall)
         metrics = {f"bert_score_{k}":np.mean(v).item() for k,v in self.bert_score.compute(predictions=predictions, references=references, lang="en").items() if k in ['recall', 'precision', 'f1']}
         metrics.update({
             'bleu-1': self.bleu.compute(predictions=predictions, references=references, max_order=1)['bleu'],
+            'precision': precisions,
+            'recall': recalls,
+            'macro-mean': means_pre_rec,
             'median macro-mean': median(means_pre_rec)
         })

tests.py CHANGED Viewed

@@ -1,17 +1,7 @@
 test_cases = [
     {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
     },
-    {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
-    },
-    {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
-    }
 ]

 test_cases = [
     {
+        "predictions": ['[Hotel New Home, Hopeland]'],
+        "references": [['Bar Guisness', 'Hotel New Home', 'New Hopeland']],
+        "result": {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
     },
 ]