Rodrigo Ferreira Rodrigues commited on
Commit
cdc442b
·
1 Parent(s): ca19a65

Updating doc

Browse files
Files changed (3) hide show
  1. README.md +38 -16
  2. place_gen_evaluate.py +26 -22
  3. tests.py +3 -13
README.md CHANGED
@@ -14,37 +14,59 @@ pinned: false
14
 
15
  # Metric Card for Place_gen_evaluate
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
-
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
26
 
27
- ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
 
 
 
 
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
 
 
 
 
 
36
 
37
  #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
  ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
48
 
49
- ## Further References
50
- *Add any useful further references.*
 
14
 
15
  # Metric Card for Place_gen_evaluate
16
 
 
 
17
  ## Metric Description
18
+ This metric aims to evaluate geographic place prediction tasks done by LMs. For each question, we expect the model to generate a list of places and the gold answers must also be a list of places names.
19
 
20
  ## How to Use
 
21
 
22
+ This metric takes 2 mandatory arguments : `generations` (a list of string), `golds` (a list of list of string containing places names).
23
+
24
+ ```python
25
+ import evaluate
26
+ place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
27
+ results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
28
+ print(results)
29
+ {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
30
+ ```
31
+
32
+ This metric accepts one optional argument:
33
 
34
+ `d`: function used to compute the distance between a generated value and a gold one. The default value is the __distance__ function from the [Levansthein library](https://github.com/rapidfuzz/python-Levenshtein).
 
 
35
 
36
  ### Output Values
37
 
38
+ This metric outputs a dictionary with the following values:
39
+
40
+ `bert_score_precision`: Average of BERTScores Precision values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
41
+
42
+ `bert_score_recall`: Average of BERTScores Recall values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
43
+
44
+ `bert_score_f1`: Average of BERTScores f1 values computed by [bertscore module](https://github.com/huggingface/evaluate/blob/main/metrics/bertscore/README.md).
45
 
46
+ `bleu-1`: Bleu-1 score computed by [bleu module](https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md).
47
+
48
+ `precision`: Sum of the minimum distances between each predicted value and the set of gold values, computed for each question.
49
+
50
+ `recall`: Sum of the minimum distances between each gold value and the set of generated values, computed for each question.
51
+
52
+ `macro-mean`: Average between precision and recall, computed for each question.
53
+
54
+ `median macro-mean`: Median accross macro-mean values.
55
 
56
  #### Values from Popular Papers
 
57
 
58
  ### Examples
59
+
60
+ ```python
61
+ import evaluate
62
+ place_pred_eval = evaluate.load("rfr2003/place_gen_evaluate")
63
+ results = place_pred_eval.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
64
+ print(results)
65
+ {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
66
+ ```
67
 
68
  ## Limitations and Bias
 
69
 
70
  ## Citation
 
71
 
72
+ ## Further References
 
place_gen_evaluate.py CHANGED
@@ -32,29 +32,35 @@ year={2020}
32
 
33
  # TODO: Add description of the module here
34
  _DESCRIPTION = """\
35
- This new module is designed to solve this great ML task and is crafted with a lot of care.
36
  """
37
 
38
 
39
  # TODO: Add description of the arguments of the module here
40
  _KWARGS_DESCRIPTION = """
41
- Calculates how good are predictions given some references, using certain scores
42
  Args:
43
- predictions: list of predictions to score. Each predictions
44
- should be a string with tokens separated by spaces.
45
- references: list of reference for each prediction. Each
46
- reference should be a string with tokens separated by spaces.
 
47
  Returns:
48
- accuracy: description of the first score,
49
- another_score: description of the second score,
 
 
 
 
 
 
50
  Examples:
51
- Examples should be written in doctest format, and should illustrate how
52
- to use the function.
53
 
54
- >>> my_new_module = evaluate.load("my_new_module")
55
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
56
  >>> print(results)
57
- {'accuracy': 1.0}
58
  """
59
 
60
 
@@ -95,15 +101,13 @@ class Place_gen_evaluate(evaluate.Metric):
95
  dists.append(g_dist)
96
 
97
  dists = np.array(dists)
98
- precision = np.min(dists, axis=0).sum()
99
 
100
- recall = np.min(dists, axis=1).sum()
101
 
102
  return precision, recall
103
 
104
- def _compute(self, generations, golds):
105
- '''Calculate Accuracy and BLEU-1 scores between model generations and golden answers.
106
- We expect a set of generated answers and want to find it among a set of gold answers.'''
107
  assert len(generations) == len(golds)
108
  assert isinstance(golds, list)
109
 
@@ -128,7 +132,7 @@ class Place_gen_evaluate(evaluate.Metric):
128
  f_ans = list(set([str(a).lower().strip() for a in f_ans])) #get rid of duples values
129
  f_gold = list(set(gold))
130
 
131
- precision, recall = self._calculate_pre_rec(f_ans, f_gold, Levenshtein.distance)
132
 
133
  precisions.append(precision)
134
  recalls.append(recall)
@@ -140,9 +144,9 @@ class Place_gen_evaluate(evaluate.Metric):
140
  metrics = {f"bert_score_{k}":np.mean(v).item() for k,v in self.bert_score.compute(predictions=predictions, references=references, lang="en").items() if k in ['recall', 'precision', 'f1']}
141
  metrics.update({
142
  'bleu-1': self.bleu.compute(predictions=predictions, references=references, max_order=1)['bleu'],
143
- 'precision': np.mean(precisions).item(),
144
- 'rappel': np.mean(recalls).item(),
145
- 'macro-mean': np.mean(means_pre_rec).item(),
146
  'median macro-mean': median(means_pre_rec)
147
  })
148
 
 
32
 
33
  # TODO: Add description of the module here
34
  _DESCRIPTION = """\
35
+ This metric aims to evaluate geographic place prediction tasks done by LMs.
36
  """
37
 
38
 
39
  # TODO: Add description of the arguments of the module here
40
  _KWARGS_DESCRIPTION = """
41
+ Calculates Precision, recall and macro-mean between generations and gold answers in a place prediction context.
42
  Args:
43
+ generations: list of predictions to score. Each predictions
44
+ should be a string generated by a LM model.
45
+ golds: list of reference for each prediction. Each
46
+ reference should be a string corresponding to the name of a geographic place.
47
+ d: function used to the compute distance between a generated value and a gold one.
48
  Returns:
49
+ bert_score_precision: Average of BERTScores Precision values computed.
50
+ bert_score_recall: Average of BERTScores Recall values computed.
51
+ bert_score_f1: Average of BERTScores f1 values computed.
52
+ bleu-1: Bleu-1 score across all questions.
53
+ precision: Sum of the minimum distances between each predicted value and the set of gold values, computed for each question.
54
+ recall: Sum of the minimum distances between each gold value and the set of generated values, computed for each question.
55
+ macro-mean: Average between precision and recall, computed for each question.
56
+ median macro-mean: Median accross macro-mean values.
57
  Examples:
58
+ Here is an exemple on how to use the metric:
 
59
 
60
+ >>> metric = evaluate.load("rfr2003/place_gen_evaluate")
61
+ >>> results = metric.compute(generations=['[Hotel New Home, Hopeland]'], golds=[['Bar Guisness', 'Hotel New Home', 'New Hopeland']])
62
  >>> print(results)
63
+ {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
64
  """
65
 
66
 
 
101
  dists.append(g_dist)
102
 
103
  dists = np.array(dists)
104
+ precision = np.min(dists, axis=0).sum().item()
105
 
106
+ recall = np.min(dists, axis=1).sum().item()
107
 
108
  return precision, recall
109
 
110
+ def _compute(self, generations, golds, d=Levenshtein.distance):
 
 
111
  assert len(generations) == len(golds)
112
  assert isinstance(golds, list)
113
 
 
132
  f_ans = list(set([str(a).lower().strip() for a in f_ans])) #get rid of duples values
133
  f_gold = list(set(gold))
134
 
135
+ precision, recall = self._calculate_pre_rec(f_ans, f_gold, d)
136
 
137
  precisions.append(precision)
138
  recalls.append(recall)
 
144
  metrics = {f"bert_score_{k}":np.mean(v).item() for k,v in self.bert_score.compute(predictions=predictions, references=references, lang="en").items() if k in ['recall', 'precision', 'f1']}
145
  metrics.update({
146
  'bleu-1': self.bleu.compute(predictions=predictions, references=references, max_order=1)['bleu'],
147
+ 'precision': precisions,
148
+ 'recall': recalls,
149
+ 'macro-mean': means_pre_rec,
150
  'median macro-mean': median(means_pre_rec)
151
  })
152
 
tests.py CHANGED
@@ -1,17 +1,7 @@
1
  test_cases = [
2
  {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
6
  },
7
- {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
- },
12
- {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
- }
17
  ]
 
1
  test_cases = [
2
  {
3
+ "predictions": ['[Hotel New Home, Hopeland]'],
4
+ "references": [['Bar Guisness', 'Hotel New Home', 'New Hopeland']],
5
+ "result": {'bert_score_precision': 0.8470218181610107, 'bert_score_recall': 0.9131535291671753, 'bert_score_f1': 0.8788453936576843, 'bleu-1': 0.5714285714285714, 'precision': [6.0], 'rappel': [15.0], 'macro-mean': [10.5], 'median macro-mean': 10.5}
6
  },
 
 
 
 
 
 
 
 
 
 
7
  ]