Spaces:

alexandrainst
/

radial-plot-generator

Running

App Files Files Community

saattrupdan commited on Feb 29, 2024

Commit

3a84b3b

1 Parent(s): 5f70754

chore: Small update

Browse files

Files changed (1) hide show

app.py +12 -14

app.py CHANGED Viewed

@@ -45,12 +45,12 @@ the models 10 times with bootstrapped test sets and different few-shot examples
 iteration. This allows us to better measure the uncertainty of the results. We use the
 uncertainty in the radial plot when we compute the rank scores for the models. Namely,
 we compute the rank score by firstly computing the rank of the model on each task,
-where two models are considered to have the same rank if they have there is not a
-statistically significant difference between their scores (one-tailed t-test with p <
-0.05). We next apply a logaritmic transformation to the ranks, to downplay the
-importance of the poorly performing models. Lastly, we invert and normalise the
-logaritmic ranks to the range [0, 1], resulting in the best performing models having
-large rank scores and the worst performing models having small rank scores.
 ## The Benchmark Datasets
@@ -551,9 +551,9 @@ def produce_radial_plot(
                 ranks.append(rank)
             log_ranks = np.log(ranks)
-            scores = log_ranks / log_ranks.max()
             for model_id, score in zip(model_ids_sorted, scores):
-                all_rank_scores[task][language][model_id] = 1 - score
     logger.info("Successfully computed rank scores.")
     # Add all the evaluation results for each model
@@ -568,15 +568,13 @@ def produce_radial_plot(
                 if model_id not in results_dfs_filtered[language].index:
                     continue
-                score_list = results_dfs_filtered[language].loc[model_id][task]
                 rank_score = 100 * all_rank_scores[task][language][model_id]
                 rank_scores.append(rank_score)
-                if np.mean(score_list) < 1:
-                    score_list = [100 * score for score in score_list]
-                scores.append(np.mean(score_list))
             if use_rank_score:
                 result_list.append(np.mean(rank_scores))
             else:

 iteration. This allows us to better measure the uncertainty of the results. We use the
 uncertainty in the radial plot when we compute the rank scores for the models. Namely,
 we compute the rank score by firstly computing the rank of the model on each task,
+where two models are considered to have the same rank if there is not a statistically
+significant difference between their scores (one-tailed t-test with p < 0.05). We next
+apply a logaritmic transformation to the ranks, to downplay the importance of the
+poorly performing models. Lastly, we invert and normalise the logaritmic ranks to the
+range [0, 1], resulting in the best performing models having rank scores close to 1 and
+the worst performing models having rank scores close to 0.
 ## The Benchmark Datasets
                 ranks.append(rank)
             log_ranks = np.log(ranks)
+            scores = 1 - (log_ranks / log_ranks.max())
             for model_id, score in zip(model_ids_sorted, scores):
+                all_rank_scores[task][language][model_id] = score
     logger.info("Successfully computed rank scores.")
     # Add all the evaluation results for each model
                 if model_id not in results_dfs_filtered[language].index:
                     continue
                 rank_score = 100 * all_rank_scores[task][language][model_id]
                 rank_scores.append(rank_score)
+                score_arr = np.array(results_dfs_filtered[language].loc[model_id][task])
+                if score_arr.mean() < 1:
+                    score_arr *= 100
+                scores.append(score_arr.mean())
             if use_rank_score:
                 result_list.append(np.mean(rank_scores))
             else: