Spaces:

TabArena
/

leaderboard

Running

App Files Files Community

LennartPurucker commited on May 22

Commit

d425853

1 Parent(s): cb51391

add: more metric cols

Browse files

Files changed (2) hide show

data/tabarena_leaderboard.csv.zip +2 -2
main.py +68 -16

data/tabarena_leaderboard.csv.zip CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c95306a347a69e561a82562c1a6306f5a9f6819a60f458d5e350639c35cde848
-size 10202

 version https://git-lfs.github.com/spec/v1
+oid sha256:7b23c724927320a54d5e4edcf1b2d938bc818c1dfd5461f1a8d204bb0b44d095
+size 10582

main.py CHANGED Viewed

@@ -26,6 +26,9 @@ tuned configurations. Each model is implemented in a tested real-world pipeline
 optimized to get the most out of the model by the maintainers of TabArena, and where
 possible together with the authors of the model.
 **Reference Pipeline:** The leaderboard includes a reference pipeline, which is applied
 independently of the tuning protocol and constraints we constructed for models within TabArena.
 The reference pipeline aims to represent the performance quickly achievable by a
@@ -39,22 +42,68 @@ The current leaderboard is based on TabArena-v0.1.
 ABOUT_TEXT = """
 ## Using TabArena for Benchmarking
 To compare your own methods to the pre-computed results for all models on the leaderboard,
 you can use the TabArena framework. For examples on how to use TabArena for benchmarking,
 please see https://github.com/TabArena/tabarena_benchmarking_examples
 ## Contributing Data
 For anything related to the datasets used in TabArena, please see https://github.com/TabArena/tabarena_dataset_curation
-## Contributing to the Leaderboard
-For guidelines on how to contribute the result of your model to the official leaderboard,
-please see the appendix of our paper. <TODO: publish documentation>
-## Contact The Maintainers
-For any inquires related to TabArena, please reach out to: contact@tabarena.ai
-## Core Maintainers
 The current core maintainers of TabArena are:
 [Nick Erickson](https://github.com/Innixma),
 [Lennart Purucker](https://github.com/LennartPurucker/),
@@ -139,6 +188,9 @@ def load_data(filename: str):
         + df_leaderboard["elo-"].round(0).astype(int).astype(str)
     )
     # select only the columns we want to display
     df_leaderboard = df_leaderboard.loc[
         :,
         [
@@ -147,8 +199,10 @@ def load_data(filename: str):
             "method",
             "elo",
             "Elo 95% CI",
             "rank",
-            "normalized-error",
             "median_time_train_s_per_1K",
             "median_time_infer_s_per_1K",
         ],
@@ -158,11 +212,11 @@ def load_data(filename: str):
     df_leaderboard[["elo", "Elo 95% CI"]] = df_leaderboard[["elo", "Elo 95% CI"]].round(
         0
     )
-    df_leaderboard[["median_time_train_s_per_1K", "rank"]] = df_leaderboard[
-        ["median_time_train_s_per_1K", "rank"]
     ].round(2)
-    df_leaderboard[["normalized-error", "median_time_infer_s_per_1K"]] = df_leaderboard[
-        ["normalized-error", "median_time_infer_s_per_1K"]
     ].round(3)
     df_leaderboard = df_leaderboard.sort_values(by="elo", ascending=False)
@@ -177,14 +231,12 @@ def load_data(filename: str):
             "method": "Model",
             "elo": "Elo [⬆️]",
             "rank": "Rank [⬇️]",
-            "normalized-error": "Normalized Error [⬇️]",
         }
     )
-    # TODO show ELO +/- sem
-    # TODO: rename and re-order columns
 def make_leaderboard(df_leaderboard: pd.DataFrame) -> Leaderboard:
     df_leaderboard["TypeFiler"] = df_leaderboard["TypeName"].apply(
         lambda m: f"{m} {model_type_emoji[m]}"

 optimized to get the most out of the model by the maintainers of TabArena, and where
 possible together with the authors of the model.
+**Metrics:** The leaderboard is ranked based on Elo. We present several additional
+metrics. See the `About` tab for more information on the metrics.
 **Reference Pipeline:** The leaderboard includes a reference pipeline, which is applied
 independently of the tuning protocol and constraints we constructed for models within TabArena.
 The reference pipeline aims to represent the performance quickly achievable by a
 ABOUT_TEXT = """
+TabArena is a living benchmark system for predictive machine learning on tabular data.
+We introduce TabArena and provide an overview of TabArena-v0.1 in our paper: TBA.
 ## Using TabArena for Benchmarking
 To compare your own methods to the pre-computed results for all models on the leaderboard,
 you can use the TabArena framework. For examples on how to use TabArena for benchmarking,
 please see https://github.com/TabArena/tabarena_benchmarking_examples
+## Contributing to the Leaderboard; Contributing Models
+For guidelines on how to contribute your model to TabArena, or the result of your model
+to the official leaderboard, please see the appendix of our paper: TBA.
 ## Contributing Data
 For anything related to the datasets used in TabArena, please see https://github.com/TabArena/tabarena_dataset_curation
+---
+## Leaderboard Documentation
+The leaderboard is ranked by Elo and includes several other metrics. Here is a short
+description for these metrics:
+#### Elo
+We evaluate models using the Elo rating system, following Chatbot Arena. Elo is a
+pairwise comparison-based rating system where each model's rating predicts its expected
+win probability against others, with a 400-point Elo gap corresponding to a 10 to 1
+(91\%) expected win rate. We calibrate 1000 Elo to the performance of our default
+random forest configuration across all figures, and perform 100 rounds of bootstrapping
+to obtain 95\% confidence intervals. Elo scores are computed using ROC AUC for binary
+classification, log-loss for multiclass classification, and RMSE for regression.
+#### Normalized Score
+Following TabRepo, we linearly rescale the error such that the best method has a
+normalized score of one, and the median method has a normalized score of 0. Scores
+below zero are clipped to zero. These scores are then averaged across datasets.
+#### Average Rank
+Ranks of methods are computed on each dataset (lower is better) and averaged.
+#### Harmonic Mean Rank
+Taking the harmonic mean of ranks, 1/((1/N) * sum(1/rank_i for i in range(N))),
+more strongly favors methods having very low ranks on some datasets. It therefore favors
+ methods that are sometimes very good and sometimes very bad over methods that are
+ always mediocre, as the former are more likely to be useful in conjunction with
+ other methods.
+#### Improvability
+We introduce improvability as a metric that measures how many percent lower the error
+of the best method is than the current method on a dataset. This is then averaged over
+datasets. Formally, for a single dataset improvability is (err_i - besterr_i)/err_i * 100\%.
+Improvability is always between $0\%$ and $100\%$.
+---
+## Contact
+For most inquires, please open issues in the relevant GitHub repository or here on
+HuggingFace.
+For any other inquiries related to TabArena, please reach out to: contact@tabarena.ai
+### Core Maintainers
 The current core maintainers of TabArena are:
 [Nick Erickson](https://github.com/Innixma),
 [Lennart Purucker](https://github.com/LennartPurucker/),
         + df_leaderboard["elo-"].round(0).astype(int).astype(str)
     )
     # select only the columns we want to display
+    df_leaderboard["normalized-score"] = 1 - df_leaderboard["normalized-error"]
+    df_leaderboard["hmr"] = 1/df_leaderboard["mrr"]
+    df_leaderboard["improvability"] = 100 * df_leaderboard["champ_delta"]
     df_leaderboard = df_leaderboard.loc[
         :,
         [
             "method",
             "elo",
             "Elo 95% CI",
+            "normalized-score",
             "rank",
+            "hmr",
+            "improvability",
             "median_time_train_s_per_1K",
             "median_time_infer_s_per_1K",
         ],
     df_leaderboard[["elo", "Elo 95% CI"]] = df_leaderboard[["elo", "Elo 95% CI"]].round(
         0
     )
+    df_leaderboard[["median_time_train_s_per_1K", "rank", "hmr"]] = df_leaderboard[
+        ["median_time_train_s_per_1K", "rank", "hmr"]
     ].round(2)
+    df_leaderboard[["normalized-score", "median_time_infer_s_per_1K", "improvability"]] = df_leaderboard[
+        ["normalized-score", "median_time_infer_s_per_1K", "improvability"]
     ].round(3)
     df_leaderboard = df_leaderboard.sort_values(by="elo", ascending=False)
             "method": "Model",
             "elo": "Elo [⬆️]",
             "rank": "Rank [⬇️]",
+            "normalized-score": "Normalized Score [⬆️]",
+            "hmr": "Harmonic Mean Rank [⬇️]",
+            "improvability": "Improvability (%) [⬇️]",
         }
     )
 def make_leaderboard(df_leaderboard: pd.DataFrame) -> Leaderboard:
     df_leaderboard["TypeFiler"] = df_leaderboard["TypeName"].apply(
         lambda m: f"{m} {model_type_emoji[m]}"