Spaces:
Running
Running
| from __future__ import annotations | |
| TITLE = """<h1 align="center" id="space-title">TabArena Leaderboard for Predictive Machine Learning on IID Tabular Data</h1>""" | |
| INTRODUCTION_TEXT = """ | |
| TabArena is a living benchmark system for predictive machine learning on tabular data. | |
| The goal of TabArena and its leaderboard is to asses the peak performance of | |
| model-specific pipelines. | |
| """ | |
| OVERVIEW_DATASETS = """ | |
| The leaderboard is based on a manually curated collection of | |
| 51 tabular classification and regression datasets for independent and identically distributed | |
| (IID) data, spanning the small to medium data regime. The datasets were carefully | |
| curated to represent various real-world predictive machine learning use cases. | |
| """ | |
| OVERVIEW_MODELS = """ | |
| The focus of the leaderboard is on model-specific pipelines. Each pipeline | |
| is evaluated with default and tuned hyperparameter configuration or as an ensemble of | |
| tuned configurations. Each model is implemented in a tested real-world pipeline that was | |
| optimized to get the most out of the model by the maintainers of TabArena, and where | |
| possible together with the authors of the model. | |
| """ | |
| OVERVIEW_METRICS = """ | |
| The leaderboards are ranked based on Elo. We present several additional | |
| metrics. See `More Details` for more information on the metrics. | |
| **Note, we impute** the performance for models that cannot run on all datasets due to | |
| task or dataset size constraints (e.g. TabPFN, TabICL). In general, imputation | |
| negatively represents the model performance, punishing the model for not being able | |
| to run on all datasets. We provide leaderboards computed only on the subset of datasets | |
| where TabPFN, TabICL, or both can run. We denote these leaderboards by adding a | |
| `X-compatible` postfix. | |
| """ | |
| OVERVIEW_REF_PIPE = """ | |
| The leaderboard includes a reference pipeline, which is applied | |
| independently of the tuning protocol and constraints we constructed for models within TabArena. | |
| The reference pipeline aims to represent the performance quickly achievable by a | |
| practitioner on a dataset. The current reference pipeline is the predictive machine | |
| learning system AutoGluon (version 1.3, with the best_quality preset and | |
| 4 hours for training). AutoGluon represents an ensemble pipeline across various model | |
| types and thus provides a reference for model-specific pipelines. | |
| """ | |
| ABOUT_TEXT = r""" | |
| ### Extended Overview of TabArena (References / Papers) | |
| We introduce TabArena and provide an overview of TabArena-v0.1 in our paper: TBA. | |
| ### Using TabArena for Benchmarking | |
| To compare your own methods to the pre-computed results for all models on the leaderboard, | |
| you can use the TabArena framework. For examples on how to use TabArena for benchmarking, | |
| please see https://tabarena.ai/code-examples | |
| ### Contributing to the Leaderboard; Contributing Models | |
| For guidelines on how to contribute your model to TabArena, or the result of your model | |
| to the official leaderboard, please see the appendix of our paper: TBA. | |
| ### Contributing Data | |
| For anything related to the datasets used in TabArena, please see https://tabarena.ai/data-tabular-ml-iid-study | |
| --- | |
| ### Leaderboard Documentation | |
| The leaderboard is ranked by Elo and includes several other metrics. Here is a short | |
| description for these metrics: | |
| #### Elo | |
| We evaluate models using the Elo rating system, following Chatbot Arena. Elo is a | |
| pairwise comparison-based rating system where each model's rating predicts its expected | |
| win probability against others, with a 400-point Elo gap corresponding to a 10 to 1 | |
| (91\%) expected win rate. We calibrate 1000 Elo to the performance of our default | |
| random forest configuration across all figures, and perform 100 rounds of bootstrapping | |
| to obtain 95\% confidence intervals. Elo scores are computed using ROC AUC for binary | |
| classification, log-loss for multiclass classification, and RMSE for regression. | |
| #### Normalized Score | |
| Following TabRepo, we linearly rescale the error such that the best method has a | |
| normalized score of one, and the median method has a normalized score of 0. Scores | |
| below zero are clipped to zero. These scores are then averaged across datasets. | |
| #### Average Rank | |
| Ranks of methods are computed on each dataset (lower is better) and averaged. | |
| #### Harmonic Mean Rank | |
| Taking the harmonic mean of ranks, 1/((1/N) * sum(1/rank_i for i in range(N))), | |
| more strongly favors methods having very low ranks on some datasets. It therefore favors | |
| methods that are sometimes very good and sometimes very bad over methods that are | |
| always mediocre, as the former are more likely to be useful in conjunction with | |
| other methods. | |
| #### Improvability | |
| We introduce improvability as a metric that measures how many percent lower the error | |
| of the best method is than the current method on a dataset. This is then averaged over | |
| datasets. Formally, for a single dataset improvability is (err_i - besterr_i)/err_i * 100\%. | |
| Improvability is always between 0\% and 100\%. | |
| --- | |
| ### Contact | |
| For most inquires, please open issues in the relevant GitHub repository or here on | |
| HuggingFace. | |
| For any other inquiries related to TabArena, please reach out to: mail@tabarena.ai | |
| ### Core Maintainers | |
| The current core maintainers of TabArena are: | |
| [Nick Erickson](https://github.com/Innixma), | |
| [Lennart Purucker](https://github.com/LennartPurucker/), | |
| [Andrej Tschalzev](https://github.com/atschalz), | |
| [David Holzmüller](https://github.com/dholzmueller) | |
| """ | |
| CITATION_BUTTON_LABEL = ( | |
| "If you use TabArena or the leaderboard in your research please cite the following:" | |
| ) | |
| CITATION_BUTTON_TEXT = r"""@article{ | |
| TBA, | |
| } | |
| """ | |
| VERSION_HISTORY_BUTTON_TEXT = """ | |
| **Current Version: TabArena-v0.1.1** | |
| The following details updates to the leaderboard (date format is YYYY/MM/DD): | |
| * 2025/06/13: Add data for all subsets and re-runs on GPU; Add leaderboards for subsets; | |
| new overview; add Figures to LBs. | |
| * 2025/05: Initialization of the TabArena-v0.1 leaderboard. | |
| """ | |