vitaliykinakh
/

binary-ddpm-tabular

Model card Files Files and versions

vitaliykinakh commited on Dec 11, 2024

Commit

012c18c

·

verified ·

1 Parent(s): 2de08f8

Update README.md

Files changed (1) hide show

README.md +53 -3

README.md CHANGED Viewed

@@ -1,3 +1,53 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- demo-org/diabetes
+- scikit-learn/adult-census-income
+- leostelon/california-housing
+- vitaliykinakh/heloc
+- vitaliykinakh/sick
+- vitaliykinakh/travel
+metrics:
+- accuracy
+---
+This repository contains the official models from the paper "[Tabular Data Generation using Binary Diffusion](https://arxiv.org/abs/2409.13882)",
+accepted to [3rd Table Representation Learning Workshop @ NeurIPS 2024](https://table-representation-learning.github.io/).
+# Abstract
+Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive.
+Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed
+data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we
+introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary
+representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary
+data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary
+cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter
+tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets,
+demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes
+datasets while being significantly smaller in size.
+# Results
+The table below presents the **Binary Diffusion** results across various datasets and models. Performance metrics are shown as **mean ± standard deviation**.
+| **Dataset**             | **LR (Binary Diffusion)** | **DT (Binary Diffusion)** | **RF (Binary Diffusion)** | **Params** |
+|-------------------------|---------------------------|---------------------------|---------------------------|------------|
+| **Travel**              | **83.79 ± 0.08**          | **88.90 ± 0.57**          | **89.95 ± 0.44**          | **1.1M**   |
+| **Sick**                | 96.14 ± 0.63              | **97.07 ± 0.24**          | 96.59 ± 0.55              | **1.4M**   |
+| **HELOC**               | 71.76 ± 0.30              | 70.25 ± 0.43              | 70.47 ± 0.32              | **2.6M**   |
+| **Adult Income**        | **85.45 ± 0.11**          | **85.27 ± 0.11**          | **85.74 ± 0.11**          | **1.4M**   |
+| **Diabetes**            | **57.75 ± 0.04**          | **57.13 ± 0.15**          | 57.52 ± 0.12              | **1.8M**   |
+| **California Housing**  | *0.55 ± 0.00*             | 0.45 ± 0.00               | 0.39 ± 0.00               | **1.5M**   |
+---
+# Citation
+```
+@article{kinakh2024tabular,
+  title={Tabular Data Generation using Binary Diffusion},
+  author={Kinakh, Vitaliy and Voloshynovskiy, Slava},
+  journal={arXiv preprint arXiv:2409.13882},
+  year={2024}
+}
+```