Other
tabular

Add metadata, paper links, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +61 -258
README.md CHANGED
@@ -1,305 +1,108 @@
1
- <div align="center">
2
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
3
- </div>
4
-
5
- # 💥 News
6
- - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
7
- - 2025-08-29: LimiX V1.0 Released.
8
-
9
- # ⚡ Latest Results Compared with SOTA Models
10
- <div align="center">
11
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-CLS.png" width="30%" style="display:inline-block;">
12
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-CLS.png" width="30%" style="display:inline-block;">
13
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabZilla-CLS.png" width="30%" style="display:inline-block;">
14
- </div>
15
- <div align="center">
16
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-REG.png" width="30%" style="display:inline-block;">
17
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-REG.png" width="30%" style="display:inline-block;">
18
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/CTR23-REG.png" width="30%" style="display:inline-block;">
19
- </div>
20
-
21
-
22
- # ➤ Overview
23
- <div align="center">
24
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
25
- </div>
26
- We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
27
-
28
- LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
29
-
30
- For details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505) or [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
31
-
32
- # ➤ Superior Performance
33
- The LimiX model achieved SOTA performance across multiple tasks.
34
-
35
- ## ➩ Classification (Tech Report)
36
- <div align="center">
37
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Classifier.png" alt="Classification" width="80%">
38
- </div>
39
 
40
- ## ➩ Regression (Tech Report)
41
  <div align="center">
42
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Regression.png" alt="Regression" width="60%">
43
  </div>
44
 
45
- ## ➩ Missing Values Imputation (Tech Report)
46
- <div align="center">
47
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
48
- </div>
49
 
50
- # Tutorials
51
- ## ➩ Installation
52
- ### Option 1 (recommended): Use the Dockerfile
53
- Download [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile)
54
- ```bash
55
- docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
56
- ```
57
-
58
- ### Option 2: Build manually
59
- Download the prebuilt flash_attn files
60
- ```bash
61
- wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
62
- ```
63
- Install Python dependencies
64
- ```bash
65
- pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
66
- pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
67
- pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
68
- ```
69
-
70
- ### Download source code
71
- ```bash
72
- git clone https://github.com/limix-ldm/LimiX.git
73
- cd LimiX
74
- ```
75
-
76
- # ➤ Inference
77
- LimiX supports tasks such as classification, regression, and missing value imputation
78
- ## ➩ Model download
79
- | Model size | Download link | Tasks supported |
80
- | --- | --- | --- |
81
- | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
82
- | LimiX-2M | [LimiX-2M.ckpt](https://huggingface.co/stableai-org/LimiX-2M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
83
-
84
- ## ➩ Interface description
85
-
86
- ### Model Creation
87
- ```python
88
- class LimiXPredictor:
89
- def __init__(self,
90
- device:torch.device,
91
- model_path:str,
92
- mix_precision:bool=True,
93
- inference_config: list|str,
94
- categorical_features_indices:List[int]|None=None,
95
- outlier_remove_std: float=12,
96
- softmax_temperature:float=0.9,
97
- task_type: Literal['Classification', 'Regression']='Classification',
98
- mask_prediction:bool=False,
99
- inference_with_DDP: bool = False,
100
- seed:int=0)
101
- ```
102
- | Parameter | Data Type | Description |
103
- |--------|----------|----------|
104
- | device | torch.device | The hardware that loads the model |
105
- | model_path | str | The path to the model that needs to be loaded |
106
- | mix_precision | bool | Whether to enable the mixed precision inference |
107
- | inference_config | list/str | Configuration file used for inference |
108
- | categorical_features_indices | list | The indices of categorical columns in the tabular data |
109
- | outlier_remove_std | float | The threshold is employed to remove outliers, defined as values that are multiples of the standard deviation |
110
- | softmax_temperature | float | The temperature used to control the behavior of softmax operator |
111
- | task_type | str | The task type which can be either "Classification" or "Regression" |
112
- | mask_prediction | bool | Whether to enable missing value imputation |
113
- | inference_with_DDP | bool | Whether to enable DDP during inference |
114
- | seed | int | The seed to control random states |
115
- ### Predict
116
- ```python
117
- def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) -> np.ndarray:
118
- ```
119
- | Parameter | Data Type | Description |
120
- | ------- | ---------- | ----------------- |
121
- | x_train | np.ndarray | The input features of the training set |
122
- | y_train | np.ndarray | The target variable of the training set |
123
- | x_test | np.ndarray | The input features of the test set |
124
-
125
- ## Inference Configuration File Description
126
- | Configuration File Name | Description | Difference |
127
- | ------- | ---------- | ----- |
128
- | cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |
129
- | cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
130
- | reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |
131
- | reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
132
- | reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** | |
133
-
134
- ## ➩ Ensemble Inference Based on Sample Retrieval
135
-
136
- For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
137
-
138
- Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
139
-
140
- ### Classification Task
141
-
142
- ```
143
- python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
144
- ```
145
-
146
- ### Regression Task
147
-
148
- ```
149
- python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
150
- ```
151
 
152
- ### Customizing Data Preprocessing for Inference Tasks
153
- #### First, Generate the Inference Configuration File
 
 
154
 
155
- ```python
156
- generate_inference_config()
157
- ```
158
-
159
- ### Classification Task
160
- #### Single GPU or CPU
161
-
162
- ```
163
- python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
164
- ```
165
-
166
- #### Multi-GPU Distributed Inference
167
-
168
- ```
169
- torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
170
- ```
171
-
172
- ### Regression Task
173
- #### Single GPU or CPU
174
-
175
- ```
176
- python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
177
- ```
178
 
179
- #### Multi-GPU Distributed Inference
180
 
181
- ```
182
- torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
183
- ```
184
 
185
- ### Retrieval Optimization Project
186
- This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
187
- #### Installation
188
- Ensure you have the required dependencies installed:
189
- ```
190
- pip install optuna
191
- ```
192
- #### Usage
193
- For standard inference using pre-optimized parameters, refer to the code below:
194
- ```
195
- searchInference = RetrievalSearchHyperparameters(
196
- dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
197
- )
198
- config, result = searchInference.search(n_trials=10, metric="AUC",
199
- inference_config='config/cls_default_retrieval.json',task_type="cls")
200
- ```
201
- This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
202
 
203
- ## ➩ Classification
204
  ```python
 
 
205
  from sklearn.datasets import load_breast_cancer
206
  from sklearn.metrics import accuracy_score, roc_auc_score
207
  from sklearn.model_selection import train_test_split
208
  from huggingface_hub import hf_hub_download
209
- import numpy as np
210
  import os, sys
211
 
 
212
  os.environ["RANK"] = "0"
213
  os.environ["WORLD_SIZE"] = "1"
214
  os.environ["MASTER_ADDR"] = "127.0.0.1"
215
  os.environ["MASTER_PORT"] = "29500"
216
 
217
- ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
218
- if ROOT_DIR not in sys.path:
219
- sys.path.insert(0, ROOT_DIR)
220
  from inference.predictor import LimiXPredictor
221
 
 
222
  X, y = load_breast_cancer(return_X_y=True)
223
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
224
 
225
- model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
 
226
 
227
- clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
 
 
 
 
 
228
  prediction = clf.predict(X_train, y_train, X_test)
229
 
230
  print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
231
  print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
232
  ```
233
- For additional examples, refer to [inference_classifier.py](./inference_classifier.py)
234
-
235
- ## ➩ Regression
236
- ```python
237
- from functools import partial
238
-
239
- from sklearn.datasets import fetch_california_housing
240
- from sklearn.model_selection import train_test_split
241
- from sklearn.metrics import r2_score
242
- from huggingface_hub import hf_hub_download
243
- try:
244
- from sklearn.metrics import root_mean_squared_error as mean_squared_error
245
- except:
246
- from sklearn.metrics import mean_squared_error
247
- mean_squared_error = partial(mean_squared_error, squared=False)
248
- import os, sys
249
-
250
- os.environ["RANK"] = "0"
251
- os.environ["WORLD_SIZE"] = "1"
252
- os.environ["MASTER_ADDR"] = "127.0.0.1"
253
- os.environ["MASTER_PORT"] = "29500"
254
-
255
- ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
256
- if ROOT_DIR not in sys.path:
257
- sys.path.insert(0, ROOT_DIR)
258
- from inference.predictor import LimiXPredictor
259
-
260
- house_data = fetch_california_housing()
261
- X, y = house_data.data, house_data.target
262
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
263
-
264
- y_mean = y_train.mean()
265
- y_std = y_train.std()
266
- y_train_normalized = (y_train - y_mean) / y_std
267
- y_test_normalized = (y_test - y_mean) / y_std
268
-
269
- model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
270
 
271
- model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
272
- y_pred = model.predict(X_train, y_train_normalized, X_test)
 
 
 
273
 
274
- # Compute RMSE and
275
- y_pred = y_pred.to('cpu').numpy()
276
- rmse = mean_squared_error(y_test_normalized, y_pred)
277
- r2 = r2_score(y_test_normalized, y_pred)
278
 
279
- print(f'RMSE: {rmse}')
280
- print(f'R2: {r2}')
 
 
 
 
281
  ```
282
- For additional examples, refer to [inference_regression.py](https://github.com/limix-ldm/LimiX/raw/main/inference_regression.py)
283
-
284
- ## ➩ Missing value imputation
285
- For the demo file, see [demo_missing_value_imputation.py](https://github.com/limix-ldm/LimiX/raw/main/examples/inference_regression.py)
286
 
287
- # Link
288
- - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
289
- - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
290
- - Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
291
- - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
292
- - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
293
 
294
  # ➤ License
295
- The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
296
 
297
  # ➤ Citation
298
- ```
 
 
 
 
 
 
 
299
  @article{LimiX,
300
- title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
301
- author={LimiXTeam},
302
  journal={arXiv preprint arXiv:2509.03505},
303
  year={2025}
304
  }
305
- ```
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: other
4
+ tags:
5
+ - tabular
6
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
 
8
  <div align="center">
9
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
10
  </div>
11
 
12
+ # LimiX-2M
 
 
 
13
 
14
+ LimiX-2M is a 2M-parameter tabular foundation model (TFM) introduced in the paper [LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models](https://huggingface.co/papers/2606.04485). It utilizes a "tokenize-and-route" framework with RaBEL tokenization and a readout-aligned routing architecture to achieve state-of-the-art performance on tabular benchmarks with significantly lower compute requirements.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ - **Paper:** [LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models](https://huggingface.co/papers/2606.04485)
17
+ - **Technical Report (Original LimiX):** [LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
18
+ - **GitHub Repository:** [limix-ldm-ai/LimiX](https://github.com/limix-ldm-ai/LimiX)
19
+ - **Project Page:** [limix.ai](https://www.limix.ai/)
20
 
21
+ # 💥 News
22
+ - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
23
+ - 2025-08-29: LimiX V1.0 Released.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ # Sample Usage
26
 
27
+ The following example demonstrates how to use the `LimiXPredictor` for a classification task.
 
 
28
 
29
+ Note: You will need to clone the [official repository](https://github.com/limix-ldm-ai/LimiX) to access the `inference.predictor` module.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
 
31
  ```python
32
+ import torch
33
+ import numpy as np
34
  from sklearn.datasets import load_breast_cancer
35
  from sklearn.metrics import accuracy_score, roc_auc_score
36
  from sklearn.model_selection import train_test_split
37
  from huggingface_hub import hf_hub_download
 
38
  import os, sys
39
 
40
+ # Setup environment for distributed backend (required by LimiXPredictor)
41
  os.environ["RANK"] = "0"
42
  os.environ["WORLD_SIZE"] = "1"
43
  os.environ["MASTER_ADDR"] = "127.0.0.1"
44
  os.environ["MASTER_PORT"] = "29500"
45
 
46
+ # Import LimiXPredictor (requires the source code from GitHub)
 
 
47
  from inference.predictor import LimiXPredictor
48
 
49
+ # Load data
50
  X, y = load_breast_cancer(return_X_y=True)
51
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
52
 
53
+ # Download the model checkpoint
54
+ model_file = hf_hub_download(repo_id="stableai-org/LimiX-2M", filename="LimiX-2M.ckpt")
55
 
56
+ # Initialize and run inference
57
+ clf = LimiXPredictor(
58
+ device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
59
+ model_path=model_file,
60
+ inference_config='config/cls_default_retrieval.json'
61
+ )
62
  prediction = clf.predict(X_train, y_train, X_test)
63
 
64
  print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
65
  print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
66
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
+ # Overview
69
+ <div align="center">
70
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
71
+ </div>
72
+ We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
73
 
74
+ LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features.
 
 
 
75
 
76
+ # ➤ Tutorials
77
+ ## ➩ Installation
78
+ ### Build manually
79
+ Install Python dependencies:
80
+ ```bash
81
+ pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
82
  ```
 
 
 
 
83
 
84
+ ### Download source code
85
+ ```bash
86
+ git clone https://github.com/limix-ldm/LimiX.git
87
+ cd LimiX
88
+ ```
 
89
 
90
  # ➤ License
91
+ The code in the associated repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
92
 
93
  # ➤ Citation
94
+ ```bibtex
95
+ @article{LimiX-2M,
96
+ title={LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models},
97
+ author={Zhang, Xingxuan and others},
98
+ journal={arXiv preprint arXiv:2606.04485},
99
+ year={2026}
100
+ }
101
+
102
  @article{LimiX,
103
+ title={LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
104
+ author={Zhang, Xingxuan and Ren, Gang and Yu, Han and Yuan, Hao and Wang, Hui and Li, Jiansheng and Wu, Jiayun and Mo, Lang and Mao, Li and Hao, Mingchao and others},
105
  journal={arXiv preprint arXiv:2509.03505},
106
  year={2025}
107
  }
108
+ ```