stableai-org commited on
Commit
744a3c7
·
verified ·
1 Parent(s): 06e7608

Upload 12 files

Browse files
.gitattributes CHANGED
@@ -43,3 +43,12 @@ figures/media/image5.png filter=lfs diff=lfs merge=lfs -text
43
  figures/media/image6.png filter=lfs diff=lfs merge=lfs -text
44
  figures/media/image7.png filter=lfs diff=lfs merge=lfs -text
45
  figures/media/image8.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
43
  figures/media/image6.png filter=lfs diff=lfs merge=lfs -text
44
  figures/media/image7.png filter=lfs diff=lfs merge=lfs -text
45
  figures/media/image8.png filter=lfs diff=lfs merge=lfs -text
46
+ images/BCCO-CLS.png filter=lfs diff=lfs merge=lfs -text
47
+ images/BCCO-REG.png filter=lfs diff=lfs merge=lfs -text
48
+ images/CTR23-REG.png filter=lfs diff=lfs merge=lfs -text
49
+ images/image-1.png filter=lfs diff=lfs merge=lfs -text
50
+ images/image-3.png filter=lfs diff=lfs merge=lfs -text
51
+ images/image-4.png filter=lfs diff=lfs merge=lfs -text
52
+ images/image.png filter=lfs diff=lfs merge=lfs -text
53
+ images/TabArena-CLS.png filter=lfs diff=lfs merge=lfs -text
54
+ images/TabZilla-CLS.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,137 +1,189 @@
1
- <div align="center">
2
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
3
- </div>
4
-
5
- # 💥 News
6
- - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
7
- - 2025-08-29: LimiX V1.0 Released.
8
-
9
- # ⚡ Latest Results Compared with SOTA Models
10
- <div align="center">
11
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-CLS.png" width="30%" style="display:inline-block;">
12
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-CLS.png" width="30%" style="display:inline-block;">
13
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabZilla-CLS.png" width="30%" style="display:inline-block;">
14
- </div>
15
- <div align="center">
16
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/BCCO-REG.png" width="30%" style="display:inline-block;">
17
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/TabArena-REG.png" width="30%" style="display:inline-block;">
18
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/CTR23-REG.png" width="30%" style="display:inline-block;">
19
- </div>
20
-
21
-
22
- # ➤ Overview
23
- <div align="center">
24
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
25
- </div>
26
- We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
27
-
28
- LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
29
-
30
- For details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505) or [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
31
-
32
- # ➤ Superior Performance
33
- The LimiX model achieved SOTA performance across multiple tasks.
34
-
35
- ## ➩ Classification (Tech Report)
36
- <div align="center">
37
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Classifier.png" alt="Classification" width="80%">
38
- </div>
39
-
40
- ## ➩ Regression (Tech Report)
41
- <div align="center">
42
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Regression.png" alt="Regression" width="60%">
43
- </div>
44
-
45
- ## ➩ Missing Values Imputation (Tech Report)
46
- <div align="center">
47
- <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
48
- </div>
49
-
50
- # ➤ Tutorials
51
- ## ➩ Installation
52
- ### Option 1 (recommended): Use the Dockerfile
53
- Download [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile)
54
- ```bash
55
- docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
56
- ```
57
 
58
- ### Option 2: Build manually
59
- Download the prebuilt flash_attn files
60
- ```bash
61
- wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
62
- ```
63
- Install Python dependencies
64
- ```bash
65
- pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
66
- pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
67
- pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
68
- ```
69
 
70
- ### Download source code
71
- ```bash
72
- git clone https://github.com/limix-ldm/LimiX.git
73
- cd LimiX
74
- ```
75
 
76
- # ➤ Inference
77
- LimiX supports tasks such as classification, regression, and missing value imputation
78
- ## ➩ Model download
79
- | Model size | Download link | Tasks supported |
80
- | --- | --- | --- |
81
- | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
82
- | LimiX-2M | [LimiX-2M.ckpt](https://huggingface.co/stableai-org/LimiX-2M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
83
-
84
- ## ➩ Interface description
85
-
86
- ### Model Creation
87
- ```python
88
- class LimiXPredictor:
89
- def __init__(self,
90
- device:torch.device,
91
- model_path:str,
92
- mix_precision:bool=True,
93
- inference_config: list|str,
94
- categorical_features_indices:List[int]|None=None,
95
- outlier_remove_std: float=12,
96
- softmax_temperature:float=0.9,
97
- task_type: Literal['Classification', 'Regression']='Classification',
98
- mask_prediction:bool=False,
99
- inference_with_DDP: bool = False,
100
- seed:int=0)
101
- ```
102
- | Parameter | Data Type | Description |
103
- |--------|----------|----------|
104
- | device | torch.device | The hardware that loads the model |
105
- | model_path | str | The path to the model that needs to be loaded |
106
- | mix_precision | bool | Whether to enable the mixed precision inference |
107
- | inference_config | list/str | Configuration file used for inference |
108
- | categorical_features_indices | list | The indices of categorical columns in the tabular data |
109
- | outlier_remove_std | float | The threshold is employed to remove outliers, defined as values that are multiples of the standard deviation |
110
- | softmax_temperature | float | The temperature used to control the behavior of softmax operator |
111
- | task_type | str | The task type which can be either "Classification" or "Regression" |
112
- | mask_prediction | bool | Whether to enable missing value imputation |
113
- | inference_with_DDP | bool | Whether to enable DDP during inference |
114
- | seed | int | The seed to control random states |
115
- ### Predict
116
- ```python
117
- def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) -> np.ndarray:
118
- ```
119
- | Parameter | Data Type | Description |
120
- | ------- | ---------- | ----------------- |
121
- | x_train | np.ndarray | The input features of the training set |
122
- | y_train | np.ndarray | The target variable of the training set |
123
- | x_test | np.ndarray | The input features of the test set |
124
-
125
- ## Inference Configuration File Description
126
- | Configuration File Name | Description | Difference |
127
- | ------- | ---------- | ----- |
128
- | cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |
129
- | cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
130
- | reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |
131
- | reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
132
- | reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** | |
133
-
134
- ## Ensemble Inference Based on Sample Retrieval
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
  For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
137
 
@@ -139,167 +191,141 @@ Considering inference speed and memory requirements, ensemble inference based on
139
 
140
  ### Classification Task
141
 
142
- ```
143
  python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
144
  ```
145
 
146
  ### Regression Task
147
 
148
- ```
149
  python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
150
  ```
151
 
152
  ### Customizing Data Preprocessing for Inference Tasks
 
153
  #### First, Generate the Inference Configuration File
154
 
155
- ```python
156
- generate_inference_config()
157
- ```
158
 
159
  ### Classification Task
 
160
  #### Single GPU or CPU
161
 
162
- ```
163
  python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
164
  ```
165
 
166
  #### Multi-GPU Distributed Inference
167
 
168
- ```
169
  torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
170
  ```
171
 
172
  ### Regression Task
 
173
  #### Single GPU or CPU
174
 
175
- ```
176
  python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
177
  ```
178
 
179
  #### Multi-GPU Distributed Inference
180
 
181
- ```
182
  torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
183
  ```
184
 
185
  ### Retrieval Optimization Project
 
186
  This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
 
187
  #### Installation
 
188
  Ensure you have the required dependencies installed:
189
- ```
 
190
  pip install optuna
191
  ```
 
192
  #### Usage
 
193
  For standard inference using pre-optimized parameters, refer to the code below:
194
- ```
 
195
  searchInference = RetrievalSearchHyperparameters(
196
  dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
197
  )
198
  config, result = searchInference.search(n_trials=10, metric="AUC",
199
  inference_config='config/cls_default_retrieval.json',task_type="cls")
200
  ```
 
201
  This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
202
 
203
- ## ➩ Classification
204
- ```python
205
- from sklearn.datasets import load_breast_cancer
206
- from sklearn.metrics import accuracy_score, roc_auc_score
207
- from sklearn.model_selection import train_test_split
208
- from huggingface_hub import hf_hub_download
209
- import numpy as np
210
- import os, sys
211
 
212
- os.environ["RANK"] = "0"
213
- os.environ["WORLD_SIZE"] = "1"
214
- os.environ["MASTER_ADDR"] = "127.0.0.1"
215
- os.environ["MASTER_PORT"] = "29500"
216
 
217
- ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
218
- if ROOT_DIR not in sys.path:
219
- sys.path.insert(0, ROOT_DIR)
220
- from inference.predictor import LimiXPredictor
221
 
222
- X, y = load_breast_cancer(return_X_y=True)
223
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
224
 
225
- model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
226
 
227
- clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
228
- prediction = clf.predict(X_train, y_train, X_test)
229
 
230
- print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
231
- print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
232
- ```
233
- For additional examples, refer to [inference_classifier.py](./inference_classifier.py)
234
-
235
- ## ➩ Regression
236
- ```python
237
- from functools import partial
238
-
239
- from sklearn.datasets import fetch_california_housing
240
- from sklearn.model_selection import train_test_split
241
- from sklearn.metrics import r2_score
242
- from huggingface_hub import hf_hub_download
243
- try:
244
- from sklearn.metrics import root_mean_squared_error as mean_squared_error
245
- except:
246
- from sklearn.metrics import mean_squared_error
247
- mean_squared_error = partial(mean_squared_error, squared=False)
248
- import os, sys
249
-
250
- os.environ["RANK"] = "0"
251
- os.environ["WORLD_SIZE"] = "1"
252
- os.environ["MASTER_ADDR"] = "127.0.0.1"
253
- os.environ["MASTER_PORT"] = "29500"
254
-
255
- ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
256
- if ROOT_DIR not in sys.path:
257
- sys.path.insert(0, ROOT_DIR)
258
- from inference.predictor import LimiXPredictor
259
-
260
- house_data = fetch_california_housing()
261
- X, y = house_data.data, house_data.target
262
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
263
-
264
- y_mean = y_train.mean()
265
- y_std = y_train.std()
266
- y_train_normalized = (y_train - y_mean) / y_std
267
- y_test_normalized = (y_test - y_mean) / y_std
268
-
269
- model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
270
-
271
- model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
272
- y_pred = model.predict(X_train, y_train_normalized, X_test)
273
-
274
- # Compute RMSE and R²
275
- y_pred = y_pred.to('cpu').numpy()
276
- rmse = mean_squared_error(y_test_normalized, y_pred)
277
- r2 = r2_score(y_test_normalized, y_pred)
278
-
279
- print(f'RMSE: {rmse}')
280
- print(f'R2: {r2}')
281
- ```
282
- For additional examples, refer to [inference_regression.py](https://github.com/limix-ldm/LimiX/raw/main/inference_regression.py)
283
 
284
- ## ➩ Missing value imputation
285
- For the demo file, see [demo_missing_value_imputation.py](https://github.com/limix-ldm/LimiX/raw/main/examples/inference_regression.py)
 
 
 
 
286
 
287
- # Link
288
- - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
289
- - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
290
- - Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
291
- - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
292
- - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
293
 
294
- # ➤ License
295
- The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
296
 
297
- # ➤ Citation
298
- ```
299
- @article{LimiX,
300
- title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
301
- author={LimiXTeam},
302
- journal={arXiv preprint arXiv:2509.03505},
303
- year={2025}
304
- }
305
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 1. Model Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ **LimiX** is a new class of tabular AI model designed to overcome one of modern machine learning’s longest-standing bottlenecks: structured data. With only **2M parameters**, **LimiX-2M** sets a new state-of-the-art across classification, regression, and missing-value imputation, surpassing XGBoost, CatBoost, AutoGluon, and TabPFN, and approaching the performance level of the larger LimiX-16M. Its lightweight, training-free design makes advanced tabular modeling accessible on ordinary hardware while preserving full transparency and offline deployability.
 
 
 
 
 
 
 
 
 
 
4
 
5
+ ![](images/BCCO-CLS.png)
 
 
 
 
6
 
7
+ ![](images/TabArena-CLS.png)
8
+
9
+ ![](images/TabZilla-CLS.png)
10
+
11
+ ![](images/BCCO-REG.png)
12
+
13
+ ![](images/TabArena-REG.png)
14
+
15
+ ![](images/CTR23-REG.png)
16
+
17
+
18
+
19
+ **Key Features**
20
+
21
+ **Unified Tabular Reasoning:**
22
+
23
+ End-to-end designed for multi-task tabular intelligence, enabling a single model to handle classification, regression, and imputation without additional tuning, preprocessing, or task-specific fine-tuning.
24
+
25
+ **Training-Free, Context-Driven Inference:**
26
+
27
+ Operates directly through context learning: no training, no hyperparameters, no preprocessing pipelines. LimiX automatically interprets and processes raw tabular inputs for immediate use.
28
+
29
+ **Lightweight & Efficient Deployment:**
30
+
31
+ A compact 2M-parameter architecture enables fast inference and smooth operation on standard CPUs and laptops, dramatically reducing compute requirements for advanced tabular modeling.
32
+
33
+
34
+
35
+ # 2. Model Architecture & Pretraining Procedures
36
+
37
+ LimiX adopts a 12-block transformer architecture with axis-wise attention to features and samples, supported by pre-normalized LayerNorm for stable scaling. The LimiX-16M variant uses an asymmetric design, two feature-axis passes and one sample-axis pass per block, to strengthen feature interaction modeling in heterogeneous schemas with minimal overhead.
38
+
39
+ To learn the joint distribution of tabular variables, LimiX is pretrained through Context-Conditional Masked Modeling (CCMM). By masking table cells and conditioning predictions on a small set of context rows, the model internalizes a wide range of conditional dependencies while adapting to new datasets without training or labels.
40
+
41
+ ![](images/image.png)
42
+
43
+ # 3. Evaluation Results
44
+
45
+ ## Classification
46
+
47
+ ![](images/image-1.png)
48
+
49
+ On the BCCO-CLS benchmark, LimiX-16M establishes leading performance by significantly outperforming AutoGluon and all PFN variants in mean AUC, Accuracy, and F1 scores, with substantially better ranks. LimiX-2M also marks a clear lead over these models in most metrics, except for its AUC rank.
50
+
51
+ ## Regression
52
+
53
+ ![](images/image-2.png)
54
+
55
+ LimiX-16M achieves the best overall scores and rankings on TALENT-REG, with the PFN models and LimiX-2M emerging as close runners-up in both R² and RMSE.
56
+
57
+ ## Missing Value Imputation
58
+
59
+ LimiX introduces the first training-free, in-context approach for missing-value imputation on entirely new datasets. Across a wide set of real-world benchmarks, LimiX-16M delivers the best performance, achieving lower RMSE and error rates than classical and learned imputers including KNN, MICE, MissForest, GAIN, and MIWAE. Unlike all prior methods, which depend on additional fitting, LimiX performs imputation directly from context with consistently superior accuracy.
60
+
61
+ ![](images/image-3.png)
62
+
63
+ ## Finetune
64
+
65
+ Using an attention-based retrieval–guided downsampling strategy, LimiX-16M fine-tunes on compact, highly relevant in-context episodes rather than full long contexts, substantially improving sample efficiency and reducing training cost. This approach enables LimiX-16M to significantly outperform strong baselines such as TabDPT and TabPFN-v2, with notable AUC gains across BCCO-CLS datasets.
66
+
67
+ ![](images/image-4.png)
68
+
69
+ # 4. Deployment
70
+
71
+ 1. **Environment Preparation**
72
+
73
+ * Recommended to deploy with Docker. Download the [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile) from the repository and execute the following command to build the image:
74
+
75
+ ```bash
76
+ docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
77
+ ```
78
+
79
+ * For manual deployment, install dependencies:
80
+
81
+ ```bash
82
+ # Download precompiled flash_attn file
83
+ wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
84
+ # Install basic dependencies
85
+ pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
86
+ pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
87
+ pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
88
+ ```
89
+
90
+ 2. **Model Download**
91
+
92
+ * Download model weights via Hugging Face Hub:
93
+
94
+ ```python
95
+ from huggingface_hub import hf_hub_download
96
+ model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
97
+ ```
98
+
99
+ ***
100
+
101
+
102
+
103
+ # 5. Model Usage
104
+
105
+ 1. **Classification Task Example**
106
+
107
+ ```python
108
+ from sklearn.datasets import load_breast_cancer
109
+ from sklearn.metrics import accuracy_score, roc_auc_score
110
+ from sklearn.model_selection import train_test_split
111
+ from huggingface_hub import hf_hub_download
112
+ import numpy as np
113
+ import os, sys
114
+
115
+ os.environ["RANK"] = "0"
116
+ os.environ["WORLD_SIZE"] = "1"
117
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
118
+ os.environ["MASTER_PORT"] = "29500"
119
+
120
+ ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
121
+ if ROOT_DIR not in sys.path:
122
+ sys.path.insert(0, ROOT_DIR)
123
+ from inference.predictor import LimiXPredictor
124
+
125
+ X, y = load_breast_cancer(return_X_y=True)
126
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
127
+
128
+ model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
129
+
130
+ clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
131
+ prediction = clf.predict(X_train, y_train, X_test)
132
+
133
+ print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
134
+ print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
135
+ ```
136
+
137
+ 2. **Regression Task Example**
138
+
139
+ ```python
140
+ from functools import partial
141
+
142
+ from sklearn.datasets import fetch_california_housing
143
+ from sklearn.model_selection import train_test_split
144
+ from sklearn.metrics import r2_score
145
+ from huggingface_hub import hf_hub_download
146
+ try:
147
+ from sklearn.metrics import root_mean_squared_error as mean_squared_error
148
+ except:
149
+ from sklearn.metrics import mean_squared_error
150
+ mean_squared_error = partial(mean_squared_error, squared=False)
151
+ import os, sys
152
+
153
+ os.environ["RANK"] = "0"
154
+ os.environ["WORLD_SIZE"] = "1"
155
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
156
+ os.environ["MASTER_PORT"] = "29500"
157
+
158
+ ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
159
+ if ROOT_DIR not in sys.path:
160
+ sys.path.insert(0, ROOT_DIR)
161
+ from inference.predictor import LimiXPredictor
162
+
163
+ house_data = fetch_california_housing()
164
+ X, y = house_data.data, house_data.target
165
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
166
+
167
+ y_mean = y_train.mean()
168
+ y_std = y_train.std()
169
+ y_train_normalized = (y_train - y_mean) / y_std
170
+ y_test_normalized = (y_test - y_mean) / y_std
171
+
172
+ model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
173
+
174
+ model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
175
+ y_pred = model.predict(X_train, y_train_normalized, X_test)
176
+
177
+ # Compute RMSE and R²
178
+ y_pred = y_pred.to('cpu').numpy()
179
+ rmse = mean_squared_error(y_test_normalized, y_pred)
180
+ r2 = r2_score(y_test_normalized, y_pred)
181
+
182
+ print(f'RMSE: {rmse}')
183
+ print(f'R2: {r2}')
184
+ ```
185
+
186
+ ## Ensemble Inference Based on Sample Retrieval
187
 
188
  For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
189
 
 
191
 
192
  ### Classification Task
193
 
194
+ ```plain&#x20;text
195
  python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
196
  ```
197
 
198
  ### Regression Task
199
 
200
+ ```plain&#x20;text
201
  python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
202
  ```
203
 
204
  ### Customizing Data Preprocessing for Inference Tasks
205
+
206
  #### First, Generate the Inference Configuration File
207
 
208
+ generate\_inference\_config()
 
 
209
 
210
  ### Classification Task
211
+
212
  #### Single GPU or CPU
213
 
214
+ ```plain&#x20;text
215
  python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
216
  ```
217
 
218
  #### Multi-GPU Distributed Inference
219
 
220
+ ```plain&#x20;text
221
  torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
222
  ```
223
 
224
  ### Regression Task
225
+
226
  #### Single GPU or CPU
227
 
228
+ ```plain&#x20;text
229
  python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
230
  ```
231
 
232
  #### Multi-GPU Distributed Inference
233
 
234
+ ```plain&#x20;text
235
  torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
236
  ```
237
 
238
  ### Retrieval Optimization Project
239
+
240
  This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
241
+
242
  #### Installation
243
+
244
  Ensure you have the required dependencies installed:
245
+
246
+ ```plain&#x20;text
247
  pip install optuna
248
  ```
249
+
250
  #### Usage
251
+
252
  For standard inference using pre-optimized parameters, refer to the code below:
253
+
254
+ ```plain&#x20;text
255
  searchInference = RetrievalSearchHyperparameters(
256
  dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
257
  )
258
  config, result = searchInference.search(n_trials=10, metric="AUC",
259
  inference_config='config/cls_default_retrieval.json',task_type="cls")
260
  ```
261
+
262
  This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
263
 
264
+ ***
 
 
 
 
 
 
 
265
 
 
 
 
 
266
 
 
 
 
 
267
 
268
+ # 6. Tool Invocation
 
269
 
270
+ The LimiX model can integrate with various toolchains for extended functionality:
271
 
272
+ * **Data Processing Tools**: Integrates with `pandas` and `scikit-learn` for data cleaning, feature engineering, and result evaluation (e.g., `r2_score`, `mean_squared_error`).
 
273
 
274
+ * **Hyperparameter Optimization Tools**: Optimize retrieval parameters via the `hyperopt` library, example as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
 
276
+ ```python
277
+ # Hyperparameter search example (refer to inference_regression.py)
278
+ from utils.inference_utils import sample_inferece_params
279
+ hyperopt_config, base_config = sample_inferece_params(rng, 2, 4)
280
+ model.set_inference_config(inference_config=hyperopt_config, **base_config)
281
+ ```
282
 
283
+ * **Distributed Inference**: Supports DDP (Distributed Data Parallel) mode for multi-GPU acceleration via `torch.distributed`.
 
 
 
 
 
284
 
285
+ ***
 
286
 
287
+
288
+
289
+ # 7. License
290
+
291
+ 1. **Code License**: The repository code is licensed under the \[Apache-2.0 License]\(LICENSE.txt), allowing commercial use and secondary development with retention of the original copyright notice.
292
+
293
+ 2. **Model Weight License**: The use of LimiX model weights is subject to a separate Model License:
294
+
295
+ * Fully open for academic research without authorization required.
296
+
297
+ * Commercial use requires official authorization (refer to the license application process on the [StableAI official website](https://www.stable-ai.ai/)).
298
+
299
+ ***
300
+
301
+
302
+
303
+ # 8. Third-Party Notices
304
+
305
+ This project uses the following third-party components, whose usage is governed by their respective licenses:
306
+
307
+ * **PyTorch**: BSD-style license
308
+
309
+ * **scikit-learn**: BSD license
310
+
311
+ * **flash-attention**: MIT License
312
+
313
+ * **Hugging Face Hub**: Apache-2.0 License
314
+
315
+ * For the complete list of dependencies and license information, refer to `requirements.txt` and the official documentation of each component.
316
+
317
+ ***
318
+
319
+
320
+
321
+ # 9. Contact Us
322
+
323
+ * **Official Documentation**: <https://www.limix.ai/doc/>
324
+
325
+ * **GitHub Repository**: <https://github.com/limix-ldm/LimiX> (Submit issues for questions)
326
+
327
+ * **Official Website**: <https://www.stable-ai.ai/> (For commercial cooperation and license inquiries)
328
+
329
+ * **Technical Report**: [LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
330
+
331
+ ***
images/BCCO-CLS.png ADDED

Git LFS Details

  • SHA256: 2821150d1ccd6ded0082ef80a3c32a4fab6f1e1c012fe5a55d07e60885594d8f
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
images/BCCO-REG.png ADDED

Git LFS Details

  • SHA256: 718306dd3fbc44cd9d39d13938d67672692bfc5135cf15f439d8a8e8e948600c
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
images/CTR23-REG.png ADDED

Git LFS Details

  • SHA256: d7064053579eca46ecf9a65fc52572c58c3222f6a2daf83ea02460a9785c358a
  • Pointer size: 131 Bytes
  • Size of remote file: 100 kB
images/TabArena-CLS.png ADDED

Git LFS Details

  • SHA256: 6e0889d5e43fd59d2bccfe22bfb2af7cb0b3465fa7bf449bfb532f9ad01d4a8e
  • Pointer size: 131 Bytes
  • Size of remote file: 105 kB
images/TabArena-REG.png ADDED
images/TabZilla-CLS.png ADDED

Git LFS Details

  • SHA256: d5f3dfb054f64cfeaf53e85ae799a6291aaee73d5620cd0c5ee84c8a6cae8ffa
  • Pointer size: 131 Bytes
  • Size of remote file: 105 kB
images/image-1.png ADDED

Git LFS Details

  • SHA256: 01302c57b9fefce9e157bb2c78cedb577e314acbe98b742a6f3588f907b4fb95
  • Pointer size: 131 Bytes
  • Size of remote file: 152 kB
images/image-2.png ADDED
images/image-3.png ADDED

Git LFS Details

  • SHA256: a56da429b53a15278f730d7bc4254c4579cbfd8cb52ef87738d62f9f6610f7a8
  • Pointer size: 131 Bytes
  • Size of remote file: 201 kB
images/image-4.png ADDED

Git LFS Details

  • SHA256: 01ba90cbbc80b3e7e0861cc0f5830bb93c36ecc889884ec7628c3c18e6e8f604
  • Pointer size: 131 Bytes
  • Size of remote file: 125 kB
images/image.png ADDED

Git LFS Details

  • SHA256: 4a79e90bfb6d66f2b31f13979e47068585ccc65124bf499f1e6ccca44c4db318
  • Pointer size: 131 Bytes
  • Size of remote file: 239 kB