File size: 12,887 Bytes
744a3c7
06e7608
744a3c7
06e7608
dea6dbf
744a3c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dea6dbf
744a3c7
 
 
 
 
dea6dbf
744a3c7
 
 
 
 
dea6dbf
744a3c7
 
 
 
 
 
 
dea6dbf
744a3c7
 
 
 
 
dea6dbf
744a3c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06e7608
 
be7b2b4
 
 
06e7608
be7b2b4
744a3c7
06e7608
 
e03cc02
06e7608
be7b2b4
744a3c7
06e7608
 
e03cc02
06e7608
744a3c7
06e7608
be7b2b4
744a3c7
e03cc02
06e7608
744a3c7
06e7608
be7b2b4
744a3c7
06e7608
 
be7b2b4
06e7608
be7b2b4
744a3c7
06e7608
 
e03cc02
06e7608
744a3c7
06e7608
be7b2b4
744a3c7
06e7608
 
be7b2b4
06e7608
be7b2b4
744a3c7
06e7608
 
e03cc02
06e7608
744a3c7
be7b2b4
744a3c7
06e7608
744a3c7
be7b2b4
744a3c7
 
06e7608
 
744a3c7
06e7608
744a3c7
be7b2b4
744a3c7
 
06e7608
 
 
 
 
 
744a3c7
be7b2b4
e03cc02
744a3c7
06e7608
 
 
744a3c7
06e7608
744a3c7
06e7608
744a3c7
06e7608
744a3c7
06e7608
744a3c7
 
 
 
 
 
06e7608
744a3c7
06e7608
744a3c7
06e7608
744a3c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
# 1. Model Introduction

**LimiX** is a new class of tabular AI model designed to overcome one of modern machine learning’s longest-standing bottlenecks: structured data. With only **2M parameters**, **LimiX-2M** sets a new state-of-the-art across classification, regression, and missing-value imputation, surpassing XGBoost, CatBoost, AutoGluon, and TabPFN, and approaching the performance level of the larger LimiX-16M. Its lightweight, training-free design makes advanced tabular modeling accessible on ordinary hardware while preserving full transparency and offline deployability.

![](images/image.png)





**Key Features**

**Unified Tabular Reasoning:**

End-to-end designed for multi-task tabular intelligence, enabling a single model to handle classification, regression, and imputation without additional tuning, preprocessing, or task-specific fine-tuning.

**Training-Free, Context-Driven Inference:**

Operates directly through context learning: no training, no hyperparameters, no preprocessing pipelines. LimiX automatically interprets and processes raw tabular inputs for immediate use.

**Lightweight & Efficient Deployment:**

A compact 2M-parameter architecture enables fast inference and smooth operation on standard CPUs and laptops, dramatically reducing compute requirements for advanced tabular modeling.



# 2. Model Architecture & Pretraining Procedures

LimiX adopts a 12-block transformer architecture with axis-wise attention to features and samples, supported by pre-normalized LayerNorm for stable scaling. The LimiX-16M variant uses an asymmetric design, two feature-axis passes and one sample-axis pass per block, to strengthen feature interaction modeling in heterogeneous schemas with minimal overhead.

To learn the joint distribution of tabular variables, LimiX is pretrained through Context-Conditional Masked Modeling (CCMM). By masking table cells and conditioning predictions on a small set of context rows, the model internalizes a wide range of conditional dependencies while adapting to new datasets without training or labels.

![](images/image-5.png)

# 3.  Evaluation Results

## Classification

![](images/image-4.png)

On the BCCO-CLS benchmark, LimiX-16M establishes leading performance by significantly outperforming AutoGluon and all PFN variants in mean AUC, Accuracy, and F1 scores, with substantially better ranks. LimiX-2M also marks a clear lead over these models in most metrics, except for its AUC rank.

## Regression

![](images/image-3.png)

LimiX-16M achieves the best overall scores and rankings on TALENT-REG, with the PFN models and LimiX-2M emerging as close runners-up in both R² and RMSE.

## Missing Value Imputation

LimiX introduces the first training-free, in-context approach for missing-value imputation on entirely new datasets. Across a wide set of real-world benchmarks, LimiX-16M delivers the best performance, achieving lower RMSE and error rates than classical and learned imputers including KNN, MICE, MissForest, GAIN, and MIWAE. Unlike all prior methods, which depend on additional fitting, LimiX performs imputation directly from context with consistently superior accuracy.

![](images/image-1.png)

## Finetune

Using an attention-based retrieval–guided downsampling strategy, LimiX-16M fine-tunes on compact, highly relevant in-context episodes rather than full long contexts, substantially improving sample efficiency and reducing training cost. This approach enables LimiX-16M to significantly outperform strong baselines such as TabDPT and TabPFN-v2, with notable AUC gains across BCCO-CLS datasets.

![](images/image-2.png)

# 4. Deployment

1. **Environment Preparation**

   * Recommended to deploy with Docker. Download the [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile) from the repository and execute the following command to build the image:

     ```bash
     docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
     ```

   * For manual deployment, install dependencies:

     ```bash
     # Download precompiled flash_attn file
     wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
     # Install basic dependencies
     pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
     pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
     pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
     ```

2. **Model Download**

   * Download model weights via Hugging Face Hub:

     ```python
     from huggingface_hub import hf_hub_download
     model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
     ```

***



# 5. Model Usage

1. **Classification Task Example**

   ```python
   from sklearn.datasets import load_breast_cancer
   from sklearn.metrics import accuracy_score, roc_auc_score
   from sklearn.model_selection import train_test_split
   from huggingface_hub import hf_hub_download
   import numpy as np
   import os, sys

   os.environ["RANK"] = "0"
   os.environ["WORLD_SIZE"] = "1"
   os.environ["MASTER_ADDR"] = "127.0.0.1"
   os.environ["MASTER_PORT"] = "29500"

   ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
   if ROOT_DIR not in sys.path:
       sys.path.insert(0, ROOT_DIR)
   from inference.predictor import LimiXPredictor

   X, y = load_breast_cancer(return_X_y=True)
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

   model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")

   clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
   prediction = clf.predict(X_train, y_train, X_test)

   print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
   print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
   ```

2. **Regression Task Example**

   ```python
   from functools import partial

   from sklearn.datasets import fetch_california_housing
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import r2_score
   from huggingface_hub import hf_hub_download
   try:
       from sklearn.metrics import root_mean_squared_error as mean_squared_error
   except:
       from sklearn.metrics import mean_squared_error
       mean_squared_error = partial(mean_squared_error, squared=False)
   import os, sys

   os.environ["RANK"] = "0"
   os.environ["WORLD_SIZE"] = "1"
   os.environ["MASTER_ADDR"] = "127.0.0.1"
   os.environ["MASTER_PORT"] = "29500"

   ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
   if ROOT_DIR not in sys.path:
       sys.path.insert(0, ROOT_DIR)
   from inference.predictor import LimiXPredictor

   house_data = fetch_california_housing()
   X, y = house_data.data, house_data.target
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

   y_mean = y_train.mean()
   y_std = y_train.std()
   y_train_normalized = (y_train - y_mean) / y_std
   y_test_normalized = (y_test - y_mean) / y_std

   model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")

   model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
   y_pred = model.predict(X_train, y_train_normalized, X_test)    

   # Compute RMSE and R²
   y_pred = y_pred.to('cpu').numpy()
   rmse = mean_squared_error(y_test_normalized, y_pred)
   r2 = r2_score(y_test_normalized, y_pred)

   print(f'RMSE: {rmse}')
   print(f'R2: {r2}')
   ```

## Ensemble Inference Based on Sample Retrieval

For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).

Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.

### Classification Task

```plain text
python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
```

### Regression Task

```plain text
python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
```

### Customizing Data Preprocessing for Inference Tasks

#### First, Generate the Inference Configuration File

generate\_inference\_config()

### Classification Task

#### Single GPU or CPU

```plain text
python  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
```

#### Multi-GPU Distributed Inference

```plain text
torchrun --nproc_per_node=8  inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
```

### Regression Task

#### Single GPU or CPU

```plain text
python  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
```

#### Multi-GPU Distributed Inference

```plain text
torchrun --nproc_per_node=8  inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
```

### Retrieval Optimization Project

This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.

#### Installation

Ensure you have the required dependencies installed:

```plain text
pip install optuna
```

#### Usage

For standard inference using pre-optimized parameters, refer to the code below:

```plain text
searchInference = RetrievalSearchHyperparameters(
           dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
)
config, result = searchInference.search(n_trials=10, metric="AUC",
              inference_config='config/cls_default_retrieval.json',task_type="cls")
```

This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.

***



# 6. Tool Invocation

The LimiX model can integrate with various toolchains for extended functionality:

* **Data Processing Tools**: Integrates with `pandas` and `scikit-learn` for data cleaning, feature engineering, and result evaluation (e.g., `r2_score`, `mean_squared_error`).

* **Hyperparameter Optimization Tools**: Optimize retrieval parameters via the `hyperopt` library, example as follows:

  ```python
  # Hyperparameter search example (refer to inference_regression.py)
  from utils.inference_utils import sample_inferece_params
  hyperopt_config, base_config = sample_inferece_params(rng, 2, 4)
  model.set_inference_config(inference_config=hyperopt_config, **base_config)
  ```

* **Distributed Inference**: Supports DDP (Distributed Data Parallel) mode for multi-GPU acceleration via `torch.distributed`.

***



# 7. License

1. **Code License**: The repository code is licensed under the \[Apache-2.0 License]\(LICENSE.txt), allowing commercial use and secondary development with retention of the original copyright notice.

2. **Model Weight License**: The use of LimiX model weights is subject to a separate Model License:

   * Fully open for academic research without authorization required.

   * Commercial use requires official authorization (refer to the license application process on the [StableAI official website](https://www.stable-ai.ai/)).

***



# 8. Third-Party Notices

This project uses the following third-party components, whose usage is governed by their respective licenses:

* **PyTorch**: BSD-style license

* **scikit-learn**: BSD license

* **flash-attention**: MIT License

* **Hugging Face Hub**: Apache-2.0 License

* For the complete list of dependencies and license information, refer to `requirements.txt` and the official documentation of each component.

***



# 9. Contact Us

* **Official Documentation**: <https://www.limix.ai/doc/>

* **GitHub Repository**: <https://github.com/limix-ldm/LimiX> (Submit issues for questions)

* **Official Website**: <https://www.stable-ai.ai/> (For commercial cooperation and license inquiries)

* **Technical Report**: [LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)

***