Upload README.md
Browse files- .gitattributes +10 -0
- README.md +309 -283
- figures/media/image1.png +3 -0
- figures/media/image10.png +3 -0
- figures/media/image11.png +3 -0
- figures/media/image2.png +3 -0
- figures/media/image3.png +3 -0
- figures/media/image4.png +3 -0
- figures/media/image5.png +3 -0
- figures/media/image6.png +3 -0
- figures/media/image7.png +3 -0
- figures/media/image8.png +3 -0
- figures/media/image9.png +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
figures/media/image1.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
figures/media/image10.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
figures/media/image11.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
figures/media/image2.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
figures/media/image3.png filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
figures/media/image4.png filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
figures/media/image5.png filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
figures/media/image6.png filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
figures/media/image7.png filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
figures/media/image8.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,305 +1,331 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
|
| 150 |
-
```
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
|
|
|
| 154 |
|
| 155 |
-
|
| 156 |
-
generate_inference_config()
|
| 157 |
-
```
|
| 158 |
|
| 159 |
-
|
| 160 |
-
#### Single GPU or CPU
|
| 161 |
|
| 162 |
-
|
| 163 |
-
python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
|
| 164 |
-
```
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
-
torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
|
| 170 |
-
```
|
| 171 |
|
| 172 |
-
|
| 173 |
-
#### Single GPU or CPU
|
| 174 |
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
### Retrieval Optimization Project
|
| 186 |
This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
|
| 187 |
-
|
|
|
|
|
|
|
| 188 |
Ensure you have the required dependencies installed:
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
For standard inference using pre-optimized parameters, refer to the code below:
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
|
| 202 |
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
|
| 256 |
-
if ROOT_DIR not in sys.path:
|
| 257 |
-
sys.path.insert(0, ROOT_DIR)
|
| 258 |
-
from inference.predictor import LimiXPredictor
|
| 259 |
-
|
| 260 |
-
house_data = fetch_california_housing()
|
| 261 |
-
X, y = house_data.data, house_data.target
|
| 262 |
-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
|
| 263 |
-
|
| 264 |
-
y_mean = y_train.mean()
|
| 265 |
-
y_std = y_train.std()
|
| 266 |
-
y_train_normalized = (y_train - y_mean) / y_std
|
| 267 |
-
y_test_normalized = (y_test - y_mean) / y_std
|
| 268 |
-
|
| 269 |
-
model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
|
| 270 |
-
|
| 271 |
-
model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
|
| 272 |
-
y_pred = model.predict(X_train, y_train_normalized, X_test)
|
| 273 |
-
|
| 274 |
-
# Compute RMSE and R²
|
| 275 |
-
y_pred = y_pred.to('cpu').numpy()
|
| 276 |
-
rmse = mean_squared_error(y_test_normalized, y_pred)
|
| 277 |
-
r2 = r2_score(y_test_normalized, y_pred)
|
| 278 |
-
|
| 279 |
-
print(f'RMSE: {rmse}')
|
| 280 |
-
print(f'R2: {r2}')
|
| 281 |
-
```
|
| 282 |
-
For additional examples, refer to [inference_regression.py](https://github.com/limix-ldm/LimiX/raw/main/inference_regression.py)
|
| 283 |
-
|
| 284 |
-
## ➩ Missing value imputation
|
| 285 |
-
For the demo file, see [demo_missing_value_imputation.py](https://github.com/limix-ldm/LimiX/raw/main/examples/inference_regression.py)
|
| 286 |
-
|
| 287 |
-
# ➤ Link
|
| 288 |
-
- LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
|
| 289 |
-
- LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
|
| 290 |
-
- Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
|
| 291 |
-
- Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
|
| 292 |
-
- Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
|
| 293 |
-
|
| 294 |
-
# ➤ License
|
| 295 |
-
The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
|
| 296 |
-
|
| 297 |
-
# ➤ Citation
|
| 298 |
-
```
|
| 299 |
-
@article{LimiX,
|
| 300 |
-
title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
|
| 301 |
-
author={LimiXTeam},
|
| 302 |
-
journal={arXiv preprint arXiv:2509.03505},
|
| 303 |
-
year={2025}
|
| 304 |
-
}
|
| 305 |
-
```
|
|
|
|
| 1 |
+
**LimiX-hugging face**
|
| 2 |
+
|
| 3 |
+
1\. **Model Introduction**
|
| 4 |
+
|
| 5 |
+
**LimiX** is a new class of tabular AI model designed to overcome one of modern machine learning's longest-standing bottlenecks: structured data. With only **2M parameters**, **LimiX-2M** sets a new state-of-the-art across classification, regression, and missing-value imputation, surpassing XGBoost, CatBoost, AutoGluon, and TabPFN, and approaching the performance level of the larger LimiX-16M. Its lightweight, training-free design makes advanced tabular modeling accessible on ordinary hardware while preserving full transparency and offline deployability.
|
| 6 |
+
|
| 7 |
+
------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------------------------------------------------------------
|
| 8 |
+
{width="1.75in" height="1.0416666666666667in"} {width="1.75in" height="1.0416666666666667in"} {width="1.75in" height="1.0416666666666667in"}
|
| 9 |
+
------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------------------------------------------------------------
|
| 10 |
+
|
| 11 |
+
------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------------------------------------------------------------
|
| 12 |
+
{width="1.75in" height="1.0416666666666667in"} {width="1.75in" height="1.0416666666666667in"} {width="1.75in" height="1.0416666666666667in"}
|
| 13 |
+
------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------------------------------------------------------------
|
| 14 |
+
|
| 15 |
+
**Key Features**
|
| 16 |
+
|
| 17 |
+
**Unified Tabular Reasoning:**
|
| 18 |
+
|
| 19 |
+
End-to-end designed for multi-task tabular intelligence, enabling a single model to handle classification, regression, and imputation without additional tuning, preprocessing, or task-specific fine-tuning.
|
| 20 |
+
|
| 21 |
+
**Training-Free, Context-Driven Inference:**
|
| 22 |
+
|
| 23 |
+
Operates directly through context learning: no training, no hyperparameters, no preprocessing pipelines. LimiX automatically interprets and processes raw tabular inputs for immediate use.
|
| 24 |
+
|
| 25 |
+
**Lightweight & Efficient Deployment:**
|
| 26 |
+
|
| 27 |
+
A compact 2M-parameter architecture enables fast inference and smooth operation on standard CPUs and laptops, dramatically reducing compute requirements for advanced tabular modeling.
|
| 28 |
+
|
| 29 |
+
2\. **Model Architecture & Pretraining Procedures**
|
| 30 |
+
|
| 31 |
+
LimiX adopts a 12-block transformer architecture with axis-wise attention to features and samples, supported by pre-normalized LayerNorm for stable scaling. The LimiX-16M variant uses an asymmetric design, two feature-axis passes and one sample-axis pass per block, to strengthen feature interaction modeling in heterogeneous schemas with minimal overhead.
|
| 32 |
+
|
| 33 |
+
To learn the joint distribution of tabular variables, LimiX is pretrained through Context-Conditional Masked Modeling (CCMM). By masking table cells and conditioning predictions on a small set of context rows, the model internalizes a wide range of conditional dependencies while adapting to new datasets without training or labels.
|
| 34 |
+
|
| 35 |
+
{width="5.75in" height="4.375in"}
|
| 36 |
+
|
| 37 |
+
3\. **Evaluation Results**
|
| 38 |
+
|
| 39 |
+
**Classification**
|
| 40 |
+
|
| 41 |
+
{width="5.75in" height="3.2604166666666665in"}
|
| 42 |
+
|
| 43 |
+
On the BCCO-CLS benchmark, LimiX-16M establishes leading performance by significantly outperforming AutoGluon and all PFN variants in mean AUC, Accuracy, and F1 scores, with substantially better ranks. LimiX-2M also marks a clear lead over these models in most metrics, except for its AUC rank.
|
| 44 |
+
|
| 45 |
+
**Regression**
|
| 46 |
+
|
| 47 |
+
{width="4.583333333333333in" height="3.4375in"}
|
| 48 |
+
|
| 49 |
+
LimiX-16M achieves the best overall scores and rankings on TALENT-REG, with the PFN models and LimiX-2M emerging as close runners-up in both R² and RMSE.
|
| 50 |
+
|
| 51 |
+
**Missing Value Imputation**
|
| 52 |
+
|
| 53 |
+
LimiX introduces the first training-free, in-context approach for missing-value imputation on entirely new datasets. Across a wide set of real-world benchmarks, LimiX-16M delivers the best performance, achieving lower RMSE and error rates than classical and learned imputers including KNN, MICE, MissForest, GAIN, and MIWAE. Unlike all prior methods, which depend on additional fitting, LimiX performs imputation directly from context with consistently superior accuracy.
|
| 54 |
+
|
| 55 |
+
{width="5.75in" height="1.4270833333333333in"}
|
| 56 |
+
|
| 57 |
+
**Finetune**
|
| 58 |
+
|
| 59 |
+
Using an attention-based retrieval--guided downsampling strategy, LimiX-16M fine-tunes on compact, highly relevant in-context episodes rather than full long contexts, substantially improving sample efficiency and reducing training cost. This approach enables LimiX-16M to significantly outperform strong baselines such as TabDPT and TabPFN-v2, with notable AUC gains across BCCO-CLS datasets.
|
| 60 |
+
|
| 61 |
+
{width="5.75in" height="2.40625in"}
|
| 62 |
+
|
| 63 |
+
4\. **Deployment**
|
| 64 |
+
|
| 65 |
+
**Environment Preparation**
|
| 66 |
+
|
| 67 |
+
Recommended to deploy with Docker. Download the [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile) from the repository and execute the following command to build the image:
|
| 68 |
+
|
| 69 |
+
-----------------------------------------------------------------------------------------------------------------------------
|
| 70 |
+
Bash\
|
| 71 |
+
docker build \--network=host -t limix/infe:v1 \--build-arg FROM\_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
|
| 72 |
+
|
| 73 |
+
-----------------------------------------------------------------------------------------------------------------------------
|
| 74 |
+
|
| 75 |
+
For manual deployment, install dependencies:
|
| 76 |
+
|
| 77 |
+
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 78 |
+
Bash\
|
| 79 |
+
\# Download precompiled flash\_attn file\
|
| 80 |
+
wget -O flash\_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux\_x86\_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash\_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux\_x86\_64.whl\
|
| 81 |
+
\# Install basic dependencies\
|
| 82 |
+
pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1\
|
| 83 |
+
pip install flash\_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux\_x86\_64.whl\
|
| 84 |
+
pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing\_extensions xgboost kditransform hyperopt
|
| 85 |
+
|
| 86 |
+
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 87 |
+
|
| 88 |
+
**Model Download**
|
| 89 |
+
|
| 90 |
+
Download model weights via Hugging Face Hub:
|
| 91 |
+
|
| 92 |
+
---------------------------------------------------------------------------------------------------------------------------
|
| 93 |
+
Python\
|
| 94 |
+
from huggingface\_hub import hf\_hub\_download\
|
| 95 |
+
model\_file = hf\_hub\_download(repo\_id=\"stableai-org/LimiX-16M\", filename=\"LimiX-16M.ckpt\", local\_dir=\"./cache\")
|
| 96 |
+
|
| 97 |
+
---------------------------------------------------------------------------------------------------------------------------
|
| 98 |
+
|
| 99 |
+
5\. **Model Usage**
|
| 100 |
+
|
| 101 |
+
**Classification Task Example**
|
| 102 |
+
|
| 103 |
+
----------------------------------------------------------------------------------------------------------------------------
|
| 104 |
+
Python\
|
| 105 |
+
from sklearn.datasets import load\_breast\_cancer\
|
| 106 |
+
from sklearn.metrics import accuracy\_score, roc\_auc\_score\
|
| 107 |
+
from sklearn.model\_selection import train\_test\_split\
|
| 108 |
+
from huggingface\_hub import hf\_hub\_download\
|
| 109 |
+
import numpy as np\
|
| 110 |
+
import os, sys\
|
| 111 |
+
\
|
| 112 |
+
os.environ\[\"RANK\"\] = \"0\"\
|
| 113 |
+
os.environ\[\"WORLD\_SIZE\"\] = \"1\"\
|
| 114 |
+
os.environ\[\"MASTER\_ADDR\"\] = \"127.0.0.1\"\
|
| 115 |
+
os.environ\[\"MASTER\_PORT\"\] = \"29500\"\
|
| 116 |
+
\
|
| 117 |
+
ROOT\_DIR = os.path.abspath(os.path.join(os.path.dirname(\_\_file\_\_), \"..\"))\
|
| 118 |
+
if ROOT\_DIR not in sys.path:\
|
| 119 |
+
sys.path.insert(0, ROOT\_DIR)\
|
| 120 |
+
from inference.predictor import LimiXPredictor\
|
| 121 |
+
\
|
| 122 |
+
X, y = load\_breast\_cancer(return\_X\_y=True)\
|
| 123 |
+
X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.5, random\_state=42)\
|
| 124 |
+
\
|
| 125 |
+
model\_file = hf\_hub\_download(repo\_id=\"stableai-org/LimiX-16M\", filename=\"LimiX-16M.ckpt\", local\_dir=\"./cache\")\
|
| 126 |
+
\
|
| 127 |
+
clf = LimiXPredictor(device=\'cuda\', model\_path=model\_file, inference\_config=\'config/cls\_default\_retrieval.json\')\
|
| 128 |
+
prediction = clf.predict(X\_train, y\_train, X\_test)\
|
| 129 |
+
\
|
| 130 |
+
print(\"roc\_auc\_score:\", roc\_auc\_score(y\_test, prediction\[:, 1\]))\
|
| 131 |
+
print(\"accuracy\_score:\", accuracy\_score(y\_test, np.argmax(prediction, axis=1)))
|
| 132 |
+
|
| 133 |
+
----------------------------------------------------------------------------------------------------------------------------
|
| 134 |
+
|
| 135 |
+
**Regression Task Example**
|
| 136 |
+
|
| 137 |
+
------------------------------------------------------------------------------------------------------------------------------
|
| 138 |
+
Python\
|
| 139 |
+
from functools import partial\
|
| 140 |
+
\
|
| 141 |
+
from sklearn.datasets import fetch\_california\_housing\
|
| 142 |
+
from sklearn.model\_selection import train\_test\_split\
|
| 143 |
+
from sklearn.metrics import r2\_score\
|
| 144 |
+
from huggingface\_hub import hf\_hub\_download\
|
| 145 |
+
try:\
|
| 146 |
+
from sklearn.metrics import root\_mean\_squared\_error as mean\_squared\_error\
|
| 147 |
+
except:\
|
| 148 |
+
from sklearn.metrics import mean\_squared\_error\
|
| 149 |
+
mean\_squared\_error = partial(mean\_squared\_error, squared=False)\
|
| 150 |
+
import os, sys\
|
| 151 |
+
\
|
| 152 |
+
os.environ\[\"RANK\"\] = \"0\"\
|
| 153 |
+
os.environ\[\"WORLD\_SIZE\"\] = \"1\"\
|
| 154 |
+
os.environ\[\"MASTER\_ADDR\"\] = \"127.0.0.1\"\
|
| 155 |
+
os.environ\[\"MASTER\_PORT\"\] = \"29500\"\
|
| 156 |
+
\
|
| 157 |
+
ROOT\_DIR = os.path.abspath(os.path.join(os.path.dirname(\_\_file\_\_), \"..\"))\
|
| 158 |
+
if ROOT\_DIR not in sys.path:\
|
| 159 |
+
sys.path.insert(0, ROOT\_DIR)\
|
| 160 |
+
from inference.predictor import LimiXPredictor\
|
| 161 |
+
\
|
| 162 |
+
house\_data = fetch\_california\_housing()\
|
| 163 |
+
X, y = house\_data.data, house\_data.target\
|
| 164 |
+
X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.33, random\_state=42)\
|
| 165 |
+
\
|
| 166 |
+
y\_mean = y\_train.mean()\
|
| 167 |
+
y\_std = y\_train.std()\
|
| 168 |
+
y\_train\_normalized = (y\_train - y\_mean) / y\_std\
|
| 169 |
+
y\_test\_normalized = (y\_test - y\_mean) / y\_std\
|
| 170 |
+
\
|
| 171 |
+
model\_path = hf\_hub\_download(repo\_id=\"stableai-org/LimiX-16M\", filename=\"LimiX-16M.ckpt\", local\_dir=\"./cache\")\
|
| 172 |
+
\
|
| 173 |
+
model = LimiXPredictor(device=\'cuda\', model\_path=model\_path, inference\_config=\'config/reg\_default\_retrieval.json\')\
|
| 174 |
+
y\_pred = model.predict(X\_train, y\_train\_normalized, X\_test)\
|
| 175 |
+
\
|
| 176 |
+
\# Compute RMSE and R²\
|
| 177 |
+
y\_pred = y\_pred.to(\'cpu\').numpy()\
|
| 178 |
+
rmse = mean\_squared\_error(y\_test\_normalized, y\_pred)\
|
| 179 |
+
r2 = r2\_score(y\_test\_normalized, y\_pred)\
|
| 180 |
+
\
|
| 181 |
+
print(f\'RMSE: {rmse}\')\
|
| 182 |
+
print(f\'R2: {r2}\')
|
| 183 |
+
|
| 184 |
+
------------------------------------------------------------------------------------------------------------------------------
|
| 185 |
+
|
| 186 |
+
**Ensemble Inference Based on Sample Retrieval**
|
| 187 |
+
|
| 188 |
+
For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [[technical report]{.underline}](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
|
| 189 |
|
| 190 |
Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
|
| 191 |
|
| 192 |
+
**Classification Task**
|
| 193 |
|
| 194 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 195 |
+
Plain Text\
|
| 196 |
+
python inference\_classifier.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data
|
| 197 |
|
| 198 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 199 |
|
| 200 |
+
**Regression Task**
|
|
|
|
|
|
|
| 201 |
|
| 202 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 203 |
+
Plain Text\
|
| 204 |
+
python inference\_regression.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data
|
| 205 |
|
| 206 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
|
|
|
|
|
|
| 207 |
|
| 208 |
+
**Customizing Data Preprocessing for Inference Tasks**
|
|
|
|
| 209 |
|
| 210 |
+
**First, Generate the Inference Configuration File**
|
|
|
|
|
|
|
| 211 |
|
| 212 |
+
generate\_inference\_config()
|
| 213 |
|
| 214 |
+
**Classification Task**
|
|
|
|
|
|
|
| 215 |
|
| 216 |
+
**Single GPU or CPU**
|
|
|
|
| 217 |
|
| 218 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 219 |
+
Plain Text\
|
| 220 |
+
python inference\_classifier.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data
|
| 221 |
|
| 222 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 223 |
|
| 224 |
+
**Multi-GPU Distributed Inference**
|
| 225 |
+
|
| 226 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 227 |
+
Plain Text\
|
| 228 |
+
torchrun \--nproc\_per\_node=8 inference\_classifier.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data \--inference\_with\_DDP
|
| 229 |
+
|
| 230 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 231 |
+
|
| 232 |
+
**Regression Task**
|
| 233 |
+
|
| 234 |
+
**Single GPU or CPU**
|
| 235 |
+
|
| 236 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 237 |
+
Plain Text\
|
| 238 |
+
python inference\_regression.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data
|
| 239 |
+
|
| 240 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------
|
| 241 |
+
|
| 242 |
+
**Multi-GPU Distributed Inference**
|
| 243 |
+
|
| 244 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 245 |
+
Plain Text\
|
| 246 |
+
torchrun \--nproc\_per\_node=8 inference\_regression.py \--save\_name your\_save\_name \--inference\_config\_path path\_to\_retrieval\_config \--data\_dir path\_to\_data \--inference\_with\_DDP
|
| 247 |
+
|
| 248 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 249 |
+
|
| 250 |
+
**Retrieval Optimization Project**
|
| 251 |
|
|
|
|
| 252 |
This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
|
| 253 |
+
|
| 254 |
+
**Installation**
|
| 255 |
+
|
| 256 |
Ensure you have the required dependencies installed:
|
| 257 |
+
|
| 258 |
+
--------------------
|
| 259 |
+
Plain Text\
|
| 260 |
+
pip install optuna
|
| 261 |
+
|
| 262 |
+
--------------------
|
| 263 |
+
|
| 264 |
+
**Usage**
|
| 265 |
+
|
| 266 |
For standard inference using pre-optimized parameters, refer to the code below:
|
| 267 |
+
|
| 268 |
+
------------------------------------------------------------------------------------
|
| 269 |
+
Plain Text\
|
| 270 |
+
searchInference = RetrievalSearchHyperparameters(\
|
| 271 |
+
dict(device\_id=0,model\_path=model\_path), X\_train, y\_train, X\_test, y\_test,\
|
| 272 |
+
)\
|
| 273 |
+
config, result = searchInference.search(n\_trials=10, metric=\"AUC\",\
|
| 274 |
+
inference\_config=\'config/cls\_default\_retrieval.json\',task\_type=\"cls\")
|
| 275 |
+
|
| 276 |
+
------------------------------------------------------------------------------------
|
| 277 |
+
|
| 278 |
This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
|
| 279 |
|
| 280 |
+
6\. **Tool Invocation**
|
| 281 |
+
|
| 282 |
+
The LimiX model can integrate with various toolchains for extended functionality:
|
| 283 |
+
|
| 284 |
+
**Data Processing Tools**: Integrates with pandas and scikit-learn for data cleaning, feature engineering, and result evaluation (e.g., r2\_score, mean\_squared\_error).
|
| 285 |
+
|
| 286 |
+
**Hyperparameter Optimization Tools**: Optimize retrieval parameters via the hyperopt library, example as follows:
|
| 287 |
+
|
| 288 |
+
------------------------------------------------------------------------------------
|
| 289 |
+
Python\
|
| 290 |
+
\# Hyperparameter search example (refer to inference\_regression.py)\
|
| 291 |
+
from utils.inference\_utils import sample\_inferece\_params\
|
| 292 |
+
hyperopt\_config, base\_config = sample\_inferece\_params(rng, 2, 4)\
|
| 293 |
+
model.set\_inference\_config(inference\_config=hyperopt\_config, \*\*base\_config)
|
| 294 |
+
|
| 295 |
+
------------------------------------------------------------------------------------
|
| 296 |
+
|
| 297 |
+
**Distributed Inference**: Supports DDP (Distributed Data Parallel) mode for multi-GPU acceleration via torch.distributed.
|
| 298 |
+
|
| 299 |
+
7\. **License**
|
| 300 |
+
|
| 301 |
+
**Code License**: The repository code is licensed under the \[Apache-2.0 License\](LICENSE.txt), allowing commercial use and secondary development with retention of the original copyright notice.
|
| 302 |
+
|
| 303 |
+
**Model Weight License**: The use of LimiX model weights is subject to a separate Model License:
|
| 304 |
+
|
| 305 |
+
Fully open for academic research without authorization required.
|
| 306 |
+
|
| 307 |
+
Commercial use requires official authorization (refer to the license application process on the [StableAI official website](https://www.stable-ai.ai/)).
|
| 308 |
+
|
| 309 |
+
8\. **Third-Party Notices**
|
| 310 |
+
|
| 311 |
+
This project uses the following third-party components, whose usage is governed by their respective licenses:
|
| 312 |
+
|
| 313 |
+
**PyTorch**: BSD-style license
|
| 314 |
+
|
| 315 |
+
**scikit-learn**: BSD license
|
| 316 |
+
|
| 317 |
+
**flash-attention**: MIT License
|
| 318 |
+
|
| 319 |
+
**Hugging Face Hub**: Apache-2.0 License
|
| 320 |
+
|
| 321 |
+
For the complete list of dependencies and license information, refer to requirements.txt and the official documentation of each component.
|
| 322 |
+
|
| 323 |
+
9\. **Contact Us**
|
| 324 |
+
|
| 325 |
+
**Official Documentation**: <https://www.limix.ai/doc/>
|
| 326 |
+
|
| 327 |
+
**GitHub Repository**: <https://github.com/limix-ldm/LimiX> (Submit issues for questions)
|
| 328 |
+
|
| 329 |
+
**Official Website**: <https://www.stable-ai.ai/> (For commercial cooperation and license inquiries)
|
| 330 |
+
|
| 331 |
+
**Technical Report**: [LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
figures/media/image1.png
ADDED
|
Git LFS Details
|
figures/media/image10.png
ADDED
|
Git LFS Details
|
figures/media/image11.png
ADDED
|
Git LFS Details
|
figures/media/image2.png
ADDED
|
Git LFS Details
|
figures/media/image3.png
ADDED
|
Git LFS Details
|
figures/media/image4.png
ADDED
|
Git LFS Details
|
figures/media/image5.png
ADDED
|
Git LFS Details
|
figures/media/image6.png
ADDED
|
Git LFS Details
|
figures/media/image7.png
ADDED
|
Git LFS Details
|
figures/media/image8.png
ADDED
|
Git LFS Details
|
figures/media/image9.png
ADDED
|