dbonet commited on
Commit
c961efa
·
verified ·
1 Parent(s): 6563cfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -1
README.md CHANGED
@@ -13,4 +13,152 @@ tags:
13
  - regression,
14
  - hypernetwork,
15
  - retrieval,
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - regression,
14
  - hypernetwork,
15
  - retrieval,
16
+ ---
17
+
18
+ # iLTM: Integrated Large Tabular Model
19
+
20
+ [![PyPI version](https://badge.fury.io/py/iltm.svg)](https://badge.fury.io/py/iltm)
21
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/AI-sandbox/iLTM/blob/main/LICENSE)
22
+ [![Downloads](https://img.shields.io/pypi/dm/iltm)](https://pypistats.org/packages/iltm)
23
+ [![Python Versions](https://img.shields.io/badge/python-3.11%20%7C%203.12%20%7C%203.13-blue)](https://pypi.org/project/iltm/)
24
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-dbonet%2FiLTM-yellow)](https://huggingface.co/dbonet/iLTM)
25
+
26
+
27
+ iLTM is a foundation model for tabular data that integrates tree-derived embeddings, dimensionality-agnostic representations, a meta-trained hypernetwork, multilayer perceptron (MLP) neural networks, and retrieval. iLTM automatically handles feature scaling, categorical features, and missing values.
28
+
29
+ We release open weights of pre-trained model checkpoints that consistently achieve superior performance across tabular classification and regression tasks, from small to large and high-dimensional tasks.
30
+
31
+ ### Install
32
+
33
+ iLTM is accessed through Python. You can install the package via pip:
34
+ ```
35
+ pip install iltm
36
+ ```
37
+
38
+ iLTM works on Linux, macOS and Windows, and can be executed on CPU and GPU, although GPU is **highly recommended** for faster execution.
39
+
40
+ Pre-trained model checkpoints are automatically downloaded from [Hugging Face](https://huggingface.co/dbonet/iLTM) on first use.
41
+ By default, checkpoints are stored in platform-specific cache directories (e.g., `~/.cache/iltm` on Linux, `~/Library/Caches/iltm` on macOS).
42
+ You can specify where model checkpoints are stored by setting the `ILTM_CKPT_DIR` environment variable:
43
+
44
+ ```bash
45
+ export ILTM_CKPT_DIR=/path/to/checkpoints
46
+ ```
47
+
48
+ > [!NOTE]
49
+ > The first call to `iLTMRegressor` or `iLTMClassifier` downloads the selected
50
+ > checkpoint. Later runs reuse the cached weights from `ILTM_CKPT_DIR` or the
51
+ > default cache location.
52
+
53
+ > [!TIP]
54
+ > For interactive work on a local machine it is often worth pointing
55
+ > `ILTM_CKPT_DIR` to a fast local disk to avoid repeated downloads across
56
+ > environments.
57
+
58
+ ### Quick Start
59
+
60
+ iLTM is designed to be easy to use, with an API similar to scikit-learn.
61
+
62
+ ```py
63
+ from iltm import iLTMRegressor, iLTMClassifier
64
+
65
+ # Regression
66
+ reg = iLTMRegressor().fit(X_train, y_train)
67
+ y_pred = reg.predict(X_test)
68
+
69
+ # Classification
70
+ clf = iLTMClassifier().fit(X_train, y_train)
71
+ proba = clf.predict_proba(X_test)
72
+ y_hat = clf.predict(X_test)
73
+
74
+ # With time limit (returns partial ensemble if time runs out)
75
+ reg = iLTMRegressor().fit(X_train, y_train, fit_max_time=3600) # 1 hour limit
76
+ ```
77
+
78
+ ### Model Checkpoints
79
+
80
+ Available checkpoint names:
81
+ - `"xgbrconcat"` (default): Robust preprocessing + XGBoost embeddings + concatenation
82
+ - `"cbrconcat"`: Robust preprocessing + CatBoost embeddings + concatenation
83
+ - `"r128bn"`: Robust preprocessing with 128-dim bottleneck
84
+ - `"rnobn"`: Robust preprocessing without bottleneck
85
+ - `"xgb"`: XGBoost embeddings only
86
+ - `"catb"`: CatBoost embeddings only
87
+ - `"rtr"`: Robust preprocessing with retrieval
88
+ - `"rtrcb"`: CatBoost embeddings with retrieval
89
+
90
+ You can also provide a local path to a checkpoint file.
91
+
92
+ Common key args:
93
+ - checkpoint: checkpoint name or path to model file. Default "xgbrconcat".
94
+ - device: torch device string. Default "cuda:0".
95
+ - n_ensemble: number of generated predictors.
96
+ - batch_size: batch size for weight prediction and inference.
97
+ - preprocessing: "realmlp_td_s_v0" or "minimal" or "none".
98
+ - cat_features: list of categorical column indices.
99
+ - tree_embedding: enable GBDT leaf embeddings.
100
+ - tree_model: "XGBoost_hist" or "CatBoost".
101
+ - concat_tree_with_orig_features: concatenate original features with embeddings.
102
+ - finetuning: end to end finetuning.
103
+ - Retrieval: do_retrieval, retrieval_alpha, retrieval_temperature, retrieval_distance.
104
+
105
+ Regressor only:
106
+ - clip_predictions: clip to train target range.
107
+ - normalize_predictions: z-normalize outputs before unscaling.
108
+
109
+ Classifier only:
110
+ - voting: "soft" or "hard".
111
+
112
+ ## Hyperparameter Optimization
113
+
114
+ iLTM performs best when you tune its hyperparameters.
115
+
116
+ ### Recommended search space
117
+
118
+ The package exposes a recommended search space via `iltm.get_hyperparameter_search_space`, a plain dictionary that maps hyperparameter names to small specs.
119
+
120
+ > [!TIP]
121
+ > When running hyperparameter optimization with time constraints, you can use the `fit_max_time` parameter in `fit()` to limit training time per configuration. The model will return a partial ensemble if the time limit is reached.
122
+
123
+ The checkpoint parameter is part of this space. It is responsible for selecting one of the built in model checkpoints, which in turn sets other fields such as `preprocessing`, `tree_embedding`, and others.
124
+
125
+ The specification format is intentionally minimal so that it can be re-used in any hyperparameter optimization library or custom tuning procedure.
126
+
127
+
128
+ - `iltm.get_hyperparameter_search_space()` gives you the canonical space definition.
129
+ - `iltm.sample_hyperparameters(rng)` draws a single random configuration from that space for quick baselines and smoke tests.
130
+
131
+ > [!TIP]
132
+ > `sample_hyperparameters` is mainly intended for quick baselines, smoke
133
+ > tests, or simple random search. For more serious tuning runs it is
134
+ > usually better to adapt the search space from
135
+ > `get_hyperparameter_search_space` into your optimization method of
136
+ > choice, and let that method decide which configurations to try.
137
+
138
+
139
+ ## Development
140
+
141
+ ### Running Tests
142
+
143
+ To run the tests:
144
+
145
+ ```bash
146
+ uv pip install -e ".[dev]"
147
+ pytest tests/
148
+ ```
149
+
150
+ ## Citation
151
+ If you use iLTM in your research, please cite our [paper](https://arxiv.org/abs/2511.15941):
152
+
153
+ ```bibtex
154
+ @article{bonet2025iltm,
155
+ title={iLTM: Integrated Large Tabular Model},
156
+ author={Bonet, David and Comajoan Cara, Marçal and Calafell, Alvaro and Mas Montserrat, Daniel and Ioannidis, Alexander G},
157
+ journal={arXiv preprint arXiv:2511.15941},
158
+ year={2025},
159
+ }
160
+ ```
161
+
162
+ ## License
163
+
164
+ © Contributors, 2025. Licensed under an [Apache-2](https://github.com/AI-sandbox/iLTM/blob/main/LICENSE) license.