File size: 16,701 Bytes
4ea7152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97464f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ea7152
 
 
 
 
 
c867b82
4ea7152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6bbf7b
4ea7152
b5bc9ee
 
4ea7152
 
 
 
 
 
 
 
 
 
d4ededf
97464f4
4ea7152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97464f4
4ea7152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
license: apache-2.0
library_name: predictlm
pipeline_tag: tabular-classification
tags:
  - tabular
  - tabular-classification
  - tabular-regression
  - in-context-learning
  - foundation-model
  - prior-fitted-network
  - tabpfn-style
metrics:
  - accuracy
  - roc_auc
  - r2
  - rmse
model-index:
  - name: predictlm-base-26m
    results:
      - task:
          type: tabular-classification
          name: Tabular Classification
        dataset:
          type: openml
          name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features  128
        metrics:
          - type: accuracy
            value: 0.685
            name: mean accuracy (n=12, seed=42, fair-set n_features  128)
      - task:
          type: tabular-regression
          name: Tabular Regression
        dataset:
          type: openml
          name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features  128
        metrics:
          - type: r2
            value: 0.589
            name: mean  (n=13, seed=42, fair-set n_features  128)
      - task:
          type: tabular-classification
          name: Tabular Classification (Duo + TTT recipe)
        dataset:
          type: openml
          name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features  128
        metrics:
          - type: accuracy
            value: 0.751
            name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
      - task:
          type: tabular-regression
          name: Tabular Regression (Duo + TTT recipe)
        dataset:
          type: openml
          name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features  128
        metrics:
          - type: r2
            value: 0.609
            name: mean  with Duo + TTT recipe (Mini + Base + test-time training)
---

# predictlm-base-26m

A 26.2M-parameter transformer-based **tabular foundation model** that uses **in-context learning** to solve regression and classification **in a single forward pass**. Pass a small training table as the context, and the model predicts on new rows — no fine-tuning, no model selection, no hyperparameter sweep.

> Looking for a more compact variant? **[PredictLM Mini (13.5M)](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** is distilled from this Base model via warm-start knowledge transfer. It is **statistically tied with Base on classification accuracy** (paired-bootstrap delta -0.001, 95% CI crosses zero) and ~4 pp lower R² on regression at half the parameter count.

## Getting started — the published 0.751 cls / 0.609 reg recipe, by default

```bash
pip install predictlm
```

```python
from predictlm import PredictLM

model = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m")  # cpu / mps / cuda all OK

# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)

# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls)            # [n, n_classes]
```

That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-mini-13m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`.

| Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² |
|---|:---:|:---:|
| Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** |
| `auto_duo=False` (Base-only, zero-tuning) | 0.685 | 0.589 |
| `auto_duo=False` + `fit_and_predict_with_ttt()` (Base-only TTT) | 0.748 | 0.608 |

**Edge cases:**

- **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result.
- **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.

**TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.

PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch on synthetic data and shipped under Apache-2.0.

## Architecture

Unified architecture: a shared backbone with two task heads (regression via a 1024-bin BarDistribution, classification via per-task masked softmax). The model auto-detects task type from the dtype of `y_train` and routes through the matching head. One `fit/predict` API for both. This unified framing follows [TabICLv2](https://huggingface.co/papers/2602.11139) (Soda Inria, Feb 2026); the closest non-unified precedent is [TabPFN v2](https://huggingface.co/Prior-Labs/TabPFN-v2-clf), which ships separate classifier and regressor checkpoints.


`X_train` and `X_test` are numeric `np.ndarray` or `torch.Tensor`. `y_train` controls task routing: float → regression, int / string → classification.

## Developers and affiliations

- **Developed by**: ZeroOne Research
- **Model card contact**: message the org on the Hub
- **License**: Apache 2.0 — permissive, commercial use allowed, no attribution-only restriction

## Intended use

predictlm-base-26m is a **drop-in tabular predictor** for small-to-medium tables when you want one model that handles both regression and classification:

- **Direct use**: `fit/predict` on numeric tabular data with ≤ 128 features, ≤ 1024 training rows, and (for classification) ≤ 10 classes. Best in zero-tuning settings.
- **Downstream use**: as a baseline foundation model in tabular benchmarking, or as an ICL backbone for derivative work (the trunk weights are released under Apache 2.0).
- **Most useful when**: you have **few training rows**, you have **mixed reg + cls tasks** in one pipeline, you want **zero hyperparameter tuning**, or you want a single model artifact rather than maintaining separate per-task models.

## Not intended use

Do not use this model for:

- **High-stakes decisions** in medicine, lending, hiring, criminal justice, or any context where a wrong prediction causes individual harm — without domain-specific validation, calibration audit, and human review. Like any tabular predictor, predictlm-base-26m will reflect biases present in the labeled context the user provides.
- **Wide tables** (> 128 features). The input projection truncates extra columns.
- **Many-class classification** (> 10 classes). Will raise an error.
- **Very large training sets** (> ~10,000 rows). Performance saturates around the 1024-row context cap; gradient-boosted trees (XGBoost / LightGBM) will outperform here.
- **Non-numeric features** without prior encoding. One-hot / target-encode categoricals first.
- **Latency-critical inference** under ~10 ms on CPU. A trained XGBoost is faster on small problems.

## Model architecture

| field | value |
|---|---|
| Parameters | 26.2 M |
| Layers | 12 (8 shared trunk + 2 reg-head + 2 cls-head) |
| d_model | 256 |
| n_heads | 8 |
| max_features | 128 |
| max_classes | 10 |
| max_context (training rows passed at inference) | 1024 |
| max_query (test rows scored per call) | 256 |
| Regression head | BarDistribution, 1024 bins |
| Classification head | Per-task masked softmax |
| Attention | row-axis transformer; queries cross-attend to context only (deterministic given the context) |
| Feature embedding | Periodic-frequency, 8 bands (scale-invariant, no explicit standardization required) |
| Inference precision | bf16 |

The trunk is a row-axis transformer over the training context concatenated with query rows. Queries cross-attend over the context but not over each other, which makes predictions deterministic given the context.

## Training data and priors

predictlm-base-26m was **trained on synthetic priors with cleared real-data augmentation**. No raw OpenML rows were ever shown to the model.

Training-task mix per step:

- **70%** structural causal model (SCM) tasks — mixed-node SCMs (linear / MLP / tree / periodic / discretizer), with heavy-tail noise, MNAR missingness, target censoring, hierarchical groups, Pitman-Yor categoricals, and covariate shift between context and query.
- **30%** Gaussian-copula tasks fit on **cleared** real tables — 99 bundles total, sampled from UCI and EU government open-data sources.

Real datasets used as copula seeds were screened by a 3-rule **contamination auditor** (MinHash + character-n-gram + target-name match) against the full locked OpenML eval set before being admitted to the training pool. The auditor's clearance manifest is the load-bearing artifact for our no-leakage claim — see *Reproducibility* below.

A **DifficultyCurriculum** + **HardExampleBuffer** (5,000-task capacity, 30% replay rate) accelerates training on tasks where the model under-performs a copula baseline.

## Performance benchmarks

### Locked OpenML eval (held-out, contamination-audited)

Benchmark suites: CC-18 + CTR-23 + AMLB + TabPFN-extras (153 unique OpenML IDs, manifest SHA-256 below). 30-dataset stratified sample, seed=42, n=1500 rows max per task, fair-set filter `n_features ≤ 128`. 4-way comparison run 2026-05-14 against open and hosted SOTA baselines.

**Fair set** (`n_features ≤ 128`):

| | reg-R² (n=13) | cls-acc (n=12) |
|---|:---:|:---:|
| **predictlm-base-26m (this model, 26M)** | **+0.589** | 0.685 |
| XGBoost (200 trees, depth 6) | +0.516 | 0.743 |
| TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 |
| TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 |

### Paired-bootstrap 95% CIs (10,000 resamples, seed=42)

Per-dataset deltas (predictlm-base-26m minus baseline):

| comparison | mean Δ | 95% CI | n | significant? |
|---|:---:|:---:|:---:|:---:|
| Reg vs XGBoost | **+0.073** | [-0.041, +0.196] | 13 | within noise |
| Reg vs TabPFN-2.5 | -0.073 | [-0.108, -0.038] | 13 | ✅ significant loss |
| Cls vs XGBoost | -0.058 | [-0.094, -0.024] | 12 | ✅ significant loss |
| Cls vs TabPFN-2.5 | -0.096 | [-0.133, -0.059] | 12 | ✅ significant loss |
| Cls vs TabICLv2 | -0.108 | [-0.150, -0.066] | 12 | ✅ significant loss |

**Honest read on the headline number.** The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], **so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample.** Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage.

**Significant losses (real, not noise):** loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count.

**Out-of-regime** (`n_features > 128`, n=2 datasets): predictlm-base-26m degrades sharply (~0.15 R² / 0.50 cls) — the input projection truncates extra columns. Use a different method for wide tables (see *Limitations*).

### Empirical examples on real-world datasets (not in the eval set)

Same `fit/predict` call, default settings, 1000 train rows / 200 test rows, single seed:

| dataset | task | n_train | predictlm | XGBoost | winner |
|---|---|:---:|:---:|:---:|---|
| California housing | reg (R²) | 1000 | **0.728** | 0.727 | tied |
| Abalone | reg (R²) | 1000 | **0.562** | 0.459 | predictlm (+10 pp) |
| Wine quality, as float | reg (R²) | 1000 | 0.129 | **0.441** | XGBoost (mean reversion on ordinal) |
| Wine quality, as int | cls (acc) | 1000 | 0.530 | n/a | (cls mode resolves the failure mode above) |
| Kin8nm | reg (R²) | 1000 | **0.625** | 0.594 | predictlm |
| Titanic | cls (acc) | 1000 | 0.905 | **0.940** | XGBoost (small gap) |
| **Glass** | **cls (acc)** | **14** | **0.610** | **0.400** | **predictlm (+21 pp)** |
| Segment | cls (acc) | 1000 | 0.940 | **0.960** | XGBoost (small gap) |

The **glass result is the foundation-model signal**: with only 14 training rows on a 6-class problem, the pretrained ICL prior generalizes; XGBoost has nothing to fit. Conversely, on **wine quality** the BarDistribution regression head collapses to mean predictions on a near-categorical target — casting `y` to `int` switches the model into classification mode and recovers utility.

## When to prefer GBDTs (XGBoost / LightGBM)

This is the operating-envelope guidance, not a confession. predictlm-base-26m is not the right tool when:

- **Training set is large** (≳ 10,000 rows) — gradient-boosted trees scale better on data and saturate the predictlm context cap.
- **Wide tables** (> 128 features) — out of model regime; use trees or wait for v12.
- **High-cardinality categoricals** (e.g. ZIP codes, product IDs) — encode-and-truncate fights ICL pretraining.
- **Latency budget < 10 ms on CPU** for many small predictions — a trained tree is faster.
- **Single, well-defined task with tuning budget** — a tuned XGBoost almost always wins by a few points if you have time to grid-search.

## Ethical considerations

- **No personal data in training**: the model was not trained on any dataset containing personally identifying information. The 99 copula bundles are drawn from public UCI / EU government open-data sources.
- **No benchmark leakage**: the locked eval set was never used in training, and any real dataset used as a copula seed was screened against the eval manifest before admission. Manifest SHA-256: `fe4da8ccc...4fe0841` (see *Reproducibility*).
- **Bias inheritance**: in classification, predictions reflect the labeled context the user supplies at inference time. Like any other tabular prediction method, when applied to high-risk use cases, users should ensure the labeled data is free of biases.
- **Interpretability**: this is a black-box transformer over context+query; do not use without a human-in-the-loop in regulated decision contexts.

## Limitations

- `max_features = 128` — wider tables truncate columns.
- `max_classes = 10` — many-class classification raises an error.
- `max_context = 1024` rows — larger training sets are randomly subsampled per call.
- Numeric features only — encode categoricals before passing.
- Rows are treated as exchangeable — no time-series / sequence inductive bias.
- Single-seed eval numbers above; per-dataset variance is ±5 pp.

## Inference latency

Single GPU, `n_train=500`, `n_test=100`, `n_features=20`:

| device | latency |
|---|:---:|
| H100 / A100 | ~30 ms |
| L4 / RTX 3090 | ~80 ms |
| Apple M-series MPS | ~150 ms |
| CPU | 2–5 s |

## Reproducibility

- **Weights file**: `v11_final.pt`
- **SHA-256**: `e787b783f4ad06c55367d1912ec105626e94c82d399909aa98d93c446dc03e26`
- **Size**: 105 MB (EMA weights + architecture cfg only — training-only state stripped)
- **Training step**: 75,000 (the best held-out checkpoint per the locked eval; subsequent fine-tunes did not improve)
- **Training seed**: 42
- **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (frozen 2026-04-25, 153 OpenML IDs)

## Licensing

Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed, no attribution-only restriction.

## Version

- **v11.0** (current) — first public release. Step-75k checkpoint of the v11 training run.
- **[predictlm-mini-13m](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** — distilled compact sibling shipping alongside this release. 13.5M params, T4-trainable. Recommended for deployment on commodity GPUs.
- Future releases will ship as new HF model repos under the same `predictlm` Python package.

## Citation

### BibTeX

```bibtex
@misc{predictlm2026,
  author       = {ZeroOne Research},
  title        = {predictlm-base-26m: a unified tabular foundation model for in-context regression and classification},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-base-26m}}
}
```

### APA

ZeroOne Research. (2026). *predictlm-base-26m: a unified tabular foundation model for in-context regression and classification.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-base-26m