File size: 16,701 Bytes
4ea7152 97464f4 4ea7152 c867b82 4ea7152 e6bbf7b 4ea7152 b5bc9ee 4ea7152 d4ededf 97464f4 4ea7152 97464f4 4ea7152 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | ---
license: apache-2.0
library_name: predictlm
pipeline_tag: tabular-classification
tags:
- tabular
- tabular-classification
- tabular-regression
- in-context-learning
- foundation-model
- prior-fitted-network
- tabpfn-style
metrics:
- accuracy
- roc_auc
- r2
- rmse
model-index:
- name: predictlm-base-26m
results:
- task:
type: tabular-classification
name: Tabular Classification
dataset:
type: openml
name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
metrics:
- type: accuracy
value: 0.685
name: mean accuracy (n=12, seed=42, fair-set n_features ≤ 128)
- task:
type: tabular-regression
name: Tabular Regression
dataset:
type: openml
name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
metrics:
- type: r2
value: 0.589
name: mean R² (n=13, seed=42, fair-set n_features ≤ 128)
- task:
type: tabular-classification
name: Tabular Classification (Duo + TTT recipe)
dataset:
type: openml
name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
metrics:
- type: accuracy
value: 0.751
name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
- task:
type: tabular-regression
name: Tabular Regression (Duo + TTT recipe)
dataset:
type: openml
name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
metrics:
- type: r2
value: 0.609
name: mean R² with Duo + TTT recipe (Mini + Base + test-time training)
---
# predictlm-base-26m
A 26.2M-parameter transformer-based **tabular foundation model** that uses **in-context learning** to solve regression and classification **in a single forward pass**. Pass a small training table as the context, and the model predicts on new rows — no fine-tuning, no model selection, no hyperparameter sweep.
> Looking for a more compact variant? **[PredictLM Mini (13.5M)](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** is distilled from this Base model via warm-start knowledge transfer. It is **statistically tied with Base on classification accuracy** (paired-bootstrap delta -0.001, 95% CI crosses zero) and ~4 pp lower R² on regression at half the parameter count.
## Getting started — the published 0.751 cls / 0.609 reg recipe, by default
```bash
pip install predictlm
```
```python
from predictlm import PredictLM
model = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m") # cpu / mps / cuda all OK
# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)
# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls) # [n, n_classes]
```
That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-mini-13m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`.
| Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² |
|---|:---:|:---:|
| Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** |
| `auto_duo=False` (Base-only, zero-tuning) | 0.685 | 0.589 |
| `auto_duo=False` + `fit_and_predict_with_ttt()` (Base-only TTT) | 0.748 | 0.608 |
**Edge cases:**
- **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result.
- **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.
**TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.
PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch on synthetic data and shipped under Apache-2.0.
## Architecture
Unified architecture: a shared backbone with two task heads (regression via a 1024-bin BarDistribution, classification via per-task masked softmax). The model auto-detects task type from the dtype of `y_train` and routes through the matching head. One `fit/predict` API for both. This unified framing follows [TabICLv2](https://huggingface.co/papers/2602.11139) (Soda Inria, Feb 2026); the closest non-unified precedent is [TabPFN v2](https://huggingface.co/Prior-Labs/TabPFN-v2-clf), which ships separate classifier and regressor checkpoints.
`X_train` and `X_test` are numeric `np.ndarray` or `torch.Tensor`. `y_train` controls task routing: float → regression, int / string → classification.
## Developers and affiliations
- **Developed by**: ZeroOne Research
- **Model card contact**: message the org on the Hub
- **License**: Apache 2.0 — permissive, commercial use allowed, no attribution-only restriction
## Intended use
predictlm-base-26m is a **drop-in tabular predictor** for small-to-medium tables when you want one model that handles both regression and classification:
- **Direct use**: `fit/predict` on numeric tabular data with ≤ 128 features, ≤ 1024 training rows, and (for classification) ≤ 10 classes. Best in zero-tuning settings.
- **Downstream use**: as a baseline foundation model in tabular benchmarking, or as an ICL backbone for derivative work (the trunk weights are released under Apache 2.0).
- **Most useful when**: you have **few training rows**, you have **mixed reg + cls tasks** in one pipeline, you want **zero hyperparameter tuning**, or you want a single model artifact rather than maintaining separate per-task models.
## Not intended use
Do not use this model for:
- **High-stakes decisions** in medicine, lending, hiring, criminal justice, or any context where a wrong prediction causes individual harm — without domain-specific validation, calibration audit, and human review. Like any tabular predictor, predictlm-base-26m will reflect biases present in the labeled context the user provides.
- **Wide tables** (> 128 features). The input projection truncates extra columns.
- **Many-class classification** (> 10 classes). Will raise an error.
- **Very large training sets** (> ~10,000 rows). Performance saturates around the 1024-row context cap; gradient-boosted trees (XGBoost / LightGBM) will outperform here.
- **Non-numeric features** without prior encoding. One-hot / target-encode categoricals first.
- **Latency-critical inference** under ~10 ms on CPU. A trained XGBoost is faster on small problems.
## Model architecture
| field | value |
|---|---|
| Parameters | 26.2 M |
| Layers | 12 (8 shared trunk + 2 reg-head + 2 cls-head) |
| d_model | 256 |
| n_heads | 8 |
| max_features | 128 |
| max_classes | 10 |
| max_context (training rows passed at inference) | 1024 |
| max_query (test rows scored per call) | 256 |
| Regression head | BarDistribution, 1024 bins |
| Classification head | Per-task masked softmax |
| Attention | row-axis transformer; queries cross-attend to context only (deterministic given the context) |
| Feature embedding | Periodic-frequency, 8 bands (scale-invariant, no explicit standardization required) |
| Inference precision | bf16 |
The trunk is a row-axis transformer over the training context concatenated with query rows. Queries cross-attend over the context but not over each other, which makes predictions deterministic given the context.
## Training data and priors
predictlm-base-26m was **trained on synthetic priors with cleared real-data augmentation**. No raw OpenML rows were ever shown to the model.
Training-task mix per step:
- **70%** structural causal model (SCM) tasks — mixed-node SCMs (linear / MLP / tree / periodic / discretizer), with heavy-tail noise, MNAR missingness, target censoring, hierarchical groups, Pitman-Yor categoricals, and covariate shift between context and query.
- **30%** Gaussian-copula tasks fit on **cleared** real tables — 99 bundles total, sampled from UCI and EU government open-data sources.
Real datasets used as copula seeds were screened by a 3-rule **contamination auditor** (MinHash + character-n-gram + target-name match) against the full locked OpenML eval set before being admitted to the training pool. The auditor's clearance manifest is the load-bearing artifact for our no-leakage claim — see *Reproducibility* below.
A **DifficultyCurriculum** + **HardExampleBuffer** (5,000-task capacity, 30% replay rate) accelerates training on tasks where the model under-performs a copula baseline.
## Performance benchmarks
### Locked OpenML eval (held-out, contamination-audited)
Benchmark suites: CC-18 + CTR-23 + AMLB + TabPFN-extras (153 unique OpenML IDs, manifest SHA-256 below). 30-dataset stratified sample, seed=42, n=1500 rows max per task, fair-set filter `n_features ≤ 128`. 4-way comparison run 2026-05-14 against open and hosted SOTA baselines.
**Fair set** (`n_features ≤ 128`):
| | reg-R² (n=13) | cls-acc (n=12) |
|---|:---:|:---:|
| **predictlm-base-26m (this model, 26M)** | **+0.589** | 0.685 |
| XGBoost (200 trees, depth 6) | +0.516 | 0.743 |
| TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 |
| TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 |
### Paired-bootstrap 95% CIs (10,000 resamples, seed=42)
Per-dataset deltas (predictlm-base-26m minus baseline):
| comparison | mean Δ | 95% CI | n | significant? |
|---|:---:|:---:|:---:|:---:|
| Reg vs XGBoost | **+0.073** | [-0.041, +0.196] | 13 | within noise |
| Reg vs TabPFN-2.5 | -0.073 | [-0.108, -0.038] | 13 | ✅ significant loss |
| Cls vs XGBoost | -0.058 | [-0.094, -0.024] | 12 | ✅ significant loss |
| Cls vs TabPFN-2.5 | -0.096 | [-0.133, -0.059] | 12 | ✅ significant loss |
| Cls vs TabICLv2 | -0.108 | [-0.150, -0.066] | 12 | ✅ significant loss |
**Honest read on the headline number.** The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], **so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample.** Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage.
**Significant losses (real, not noise):** loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count.
**Out-of-regime** (`n_features > 128`, n=2 datasets): predictlm-base-26m degrades sharply (~0.15 R² / 0.50 cls) — the input projection truncates extra columns. Use a different method for wide tables (see *Limitations*).
### Empirical examples on real-world datasets (not in the eval set)
Same `fit/predict` call, default settings, 1000 train rows / 200 test rows, single seed:
| dataset | task | n_train | predictlm | XGBoost | winner |
|---|---|:---:|:---:|:---:|---|
| California housing | reg (R²) | 1000 | **0.728** | 0.727 | tied |
| Abalone | reg (R²) | 1000 | **0.562** | 0.459 | predictlm (+10 pp) |
| Wine quality, as float | reg (R²) | 1000 | 0.129 | **0.441** | XGBoost (mean reversion on ordinal) |
| Wine quality, as int | cls (acc) | 1000 | 0.530 | n/a | (cls mode resolves the failure mode above) |
| Kin8nm | reg (R²) | 1000 | **0.625** | 0.594 | predictlm |
| Titanic | cls (acc) | 1000 | 0.905 | **0.940** | XGBoost (small gap) |
| **Glass** | **cls (acc)** | **14** | **0.610** | **0.400** | **predictlm (+21 pp)** |
| Segment | cls (acc) | 1000 | 0.940 | **0.960** | XGBoost (small gap) |
The **glass result is the foundation-model signal**: with only 14 training rows on a 6-class problem, the pretrained ICL prior generalizes; XGBoost has nothing to fit. Conversely, on **wine quality** the BarDistribution regression head collapses to mean predictions on a near-categorical target — casting `y` to `int` switches the model into classification mode and recovers utility.
## When to prefer GBDTs (XGBoost / LightGBM)
This is the operating-envelope guidance, not a confession. predictlm-base-26m is not the right tool when:
- **Training set is large** (≳ 10,000 rows) — gradient-boosted trees scale better on data and saturate the predictlm context cap.
- **Wide tables** (> 128 features) — out of model regime; use trees or wait for v12.
- **High-cardinality categoricals** (e.g. ZIP codes, product IDs) — encode-and-truncate fights ICL pretraining.
- **Latency budget < 10 ms on CPU** for many small predictions — a trained tree is faster.
- **Single, well-defined task with tuning budget** — a tuned XGBoost almost always wins by a few points if you have time to grid-search.
## Ethical considerations
- **No personal data in training**: the model was not trained on any dataset containing personally identifying information. The 99 copula bundles are drawn from public UCI / EU government open-data sources.
- **No benchmark leakage**: the locked eval set was never used in training, and any real dataset used as a copula seed was screened against the eval manifest before admission. Manifest SHA-256: `fe4da8ccc...4fe0841` (see *Reproducibility*).
- **Bias inheritance**: in classification, predictions reflect the labeled context the user supplies at inference time. Like any other tabular prediction method, when applied to high-risk use cases, users should ensure the labeled data is free of biases.
- **Interpretability**: this is a black-box transformer over context+query; do not use without a human-in-the-loop in regulated decision contexts.
## Limitations
- `max_features = 128` — wider tables truncate columns.
- `max_classes = 10` — many-class classification raises an error.
- `max_context = 1024` rows — larger training sets are randomly subsampled per call.
- Numeric features only — encode categoricals before passing.
- Rows are treated as exchangeable — no time-series / sequence inductive bias.
- Single-seed eval numbers above; per-dataset variance is ±5 pp.
## Inference latency
Single GPU, `n_train=500`, `n_test=100`, `n_features=20`:
| device | latency |
|---|:---:|
| H100 / A100 | ~30 ms |
| L4 / RTX 3090 | ~80 ms |
| Apple M-series MPS | ~150 ms |
| CPU | 2–5 s |
## Reproducibility
- **Weights file**: `v11_final.pt`
- **SHA-256**: `e787b783f4ad06c55367d1912ec105626e94c82d399909aa98d93c446dc03e26`
- **Size**: 105 MB (EMA weights + architecture cfg only — training-only state stripped)
- **Training step**: 75,000 (the best held-out checkpoint per the locked eval; subsequent fine-tunes did not improve)
- **Training seed**: 42
- **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (frozen 2026-04-25, 153 OpenML IDs)
## Licensing
Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed, no attribution-only restriction.
## Version
- **v11.0** (current) — first public release. Step-75k checkpoint of the v11 training run.
- **[predictlm-mini-13m](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** — distilled compact sibling shipping alongside this release. 13.5M params, T4-trainable. Recommended for deployment on commodity GPUs.
- Future releases will ship as new HF model repos under the same `predictlm` Python package.
## Citation
### BibTeX
```bibtex
@misc{predictlm2026,
author = {ZeroOne Research},
title = {predictlm-base-26m: a unified tabular foundation model for in-context regression and classification},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-base-26m}}
}
```
### APA
ZeroOne Research. (2026). *predictlm-base-26m: a unified tabular foundation model for in-context regression and classification.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-base-26m
|