| --- |
| license: apache-2.0 |
| library_name: predictlm |
| pipeline_tag: tabular-classification |
| tags: |
| - tabular |
| - tabular-classification |
| - tabular-regression |
| - in-context-learning |
| - foundation-model |
| - prior-fitted-network |
| - tabpfn-style |
| metrics: |
| - accuracy |
| - roc_auc |
| - r2 |
| - rmse |
| model-index: |
| - name: predictlm-base-26m |
| results: |
| - task: |
| type: tabular-classification |
| name: Tabular Classification |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128 |
| metrics: |
| - type: accuracy |
| value: 0.685 |
| name: mean accuracy (n=12, seed=42, fair-set n_features ≤ 128) |
| - task: |
| type: tabular-regression |
| name: Tabular Regression |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128 |
| metrics: |
| - type: r2 |
| value: 0.589 |
| name: mean R² (n=13, seed=42, fair-set n_features ≤ 128) |
| - task: |
| type: tabular-classification |
| name: Tabular Classification (Duo + TTT recipe) |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128 |
| metrics: |
| - type: accuracy |
| value: 0.751 |
| name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training) |
| - task: |
| type: tabular-regression |
| name: Tabular Regression (Duo + TTT recipe) |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128 |
| metrics: |
| - type: r2 |
| value: 0.609 |
| name: mean R² with Duo + TTT recipe (Mini + Base + test-time training) |
| --- |
| |
| # predictlm-base-26m |
|
|
| A 26.2M-parameter transformer-based **tabular foundation model** that uses **in-context learning** to solve regression and classification **in a single forward pass**. Pass a small training table as the context, and the model predicts on new rows — no fine-tuning, no model selection, no hyperparameter sweep. |
|
|
| > Looking for a more compact variant? **[PredictLM Mini (13.5M)](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** is distilled from this Base model via warm-start knowledge transfer. It is **statistically tied with Base on classification accuracy** (paired-bootstrap delta -0.001, 95% CI crosses zero) and ~4 pp lower R² on regression at half the parameter count. |
|
|
| ## Getting started — the published 0.751 cls / 0.609 reg recipe, by default |
|
|
| ```bash |
| pip install predictlm |
| ``` |
|
|
| ```python |
| from predictlm import PredictLM |
| |
| model = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m") # cpu / mps / cuda all OK |
| |
| # Regression — pass float y, get continuous predictions |
| preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg) |
| |
| # Classification — same model, same API; auto-routed via y_train dtype |
| preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls) |
| probs = model.predict_proba(X_test_cls) # [n, n_classes] |
| ``` |
|
|
| That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-mini-13m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`. |
|
|
| | Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² | |
| |---|:---:|:---:| |
| | Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** | |
| | `auto_duo=False` (Base-only, zero-tuning) | 0.685 | 0.589 | |
| | `auto_duo=False` + `fit_and_predict_with_ttt()` (Base-only TTT) | 0.748 | 0.608 | |
|
|
| **Edge cases:** |
|
|
| - **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result. |
| - **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size. |
|
|
| **TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006. |
|
|
| PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch on synthetic data and shipped under Apache-2.0. |
|
|
| ## Architecture |
|
|
| Unified architecture: a shared backbone with two task heads (regression via a 1024-bin BarDistribution, classification via per-task masked softmax). The model auto-detects task type from the dtype of `y_train` and routes through the matching head. One `fit/predict` API for both. This unified framing follows [TabICLv2](https://huggingface.co/papers/2602.11139) (Soda Inria, Feb 2026); the closest non-unified precedent is [TabPFN v2](https://huggingface.co/Prior-Labs/TabPFN-v2-clf), which ships separate classifier and regressor checkpoints. |
|
|
|
|
| `X_train` and `X_test` are numeric `np.ndarray` or `torch.Tensor`. `y_train` controls task routing: float → regression, int / string → classification. |
|
|
| ## Developers and affiliations |
|
|
| - **Developed by**: ZeroOne Research |
| - **Model card contact**: message the org on the Hub |
| - **License**: Apache 2.0 — permissive, commercial use allowed, no attribution-only restriction |
|
|
| ## Intended use |
|
|
| predictlm-base-26m is a **drop-in tabular predictor** for small-to-medium tables when you want one model that handles both regression and classification: |
|
|
| - **Direct use**: `fit/predict` on numeric tabular data with ≤ 128 features, ≤ 1024 training rows, and (for classification) ≤ 10 classes. Best in zero-tuning settings. |
| - **Downstream use**: as a baseline foundation model in tabular benchmarking, or as an ICL backbone for derivative work (the trunk weights are released under Apache 2.0). |
| - **Most useful when**: you have **few training rows**, you have **mixed reg + cls tasks** in one pipeline, you want **zero hyperparameter tuning**, or you want a single model artifact rather than maintaining separate per-task models. |
|
|
| ## Not intended use |
|
|
| Do not use this model for: |
|
|
| - **High-stakes decisions** in medicine, lending, hiring, criminal justice, or any context where a wrong prediction causes individual harm — without domain-specific validation, calibration audit, and human review. Like any tabular predictor, predictlm-base-26m will reflect biases present in the labeled context the user provides. |
| - **Wide tables** (> 128 features). The input projection truncates extra columns. |
| - **Many-class classification** (> 10 classes). Will raise an error. |
| - **Very large training sets** (> ~10,000 rows). Performance saturates around the 1024-row context cap; gradient-boosted trees (XGBoost / LightGBM) will outperform here. |
| - **Non-numeric features** without prior encoding. One-hot / target-encode categoricals first. |
| - **Latency-critical inference** under ~10 ms on CPU. A trained XGBoost is faster on small problems. |
|
|
| ## Model architecture |
|
|
| | field | value | |
| |---|---| |
| | Parameters | 26.2 M | |
| | Layers | 12 (8 shared trunk + 2 reg-head + 2 cls-head) | |
| | d_model | 256 | |
| | n_heads | 8 | |
| | max_features | 128 | |
| | max_classes | 10 | |
| | max_context (training rows passed at inference) | 1024 | |
| | max_query (test rows scored per call) | 256 | |
| | Regression head | BarDistribution, 1024 bins | |
| | Classification head | Per-task masked softmax | |
| | Attention | row-axis transformer; queries cross-attend to context only (deterministic given the context) | |
| | Feature embedding | Periodic-frequency, 8 bands (scale-invariant, no explicit standardization required) | |
| | Inference precision | bf16 | |
|
|
| The trunk is a row-axis transformer over the training context concatenated with query rows. Queries cross-attend over the context but not over each other, which makes predictions deterministic given the context. |
|
|
| ## Training data and priors |
|
|
| predictlm-base-26m was **trained on synthetic priors with cleared real-data augmentation**. No raw OpenML rows were ever shown to the model. |
|
|
| Training-task mix per step: |
|
|
| - **70%** structural causal model (SCM) tasks — mixed-node SCMs (linear / MLP / tree / periodic / discretizer), with heavy-tail noise, MNAR missingness, target censoring, hierarchical groups, Pitman-Yor categoricals, and covariate shift between context and query. |
| - **30%** Gaussian-copula tasks fit on **cleared** real tables — 99 bundles total, sampled from UCI and EU government open-data sources. |
|
|
| Real datasets used as copula seeds were screened by a 3-rule **contamination auditor** (MinHash + character-n-gram + target-name match) against the full locked OpenML eval set before being admitted to the training pool. The auditor's clearance manifest is the load-bearing artifact for our no-leakage claim — see *Reproducibility* below. |
|
|
| A **DifficultyCurriculum** + **HardExampleBuffer** (5,000-task capacity, 30% replay rate) accelerates training on tasks where the model under-performs a copula baseline. |
|
|
| ## Performance benchmarks |
|
|
| ### Locked OpenML eval (held-out, contamination-audited) |
|
|
| Benchmark suites: CC-18 + CTR-23 + AMLB + TabPFN-extras (153 unique OpenML IDs, manifest SHA-256 below). 30-dataset stratified sample, seed=42, n=1500 rows max per task, fair-set filter `n_features ≤ 128`. 4-way comparison run 2026-05-14 against open and hosted SOTA baselines. |
|
|
| **Fair set** (`n_features ≤ 128`): |
|
|
| | | reg-R² (n=13) | cls-acc (n=12) | |
| |---|:---:|:---:| |
| | **predictlm-base-26m (this model, 26M)** | **+0.589** | 0.685 | |
| | XGBoost (200 trees, depth 6) | +0.516 | 0.743 | |
| | TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 | |
| | TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 | |
|
|
| ### Paired-bootstrap 95% CIs (10,000 resamples, seed=42) |
|
|
| Per-dataset deltas (predictlm-base-26m minus baseline): |
|
|
| | comparison | mean Δ | 95% CI | n | significant? | |
| |---|:---:|:---:|:---:|:---:| |
| | Reg vs XGBoost | **+0.073** | [-0.041, +0.196] | 13 | within noise | |
| | Reg vs TabPFN-2.5 | -0.073 | [-0.108, -0.038] | 13 | ✅ significant loss | |
| | Cls vs XGBoost | -0.058 | [-0.094, -0.024] | 12 | ✅ significant loss | |
| | Cls vs TabPFN-2.5 | -0.096 | [-0.133, -0.059] | 12 | ✅ significant loss | |
| | Cls vs TabICLv2 | -0.108 | [-0.150, -0.066] | 12 | ✅ significant loss | |
|
|
| **Honest read on the headline number.** The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], **so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample.** Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage. |
|
|
| **Significant losses (real, not noise):** loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count. |
|
|
| **Out-of-regime** (`n_features > 128`, n=2 datasets): predictlm-base-26m degrades sharply (~0.15 R² / 0.50 cls) — the input projection truncates extra columns. Use a different method for wide tables (see *Limitations*). |
|
|
| ### Empirical examples on real-world datasets (not in the eval set) |
|
|
| Same `fit/predict` call, default settings, 1000 train rows / 200 test rows, single seed: |
|
|
| | dataset | task | n_train | predictlm | XGBoost | winner | |
| |---|---|:---:|:---:|:---:|---| |
| | California housing | reg (R²) | 1000 | **0.728** | 0.727 | tied | |
| | Abalone | reg (R²) | 1000 | **0.562** | 0.459 | predictlm (+10 pp) | |
| | Wine quality, as float | reg (R²) | 1000 | 0.129 | **0.441** | XGBoost (mean reversion on ordinal) | |
| | Wine quality, as int | cls (acc) | 1000 | 0.530 | n/a | (cls mode resolves the failure mode above) | |
| | Kin8nm | reg (R²) | 1000 | **0.625** | 0.594 | predictlm | |
| | Titanic | cls (acc) | 1000 | 0.905 | **0.940** | XGBoost (small gap) | |
| | **Glass** | **cls (acc)** | **14** | **0.610** | **0.400** | **predictlm (+21 pp)** | |
| | Segment | cls (acc) | 1000 | 0.940 | **0.960** | XGBoost (small gap) | |
| |
| The **glass result is the foundation-model signal**: with only 14 training rows on a 6-class problem, the pretrained ICL prior generalizes; XGBoost has nothing to fit. Conversely, on **wine quality** the BarDistribution regression head collapses to mean predictions on a near-categorical target — casting `y` to `int` switches the model into classification mode and recovers utility. |
| |
| ## When to prefer GBDTs (XGBoost / LightGBM) |
| |
| This is the operating-envelope guidance, not a confession. predictlm-base-26m is not the right tool when: |
| |
| - **Training set is large** (≳ 10,000 rows) — gradient-boosted trees scale better on data and saturate the predictlm context cap. |
| - **Wide tables** (> 128 features) — out of model regime; use trees or wait for v12. |
| - **High-cardinality categoricals** (e.g. ZIP codes, product IDs) — encode-and-truncate fights ICL pretraining. |
| - **Latency budget < 10 ms on CPU** for many small predictions — a trained tree is faster. |
| - **Single, well-defined task with tuning budget** — a tuned XGBoost almost always wins by a few points if you have time to grid-search. |
| |
| ## Ethical considerations |
| |
| - **No personal data in training**: the model was not trained on any dataset containing personally identifying information. The 99 copula bundles are drawn from public UCI / EU government open-data sources. |
| - **No benchmark leakage**: the locked eval set was never used in training, and any real dataset used as a copula seed was screened against the eval manifest before admission. Manifest SHA-256: `fe4da8ccc...4fe0841` (see *Reproducibility*). |
| - **Bias inheritance**: in classification, predictions reflect the labeled context the user supplies at inference time. Like any other tabular prediction method, when applied to high-risk use cases, users should ensure the labeled data is free of biases. |
| - **Interpretability**: this is a black-box transformer over context+query; do not use without a human-in-the-loop in regulated decision contexts. |
| |
| ## Limitations |
| |
| - `max_features = 128` — wider tables truncate columns. |
| - `max_classes = 10` — many-class classification raises an error. |
| - `max_context = 1024` rows — larger training sets are randomly subsampled per call. |
| - Numeric features only — encode categoricals before passing. |
| - Rows are treated as exchangeable — no time-series / sequence inductive bias. |
| - Single-seed eval numbers above; per-dataset variance is ±5 pp. |
|
|
| ## Inference latency |
|
|
| Single GPU, `n_train=500`, `n_test=100`, `n_features=20`: |
|
|
| | device | latency | |
| |---|:---:| |
| | H100 / A100 | ~30 ms | |
| | L4 / RTX 3090 | ~80 ms | |
| | Apple M-series MPS | ~150 ms | |
| | CPU | 2–5 s | |
|
|
| ## Reproducibility |
|
|
| - **Weights file**: `v11_final.pt` |
| - **SHA-256**: `e787b783f4ad06c55367d1912ec105626e94c82d399909aa98d93c446dc03e26` |
| - **Size**: 105 MB (EMA weights + architecture cfg only — training-only state stripped) |
| - **Training step**: 75,000 (the best held-out checkpoint per the locked eval; subsequent fine-tunes did not improve) |
| - **Training seed**: 42 |
| - **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (frozen 2026-04-25, 153 OpenML IDs) |
|
|
| ## Licensing |
|
|
| Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed, no attribution-only restriction. |
|
|
| ## Version |
|
|
| - **v11.0** (current) — first public release. Step-75k checkpoint of the v11 training run. |
| - **[predictlm-mini-13m](https://huggingface.co/zerooneresearch/predictlm-mini-13m)** — distilled compact sibling shipping alongside this release. 13.5M params, T4-trainable. Recommended for deployment on commodity GPUs. |
| - Future releases will ship as new HF model repos under the same `predictlm` Python package. |
|
|
| ## Citation |
|
|
| ### BibTeX |
|
|
| ```bibtex |
| @misc{predictlm2026, |
| author = {ZeroOne Research}, |
| title = {predictlm-base-26m: a unified tabular foundation model for in-context regression and classification}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-base-26m}} |
| } |
| ``` |
|
|
| ### APA |
|
|
| ZeroOne Research. (2026). *predictlm-base-26m: a unified tabular foundation model for in-context regression and classification.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-base-26m |
|
|