File size: 13,246 Bytes
d303b8b 950834d d303b8b 950834d d303b8b 761a46b d303b8b 11f470a d303b8b 3420cd1 d303b8b b846449 d303b8b 761a46b d303b8b 761a46b d303b8b 950834d d303b8b b846449 d303b8b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | ---
license: apache-2.0
library_name: predictlm
pipeline_tag: tabular-classification
tags:
- tabular
- tabular-classification
- tabular-regression
- in-context-learning
- foundation-model
- prior-fitted-network
- tabpfn-style
- distilled
- compact
metrics:
- accuracy
- r2
base_model: zerooneresearch/predictlm-base-26m
model-index:
- name: predictlm-mini-13m
results:
- task:
type: tabular-classification
name: Tabular Classification
dataset:
type: openml
name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
metrics:
- type: accuracy
value: 0.684
name: mean accuracy (n=12, seed=42, fair-set n_features ≤ 128)
- task:
type: tabular-regression
name: Tabular Regression
dataset:
type: openml
name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
metrics:
- type: r2
value: 0.551
name: mean R² (n=13, seed=42, fair-set n_features ≤ 128)
- task:
type: tabular-classification
name: Tabular Classification (Duo + TTT recipe)
dataset:
type: openml
name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
metrics:
- type: accuracy
value: 0.751
name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
- task:
type: tabular-regression
name: Tabular Regression (Duo + TTT recipe)
dataset:
type: openml
name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
metrics:
- type: r2
value: 0.609
name: mean R² with Duo + TTT recipe (Mini + Base + test-time training)
---
# predictlm-mini-13m
A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
## Getting started — the published 0.751 cls / 0.609 reg recipe, by default
```bash
pip install predictlm
```
```python
from predictlm import PredictLM
model = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m") # cpu / mps / cuda all OK
# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)
# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls)
```
That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-base-26m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`.
| Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² |
|---|:---:|:---:|
| Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** |
| `auto_duo=False` (Mini-only, zero-tuning) | 0.673 | 0.536 |
| `auto_duo=False` + `fit_and_predict_with_ttt()` (Mini-only TTT) | 0.742 | 0.595 |
**Edge cases:**
- **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result.
- **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.
**TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.
PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch (Mini distilled from PredictLM-Base) and shipped under Apache-2.0.
## Developers and affiliations
- **Developed by**: ZeroOne Research
- **Distilled from**: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (v11.0)
- **Model card contact**: message the org on the Hub
- **License**: Apache 2.0 — permissive, commercial use allowed
## Why Mini (when to prefer this over Base)
- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS
- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
- **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
- **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
- **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
Prefer **Base** instead if you have an A100/H100, value the last ~4 pp of regression accuracy, and don't need to re-distill.
## Performance benchmarks
### Locked OpenML eval (held-out, contamination-audited)
Same 30-dataset stratified sample, seed=42, fair-set filter `n_features ≤ 128`, 4-way comparison. Same eval pipeline as Base (`scripts/eval_v11.py`).
| | reg-R² (n=13) | cls-acc (n=12) |
|---|:---:|:---:|
| predictlm-base-26m (teacher) | +0.589 | 0.685 |
| **predictlm-mini-13m (this model, 13.5M)** | **+0.551** | **0.684** |
| XGBoost (200 trees, depth 6) | +0.516 | 0.743 |
| TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 |
| TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 |
### Paired-bootstrap 95% CIs (10,000 resamples, seed=42)
Per-dataset deltas (predictlm-mini-13m minus baseline):
| comparison | mean Δ | 95% CI | n | significant? |
|---|:---:|:---:|:---:|:---:|
| **Mini vs Base (compression cost)** | | | | |
| Reg R² | **-0.038** | [-0.065, -0.015] | 13 | ✅ real (~4 pp loss) |
| Cls acc | **-0.001** | [-0.027, +0.029] | 12 | ✅ **statistical tie** |
| **vs other peers (Mini)** | | | | |
| Reg vs XGBoost | +0.035 | [-0.076, +0.158] | 13 | within noise |
| Reg vs TabPFN-2.5 | -0.111 | [-0.152, -0.067] | 13 | ✅ significant loss |
| Cls vs XGBoost | -0.059 | [-0.089, -0.031] | 12 | ✅ significant loss |
| Cls vs TabPFN-2.5 | -0.097 | [-0.132, -0.059] | 12 | ✅ significant loss |
| Cls vs TabICLv2 | -0.109 | [-0.147, -0.069] | 12 | ✅ significant loss |
**Retention vs Base — the headline compression story:**
- **Classification: statistical tie** with Base (delta -0.001, CI [-0.027, +0.029]). At half the parameters, Mini is indistinguishable from the 26M teacher on classification accuracy.
- **Regression: ~4 pp R² cost** vs Base, CI [-6.5, -1.5] (statistically real but small).
**Honest read on the peer comparisons.** Like Base, Mini's regression-vs-XGBoost point estimate is positive (+3.5 pp) but the 95% CI on this 13-dataset sample crosses zero. We can't claim a statistically significant XGBoost win on regression from this single-seed eval. What we *can* say: Mini and XGBoost are competitive on regression on this benchmark, with Mini's distribution being slightly better on most datasets.
**Significant losses (real, not noise):** loses to XGBoost on classification (-5.9 pp), and to TabPFN-2.5 / TabICLv2 on both axes — these are commercial / SOTA models 2-8× Mini's parameter count.
### Model size vs accuracy
| model | params | params (%) | reg-R² | cls-acc |
|---|:---:|:---:|:---:|:---:|
| TabPFN-2.5 | ~100M | 740% | 0.662 | 0.780 |
| TabICLv2 | ~50M | 370% | — | 0.792 |
| **predictlm-base-26m** | 26M | 192% | 0.589 | 0.685 |
| **predictlm-mini-13m** | 13.5M | 100% (baseline) | 0.551 | 0.684 |
Mini is the smallest open-source ICL tabular FM in this comparison and the only one that trains on a single commodity GPU.
## Architecture
Identical architecture family to PredictLM Base, with cross-layer parameter sharing (ALBERT-style) to halve the trunk parameter count.
| field | value |
|---|---|
| Parameters | 13.5 M |
| Layers (effective depth) | 12 (4 unique × 3 shares — ALBERT-style sharing in shared trunk; 2 unique × 2 shares per task head) |
| d_model | 256 |
| n_heads | 8 |
| max_features | 128 |
| max_classes | 10 |
| max_context | 1024 |
| max_query | 256 |
| Regression head | BarDistribution, 1024 bins (bins identical to Base — required for KL distillation) |
| Classification head | Per-task masked softmax |
| Attention | row-axis transformer (same as Base) |
| Inference precision | fp16 (T4-compatible — Base uses bf16 on A100/H100) |
Cross-layer sharing means Mini has 4 unique trunk blocks each applied 3 times during forward pass (vs Base's 8 unique blocks each applied once). The effective compute graph depth is preserved; only the parameter count is halved.
## Training recipe (distillation from Base)
Mini was trained via **warm-start sliced distillation**: a novel recipe for compressing in-context-learning models that preserves real-data transfer ability.
**Three-stage recipe:**
1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.
The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
## Intended use, limitations, ethical considerations
Identical to [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) — see that model card for full details:
- **Intended**: drop-in tabular predictor for ≤128 features, ≤1024 training rows, ≤10 classes
- **Not intended**: high-stakes decisions without domain validation; wide tables (>128 features); many-class cls (>10); very large training sets (>10K rows); non-numeric features without encoding
- **No personal data in training**: distilled from Base, which was trained on synthetic priors + cleared real-data copulas. No raw eval-set rows seen.
- **Bias inheritance**: predictions reflect the labeled context the user supplies at inference time
The known weaknesses (cls below XGBoost; below TabPFN-2.5 / TabICLv2 on both axes) are inherited from Base; Mini does not amplify them but cannot fix them either.
## Reproducibility
- **Weights file**: `v11_06_tiny_final.pt` (inference-only, EMA-preferred state)
- **SHA-256**: `e27c8af6cda7a3426ffed33cb98eb8338966a8190712b5d37ff9e5f442b75a17`
- **Size**: 54.4 MB (inference-only, optimizer + curriculum + buffer + L2-SP state stripped from 217 MB raw)
- **Training step**: 30,000 (final)
- **Training seed**: 42
- **Teacher**: `predictlm-base-26m` (v11.0)
- **Distillation recipe**: warm-start slice + online KL distillation
- **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (identical to Base)
## Licensing
Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed.
The distillation recipe uses our own [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (Apache 2.0) as the teacher — no third-party license obligations propagate to this model. Mini is fully commercially usable.
## Version
- **v11.0.6-tiny** (current) — first public release of the compact distilled variant.
- Sibling: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (full-size, 26M)
- Future releases under the same `predictlm` Python package.
## Citation
### BibTeX
```bibtex
@misc{predictlm_mini_2026,
author = {ZeroOne Research},
title = {predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-mini-13m}}
}
```
### APA
ZeroOne Research. (2026). *predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-mini-13m
|