File size: 13,246 Bytes
d303b8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
950834d
d303b8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
950834d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d303b8b
 
 
 
761a46b
d303b8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11f470a
d303b8b
3420cd1
 
d303b8b
 
 
 
b846449
d303b8b
 
 
 
761a46b
 
d303b8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
761a46b
d303b8b
 
 
 
 
 
 
 
 
 
 
 
950834d
d303b8b
 
 
 
 
 
 
 
 
b846449
d303b8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
license: apache-2.0
library_name: predictlm
pipeline_tag: tabular-classification
tags:
  - tabular
  - tabular-classification
  - tabular-regression
  - in-context-learning
  - foundation-model
  - prior-fitted-network
  - tabpfn-style
  - distilled
  - compact
metrics:
  - accuracy
  - r2
base_model: zerooneresearch/predictlm-base-26m
model-index:
  - name: predictlm-mini-13m
    results:
      - task:
          type: tabular-classification
          name: Tabular Classification
        dataset:
          type: openml
          name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features  128
        metrics:
          - type: accuracy
            value: 0.684
            name: mean accuracy (n=12, seed=42, fair-set n_features  128)
      - task:
          type: tabular-regression
          name: Tabular Regression
        dataset:
          type: openml
          name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features  128
        metrics:
          - type: r2
            value: 0.551
            name: mean  (n=13, seed=42, fair-set n_features  128)
      - task:
          type: tabular-classification
          name: Tabular Classification (Duo + TTT recipe)
        dataset:
          type: openml
          name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features  128
        metrics:
          - type: accuracy
            value: 0.751
            name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
      - task:
          type: tabular-regression
          name: Tabular Regression (Duo + TTT recipe)
        dataset:
          type: openml
          name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features  128
        metrics:
          - type: r2
            value: 0.609
            name: mean  with Duo + TTT recipe (Mini + Base + test-time training)
---

# predictlm-mini-13m

A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.

This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.

## Getting started — the published 0.751 cls / 0.609 reg recipe, by default

```bash
pip install predictlm
```

```python
from predictlm import PredictLM

model = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m")  # cpu / mps / cuda all OK

# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)

# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls)
```

That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-base-26m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`.

| Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² |
|---|:---:|:---:|
| Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** |
| `auto_duo=False` (Mini-only, zero-tuning) | 0.673 | 0.536 |
| `auto_duo=False` + `fit_and_predict_with_ttt()` (Mini-only TTT) | 0.742 | 0.595 |

**Edge cases:**

- **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result.
- **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.

**TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.

PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch (Mini distilled from PredictLM-Base) and shipped under Apache-2.0.

## Developers and affiliations

- **Developed by**: ZeroOne Research
- **Distilled from**: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (v11.0)
- **Model card contact**: message the org on the Hub
- **License**: Apache 2.0 — permissive, commercial use allowed

## Why Mini (when to prefer this over Base)

- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS
- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
- **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
- **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
- **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)

Prefer **Base** instead if you have an A100/H100, value the last ~4 pp of regression accuracy, and don't need to re-distill.

## Performance benchmarks

### Locked OpenML eval (held-out, contamination-audited)

Same 30-dataset stratified sample, seed=42, fair-set filter `n_features ≤ 128`, 4-way comparison. Same eval pipeline as Base (`scripts/eval_v11.py`).

| | reg-R² (n=13) | cls-acc (n=12) |
|---|:---:|:---:|
| predictlm-base-26m (teacher) | +0.589 | 0.685 |
| **predictlm-mini-13m (this model, 13.5M)** | **+0.551** | **0.684** |
| XGBoost (200 trees, depth 6) | +0.516 | 0.743 |
| TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 |
| TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 |

### Paired-bootstrap 95% CIs (10,000 resamples, seed=42)

Per-dataset deltas (predictlm-mini-13m minus baseline):

| comparison | mean Δ | 95% CI | n | significant? |
|---|:---:|:---:|:---:|:---:|
| **Mini vs Base (compression cost)** | | | | |
| Reg R² | **-0.038** | [-0.065, -0.015] | 13 | ✅ real (~4 pp loss) |
| Cls acc | **-0.001** | [-0.027, +0.029] | 12 | ✅ **statistical tie** |
| **vs other peers (Mini)** | | | | |
| Reg vs XGBoost | +0.035 | [-0.076, +0.158] | 13 | within noise |
| Reg vs TabPFN-2.5 | -0.111 | [-0.152, -0.067] | 13 | ✅ significant loss |
| Cls vs XGBoost | -0.059 | [-0.089, -0.031] | 12 | ✅ significant loss |
| Cls vs TabPFN-2.5 | -0.097 | [-0.132, -0.059] | 12 | ✅ significant loss |
| Cls vs TabICLv2 | -0.109 | [-0.147, -0.069] | 12 | ✅ significant loss |

**Retention vs Base — the headline compression story:**
- **Classification: statistical tie** with Base (delta -0.001, CI [-0.027, +0.029]). At half the parameters, Mini is indistinguishable from the 26M teacher on classification accuracy.
- **Regression: ~4 pp R² cost** vs Base, CI [-6.5, -1.5] (statistically real but small).

**Honest read on the peer comparisons.** Like Base, Mini's regression-vs-XGBoost point estimate is positive (+3.5 pp) but the 95% CI on this 13-dataset sample crosses zero. We can't claim a statistically significant XGBoost win on regression from this single-seed eval. What we *can* say: Mini and XGBoost are competitive on regression on this benchmark, with Mini's distribution being slightly better on most datasets.

**Significant losses (real, not noise):** loses to XGBoost on classification (-5.9 pp), and to TabPFN-2.5 / TabICLv2 on both axes — these are commercial / SOTA models 2-8× Mini's parameter count.

### Model size vs accuracy

| model | params | params (%) | reg-R² | cls-acc |
|---|:---:|:---:|:---:|:---:|
| TabPFN-2.5 | ~100M | 740% | 0.662 | 0.780 |
| TabICLv2 | ~50M | 370% | — | 0.792 |
| **predictlm-base-26m** | 26M | 192% | 0.589 | 0.685 |
| **predictlm-mini-13m** | 13.5M | 100% (baseline) | 0.551 | 0.684 |

Mini is the smallest open-source ICL tabular FM in this comparison and the only one that trains on a single commodity GPU.

## Architecture

Identical architecture family to PredictLM Base, with cross-layer parameter sharing (ALBERT-style) to halve the trunk parameter count.

| field | value |
|---|---|
| Parameters | 13.5 M |
| Layers (effective depth) | 12 (4 unique × 3 shares — ALBERT-style sharing in shared trunk; 2 unique × 2 shares per task head) |
| d_model | 256 |
| n_heads | 8 |
| max_features | 128 |
| max_classes | 10 |
| max_context | 1024 |
| max_query | 256 |
| Regression head | BarDistribution, 1024 bins (bins identical to Base — required for KL distillation) |
| Classification head | Per-task masked softmax |
| Attention | row-axis transformer (same as Base) |
| Inference precision | fp16 (T4-compatible — Base uses bf16 on A100/H100) |

Cross-layer sharing means Mini has 4 unique trunk blocks each applied 3 times during forward pass (vs Base's 8 unique blocks each applied once). The effective compute graph depth is preserved; only the parameter count is halved.

## Training recipe (distillation from Base)

Mini was trained via **warm-start sliced distillation**: a novel recipe for compressing in-context-learning models that preserves real-data transfer ability.

**Three-stage recipe:**
1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.

The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.

## Intended use, limitations, ethical considerations

Identical to [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) — see that model card for full details:

- **Intended**: drop-in tabular predictor for ≤128 features, ≤1024 training rows, ≤10 classes
- **Not intended**: high-stakes decisions without domain validation; wide tables (>128 features); many-class cls (>10); very large training sets (>10K rows); non-numeric features without encoding
- **No personal data in training**: distilled from Base, which was trained on synthetic priors + cleared real-data copulas. No raw eval-set rows seen.
- **Bias inheritance**: predictions reflect the labeled context the user supplies at inference time

The known weaknesses (cls below XGBoost; below TabPFN-2.5 / TabICLv2 on both axes) are inherited from Base; Mini does not amplify them but cannot fix them either.

## Reproducibility

- **Weights file**: `v11_06_tiny_final.pt` (inference-only, EMA-preferred state)
- **SHA-256**: `e27c8af6cda7a3426ffed33cb98eb8338966a8190712b5d37ff9e5f442b75a17`
- **Size**: 54.4 MB (inference-only, optimizer + curriculum + buffer + L2-SP state stripped from 217 MB raw)
- **Training step**: 30,000 (final)
- **Training seed**: 42
- **Teacher**: `predictlm-base-26m` (v11.0)
- **Distillation recipe**: warm-start slice + online KL distillation
- **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (identical to Base)

## Licensing

Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed.

The distillation recipe uses our own [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (Apache 2.0) as the teacher — no third-party license obligations propagate to this model. Mini is fully commercially usable.

## Version

- **v11.0.6-tiny** (current) — first public release of the compact distilled variant.
- Sibling: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (full-size, 26M)
- Future releases under the same `predictlm` Python package.

## Citation

### BibTeX

```bibtex
@misc{predictlm_mini_2026,
  author       = {ZeroOne Research},
  title        = {predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-mini-13m}}
}
```

### APA

ZeroOne Research. (2026). *predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-mini-13m