PredictLM v11.0 + Mini ship-bundle

b846449 verified 3 days ago

13.2 kB

	---
	license: apache-2.0
	library_name: predictlm
	pipeline_tag: tabular-classification
	tags:
	- tabular
	- tabular-classification
	- tabular-regression
	- in-context-learning
	- foundation-model
	- prior-fitted-network
	- tabpfn-style
	- distilled
	- compact
	metrics:
	- accuracy
	- r2
	base_model: zerooneresearch/predictlm-base-26m
	model-index:
	- name: predictlm-mini-13m
	results:
	- task:
	type: tabular-classification
	name: Tabular Classification
	dataset:
	type: openml
	name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
	metrics:
	- type: accuracy
	value: 0.684
	name: mean accuracy (n=12, seed=42, fair-set n_features ≤ 128)
	- task:
	type: tabular-regression
	name: Tabular Regression
	dataset:
	type: openml
	name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
	metrics:
	- type: r2
	value: 0.551
	name: mean R² (n=13, seed=42, fair-set n_features ≤ 128)
	- task:
	type: tabular-classification
	name: Tabular Classification (Duo + TTT recipe)
	dataset:
	type: openml
	name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
	metrics:
	- type: accuracy
	value: 0.751
	name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
	- task:
	type: tabular-regression
	name: Tabular Regression (Duo + TTT recipe)
	dataset:
	type: openml
	name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
	metrics:
	- type: r2
	value: 0.609
	name: mean R² with Duo + TTT recipe (Mini + Base + test-time training)
	---

	# predictlm-mini-13m

	A 13.5M-parameter distilled tabular foundation model. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); statistically tied with Base on classification accuracy and within ~4 pp R² on regression.

	This is the compact deployment variant of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.

	## Getting started — the published 0.751 cls / 0.609 reg recipe, by default

	```bash
	pip install predictlm
	```

	```python
	from predictlm import PredictLM

	model = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m") # cpu / mps / cuda all OK

	# Regression — pass float y, get continuous predictions
	preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)

	# Classification — same model, same API; auto-routed via y_train dtype
	preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
	probs = model.predict_proba(X_test_cls)
	```

	That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-base-26m`), forms the published Duo + TTT ensemble under the hood, and returns the 0.751 cls / 0.609 reg result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`.

	\| Recipe (chosen via `auto_duo=` flag) \| cls mean acc \| reg mean R² \|
	\|---\|:---:\|:---:\|
	\| Default `.predict()` (Duo + TTT under the hood) \| 0.751 \| 0.609 \|
	\| `auto_duo=False` (Mini-only, zero-tuning) \| 0.673 \| 0.536 \|
	\| `auto_duo=False` + `fit_and_predict_with_ttt()` (Mini-only TTT) \| 0.742 \| 0.595 \|

	Edge cases:

	- No internet / air-gapped. Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result.
	- Real-time inference (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.

	TTT ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.

	PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch (Mini distilled from PredictLM-Base) and shipped under Apache-2.0.

	## Developers and affiliations

	- Developed by: ZeroOne Research
	- Distilled from: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (v11.0)
	- Model card contact: message the org on the Hub
	- License: Apache 2.0 — permissive, commercial use allowed

	## Why Mini (when to prefer this over Base)

	- GPU memory budget < 8 GB at inference — Mini fits comfortably on a consumer GPU or M-series MPS
	- You want to re-distill / fine-tune yourself — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
	- You want a smaller artifact to ship inside a product — 55 MB inference weights vs Base's 105 MB
	- You're running many concurrent inference jobs — 4× as many parallel Mini instances fit per GPU vs Base
	- You can tolerate ~4 pp lower regression R² (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)

	Prefer Base instead if you have an A100/H100, value the last ~4 pp of regression accuracy, and don't need to re-distill.

	## Performance benchmarks

	### Locked OpenML eval (held-out, contamination-audited)

	Same 30-dataset stratified sample, seed=42, fair-set filter `n_features ≤ 128`, 4-way comparison. Same eval pipeline as Base (`scripts/eval_v11.py`).

	\| \| reg-R² (n=13) \| cls-acc (n=12) \|
	\|---\|:---:\|:---:\|
	\| predictlm-base-26m (teacher) \| +0.589 \| 0.685 \|
	\| predictlm-mini-13m (this model, 13.5M) \| +0.551 \| 0.684 \|
	\| XGBoost (200 trees, depth 6) \| +0.516 \| 0.743 \|
	\| TabPFN-2.5 (hosted, ~100M, non-commercial license) \| +0.662 \| 0.780 \|
	\| TabICLv2 (open, BSD-3, ~50M) \| (cls-only) \| 0.792 \|

	### Paired-bootstrap 95% CIs (10,000 resamples, seed=42)

	Per-dataset deltas (predictlm-mini-13m minus baseline):

	\| comparison \| mean Δ \| 95% CI \| n \| significant? \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| Mini vs Base (compression cost) \| \| \| \| \|
	\| Reg R² \| -0.038 \| [-0.065, -0.015] \| 13 \| ✅ real (~4 pp loss) \|
	\| Cls acc \| -0.001 \| [-0.027, +0.029] \| 12 \| ✅ statistical tie \|
	\| vs other peers (Mini) \| \| \| \| \|
	\| Reg vs XGBoost \| +0.035 \| [-0.076, +0.158] \| 13 \| within noise \|
	\| Reg vs TabPFN-2.5 \| -0.111 \| [-0.152, -0.067] \| 13 \| ✅ significant loss \|
	\| Cls vs XGBoost \| -0.059 \| [-0.089, -0.031] \| 12 \| ✅ significant loss \|
	\| Cls vs TabPFN-2.5 \| -0.097 \| [-0.132, -0.059] \| 12 \| ✅ significant loss \|
	\| Cls vs TabICLv2 \| -0.109 \| [-0.147, -0.069] \| 12 \| ✅ significant loss \|

	Retention vs Base — the headline compression story:
	- Classification: statistical tie with Base (delta -0.001, CI [-0.027, +0.029]). At half the parameters, Mini is indistinguishable from the 26M teacher on classification accuracy.
	- Regression: ~4 pp R² cost vs Base, CI [-6.5, -1.5] (statistically real but small).

	Honest read on the peer comparisons. Like Base, Mini's regression-vs-XGBoost point estimate is positive (+3.5 pp) but the 95% CI on this 13-dataset sample crosses zero. We can't claim a statistically significant XGBoost win on regression from this single-seed eval. What we can say: Mini and XGBoost are competitive on regression on this benchmark, with Mini's distribution being slightly better on most datasets.

	Significant losses (real, not noise): loses to XGBoost on classification (-5.9 pp), and to TabPFN-2.5 / TabICLv2 on both axes — these are commercial / SOTA models 2-8× Mini's parameter count.

	### Model size vs accuracy

	\| model \| params \| params (%) \| reg-R² \| cls-acc \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| TabPFN-2.5 \| ~100M \| 740% \| 0.662 \| 0.780 \|
	\| TabICLv2 \| ~50M \| 370% \| — \| 0.792 \|
	\| predictlm-base-26m \| 26M \| 192% \| 0.589 \| 0.685 \|
	\| predictlm-mini-13m \| 13.5M \| 100% (baseline) \| 0.551 \| 0.684 \|

	Mini is the smallest open-source ICL tabular FM in this comparison and the only one that trains on a single commodity GPU.

	## Architecture

	Identical architecture family to PredictLM Base, with cross-layer parameter sharing (ALBERT-style) to halve the trunk parameter count.

	\| field \| value \|
	\|---\|---\|
	\| Parameters \| 13.5 M \|
	\| Layers (effective depth) \| 12 (4 unique × 3 shares — ALBERT-style sharing in shared trunk; 2 unique × 2 shares per task head) \|
	\| d_model \| 256 \|
	\| n_heads \| 8 \|
	\| max_features \| 128 \|
	\| max_classes \| 10 \|
	\| max_context \| 1024 \|
	\| max_query \| 256 \|
	\| Regression head \| BarDistribution, 1024 bins (bins identical to Base — required for KL distillation) \|
	\| Classification head \| Per-task masked softmax \|
	\| Attention \| row-axis transformer (same as Base) \|
	\| Inference precision \| fp16 (T4-compatible — Base uses bf16 on A100/H100) \|

	Cross-layer sharing means Mini has 4 unique trunk blocks each applied 3 times during forward pass (vs Base's 8 unique blocks each applied once). The effective compute graph depth is preserved; only the parameter count is halved.

	## Training recipe (distillation from Base)

	Mini was trained via warm-start sliced distillation: a novel recipe for compressing in-context-learning models that preserves real-data transfer ability.

	Three-stage recipe:
	1. Warm-start by slicing. Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
	2. Distill via teacher logits. Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student \|\| teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
	3. 30,000 training steps with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.

	The critical insight: distillation from scratch (Option A in our experiments) failed to transfer to real OpenML data — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.

	## Intended use, limitations, ethical considerations

	Identical to [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) — see that model card for full details:

	- Intended: drop-in tabular predictor for ≤128 features, ≤1024 training rows, ≤10 classes
	- Not intended: high-stakes decisions without domain validation; wide tables (>128 features); many-class cls (>10); very large training sets (>10K rows); non-numeric features without encoding
	- No personal data in training: distilled from Base, which was trained on synthetic priors + cleared real-data copulas. No raw eval-set rows seen.
	- Bias inheritance: predictions reflect the labeled context the user supplies at inference time

	The known weaknesses (cls below XGBoost; below TabPFN-2.5 / TabICLv2 on both axes) are inherited from Base; Mini does not amplify them but cannot fix them either.

	## Reproducibility

	- Weights file: `v11_06_tiny_final.pt` (inference-only, EMA-preferred state)
	- SHA-256: `e27c8af6cda7a3426ffed33cb98eb8338966a8190712b5d37ff9e5f442b75a17`
	- Size: 54.4 MB (inference-only, optimizer + curriculum + buffer + L2-SP state stripped from 217 MB raw)
	- Training step: 30,000 (final)
	- Training seed: 42
	- Teacher: `predictlm-base-26m` (v11.0)
	- Distillation recipe: warm-start slice + online KL distillation
	- Eval-lock manifest SHA-256: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (identical to Base)

	## Licensing

	Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed.

	The distillation recipe uses our own [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (Apache 2.0) as the teacher — no third-party license obligations propagate to this model. Mini is fully commercially usable.

	## Version

	- v11.0.6-tiny (current) — first public release of the compact distilled variant.
	- Sibling: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (full-size, 26M)
	- Future releases under the same `predictlm` Python package.

	## Citation

	### BibTeX

	```bibtex
	@misc{predictlm_mini_2026,
	author = {ZeroOne Research},
	title = {predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-mini-13m}}
	}
	```

	### APA

	ZeroOne Research. (2026). predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware. Hugging Face. https://huggingface.co/zerooneresearch/predictlm-mini-13m