| --- |
| license: apache-2.0 |
| library_name: predictlm |
| pipeline_tag: tabular-classification |
| tags: |
| - tabular |
| - tabular-classification |
| - tabular-regression |
| - in-context-learning |
| - foundation-model |
| - prior-fitted-network |
| - tabpfn-style |
| - distilled |
| - compact |
| metrics: |
| - accuracy |
| - r2 |
| base_model: zerooneresearch/predictlm-base-26m |
| model-index: |
| - name: predictlm-mini-13m |
| results: |
| - task: |
| type: tabular-classification |
| name: Tabular Classification |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128 |
| metrics: |
| - type: accuracy |
| value: 0.684 |
| name: mean accuracy (n=12, seed=42, fair-set n_features ≤ 128) |
| - task: |
| type: tabular-regression |
| name: Tabular Regression |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128 |
| metrics: |
| - type: r2 |
| value: 0.551 |
| name: mean R² (n=13, seed=42, fair-set n_features ≤ 128) |
| - task: |
| type: tabular-classification |
| name: Tabular Classification (Duo + TTT recipe) |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128 |
| metrics: |
| - type: accuracy |
| value: 0.751 |
| name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training) |
| - task: |
| type: tabular-regression |
| name: Tabular Regression (Duo + TTT recipe) |
| dataset: |
| type: openml |
| name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128 |
| metrics: |
| - type: r2 |
| value: 0.609 |
| name: mean R² with Duo + TTT recipe (Mini + Base + test-time training) |
| --- |
| |
| # predictlm-mini-13m |
|
|
| A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression. |
|
|
| This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have. |
|
|
| ## Getting started — the published 0.751 cls / 0.609 reg recipe, by default |
|
|
| ```bash |
| pip install predictlm |
| ``` |
|
|
| ```python |
| from predictlm import PredictLM |
| |
| model = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m") # cpu / mps / cuda all OK |
| |
| # Regression — pass float y, get continuous predictions |
| preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg) |
| |
| # Classification — same model, same API; auto-routed via y_train dtype |
| preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls) |
| probs = model.predict_proba(X_test_cls) |
| ``` |
|
|
| That's it. On the first `.predict()` call the package silently downloads its partner checkpoint (`predictlm-base-26m`), forms the published **Duo + TTT** ensemble under the hood, and returns the **0.751 cls / 0.609 reg** result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in `~/.cache/huggingface/`. |
|
|
| | Recipe (chosen via `auto_duo=` flag) | cls mean acc | reg mean R² | |
| |---|:---:|:---:| |
| | Default `.predict()` (Duo + TTT under the hood) | **0.751** | **0.609** | |
| | `auto_duo=False` (Mini-only, zero-tuning) | 0.673 | 0.536 | |
| | `auto_duo=False` + `fit_and_predict_with_ttt()` (Mini-only TTT) | 0.742 | 0.595 | |
|
|
| **Edge cases:** |
|
|
| - **No internet / air-gapped.** Pass `auto_duo=False` at load to disable partner download — `.predict()` returns the single-model in-context result. |
| - **Real-time inference** (<10 ms latency)? Use `auto_duo=False` zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size. |
|
|
| **TTT** ([Test-Time Training](https://arxiv.org/abs/2503.11842)) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006. |
|
|
| PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch (Mini distilled from PredictLM-Base) and shipped under Apache-2.0. |
|
|
| ## Developers and affiliations |
|
|
| - **Developed by**: ZeroOne Research |
| - **Distilled from**: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (v11.0) |
| - **Model card contact**: message the org on the Hub |
| - **License**: Apache 2.0 — permissive, commercial use allowed |
|
|
| ## Why Mini (when to prefer this over Base) |
|
|
| - **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS |
| - **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100 |
| - **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB |
| - **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base |
| - **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base) |
|
|
| Prefer **Base** instead if you have an A100/H100, value the last ~4 pp of regression accuracy, and don't need to re-distill. |
|
|
| ## Performance benchmarks |
|
|
| ### Locked OpenML eval (held-out, contamination-audited) |
|
|
| Same 30-dataset stratified sample, seed=42, fair-set filter `n_features ≤ 128`, 4-way comparison. Same eval pipeline as Base (`scripts/eval_v11.py`). |
|
|
| | | reg-R² (n=13) | cls-acc (n=12) | |
| |---|:---:|:---:| |
| | predictlm-base-26m (teacher) | +0.589 | 0.685 | |
| | **predictlm-mini-13m (this model, 13.5M)** | **+0.551** | **0.684** | |
| | XGBoost (200 trees, depth 6) | +0.516 | 0.743 | |
| | TabPFN-2.5 (hosted, ~100M, non-commercial license) | +0.662 | 0.780 | |
| | TabICLv2 (open, BSD-3, ~50M) | *(cls-only)* | 0.792 | |
|
|
| ### Paired-bootstrap 95% CIs (10,000 resamples, seed=42) |
|
|
| Per-dataset deltas (predictlm-mini-13m minus baseline): |
|
|
| | comparison | mean Δ | 95% CI | n | significant? | |
| |---|:---:|:---:|:---:|:---:| |
| | **Mini vs Base (compression cost)** | | | | | |
| | Reg R² | **-0.038** | [-0.065, -0.015] | 13 | ✅ real (~4 pp loss) | |
| | Cls acc | **-0.001** | [-0.027, +0.029] | 12 | ✅ **statistical tie** | |
| | **vs other peers (Mini)** | | | | | |
| | Reg vs XGBoost | +0.035 | [-0.076, +0.158] | 13 | within noise | |
| | Reg vs TabPFN-2.5 | -0.111 | [-0.152, -0.067] | 13 | ✅ significant loss | |
| | Cls vs XGBoost | -0.059 | [-0.089, -0.031] | 12 | ✅ significant loss | |
| | Cls vs TabPFN-2.5 | -0.097 | [-0.132, -0.059] | 12 | ✅ significant loss | |
| | Cls vs TabICLv2 | -0.109 | [-0.147, -0.069] | 12 | ✅ significant loss | |
|
|
| **Retention vs Base — the headline compression story:** |
| - **Classification: statistical tie** with Base (delta -0.001, CI [-0.027, +0.029]). At half the parameters, Mini is indistinguishable from the 26M teacher on classification accuracy. |
| - **Regression: ~4 pp R² cost** vs Base, CI [-6.5, -1.5] (statistically real but small). |
|
|
| **Honest read on the peer comparisons.** Like Base, Mini's regression-vs-XGBoost point estimate is positive (+3.5 pp) but the 95% CI on this 13-dataset sample crosses zero. We can't claim a statistically significant XGBoost win on regression from this single-seed eval. What we *can* say: Mini and XGBoost are competitive on regression on this benchmark, with Mini's distribution being slightly better on most datasets. |
|
|
| **Significant losses (real, not noise):** loses to XGBoost on classification (-5.9 pp), and to TabPFN-2.5 / TabICLv2 on both axes — these are commercial / SOTA models 2-8× Mini's parameter count. |
|
|
| ### Model size vs accuracy |
|
|
| | model | params | params (%) | reg-R² | cls-acc | |
| |---|:---:|:---:|:---:|:---:| |
| | TabPFN-2.5 | ~100M | 740% | 0.662 | 0.780 | |
| | TabICLv2 | ~50M | 370% | — | 0.792 | |
| | **predictlm-base-26m** | 26M | 192% | 0.589 | 0.685 | |
| | **predictlm-mini-13m** | 13.5M | 100% (baseline) | 0.551 | 0.684 | |
|
|
| Mini is the smallest open-source ICL tabular FM in this comparison and the only one that trains on a single commodity GPU. |
|
|
| ## Architecture |
|
|
| Identical architecture family to PredictLM Base, with cross-layer parameter sharing (ALBERT-style) to halve the trunk parameter count. |
|
|
| | field | value | |
| |---|---| |
| | Parameters | 13.5 M | |
| | Layers (effective depth) | 12 (4 unique × 3 shares — ALBERT-style sharing in shared trunk; 2 unique × 2 shares per task head) | |
| | d_model | 256 | |
| | n_heads | 8 | |
| | max_features | 128 | |
| | max_classes | 10 | |
| | max_context | 1024 | |
| | max_query | 256 | |
| | Regression head | BarDistribution, 1024 bins (bins identical to Base — required for KL distillation) | |
| | Classification head | Per-task masked softmax | |
| | Attention | row-axis transformer (same as Base) | |
| | Inference precision | fp16 (T4-compatible — Base uses bf16 on A100/H100) | |
|
|
| Cross-layer sharing means Mini has 4 unique trunk blocks each applied 3 times during forward pass (vs Base's 8 unique blocks each applied once). The effective compute graph depth is preserved; only the parameter count is halved. |
|
|
| ## Training recipe (distillation from Base) |
|
|
| Mini was trained via **warm-start sliced distillation**: a novel recipe for compressing in-context-learning models that preserves real-data transfer ability. |
|
|
| **Three-stage recipe:** |
| 1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already. |
| 2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer. |
| 3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision. |
|
|
| The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine. |
|
|
| ## Intended use, limitations, ethical considerations |
|
|
| Identical to [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) — see that model card for full details: |
|
|
| - **Intended**: drop-in tabular predictor for ≤128 features, ≤1024 training rows, ≤10 classes |
| - **Not intended**: high-stakes decisions without domain validation; wide tables (>128 features); many-class cls (>10); very large training sets (>10K rows); non-numeric features without encoding |
| - **No personal data in training**: distilled from Base, which was trained on synthetic priors + cleared real-data copulas. No raw eval-set rows seen. |
| - **Bias inheritance**: predictions reflect the labeled context the user supplies at inference time |
|
|
| The known weaknesses (cls below XGBoost; below TabPFN-2.5 / TabICLv2 on both axes) are inherited from Base; Mini does not amplify them but cannot fix them either. |
|
|
| ## Reproducibility |
|
|
| - **Weights file**: `v11_06_tiny_final.pt` (inference-only, EMA-preferred state) |
| - **SHA-256**: `e27c8af6cda7a3426ffed33cb98eb8338966a8190712b5d37ff9e5f442b75a17` |
| - **Size**: 54.4 MB (inference-only, optimizer + curriculum + buffer + L2-SP state stripped from 217 MB raw) |
| - **Training step**: 30,000 (final) |
| - **Training seed**: 42 |
| - **Teacher**: `predictlm-base-26m` (v11.0) |
| - **Distillation recipe**: warm-start slice + online KL distillation |
| - **Eval-lock manifest SHA-256**: `fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841` (identical to Base) |
|
|
| ## Licensing |
|
|
| Apache 2.0 — see [LICENSE](./LICENSE). Permissive, commercial use allowed. |
|
|
| The distillation recipe uses our own [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (Apache 2.0) as the teacher — no third-party license obligations propagate to this model. Mini is fully commercially usable. |
|
|
| ## Version |
|
|
| - **v11.0.6-tiny** (current) — first public release of the compact distilled variant. |
| - Sibling: [predictlm-base-26m](https://huggingface.co/zerooneresearch/predictlm-base-26m) (full-size, 26M) |
| - Future releases under the same `predictlm` Python package. |
|
|
| ## Citation |
|
|
| ### BibTeX |
|
|
| ```bibtex |
| @misc{predictlm_mini_2026, |
| author = {ZeroOne Research}, |
| title = {predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-mini-13m}} |
| } |
| ``` |
|
|
| ### APA |
|
|
| ZeroOne Research. (2026). *predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware.* Hugging Face. https://huggingface.co/zerooneresearch/predictlm-mini-13m |
|
|