PredictLM v11.0 + Mini ship-bundle
Browse files
README.md
CHANGED
|
@@ -17,10 +17,9 @@ metrics:
|
|
| 17 |
- r2
|
| 18 |
co2_eq_emissions:
|
| 19 |
emissions: 700
|
| 20 |
-
source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and
|
| 21 |
training_type: distillation
|
| 22 |
geographical_location: Netherlands
|
| 23 |
-
hardware_used: 1× NVIDIA Tesla T4 16GB (commodity)
|
| 24 |
model-index:
|
| 25 |
- name: predictlm-mini-13m
|
| 26 |
results:
|
|
@@ -48,7 +47,7 @@ model-index:
|
|
| 48 |
|
| 49 |
# predictlm-mini-13m
|
| 50 |
|
| 51 |
-
A 13.5M-parameter **distilled tabular foundation model**
|
| 52 |
|
| 53 |
This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
|
| 54 |
|
|
@@ -97,8 +96,8 @@ PredictLM's TTT is an independent implementation of the published technique. Thi
|
|
| 97 |
|
| 98 |
## Why Mini (when to prefer this over Base)
|
| 99 |
|
| 100 |
-
- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a
|
| 101 |
-
- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs
|
| 102 |
- **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
|
| 103 |
- **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
|
| 104 |
- **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
|
|
@@ -182,15 +181,12 @@ Mini was trained via **warm-start sliced distillation**: a novel recipe for comp
|
|
| 182 |
**Three-stage recipe:**
|
| 183 |
1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
|
| 184 |
2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
|
| 185 |
-
3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision
|
| 186 |
|
| 187 |
The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
|
| 188 |
|
| 189 |
### Compute
|
| 190 |
|
| 191 |
-
- **Hardware**: 1× Tesla T4 16GB (commodity)
|
| 192 |
-
- **Wall time**: 3.3 hours
|
| 193 |
-
- **Cost**: ~$1.30 at $0.40/hr spot
|
| 194 |
- **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
|
| 195 |
|
| 196 |
Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
|
|
@@ -224,7 +220,7 @@ To reproduce from scratch:
|
|
| 224 |
# Pull the v11.0 teacher
|
| 225 |
huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
|
| 226 |
|
| 227 |
-
# Reproduce Mini
|
| 228 |
python3 scripts/train_v11_06_tiny.py \
|
| 229 |
--teacher-ckpt v11_final.pt \
|
| 230 |
--warm-start-from-v11 v11_final.pt \
|
|
|
|
| 17 |
- r2
|
| 18 |
co2_eq_emissions:
|
| 19 |
emissions: 700
|
| 20 |
+
source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and training compute footprint
|
| 21 |
training_type: distillation
|
| 22 |
geographical_location: Netherlands
|
|
|
|
| 23 |
model-index:
|
| 24 |
- name: predictlm-mini-13m
|
| 25 |
results:
|
|
|
|
| 47 |
|
| 48 |
# predictlm-mini-13m
|
| 49 |
|
| 50 |
+
A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
|
| 51 |
|
| 52 |
This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
|
| 53 |
|
|
|
|
| 96 |
|
| 97 |
## Why Mini (when to prefer this over Base)
|
| 98 |
|
| 99 |
+
- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS
|
| 100 |
+
- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
|
| 101 |
- **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
|
| 102 |
- **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
|
| 103 |
- **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
|
|
|
|
| 181 |
**Three-stage recipe:**
|
| 182 |
1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
|
| 183 |
2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
|
| 184 |
+
3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.
|
| 185 |
|
| 186 |
The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
|
| 187 |
|
| 188 |
### Compute
|
| 189 |
|
|
|
|
|
|
|
|
|
|
| 190 |
- **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
|
| 191 |
|
| 192 |
Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
|
|
|
|
| 220 |
# Pull the v11.0 teacher
|
| 221 |
huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
|
| 222 |
|
| 223 |
+
# Reproduce Mini
|
| 224 |
python3 scripts/train_v11_06_tiny.py \
|
| 225 |
--teacher-ckpt v11_final.pt \
|
| 226 |
--warm-start-from-v11 v11_final.pt \
|