zerooneresearch
/

predictlm-mini-13m

@@ -17,10 +17,9 @@ metrics:
   - r2
 co2_eq_emissions:
   emissions: 700
-  source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and ~3.3 GPU-hours at T4 ~70W TDP
   training_type: distillation
   geographical_location: Netherlands
-  hardware_used: 1× NVIDIA Tesla T4 16GB (commodity)
 model-index:
   - name: predictlm-mini-13m
     results:
@@ -48,7 +47,7 @@ model-index:
 # predictlm-mini-13m
-A 13.5M-parameter **distilled tabular foundation model** trained on a single Tesla T4 in 3.3 hours for ~$1.30. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
 This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
@@ -97,8 +96,8 @@ PredictLM's TTT is an independent implementation of the published technique. Thi
 ## Why Mini (when to prefer this over Base)
-- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a Tesla T4, RTX 3060, or M-series MPS
-- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs end-to-end on a single T4 in 3.3 hours for ~$1.30; Base requires an A100/H100
 - **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
 - **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
 - **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
@@ -182,15 +181,12 @@ Mini was trained via **warm-start sliced distillation**: a novel recipe for comp
 **Three-stage recipe:**
 1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
 2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
-3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision on a single Tesla T4.
 The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
 ### Compute
-- **Hardware**: 1× Tesla T4 16GB (commodity)
-- **Wall time**: 3.3 hours
-- **Cost**: ~$1.30 at $0.40/hr spot
 - **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
 Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
@@ -224,7 +220,7 @@ To reproduce from scratch:
 # Pull the v11.0 teacher
 huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
-# Reproduce Mini (single T4, ~3.3 hr, ~$1.30)
 python3 scripts/train_v11_06_tiny.py \
     --teacher-ckpt v11_final.pt \
     --warm-start-from-v11 v11_final.pt \

   - r2
 co2_eq_emissions:
   emissions: 700
+  source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and training compute footprint
   training_type: distillation
   geographical_location: Netherlands
 model-index:
   - name: predictlm-mini-13m
     results:
 # predictlm-mini-13m
+A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
 This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
 ## Why Mini (when to prefer this over Base)
+- **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS
+- **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
 - **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
 - **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
 - **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
 **Three-stage recipe:**
 1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
 2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
+3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.
 The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
 ### Compute
 - **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
 Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
 # Pull the v11.0 teacher
 huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
+# Reproduce Mini
 python3 scripts/train_v11_06_tiny.py \
     --teacher-ckpt v11_final.pt \
     --warm-start-from-v11 v11_final.pt \