01RAI commited on
Commit
761a46b
·
verified ·
1 Parent(s): 11f470a

PredictLM v11.0 + Mini ship-bundle

Browse files
Files changed (1) hide show
  1. README.md +6 -10
README.md CHANGED
@@ -17,10 +17,9 @@ metrics:
17
  - r2
18
  co2_eq_emissions:
19
  emissions: 700
20
- source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and ~3.3 GPU-hours at T4 ~70W TDP
21
  training_type: distillation
22
  geographical_location: Netherlands
23
- hardware_used: 1× NVIDIA Tesla T4 16GB (commodity)
24
  model-index:
25
  - name: predictlm-mini-13m
26
  results:
@@ -48,7 +47,7 @@ model-index:
48
 
49
  # predictlm-mini-13m
50
 
51
- A 13.5M-parameter **distilled tabular foundation model** trained on a single Tesla T4 in 3.3 hours for ~$1.30. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
52
 
53
  This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
54
 
@@ -97,8 +96,8 @@ PredictLM's TTT is an independent implementation of the published technique. Thi
97
 
98
  ## Why Mini (when to prefer this over Base)
99
 
100
- - **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a Tesla T4, RTX 3060, or M-series MPS
101
- - **You want to re-distill / fine-tune yourself** — Mini's training recipe runs end-to-end on a single T4 in 3.3 hours for ~$1.30; Base requires an A100/H100
102
  - **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
103
  - **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
104
  - **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
@@ -182,15 +181,12 @@ Mini was trained via **warm-start sliced distillation**: a novel recipe for comp
182
  **Three-stage recipe:**
183
  1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
184
  2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
185
- 3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision on a single Tesla T4.
186
 
187
  The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
188
 
189
  ### Compute
190
 
191
- - **Hardware**: 1× Tesla T4 16GB (commodity)
192
- - **Wall time**: 3.3 hours
193
- - **Cost**: ~$1.30 at $0.40/hr spot
194
  - **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
195
 
196
  Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
@@ -224,7 +220,7 @@ To reproduce from scratch:
224
  # Pull the v11.0 teacher
225
  huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
226
 
227
- # Reproduce Mini (single T4, ~3.3 hr, ~$1.30)
228
  python3 scripts/train_v11_06_tiny.py \
229
  --teacher-ckpt v11_final.pt \
230
  --warm-start-from-v11 v11_final.pt \
 
17
  - r2
18
  co2_eq_emissions:
19
  emissions: 700
20
+ source: estimated from Azure EU-North grid factor (~0.3 kg CO₂/kWh) and training compute footprint
21
  training_type: distillation
22
  geographical_location: Netherlands
 
23
  model-index:
24
  - name: predictlm-mini-13m
25
  results:
 
47
 
48
  # predictlm-mini-13m
49
 
50
+ A 13.5M-parameter **distilled tabular foundation model**. Half the parameters of [PredictLM Base (26M)](https://huggingface.co/zerooneresearch/predictlm-base-26m); **statistically tied with Base on classification accuracy** and within ~4 pp R² on regression.
51
 
52
  This is the **compact deployment variant** of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.
53
 
 
96
 
97
  ## Why Mini (when to prefer this over Base)
98
 
99
+ - **GPU memory budget < 8 GB at inference** — Mini fits comfortably on a consumer GPU or M-series MPS
100
+ - **You want to re-distill / fine-tune yourself** — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
101
  - **You want a smaller artifact to ship inside a product** — 55 MB inference weights vs Base's 105 MB
102
  - **You're running many concurrent inference jobs** — 4× as many parallel Mini instances fit per GPU vs Base
103
  - **You can tolerate ~4 pp lower regression R²** (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)
 
181
  **Three-stage recipe:**
182
  1. **Warm-start by slicing.** Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
183
  2. **Distill via teacher logits.** Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
184
+ 3. **30,000 training steps** with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.
185
 
186
  The critical insight: distillation from scratch (Option A in our experiments) **failed to transfer to real OpenML data** — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.
187
 
188
  ### Compute
189
 
 
 
 
190
  - **Carbon**: ~0.7 kg CO₂ (Azure EU-North grid)
191
 
192
  Mini is the **cheapest tabular foundation model release on Hugging Face** by training cost as of 2026-05-14. Reproducible from scratch with `scripts/train_v11_06_tiny.py` in the code repo.
 
220
  # Pull the v11.0 teacher
221
  huggingface-cli download zerooneresearch/predictlm-base-26m v11_final.pt --local-dir ./
222
 
223
+ # Reproduce Mini
224
  python3 scripts/train_v11_06_tiny.py \
225
  --teacher-ckpt v11_final.pt \
226
  --warm-start-from-v11 v11_final.pt \