zerooneresearch
/

predictlm-base-26m

@@ -15,12 +15,6 @@ metrics:
   - roc_auc
   - r2
   - rmse
-co2_eq_emissions:
-  emissions: 20000
-  source: estimated from RunPod EU-NL grid factor (~0.3 kg CO₂/kWh) and ~80 GPU-hours at H100 ~700W TDP
-  training_type: pre-training
-  geographical_location: Netherlands
-  hardware_used: 1× NVIDIA H100 80GB SXM5 (primary), 1× NVIDIA A100 40GB SXM4 (initial run), 1× NVIDIA L40S 46GB (failed Plan-B+ probe)
 model-index:
   - name: predictlm-base-26m
     results:
@@ -44,6 +38,26 @@ model-index:
           - type: r2
             value: 0.589
             name: mean R² (n=13, seed=42, fair-set n_features ≤ 128)
 ---
 # predictlm-base-26m
@@ -99,7 +113,7 @@ Unified architecture: a shared backbone with two task heads (regression via a 10
 - **Developed by**: ZeroOne Research
 - **Model card contact**: open an issue at the [code repo](https://github.com/zerooneresearch/predictlm-v11) or message the org on the Hub
-- **License**: Apache 2.0 — permissive, no attribution-only restriction (more permissive than TabPFN v2's bespoke license)
 ## Intended use
@@ -180,7 +194,7 @@ Per-dataset deltas (predictlm-base-26m minus baseline):
 | Cls vs TabPFN-2.5 | -0.096 | [-0.133, -0.059] | 12 | ✅ significant loss |
 | Cls vs TabICLv2 | -0.108 | [-0.150, -0.066] | 12 | ✅ significant loss |
-**Honest read on the headline number.** The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], **so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample.** Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage. Wider multi-seed evals are planned for v11.0.7.
 **Significant losses (real, not noise):** loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count.

   - roc_auc
   - r2
   - rmse
 model-index:
   - name: predictlm-base-26m
     results:
           - type: r2
             value: 0.589
             name: mean R² (n=13, seed=42, fair-set n_features ≤ 128)
+      - task:
+          type: tabular-classification
+          name: Tabular Classification (Duo + TTT recipe)
+        dataset:
+          type: openml
+          name: Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
+        metrics:
+          - type: accuracy
+            value: 0.751
+            name: mean accuracy with Duo + TTT recipe (Mini + Base + test-time training)
+      - task:
+          type: tabular-regression
+          name: Tabular Regression (Duo + TTT recipe)
+        dataset:
+          type: openml
+          name: Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
+        metrics:
+          - type: r2
+            value: 0.609
+            name: mean R² with Duo + TTT recipe (Mini + Base + test-time training)
 ---
 # predictlm-base-26m
 - **Developed by**: ZeroOne Research
 - **Model card contact**: open an issue at the [code repo](https://github.com/zerooneresearch/predictlm-v11) or message the org on the Hub
+- **License**: Apache 2.0 — permissive, commercial use allowed, no attribution-only restriction
 ## Intended use
 | Cls vs TabPFN-2.5 | -0.096 | [-0.133, -0.059] | 12 | ✅ significant loss |
 | Cls vs TabICLv2 | -0.108 | [-0.150, -0.066] | 12 | ✅ significant loss |
+**Honest read on the headline number.** The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], **so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample.** Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage.
 **Significant losses (real, not noise):** loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count.