Upload Tabula v1 pretrained model — step 61,825, best_val=0.2295

Browse files

Files changed (5) hide show

README.md +130 -0
best.pt +3 -0
config.json +69 -0
latest.pt +3 -0
training_log.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+license: apache-2.0
+tags:
+  - tabular
+  - foundation-model
+  - pretraining
+  - tabpfn
+  - schema-aware
+  - pytorch
+datasets:
+  - avewright/tabula-pretraining-corpus-v2
+language:
+  - en
+---
+# Tabula v1 — Tabular Foundation Model (Pretrained)
+A schema-aware tabular transformer pretrained on a large multi-source corpus
+of real and synthetic tabular datasets.
+## Model Architecture
+| Property | Value |
+|---|---|
+| Architecture | TabularTransformer |
+| d_model | 256 |
+| Heads | 8 |
+| Layers | 8 |
+| FFN dim | 512 |
+| FFN activation | SwiGLU |
+| Normalization | RMSNorm |
+| Pooling | CLS token |
+| Numeric embedding | Periodic (k=16) |
+| Max numeric features | 64 |
+| Max categories | 128 |
+| Parameters | **10,752,769** (~10.75M) |
+## Pretraining
+| Property | Value |
+|---|---|
+| Best checkpoint | Step 45,000 |
+| Best val loss | 0.2295 |
+| Rows seen at best | 23,040,000 |
+| Final step | 61,825 |
+| Total rows seen | 31,654,400 |
+| Batch size | 512 |
+| Learning rate | 3e-4 (cosine decay, 2K warmup) |
+| AMP | fp16 |
+| Hardware | NVIDIA RTX A4500 (20 GB) |
+| Training time | ~3 hours |
+Loss objective: multi-task MSE on target prediction from mixed numeric/categorical features,
+normalized per-column (z-score). Each batch samples from a fixed-width (64-feature)
+schema where unused slots are masked with NaN.
+## Pretraining Corpus
+Trained on [`avewright/tabula-pretraining-corpus-v2`](https://huggingface.co/datasets/avewright/tabula-pretraining-corpus-v2):
+| Source | OK Datasets | Status |
+|---|---|---|
+| PMLB | 422 | **Fully exhausted** (all 422 known datasets used) |
+| OpenML | 2,949 | 4,886 attempted — 1,900 rejected (too few features), 37 download failures |
+| HuggingFace | 0 | 67 attempted — format incompatibilities |
+| **Synthetic** | (unlimited) | tree-prior, GMM, polynomial, SCM, regression, time-series, mixed-type |
+**Total corpus:** 541 shards, ~160 GB parquet.
+**Format:** `feat_0..feat_63` (Float32, NaN=unused), `target` (Float32), `_source_meta` (JSON).
+### Dataset Exhaustion Notes
+- **PMLB: fully exhausted.** All 422 of 423 known datasets successfully processed
+  (1 download failure: `chess`). No new PMLB datasets can be added without an
+  upstream PMLB library update.
+- **OpenML: largely exhausted.** 4,886 unique datasets attempted. 2,949 passed
+  the pipeline. The 1,900 `schema_fail` entries are almost entirely datasets with
+  only 1 output column and too few rows/features to be useful (e.g. `too small: (53, 1)`).
+  These are unrecoverable without lowering quality thresholds. There may be a small
+  tail of undiscovered OpenML datasets not yet paginated.
+- **HuggingFace tabular:** 67 attempted from curated catalog. All failed due to
+  schema mismatches, missing splits, or download timeouts. Catalog needs expansion
+  with manually vetted datasets.
+## Files
+| File | Description |
+|---|---|
+| `best.pt` | Best validation checkpoint (step 45,000, val_loss=0.2295) |
+| `latest.pt` | Final training checkpoint (step 61,825) |
+| `config.json` | Model and training hyperparameters |
+| `training_log.txt` | Full training run output |
+## Usage
+```python
+import torch
+from tabula.models.transformer import TabularTransformer
+from tabula.config import ModelConfig
+# Load checkpoint
+ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
+cfg  = ckpt["config"].model
+# Reconstruct model
+model = TabularTransformer(
+    d_model=cfg.d_model, n_heads=cfg.n_heads, n_layers=cfg.n_layers,
+    d_ff=cfg.d_ff, dropout=cfg.dropout,
+    num_numeric=64, num_categorical=0, num_text=0,
+    output_dim=1,
+    numeric_embedding=cfg.numeric_embedding,
+    numeric_periodic_features=cfg.numeric_periodic_features,
+    ffn_activation=cfg.ffn_activation, norm=cfg.norm, pooling=cfg.pooling,
+)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+```
+## Training Notes
+The model uses a fixed-width schema (64 numeric slots) regardless of original
+dataset width. Narrower datasets are zero-padded with NaN masks. This forces the
+model to learn position-invariant feature representations compatible with arbitrary
+tabular schemas.
+Synthetic data fills gaps when real corpus buffer is empty, providing 100M+ rows
+per session of controlled variation in feature distributions, missingness patterns,
+and task types.

best.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e00cb1b6d673dd836fed25916eff3e74d6ef08eee9c88d7da33943779aa7ef48
+size 90133178

config.json ADDED Viewed

	@@ -0,0 +1,69 @@

+{
+  "model_type": "tabula_transformer",
+  "architecture": "TabularTransformer",
+  "d_model": 256,
+  "n_heads": 8,
+  "n_layers": 8,
+  "d_ff": 512,
+  "dropout": 0.1,
+  "ffn_activation": "swiglu",
+  "norm": "rmsnorm",
+  "pooling": "cls",
+  "numeric_embedding": "periodic",
+  "numeric_periodic_features": 16,
+  "max_numeric_features": 64,
+  "max_categories": 128,
+  "feature_token_dropout": 0.05,
+  "n_params": 10752769,
+  "pretraining": {
+    "best_step": 45000,
+    "best_val_loss": 0.229543,
+    "best_rows_seen": 23040000,
+    "final_step": 61825,
+    "final_rows_seen": 31654400,
+    "batch_size": 512,
+    "lr": 0.0003,
+    "weight_decay": 0.0001,
+    "amp": true,
+    "amp_dtype": "float16",
+    "grad_clip": 1.0,
+    "warmup_steps": 2000,
+    "lr_schedule": "cosine",
+    "max_steps": 200000
+  },
+  "corpus": {
+    "hf_repo": "avewright/tabula-pretraining-corpus-v2",
+    "total_shards": 541,
+    "real_datasets_ok": 3371,
+    "sources": {
+      "pmlb": {
+        "ok": 422,
+        "total_attempted": 423,
+        "status": "fully_exhausted"
+      },
+      "openml": {
+        "ok": 2949,
+        "total_attempted": 4886,
+        "schema_fail": 1900,
+        "download_fail": 37
+      },
+      "huggingface": {
+        "ok": 0,
+        "download_fail": 66,
+        "schema_fail": 1
+      }
+    },
+    "synthetic_generators": [
+      "tree_prior",
+      "gaussian_mixture",
+      "polynomial",
+      "scm",
+      "regression",
+      "time_series",
+      "mixed_type"
+    ]
+  },
+  "date_trained": "2026-03-16",
+  "framework": "pytorch",
+  "pytorch_version": "2.4.1+cu124"
+}

latest.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5099d3ce00a81b85230f7e18636dc328bd1c673993f554022aa03c8e7c2af0c
+size 90173498

training_log.txt ADDED Viewed

The diff for this file is too large to render. See raw diff