--- license: apache-2.0 tags: - tabular - foundation-model - pretraining - tabpfn - schema-aware - pytorch datasets: - avewright/tabula-pretraining-corpus-v2 language: - en --- # Tabula v1 — Tabular Foundation Model (Pretrained) A schema-aware tabular transformer pretrained on a large multi-source corpus of real and synthetic tabular datasets. ## Model Architecture | Property | Value | |---|---| | Architecture | TabularTransformer | | d_model | 256 | | Heads | 8 | | Layers | 8 | | FFN dim | 512 | | FFN activation | SwiGLU | | Normalization | RMSNorm | | Pooling | CLS token | | Numeric embedding | Periodic (k=16) | | Max numeric features | 64 | | Max categories | 128 | | Parameters | **10,752,769** (~10.75M) | ## Pretraining | Property | Value | |---|---| | Best checkpoint | Step 45,000 | | Best val loss | 0.2295 | | Rows seen at best | 23,040,000 | | Final step | 61,825 | | Total rows seen | 31,654,400 | | Batch size | 512 | | Learning rate | 3e-4 (cosine decay, 2K warmup) | | AMP | fp16 | | Hardware | NVIDIA RTX A4500 (20 GB) | | Training time | ~3 hours | Loss objective: multi-task MSE on target prediction from mixed numeric/categorical features, normalized per-column (z-score). Each batch samples from a fixed-width (64-feature) schema where unused slots are masked with NaN. ## Pretraining Corpus Trained on [`avewright/tabula-pretraining-corpus-v2`](https://huggingface.co/datasets/avewright/tabula-pretraining-corpus-v2): | Source | OK Datasets | Status | |---|---|---| | PMLB | 422 | **Fully exhausted** (all 422 known datasets used) | | OpenML | 2,949 | 4,886 attempted — 1,900 rejected (too few features), 37 download failures | | HuggingFace | 0 | 67 attempted — format incompatibilities | | **Synthetic** | (unlimited) | tree-prior, GMM, polynomial, SCM, regression, time-series, mixed-type | **Total corpus:** 541 shards, ~160 GB parquet. **Format:** `feat_0..feat_63` (Float32, NaN=unused), `target` (Float32), `_source_meta` (JSON). ### Dataset Exhaustion Notes - **PMLB: fully exhausted.** All 422 of 423 known datasets successfully processed (1 download failure: `chess`). No new PMLB datasets can be added without an upstream PMLB library update. - **OpenML: largely exhausted.** 4,886 unique datasets attempted. 2,949 passed the pipeline. The 1,900 `schema_fail` entries are almost entirely datasets with only 1 output column and too few rows/features to be useful (e.g. `too small: (53, 1)`). These are unrecoverable without lowering quality thresholds. There may be a small tail of undiscovered OpenML datasets not yet paginated. - **HuggingFace tabular:** 67 attempted from curated catalog. All failed due to schema mismatches, missing splits, or download timeouts. Catalog needs expansion with manually vetted datasets. ## Files | File | Description | |---|---| | `best.pt` | Best validation checkpoint (step 45,000, val_loss=0.2295) | | `latest.pt` | Final training checkpoint (step 61,825) | | `config.json` | Model and training hyperparameters | | `training_log.txt` | Full training run output | ## Usage ```python import torch from tabula.models.transformer import TabularTransformer from tabula.config import ModelConfig # Load checkpoint ckpt = torch.load("best.pt", map_location="cpu", weights_only=False) cfg = ckpt["config"].model # Reconstruct model model = TabularTransformer( d_model=cfg.d_model, n_heads=cfg.n_heads, n_layers=cfg.n_layers, d_ff=cfg.d_ff, dropout=cfg.dropout, num_numeric=64, num_categorical=0, num_text=0, output_dim=1, numeric_embedding=cfg.numeric_embedding, numeric_periodic_features=cfg.numeric_periodic_features, ffn_activation=cfg.ffn_activation, norm=cfg.norm, pooling=cfg.pooling, ) model.load_state_dict(ckpt["model_state_dict"]) model.eval() ``` ## Training Notes The model uses a fixed-width schema (64 numeric slots) regardless of original dataset width. Narrower datasets are zero-padded with NaN masks. This forces the model to learn position-invariant feature representations compatible with arbitrary tabular schemas. Synthetic data fills gaps when real corpus buffer is empty, providing 100M+ rows per session of controlled variation in feature distributions, missingness patterns, and task types.