Upload Tabula v1 pretrained model — step 61,825, best_val=0.2295
Browse files- README.md +130 -0
- best.pt +3 -0
- config.json +69 -0
- latest.pt +3 -0
- training_log.txt +0 -0
README.md
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- tabular
|
| 5 |
+
- foundation-model
|
| 6 |
+
- pretraining
|
| 7 |
+
- tabpfn
|
| 8 |
+
- schema-aware
|
| 9 |
+
- pytorch
|
| 10 |
+
datasets:
|
| 11 |
+
- avewright/tabula-pretraining-corpus-v2
|
| 12 |
+
language:
|
| 13 |
+
- en
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Tabula v1 — Tabular Foundation Model (Pretrained)
|
| 17 |
+
|
| 18 |
+
A schema-aware tabular transformer pretrained on a large multi-source corpus
|
| 19 |
+
of real and synthetic tabular datasets.
|
| 20 |
+
|
| 21 |
+
## Model Architecture
|
| 22 |
+
|
| 23 |
+
| Property | Value |
|
| 24 |
+
|---|---|
|
| 25 |
+
| Architecture | TabularTransformer |
|
| 26 |
+
| d_model | 256 |
|
| 27 |
+
| Heads | 8 |
|
| 28 |
+
| Layers | 8 |
|
| 29 |
+
| FFN dim | 512 |
|
| 30 |
+
| FFN activation | SwiGLU |
|
| 31 |
+
| Normalization | RMSNorm |
|
| 32 |
+
| Pooling | CLS token |
|
| 33 |
+
| Numeric embedding | Periodic (k=16) |
|
| 34 |
+
| Max numeric features | 64 |
|
| 35 |
+
| Max categories | 128 |
|
| 36 |
+
| Parameters | **10,752,769** (~10.75M) |
|
| 37 |
+
|
| 38 |
+
## Pretraining
|
| 39 |
+
|
| 40 |
+
| Property | Value |
|
| 41 |
+
|---|---|
|
| 42 |
+
| Best checkpoint | Step 45,000 |
|
| 43 |
+
| Best val loss | 0.2295 |
|
| 44 |
+
| Rows seen at best | 23,040,000 |
|
| 45 |
+
| Final step | 61,825 |
|
| 46 |
+
| Total rows seen | 31,654,400 |
|
| 47 |
+
| Batch size | 512 |
|
| 48 |
+
| Learning rate | 3e-4 (cosine decay, 2K warmup) |
|
| 49 |
+
| AMP | fp16 |
|
| 50 |
+
| Hardware | NVIDIA RTX A4500 (20 GB) |
|
| 51 |
+
| Training time | ~3 hours |
|
| 52 |
+
|
| 53 |
+
Loss objective: multi-task MSE on target prediction from mixed numeric/categorical features,
|
| 54 |
+
normalized per-column (z-score). Each batch samples from a fixed-width (64-feature)
|
| 55 |
+
schema where unused slots are masked with NaN.
|
| 56 |
+
|
| 57 |
+
## Pretraining Corpus
|
| 58 |
+
|
| 59 |
+
Trained on [`avewright/tabula-pretraining-corpus-v2`](https://huggingface.co/datasets/avewright/tabula-pretraining-corpus-v2):
|
| 60 |
+
|
| 61 |
+
| Source | OK Datasets | Status |
|
| 62 |
+
|---|---|---|
|
| 63 |
+
| PMLB | 422 | **Fully exhausted** (all 422 known datasets used) |
|
| 64 |
+
| OpenML | 2,949 | 4,886 attempted — 1,900 rejected (too few features), 37 download failures |
|
| 65 |
+
| HuggingFace | 0 | 67 attempted — format incompatibilities |
|
| 66 |
+
| **Synthetic** | (unlimited) | tree-prior, GMM, polynomial, SCM, regression, time-series, mixed-type |
|
| 67 |
+
|
| 68 |
+
**Total corpus:** 541 shards, ~160 GB parquet.
|
| 69 |
+
**Format:** `feat_0..feat_63` (Float32, NaN=unused), `target` (Float32), `_source_meta` (JSON).
|
| 70 |
+
|
| 71 |
+
### Dataset Exhaustion Notes
|
| 72 |
+
|
| 73 |
+
- **PMLB: fully exhausted.** All 422 of 423 known datasets successfully processed
|
| 74 |
+
(1 download failure: `chess`). No new PMLB datasets can be added without an
|
| 75 |
+
upstream PMLB library update.
|
| 76 |
+
|
| 77 |
+
- **OpenML: largely exhausted.** 4,886 unique datasets attempted. 2,949 passed
|
| 78 |
+
the pipeline. The 1,900 `schema_fail` entries are almost entirely datasets with
|
| 79 |
+
only 1 output column and too few rows/features to be useful (e.g. `too small: (53, 1)`).
|
| 80 |
+
These are unrecoverable without lowering quality thresholds. There may be a small
|
| 81 |
+
tail of undiscovered OpenML datasets not yet paginated.
|
| 82 |
+
|
| 83 |
+
- **HuggingFace tabular:** 67 attempted from curated catalog. All failed due to
|
| 84 |
+
schema mismatches, missing splits, or download timeouts. Catalog needs expansion
|
| 85 |
+
with manually vetted datasets.
|
| 86 |
+
|
| 87 |
+
## Files
|
| 88 |
+
|
| 89 |
+
| File | Description |
|
| 90 |
+
|---|---|
|
| 91 |
+
| `best.pt` | Best validation checkpoint (step 45,000, val_loss=0.2295) |
|
| 92 |
+
| `latest.pt` | Final training checkpoint (step 61,825) |
|
| 93 |
+
| `config.json` | Model and training hyperparameters |
|
| 94 |
+
| `training_log.txt` | Full training run output |
|
| 95 |
+
|
| 96 |
+
## Usage
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
import torch
|
| 100 |
+
from tabula.models.transformer import TabularTransformer
|
| 101 |
+
from tabula.config import ModelConfig
|
| 102 |
+
|
| 103 |
+
# Load checkpoint
|
| 104 |
+
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
|
| 105 |
+
cfg = ckpt["config"].model
|
| 106 |
+
|
| 107 |
+
# Reconstruct model
|
| 108 |
+
model = TabularTransformer(
|
| 109 |
+
d_model=cfg.d_model, n_heads=cfg.n_heads, n_layers=cfg.n_layers,
|
| 110 |
+
d_ff=cfg.d_ff, dropout=cfg.dropout,
|
| 111 |
+
num_numeric=64, num_categorical=0, num_text=0,
|
| 112 |
+
output_dim=1,
|
| 113 |
+
numeric_embedding=cfg.numeric_embedding,
|
| 114 |
+
numeric_periodic_features=cfg.numeric_periodic_features,
|
| 115 |
+
ffn_activation=cfg.ffn_activation, norm=cfg.norm, pooling=cfg.pooling,
|
| 116 |
+
)
|
| 117 |
+
model.load_state_dict(ckpt["model_state_dict"])
|
| 118 |
+
model.eval()
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## Training Notes
|
| 122 |
+
|
| 123 |
+
The model uses a fixed-width schema (64 numeric slots) regardless of original
|
| 124 |
+
dataset width. Narrower datasets are zero-padded with NaN masks. This forces the
|
| 125 |
+
model to learn position-invariant feature representations compatible with arbitrary
|
| 126 |
+
tabular schemas.
|
| 127 |
+
|
| 128 |
+
Synthetic data fills gaps when real corpus buffer is empty, providing 100M+ rows
|
| 129 |
+
per session of controlled variation in feature distributions, missingness patterns,
|
| 130 |
+
and task types.
|
best.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e00cb1b6d673dd836fed25916eff3e74d6ef08eee9c88d7da33943779aa7ef48
|
| 3 |
+
size 90133178
|
config.json
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "tabula_transformer",
|
| 3 |
+
"architecture": "TabularTransformer",
|
| 4 |
+
"d_model": 256,
|
| 5 |
+
"n_heads": 8,
|
| 6 |
+
"n_layers": 8,
|
| 7 |
+
"d_ff": 512,
|
| 8 |
+
"dropout": 0.1,
|
| 9 |
+
"ffn_activation": "swiglu",
|
| 10 |
+
"norm": "rmsnorm",
|
| 11 |
+
"pooling": "cls",
|
| 12 |
+
"numeric_embedding": "periodic",
|
| 13 |
+
"numeric_periodic_features": 16,
|
| 14 |
+
"max_numeric_features": 64,
|
| 15 |
+
"max_categories": 128,
|
| 16 |
+
"feature_token_dropout": 0.05,
|
| 17 |
+
"n_params": 10752769,
|
| 18 |
+
"pretraining": {
|
| 19 |
+
"best_step": 45000,
|
| 20 |
+
"best_val_loss": 0.229543,
|
| 21 |
+
"best_rows_seen": 23040000,
|
| 22 |
+
"final_step": 61825,
|
| 23 |
+
"final_rows_seen": 31654400,
|
| 24 |
+
"batch_size": 512,
|
| 25 |
+
"lr": 0.0003,
|
| 26 |
+
"weight_decay": 0.0001,
|
| 27 |
+
"amp": true,
|
| 28 |
+
"amp_dtype": "float16",
|
| 29 |
+
"grad_clip": 1.0,
|
| 30 |
+
"warmup_steps": 2000,
|
| 31 |
+
"lr_schedule": "cosine",
|
| 32 |
+
"max_steps": 200000
|
| 33 |
+
},
|
| 34 |
+
"corpus": {
|
| 35 |
+
"hf_repo": "avewright/tabula-pretraining-corpus-v2",
|
| 36 |
+
"total_shards": 541,
|
| 37 |
+
"real_datasets_ok": 3371,
|
| 38 |
+
"sources": {
|
| 39 |
+
"pmlb": {
|
| 40 |
+
"ok": 422,
|
| 41 |
+
"total_attempted": 423,
|
| 42 |
+
"status": "fully_exhausted"
|
| 43 |
+
},
|
| 44 |
+
"openml": {
|
| 45 |
+
"ok": 2949,
|
| 46 |
+
"total_attempted": 4886,
|
| 47 |
+
"schema_fail": 1900,
|
| 48 |
+
"download_fail": 37
|
| 49 |
+
},
|
| 50 |
+
"huggingface": {
|
| 51 |
+
"ok": 0,
|
| 52 |
+
"download_fail": 66,
|
| 53 |
+
"schema_fail": 1
|
| 54 |
+
}
|
| 55 |
+
},
|
| 56 |
+
"synthetic_generators": [
|
| 57 |
+
"tree_prior",
|
| 58 |
+
"gaussian_mixture",
|
| 59 |
+
"polynomial",
|
| 60 |
+
"scm",
|
| 61 |
+
"regression",
|
| 62 |
+
"time_series",
|
| 63 |
+
"mixed_type"
|
| 64 |
+
]
|
| 65 |
+
},
|
| 66 |
+
"date_trained": "2026-03-16",
|
| 67 |
+
"framework": "pytorch",
|
| 68 |
+
"pytorch_version": "2.4.1+cu124"
|
| 69 |
+
}
|
latest.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c5099d3ce00a81b85230f7e18636dc328bd1c673993f554022aa03c8e7c2af0c
|
| 3 |
+
size 90173498
|
training_log.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|