avewright commited on
Commit
0bf0abe
·
verified ·
1 Parent(s): 1de75f1

Upload Tabula v1 pretrained model — step 61,825, best_val=0.2295

Browse files
Files changed (5) hide show
  1. README.md +130 -0
  2. best.pt +3 -0
  3. config.json +69 -0
  4. latest.pt +3 -0
  5. training_log.txt +0 -0
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - tabular
5
+ - foundation-model
6
+ - pretraining
7
+ - tabpfn
8
+ - schema-aware
9
+ - pytorch
10
+ datasets:
11
+ - avewright/tabula-pretraining-corpus-v2
12
+ language:
13
+ - en
14
+ ---
15
+
16
+ # Tabula v1 — Tabular Foundation Model (Pretrained)
17
+
18
+ A schema-aware tabular transformer pretrained on a large multi-source corpus
19
+ of real and synthetic tabular datasets.
20
+
21
+ ## Model Architecture
22
+
23
+ | Property | Value |
24
+ |---|---|
25
+ | Architecture | TabularTransformer |
26
+ | d_model | 256 |
27
+ | Heads | 8 |
28
+ | Layers | 8 |
29
+ | FFN dim | 512 |
30
+ | FFN activation | SwiGLU |
31
+ | Normalization | RMSNorm |
32
+ | Pooling | CLS token |
33
+ | Numeric embedding | Periodic (k=16) |
34
+ | Max numeric features | 64 |
35
+ | Max categories | 128 |
36
+ | Parameters | **10,752,769** (~10.75M) |
37
+
38
+ ## Pretraining
39
+
40
+ | Property | Value |
41
+ |---|---|
42
+ | Best checkpoint | Step 45,000 |
43
+ | Best val loss | 0.2295 |
44
+ | Rows seen at best | 23,040,000 |
45
+ | Final step | 61,825 |
46
+ | Total rows seen | 31,654,400 |
47
+ | Batch size | 512 |
48
+ | Learning rate | 3e-4 (cosine decay, 2K warmup) |
49
+ | AMP | fp16 |
50
+ | Hardware | NVIDIA RTX A4500 (20 GB) |
51
+ | Training time | ~3 hours |
52
+
53
+ Loss objective: multi-task MSE on target prediction from mixed numeric/categorical features,
54
+ normalized per-column (z-score). Each batch samples from a fixed-width (64-feature)
55
+ schema where unused slots are masked with NaN.
56
+
57
+ ## Pretraining Corpus
58
+
59
+ Trained on [`avewright/tabula-pretraining-corpus-v2`](https://huggingface.co/datasets/avewright/tabula-pretraining-corpus-v2):
60
+
61
+ | Source | OK Datasets | Status |
62
+ |---|---|---|
63
+ | PMLB | 422 | **Fully exhausted** (all 422 known datasets used) |
64
+ | OpenML | 2,949 | 4,886 attempted — 1,900 rejected (too few features), 37 download failures |
65
+ | HuggingFace | 0 | 67 attempted — format incompatibilities |
66
+ | **Synthetic** | (unlimited) | tree-prior, GMM, polynomial, SCM, regression, time-series, mixed-type |
67
+
68
+ **Total corpus:** 541 shards, ~160 GB parquet.
69
+ **Format:** `feat_0..feat_63` (Float32, NaN=unused), `target` (Float32), `_source_meta` (JSON).
70
+
71
+ ### Dataset Exhaustion Notes
72
+
73
+ - **PMLB: fully exhausted.** All 422 of 423 known datasets successfully processed
74
+ (1 download failure: `chess`). No new PMLB datasets can be added without an
75
+ upstream PMLB library update.
76
+
77
+ - **OpenML: largely exhausted.** 4,886 unique datasets attempted. 2,949 passed
78
+ the pipeline. The 1,900 `schema_fail` entries are almost entirely datasets with
79
+ only 1 output column and too few rows/features to be useful (e.g. `too small: (53, 1)`).
80
+ These are unrecoverable without lowering quality thresholds. There may be a small
81
+ tail of undiscovered OpenML datasets not yet paginated.
82
+
83
+ - **HuggingFace tabular:** 67 attempted from curated catalog. All failed due to
84
+ schema mismatches, missing splits, or download timeouts. Catalog needs expansion
85
+ with manually vetted datasets.
86
+
87
+ ## Files
88
+
89
+ | File | Description |
90
+ |---|---|
91
+ | `best.pt` | Best validation checkpoint (step 45,000, val_loss=0.2295) |
92
+ | `latest.pt` | Final training checkpoint (step 61,825) |
93
+ | `config.json` | Model and training hyperparameters |
94
+ | `training_log.txt` | Full training run output |
95
+
96
+ ## Usage
97
+
98
+ ```python
99
+ import torch
100
+ from tabula.models.transformer import TabularTransformer
101
+ from tabula.config import ModelConfig
102
+
103
+ # Load checkpoint
104
+ ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
105
+ cfg = ckpt["config"].model
106
+
107
+ # Reconstruct model
108
+ model = TabularTransformer(
109
+ d_model=cfg.d_model, n_heads=cfg.n_heads, n_layers=cfg.n_layers,
110
+ d_ff=cfg.d_ff, dropout=cfg.dropout,
111
+ num_numeric=64, num_categorical=0, num_text=0,
112
+ output_dim=1,
113
+ numeric_embedding=cfg.numeric_embedding,
114
+ numeric_periodic_features=cfg.numeric_periodic_features,
115
+ ffn_activation=cfg.ffn_activation, norm=cfg.norm, pooling=cfg.pooling,
116
+ )
117
+ model.load_state_dict(ckpt["model_state_dict"])
118
+ model.eval()
119
+ ```
120
+
121
+ ## Training Notes
122
+
123
+ The model uses a fixed-width schema (64 numeric slots) regardless of original
124
+ dataset width. Narrower datasets are zero-padded with NaN masks. This forces the
125
+ model to learn position-invariant feature representations compatible with arbitrary
126
+ tabular schemas.
127
+
128
+ Synthetic data fills gaps when real corpus buffer is empty, providing 100M+ rows
129
+ per session of controlled variation in feature distributions, missingness patterns,
130
+ and task types.
best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e00cb1b6d673dd836fed25916eff3e74d6ef08eee9c88d7da33943779aa7ef48
3
+ size 90133178
config.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "tabula_transformer",
3
+ "architecture": "TabularTransformer",
4
+ "d_model": 256,
5
+ "n_heads": 8,
6
+ "n_layers": 8,
7
+ "d_ff": 512,
8
+ "dropout": 0.1,
9
+ "ffn_activation": "swiglu",
10
+ "norm": "rmsnorm",
11
+ "pooling": "cls",
12
+ "numeric_embedding": "periodic",
13
+ "numeric_periodic_features": 16,
14
+ "max_numeric_features": 64,
15
+ "max_categories": 128,
16
+ "feature_token_dropout": 0.05,
17
+ "n_params": 10752769,
18
+ "pretraining": {
19
+ "best_step": 45000,
20
+ "best_val_loss": 0.229543,
21
+ "best_rows_seen": 23040000,
22
+ "final_step": 61825,
23
+ "final_rows_seen": 31654400,
24
+ "batch_size": 512,
25
+ "lr": 0.0003,
26
+ "weight_decay": 0.0001,
27
+ "amp": true,
28
+ "amp_dtype": "float16",
29
+ "grad_clip": 1.0,
30
+ "warmup_steps": 2000,
31
+ "lr_schedule": "cosine",
32
+ "max_steps": 200000
33
+ },
34
+ "corpus": {
35
+ "hf_repo": "avewright/tabula-pretraining-corpus-v2",
36
+ "total_shards": 541,
37
+ "real_datasets_ok": 3371,
38
+ "sources": {
39
+ "pmlb": {
40
+ "ok": 422,
41
+ "total_attempted": 423,
42
+ "status": "fully_exhausted"
43
+ },
44
+ "openml": {
45
+ "ok": 2949,
46
+ "total_attempted": 4886,
47
+ "schema_fail": 1900,
48
+ "download_fail": 37
49
+ },
50
+ "huggingface": {
51
+ "ok": 0,
52
+ "download_fail": 66,
53
+ "schema_fail": 1
54
+ }
55
+ },
56
+ "synthetic_generators": [
57
+ "tree_prior",
58
+ "gaussian_mixture",
59
+ "polynomial",
60
+ "scm",
61
+ "regression",
62
+ "time_series",
63
+ "mixed_type"
64
+ ]
65
+ },
66
+ "date_trained": "2026-03-16",
67
+ "framework": "pytorch",
68
+ "pytorch_version": "2.4.1+cu124"
69
+ }
latest.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5099d3ce00a81b85230f7e18636dc328bd1c673993f554022aa03c8e7c2af0c
3
+ size 90173498
training_log.txt ADDED
The diff for this file is too large to render. See raw diff