File size: 6,871 Bytes
908b1d3 d580ab2 8fa6d4f 7b1958c 8fa6d4f 908b1d3 7b1958c 908b1d3 7b1958c d580ab2 7b1958c 908b1d3 7b1958c 908b1d3 7b1958c 908b1d3 c7c5e29 908b1d3 7b1958c c7c5e29 908b1d3 7b1958c c7c5e29 7b1958c 908b1d3 7b1958c 908b1d3 7b1958c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
tags:
- causal-inference
- time-series
- foundation-model
- temporal-causal
- causal-judgment
license: other
license_name: proprietary
license_link: LICENSE
---
# TCPFN β Temporal Causal Prior-Data Fitted Networks
A family of causal reasoning foundation models β predict effects, judge trustworthiness, operate zero-shot. Three checkpoints share one architecture (12-layer transformer, `embed_dim=512`, 8 heads, HL-Gaussian output head) and differ only in training-data distribution and curriculum.
## Pick by task
| Task | Best checkpoint | Path |
|------|-----------------|------|
| General causal discovery (biology, cross-sectional, short-lag) | **v2.1** | `models/temporal/final.pt` |
| Industrial / long-range discovery (12+ h lags, digesterβpaper machine etc.) | **v2.2** | `models/v2.2/final.pt` |
| Effect estimation (CATE / PEHE) | **v3** | `models/v3/final.pt` |
All three are zero-shot. Pick the one matching your task β specialisation beats generalist on every task we've measured.
## Shared contributions
1. **Temporal Token Design** β first PFN for temporal panel data.
2. **Causal Judgment Head** β learned reliability signals (null detection, regime classification, identifiability, mediation, confounding).
3. **Causal Regime Prior** β direct / confounded / mediated / feedback training structures.
4. **Self-Calibration** β auto-detects natural experiments in sensor data.
5. **End-to-End System** β discovery + estimation + judgment + RCA from one forward pass.
## Shared capabilities
- **Causal Discovery** β pairwise interventional CATE with judgment-aware edge scoring, natural-experiment detection, continuous treatment, multi-lag estimation, asymmetry penalty.
- **Effect Estimation** β temporal CATE trajectories with distributional output.
- **Causal Judgment** β null-effect detection, regime classification (learned heuristics, not formal guarantees).
- **Root Cause Analysis** β 8-method ensemble (AERCA, ESD, ProRCA, GCM noise, ICC, Shapley, counterfactual, chain tracing).
---
## v2.1 β default discovery model
- 200K steps, curriculum-trained (Phase 1 CATE-only β Phase 2 +Null β Phase 3 Full).
- Mixed prior: 40% CausalTimePrior + 30% base + 30% CausalFM.
- Training window: `max_T_pre=50, max_T_post=30`.
- Hardware: RTX 5090, ~4.1 h, 13.9 steps/s.
### Discovery benchmarks (14 datasets, 6 domains, zero-shot)
- Sachs (11 proteins, biological): F1 0.412, **AUROC 0.725** (vs Granger 0.291 / 0.621) β **champion**.
- Causal Rivers (environmental): F1 0.319, AUROC 0.955.
- Tennessee Eastman (52 vars, industrial): F1 0.314, AUROC 0.904.
- SWaT (51 vars, water treatment): F1 0.265, AUROC 0.859.
- CauseMe NVAR-5 / NVAR-10: F1 0.571 / 0.439.
- Highest default-threshold F1 on 6 of 14 datasets.
- Hallucination FPR: 0.02β0.08 (down from 1.0 in v2.0).
### Training metrics (mean over steps 150Kβ200K)
- EffectLoss ~2.9 | JudgmentLoss ~2.8
- NullF1 0.94 | NullAUROC 0.99 | NullBrier 0.04 | NullSep 0.86
- RegimeAcc 0.68 | RegimeMacroF1 0.48
### Limitations
- CATE estimation weak (PEHE 0.92) due to per-group Z-standardisation β **use v3 for estimation**.
---
## v2.2 β industrial / long-range specialist
Built for 12+ hour causal lags in industrial control loops (digester β paper machine, reactor β downstream controller). Training window extended 4Γ and curriculum rebalanced to include null-effect batches in Phase 2.
- 200K steps, BF16 mixed precision, `head_lr_scale=0.1` (decouples output-head learning from backbone to prevent late-stage drift collapse).
- Training window: `max_T_pre=200, max_T_post=100, max_horizon=500` β supports lags up to ~16 h at 2-min sampling.
- Manual NaN-skip with observability (saves first NaN-producing batch, aborts if skip rate β₯ threshold).
- Hardware: RTX 5090, ~14.9 h.
### Discovery benchmarks (default threshold 0.5)
Strong on industrial / multivariate temporal data β **use this** when lags exceed ~1 h or when data is genuinely time-series (not stitched cross-sectional).
| Dataset | Default F1 | Best F1 | AUROC |
|---------|-----------|---------|-------|
| Tennessee Eastman | **0.512** | 0.545 | **0.972** |
| SWaT | **0.463** | 0.552 | **0.945** |
| CauseMe VAR-5 | 0.769 | 0.800 | 0.960 |
| CauseMe NVAR-5 | 0.800 | 0.800 | 0.863 |
| CauseMe VAR-10 | 0.488 | 0.643 | 0.812 |
| CauseMe NVAR-10 | 0.634 | 0.634 | 0.759 |
| CauseMe Lorenz96-10 | 0.484 | 0.638 | 0.699 |
| Sachs | 0.174 | 0.308 | 0.565 |
Granger and PCMCI collapse on industrial data β they over-predict (1897 edges on TE vs 38 true), giving F1 ~0.04. TCPFN v2.2 is the only method with usable precision + recall together.
### Estimation benchmarks
- Overall PEHE 0.917 | ATE MAE 0.504 | trajectory correlation β 0 β **use v3 for CATE**.
### Limitations
- **Sachs regressed** vs v2.1 (AUROC 0.565 vs 0.725). Use v2.1 for cross-sectional biological graphs.
- Estimation degraded β trades short-range precision for long-range reach (see scar-tissue entry L-33 in project docs).
---
## v3 β estimation champion (experimental)
Tag: `3.0.0-exp-global-std`. Global standardisation fix for the per-group Z-score bias that caps v2.1/v2.2 estimation quality.
- 200K steps.
- **PEHE 0.72** (vs v2.1 0.92 and v2.2 0.92) β best of the three on CATE estimation.
- Discovery regressed slightly as trade-off; not yet benchmarked across all 14 discovery datasets β **use v2.1 or v2.2 for discovery**.
### Limitations
- Experimental tag β standardisation change not yet battle-tested beyond estimation.
- Full benchmark matrix still pending.
---
## Usage
```python
from tcpfn import TemporalCausalAnalyzer
# Discovery on general data (biology, cross-sectional)
analyzer = TemporalCausalAnalyzer(temporal_model="models/temporal/final.pt")
# Industrial / long-range discovery (lags in hours)
analyzer = TemporalCausalAnalyzer(temporal_model="models/v2.2/final.pt")
# Effect estimation (CATE trajectories, PEHE-sensitive work)
analyzer = TemporalCausalAnalyzer(temporal_model="models/v3/final.pt")
report = analyzer.run("sensor_data.csv")
print(report.edges) # causal graph with edge strengths and lags
print(report.summary()) # human-readable summary
result = analyzer.explain_event(
data_path="sensor_data.csv",
target_var="temperature_sensor",
event_time="2025-11-15 14:15",
)
print(result.summary()) # ranked root causes + causal chains
```
## Cross-cutting limitations
- Regime classification is noisy (~0.68 accuracy, high eval variance). Judgment heads are **learned heuristics**, not formal guarantees.
- Low-dim cross-sectional data stitched into pseudo-timeseries is out-of-distribution for v2.2 and v3; use v2.1.
- v3 has not yet been run on the full discovery benchmark suite.
## Paper
Stalupula et al., "Temporal Causal Prior-Data Fitted Networks for Panel Data with Learned Reliability Signals"
|