Causal Discovery Algorithm Selection Meta-Learner
A meta-learning system that predicts the top-3 best causal discovery algorithms for any discrete observational dataset, based on dataset meta-features.
🎯 What it Does
Given a new discrete dataset (pandas DataFrame), the system:
- Extracts 34 meta-features (entropy, mutual information, chi² statistics, CI test probes, etc.)
- Predicts normalized SHD for each of 9 algorithms via a trained Random Forest
- Ranks and returns the top-3 algorithms expected to produce the most accurate CPDAG
📊 Performance (Leave-One-Network-Out Cross-Validation)
| Metric | Value |
|---|---|
| Top-3 Hit Rate | 67.2% (true best algorithm is in predicted top-3) |
| NDCG@3 | 0.947 (ranking quality) |
| Mean Regret | 0.012 (tiny SHD gap vs oracle selection) |
| Median Regret | 0.000 (majority of predictions are perfect) |
Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).
🧪 Algorithm Pool (9 algorithms)
| Algorithm | Family | Library | Output |
|---|---|---|---|
| PC | Constraint-based | causal-learn | CPDAG |
| FCI | Constraint-based | causal-learn | PAG |
| GES | Score-based | causal-learn | CPDAG |
| BOSS | Permutation-based | causal-learn | CPDAG |
| GRaSP | Permutation-based | causal-learn | CPDAG |
| HC | Score-based (greedy) | pgmpy | DAG |
| Tabu | Score-based (meta-heuristic) | pgmpy | DAG |
| MMHC | Hybrid | pgmpy | DAG |
| K2 | Score-based (ordering) | pgmpy | DAG |
🔬 Key Insight: Dependency Parsing Connection
This project was inspired by a structural parallel between NLP dependency parsing and causal discovery:
- Both predict directed graphs over nodes (words/variables)
- Both have ground-truth annotations (treebanks/bnlearn networks)
- Both use arc-level evaluation (UAS/LAS ↔ SHD/F1)
The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.
Top predictive meta-features (confirming the parsing analogy):
max_pairwise_MI(24.6%) — strongest pairwise dependency (≈ biaffine arc score)n_variables(14.8%) — network sizemax_entropy(9.5%) — variable complexitymax_cramers_v(6.7%) — strongest association strength
🚀 Quick Start
from causal_selection.meta_learner.predictor import predict_best_algorithms
import pandas as pd
# Load your discrete dataset
df = pd.read_csv("my_discrete_data.csv")
# Get top-3 recommendations
result = predict_best_algorithms(df, k=3)
# Prints ranked algorithms with predicted accuracy and confidence
📁 Project Structure
causal_selection/
├── data/
│ ├── generator.py # Load bnlearn networks, sample data, DAG→CPDAG
│ ├── bif_files/ # 14 bnlearn BIF files (asia through win95pts)
│ └── results/ # Benchmark CSVs: meta-features, SHD matrices
├── discovery/
│ ├── algorithms.py # 9 algorithm adapters with timeout handling
│ └── evaluator.py # SHD, F1, Precision, Recall computation
├── features/
│ └── extractor.py # 34 meta-features across 5 tiers
├── meta_learner/
│ ├── trainer.py # Multi-Output RF/GBM + LONO-CV evaluation
│ └── predictor.py # Inference: dataset → top-3 prediction
├── models/
│ ├── meta_learner.pkl # Trained Random Forest
│ └── scaler.pkl # Feature scaler
├── benchmark.py # Full benchmark orchestration
└── run_benchmark.py # Resumable benchmark runner
📈 Benchmark Data
- 14 bnlearn networks: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
- 116+ dataset configs: varying sample sizes (250–10,000) × multiple seeds
- 1,000+ algorithm runs: 9 algorithms × 116 configs with per-algorithm timeout
🔧 Dependencies
causal-learn>=0.1.4
pgmpy>=0.1.25
scikit-learn>=1.8
pandas
numpy
scipy
joblib
📚 References
- Causal-Copilot (arxiv:2504.13263) — Closest existing algorithm selection system
- AVICI (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
- Dozat & Manning (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
- SATzilla (arxiv:1401.2474) — Algorithm selection via meta-learning
- bnlearn (bnlearn.com) — Bayesian network benchmark repository
License
MIT