Oguzz07
/

causal-discovery-algorithm-selection

Model card Files Files and versions

xet

Community

Oguzz07 commited on 20 days ago

Commit

3212b49

verified ·

1 Parent(s): 46cec70

Add README with project documentation

Browse files

Files changed (1) hide show

README.md +117 -0

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+# Causal Discovery Algorithm Selection Meta-Learner
+A meta-learning system that predicts the **top-3 best causal discovery algorithms** for any discrete observational dataset, based on dataset meta-features.
+## 🎯 What it Does
+Given a new discrete dataset (pandas DataFrame), the system:
+1. **Extracts 34 meta-features** (entropy, mutual information, chi² statistics, CI test probes, etc.)
+2. **Predicts normalized SHD** for each of 9 algorithms via a trained Random Forest
+3. **Ranks and returns the top-3** algorithms expected to produce the most accurate CPDAG
+## 📊 Performance (Leave-One-Network-Out Cross-Validation)
+| Metric | Value |
+|--------|-------|
+| **Top-3 Hit Rate** | **67.2%** (true best algorithm is in predicted top-3) |
+| **NDCG@3** | **0.947** (ranking quality) |
+| **Mean Regret** | **0.012** (tiny SHD gap vs oracle selection) |
+| **Median Regret** | **0.000** (majority of predictions are perfect) |
+Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).
+## 🧪 Algorithm Pool (9 algorithms)
+| Algorithm | Family | Library | Output |
+|-----------|--------|---------|--------|
+| **PC** | Constraint-based | causal-learn | CPDAG |
+| **FCI** | Constraint-based | causal-learn | PAG |
+| **GES** | Score-based | causal-learn | CPDAG |
+| **BOSS** | Permutation-based | causal-learn | CPDAG |
+| **GRaSP** | Permutation-based | causal-learn | CPDAG |
+| **HC** | Score-based (greedy) | pgmpy | DAG |
+| **Tabu** | Score-based (meta-heuristic) | pgmpy | DAG |
+| **MMHC** | Hybrid | pgmpy | DAG |
+| **K2** | Score-based (ordering) | pgmpy | DAG |
+## 🔬 Key Insight: Dependency Parsing Connection
+This project was inspired by a structural parallel between **NLP dependency parsing** and **causal discovery**:
+- Both predict **directed graphs** over nodes (words/variables)
+- Both have **ground-truth annotations** (treebanks/bnlearn networks)
+- Both use **arc-level evaluation** (UAS/LAS ↔ SHD/F1)
+The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.
+**Top predictive meta-features** (confirming the parsing analogy):
+1. `max_pairwise_MI` (24.6%) — strongest pairwise dependency (≈ biaffine arc score)
+2. `n_variables` (14.8%) — network size
+3. `max_entropy` (9.5%) — variable complexity
+4. `max_cramers_v` (6.7%) — strongest association strength
+## 🚀 Quick Start
+```python
+from causal_selection.meta_learner.predictor import predict_best_algorithms
+import pandas as pd
+# Load your discrete dataset
+df = pd.read_csv("my_discrete_data.csv")
+# Get top-3 recommendations
+result = predict_best_algorithms(df, k=3)
+# Prints ranked algorithms with predicted accuracy and confidence
+```
+## 📁 Project Structure
+```
+causal_selection/
+├── data/
+│   ├── generator.py          # Load bnlearn networks, sample data, DAG→CPDAG
+│   ├── bif_files/            # 14 bnlearn BIF files (asia through win95pts)
+│   └── results/              # Benchmark CSVs: meta-features, SHD matrices
+├── discovery/
+│   ├── algorithms.py         # 9 algorithm adapters with timeout handling
+│   └── evaluator.py          # SHD, F1, Precision, Recall computation
+├── features/
+│   └── extractor.py          # 34 meta-features across 5 tiers
+├── meta_learner/
+│   ├── trainer.py            # Multi-Output RF/GBM + LONO-CV evaluation
+│   └── predictor.py          # Inference: dataset → top-3 prediction
+├── models/
+│   ├── meta_learner.pkl      # Trained Random Forest
+│   └── scaler.pkl            # Feature scaler
+├── benchmark.py              # Full benchmark orchestration
+└── run_benchmark.py          # Resumable benchmark runner
+```
+## 📈 Benchmark Data
+- **14 bnlearn networks**: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
+- **116+ dataset configs**: varying sample sizes (250–10,000) × multiple seeds
+- **1,000+ algorithm runs**: 9 algorithms × 116 configs with per-algorithm timeout
+## 🔧 Dependencies
+```
+causal-learn>=0.1.4
+pgmpy>=0.1.25
+scikit-learn>=1.8
+pandas
+numpy
+scipy
+joblib
+```
+## 📚 References
+- **Causal-Copilot** (arxiv:2504.13263) — Closest existing algorithm selection system
+- **AVICI** (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
+- **Dozat & Manning** (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
+- **SATzilla** (arxiv:1401.2474) — Algorithm selection via meta-learning
+- **bnlearn** (bnlearn.com) — Bayesian network benchmark repository
+## License
+MIT