Add README with project documentation
Browse files
README.md
ADDED
|
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Causal Discovery Algorithm Selection Meta-Learner
|
| 2 |
+
|
| 3 |
+
A meta-learning system that predicts the **top-3 best causal discovery algorithms** for any discrete observational dataset, based on dataset meta-features.
|
| 4 |
+
|
| 5 |
+
## 🎯 What it Does
|
| 6 |
+
|
| 7 |
+
Given a new discrete dataset (pandas DataFrame), the system:
|
| 8 |
+
1. **Extracts 34 meta-features** (entropy, mutual information, chi² statistics, CI test probes, etc.)
|
| 9 |
+
2. **Predicts normalized SHD** for each of 9 algorithms via a trained Random Forest
|
| 10 |
+
3. **Ranks and returns the top-3** algorithms expected to produce the most accurate CPDAG
|
| 11 |
+
|
| 12 |
+
## 📊 Performance (Leave-One-Network-Out Cross-Validation)
|
| 13 |
+
|
| 14 |
+
| Metric | Value |
|
| 15 |
+
|--------|-------|
|
| 16 |
+
| **Top-3 Hit Rate** | **67.2%** (true best algorithm is in predicted top-3) |
|
| 17 |
+
| **NDCG@3** | **0.947** (ranking quality) |
|
| 18 |
+
| **Mean Regret** | **0.012** (tiny SHD gap vs oracle selection) |
|
| 19 |
+
| **Median Regret** | **0.000** (majority of predictions are perfect) |
|
| 20 |
+
|
| 21 |
+
Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).
|
| 22 |
+
|
| 23 |
+
## 🧪 Algorithm Pool (9 algorithms)
|
| 24 |
+
|
| 25 |
+
| Algorithm | Family | Library | Output |
|
| 26 |
+
|-----------|--------|---------|--------|
|
| 27 |
+
| **PC** | Constraint-based | causal-learn | CPDAG |
|
| 28 |
+
| **FCI** | Constraint-based | causal-learn | PAG |
|
| 29 |
+
| **GES** | Score-based | causal-learn | CPDAG |
|
| 30 |
+
| **BOSS** | Permutation-based | causal-learn | CPDAG |
|
| 31 |
+
| **GRaSP** | Permutation-based | causal-learn | CPDAG |
|
| 32 |
+
| **HC** | Score-based (greedy) | pgmpy | DAG |
|
| 33 |
+
| **Tabu** | Score-based (meta-heuristic) | pgmpy | DAG |
|
| 34 |
+
| **MMHC** | Hybrid | pgmpy | DAG |
|
| 35 |
+
| **K2** | Score-based (ordering) | pgmpy | DAG |
|
| 36 |
+
|
| 37 |
+
## 🔬 Key Insight: Dependency Parsing Connection
|
| 38 |
+
|
| 39 |
+
This project was inspired by a structural parallel between **NLP dependency parsing** and **causal discovery**:
|
| 40 |
+
- Both predict **directed graphs** over nodes (words/variables)
|
| 41 |
+
- Both have **ground-truth annotations** (treebanks/bnlearn networks)
|
| 42 |
+
- Both use **arc-level evaluation** (UAS/LAS ↔ SHD/F1)
|
| 43 |
+
|
| 44 |
+
The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.
|
| 45 |
+
|
| 46 |
+
**Top predictive meta-features** (confirming the parsing analogy):
|
| 47 |
+
1. `max_pairwise_MI` (24.6%) — strongest pairwise dependency (≈ biaffine arc score)
|
| 48 |
+
2. `n_variables` (14.8%) — network size
|
| 49 |
+
3. `max_entropy` (9.5%) — variable complexity
|
| 50 |
+
4. `max_cramers_v` (6.7%) — strongest association strength
|
| 51 |
+
|
| 52 |
+
## 🚀 Quick Start
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
from causal_selection.meta_learner.predictor import predict_best_algorithms
|
| 56 |
+
import pandas as pd
|
| 57 |
+
|
| 58 |
+
# Load your discrete dataset
|
| 59 |
+
df = pd.read_csv("my_discrete_data.csv")
|
| 60 |
+
|
| 61 |
+
# Get top-3 recommendations
|
| 62 |
+
result = predict_best_algorithms(df, k=3)
|
| 63 |
+
# Prints ranked algorithms with predicted accuracy and confidence
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## 📁 Project Structure
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
causal_selection/
|
| 70 |
+
├── data/
|
| 71 |
+
│ ├── generator.py # Load bnlearn networks, sample data, DAG→CPDAG
|
| 72 |
+
│ ├── bif_files/ # 14 bnlearn BIF files (asia through win95pts)
|
| 73 |
+
│ └── results/ # Benchmark CSVs: meta-features, SHD matrices
|
| 74 |
+
├── discovery/
|
| 75 |
+
│ ├── algorithms.py # 9 algorithm adapters with timeout handling
|
| 76 |
+
│ └── evaluator.py # SHD, F1, Precision, Recall computation
|
| 77 |
+
├── features/
|
| 78 |
+
│ └── extractor.py # 34 meta-features across 5 tiers
|
| 79 |
+
├── meta_learner/
|
| 80 |
+
│ ├── trainer.py # Multi-Output RF/GBM + LONO-CV evaluation
|
| 81 |
+
│ └── predictor.py # Inference: dataset → top-3 prediction
|
| 82 |
+
├── models/
|
| 83 |
+
│ ├── meta_learner.pkl # Trained Random Forest
|
| 84 |
+
│ └── scaler.pkl # Feature scaler
|
| 85 |
+
├── benchmark.py # Full benchmark orchestration
|
| 86 |
+
└── run_benchmark.py # Resumable benchmark runner
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## 📈 Benchmark Data
|
| 90 |
+
|
| 91 |
+
- **14 bnlearn networks**: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
|
| 92 |
+
- **116+ dataset configs**: varying sample sizes (250–10,000) × multiple seeds
|
| 93 |
+
- **1,000+ algorithm runs**: 9 algorithms × 116 configs with per-algorithm timeout
|
| 94 |
+
|
| 95 |
+
## 🔧 Dependencies
|
| 96 |
+
|
| 97 |
+
```
|
| 98 |
+
causal-learn>=0.1.4
|
| 99 |
+
pgmpy>=0.1.25
|
| 100 |
+
scikit-learn>=1.8
|
| 101 |
+
pandas
|
| 102 |
+
numpy
|
| 103 |
+
scipy
|
| 104 |
+
joblib
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## 📚 References
|
| 108 |
+
|
| 109 |
+
- **Causal-Copilot** (arxiv:2504.13263) — Closest existing algorithm selection system
|
| 110 |
+
- **AVICI** (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
|
| 111 |
+
- **Dozat & Manning** (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
|
| 112 |
+
- **SATzilla** (arxiv:1401.2474) — Algorithm selection via meta-learning
|
| 113 |
+
- **bnlearn** (bnlearn.com) — Bayesian network benchmark repository
|
| 114 |
+
|
| 115 |
+
## License
|
| 116 |
+
|
| 117 |
+
MIT
|