Oguzz07 commited on
Commit
3212b49
·
verified ·
1 Parent(s): 46cec70

Add README with project documentation

Browse files
Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Causal Discovery Algorithm Selection Meta-Learner
2
+
3
+ A meta-learning system that predicts the **top-3 best causal discovery algorithms** for any discrete observational dataset, based on dataset meta-features.
4
+
5
+ ## 🎯 What it Does
6
+
7
+ Given a new discrete dataset (pandas DataFrame), the system:
8
+ 1. **Extracts 34 meta-features** (entropy, mutual information, chi² statistics, CI test probes, etc.)
9
+ 2. **Predicts normalized SHD** for each of 9 algorithms via a trained Random Forest
10
+ 3. **Ranks and returns the top-3** algorithms expected to produce the most accurate CPDAG
11
+
12
+ ## 📊 Performance (Leave-One-Network-Out Cross-Validation)
13
+
14
+ | Metric | Value |
15
+ |--------|-------|
16
+ | **Top-3 Hit Rate** | **67.2%** (true best algorithm is in predicted top-3) |
17
+ | **NDCG@3** | **0.947** (ranking quality) |
18
+ | **Mean Regret** | **0.012** (tiny SHD gap vs oracle selection) |
19
+ | **Median Regret** | **0.000** (majority of predictions are perfect) |
20
+
21
+ Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).
22
+
23
+ ## 🧪 Algorithm Pool (9 algorithms)
24
+
25
+ | Algorithm | Family | Library | Output |
26
+ |-----------|--------|---------|--------|
27
+ | **PC** | Constraint-based | causal-learn | CPDAG |
28
+ | **FCI** | Constraint-based | causal-learn | PAG |
29
+ | **GES** | Score-based | causal-learn | CPDAG |
30
+ | **BOSS** | Permutation-based | causal-learn | CPDAG |
31
+ | **GRaSP** | Permutation-based | causal-learn | CPDAG |
32
+ | **HC** | Score-based (greedy) | pgmpy | DAG |
33
+ | **Tabu** | Score-based (meta-heuristic) | pgmpy | DAG |
34
+ | **MMHC** | Hybrid | pgmpy | DAG |
35
+ | **K2** | Score-based (ordering) | pgmpy | DAG |
36
+
37
+ ## 🔬 Key Insight: Dependency Parsing Connection
38
+
39
+ This project was inspired by a structural parallel between **NLP dependency parsing** and **causal discovery**:
40
+ - Both predict **directed graphs** over nodes (words/variables)
41
+ - Both have **ground-truth annotations** (treebanks/bnlearn networks)
42
+ - Both use **arc-level evaluation** (UAS/LAS ↔ SHD/F1)
43
+
44
+ The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.
45
+
46
+ **Top predictive meta-features** (confirming the parsing analogy):
47
+ 1. `max_pairwise_MI` (24.6%) — strongest pairwise dependency (≈ biaffine arc score)
48
+ 2. `n_variables` (14.8%) — network size
49
+ 3. `max_entropy` (9.5%) — variable complexity
50
+ 4. `max_cramers_v` (6.7%) — strongest association strength
51
+
52
+ ## 🚀 Quick Start
53
+
54
+ ```python
55
+ from causal_selection.meta_learner.predictor import predict_best_algorithms
56
+ import pandas as pd
57
+
58
+ # Load your discrete dataset
59
+ df = pd.read_csv("my_discrete_data.csv")
60
+
61
+ # Get top-3 recommendations
62
+ result = predict_best_algorithms(df, k=3)
63
+ # Prints ranked algorithms with predicted accuracy and confidence
64
+ ```
65
+
66
+ ## 📁 Project Structure
67
+
68
+ ```
69
+ causal_selection/
70
+ ├── data/
71
+ │ ├── generator.py # Load bnlearn networks, sample data, DAG→CPDAG
72
+ │ ├── bif_files/ # 14 bnlearn BIF files (asia through win95pts)
73
+ │ └── results/ # Benchmark CSVs: meta-features, SHD matrices
74
+ ├── discovery/
75
+ │ ├── algorithms.py # 9 algorithm adapters with timeout handling
76
+ │ └── evaluator.py # SHD, F1, Precision, Recall computation
77
+ ├── features/
78
+ │ └── extractor.py # 34 meta-features across 5 tiers
79
+ ├── meta_learner/
80
+ │ ├── trainer.py # Multi-Output RF/GBM + LONO-CV evaluation
81
+ │ └── predictor.py # Inference: dataset → top-3 prediction
82
+ ├── models/
83
+ │ ├── meta_learner.pkl # Trained Random Forest
84
+ │ └── scaler.pkl # Feature scaler
85
+ ├── benchmark.py # Full benchmark orchestration
86
+ └── run_benchmark.py # Resumable benchmark runner
87
+ ```
88
+
89
+ ## 📈 Benchmark Data
90
+
91
+ - **14 bnlearn networks**: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
92
+ - **116+ dataset configs**: varying sample sizes (250–10,000) × multiple seeds
93
+ - **1,000+ algorithm runs**: 9 algorithms × 116 configs with per-algorithm timeout
94
+
95
+ ## 🔧 Dependencies
96
+
97
+ ```
98
+ causal-learn>=0.1.4
99
+ pgmpy>=0.1.25
100
+ scikit-learn>=1.8
101
+ pandas
102
+ numpy
103
+ scipy
104
+ joblib
105
+ ```
106
+
107
+ ## 📚 References
108
+
109
+ - **Causal-Copilot** (arxiv:2504.13263) — Closest existing algorithm selection system
110
+ - **AVICI** (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
111
+ - **Dozat & Manning** (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
112
+ - **SATzilla** (arxiv:1401.2474) — Algorithm selection via meta-learning
113
+ - **bnlearn** (bnlearn.com) — Bayesian network benchmark repository
114
+
115
+ ## License
116
+
117
+ MIT