Add README with project documentation

3212b49 verified 27 days ago

4.74 kB

	# Causal Discovery Algorithm Selection Meta-Learner

	A meta-learning system that predicts the top-3 best causal discovery algorithms for any discrete observational dataset, based on dataset meta-features.

	## 🎯 What it Does

	Given a new discrete dataset (pandas DataFrame), the system:
	1. Extracts 34 meta-features (entropy, mutual information, chi² statistics, CI test probes, etc.)
	2. Predicts normalized SHD for each of 9 algorithms via a trained Random Forest
	3. Ranks and returns the top-3 algorithms expected to produce the most accurate CPDAG

	## 📊 Performance (Leave-One-Network-Out Cross-Validation)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Top-3 Hit Rate \| 67.2% (true best algorithm is in predicted top-3) \|
	\| NDCG@3 \| 0.947 (ranking quality) \|
	\| Mean Regret \| 0.012 (tiny SHD gap vs oracle selection) \|
	\| Median Regret \| 0.000 (majority of predictions are perfect) \|

	Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).

	## 🧪 Algorithm Pool (9 algorithms)

	\| Algorithm \| Family \| Library \| Output \|
	\|-----------\|--------\|---------\|--------\|
	\| PC \| Constraint-based \| causal-learn \| CPDAG \|
	\| FCI \| Constraint-based \| causal-learn \| PAG \|
	\| GES \| Score-based \| causal-learn \| CPDAG \|
	\| BOSS \| Permutation-based \| causal-learn \| CPDAG \|
	\| GRaSP \| Permutation-based \| causal-learn \| CPDAG \|
	\| HC \| Score-based (greedy) \| pgmpy \| DAG \|
	\| Tabu \| Score-based (meta-heuristic) \| pgmpy \| DAG \|
	\| MMHC \| Hybrid \| pgmpy \| DAG \|
	\| K2 \| Score-based (ordering) \| pgmpy \| DAG \|

	## 🔬 Key Insight: Dependency Parsing Connection

	This project was inspired by a structural parallel between NLP dependency parsing and causal discovery:
	- Both predict directed graphs over nodes (words/variables)
	- Both have ground-truth annotations (treebanks/bnlearn networks)
	- Both use arc-level evaluation (UAS/LAS ↔ SHD/F1)

	The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.

	Top predictive meta-features (confirming the parsing analogy):
	1. `max_pairwise_MI` (24.6%) — strongest pairwise dependency (≈ biaffine arc score)
	2. `n_variables` (14.8%) — network size
	3. `max_entropy` (9.5%) — variable complexity
	4. `max_cramers_v` (6.7%) — strongest association strength

	## 🚀 Quick Start

	```python
	from causal_selection.meta_learner.predictor import predict_best_algorithms
	import pandas as pd

	# Load your discrete dataset
	df = pd.read_csv("my_discrete_data.csv")

	# Get top-3 recommendations
	result = predict_best_algorithms(df, k=3)
	# Prints ranked algorithms with predicted accuracy and confidence
	```

	## 📁 Project Structure

	```
	causal_selection/
	├── data/
	│ ├── generator.py # Load bnlearn networks, sample data, DAG→CPDAG
	│ ├── bif_files/ # 14 bnlearn BIF files (asia through win95pts)
	│ └── results/ # Benchmark CSVs: meta-features, SHD matrices
	├── discovery/
	│ ├── algorithms.py # 9 algorithm adapters with timeout handling
	│ └── evaluator.py # SHD, F1, Precision, Recall computation
	├── features/
	│ └── extractor.py # 34 meta-features across 5 tiers
	├── meta_learner/
	│ ├── trainer.py # Multi-Output RF/GBM + LONO-CV evaluation
	│ └── predictor.py # Inference: dataset → top-3 prediction
	├── models/
	│ ├── meta_learner.pkl # Trained Random Forest
	│ └── scaler.pkl # Feature scaler
	├── benchmark.py # Full benchmark orchestration
	└── run_benchmark.py # Resumable benchmark runner
	```

	## 📈 Benchmark Data

	- 14 bnlearn networks: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
	- 116+ dataset configs: varying sample sizes (250–10,000) × multiple seeds
	- 1,000+ algorithm runs: 9 algorithms × 116 configs with per-algorithm timeout

	## 🔧 Dependencies

	```
	causal-learn>=0.1.4
	pgmpy>=0.1.25
	scikit-learn>=1.8
	pandas
	numpy
	scipy
	joblib
	```

	## 📚 References

	- Causal-Copilot (arxiv:2504.13263) — Closest existing algorithm selection system
	- AVICI (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
	- Dozat & Manning (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
	- SATzilla (arxiv:1401.2474) — Algorithm selection via meta-learning
	- bnlearn (bnlearn.com) — Bayesian network benchmark repository

	## License

	MIT