# Causal Discovery Algorithm Selection Meta-Learner A meta-learning system that predicts the **top-3 best causal discovery algorithms** for any discrete observational dataset, based on dataset meta-features. ## 🎯 What it Does Given a new discrete dataset (pandas DataFrame), the system: 1. **Extracts 34 meta-features** (entropy, mutual information, chiΒ² statistics, CI test probes, etc.) 2. **Predicts normalized SHD** for each of 9 algorithms via a trained Random Forest 3. **Ranks and returns the top-3** algorithms expected to produce the most accurate CPDAG ## πŸ“Š Performance (Leave-One-Network-Out Cross-Validation) | Metric | Value | |--------|-------| | **Top-3 Hit Rate** | **67.2%** (true best algorithm is in predicted top-3) | | **NDCG@3** | **0.947** (ranking quality) | | **Mean Regret** | **0.012** (tiny SHD gap vs oracle selection) | | **Median Regret** | **0.000** (majority of predictions are perfect) | Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes). ## πŸ§ͺ Algorithm Pool (9 algorithms) | Algorithm | Family | Library | Output | |-----------|--------|---------|--------| | **PC** | Constraint-based | causal-learn | CPDAG | | **FCI** | Constraint-based | causal-learn | PAG | | **GES** | Score-based | causal-learn | CPDAG | | **BOSS** | Permutation-based | causal-learn | CPDAG | | **GRaSP** | Permutation-based | causal-learn | CPDAG | | **HC** | Score-based (greedy) | pgmpy | DAG | | **Tabu** | Score-based (meta-heuristic) | pgmpy | DAG | | **MMHC** | Hybrid | pgmpy | DAG | | **K2** | Score-based (ordering) | pgmpy | DAG | ## πŸ”¬ Key Insight: Dependency Parsing Connection This project was inspired by a structural parallel between **NLP dependency parsing** and **causal discovery**: - Both predict **directed graphs** over nodes (words/variables) - Both have **ground-truth annotations** (treebanks/bnlearn networks) - Both use **arc-level evaluation** (UAS/LAS ↔ SHD/F1) The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning β€” validating this connection. **Top predictive meta-features** (confirming the parsing analogy): 1. `max_pairwise_MI` (24.6%) β€” strongest pairwise dependency (β‰ˆ biaffine arc score) 2. `n_variables` (14.8%) β€” network size 3. `max_entropy` (9.5%) β€” variable complexity 4. `max_cramers_v` (6.7%) β€” strongest association strength ## πŸš€ Quick Start ```python from causal_selection.meta_learner.predictor import predict_best_algorithms import pandas as pd # Load your discrete dataset df = pd.read_csv("my_discrete_data.csv") # Get top-3 recommendations result = predict_best_algorithms(df, k=3) # Prints ranked algorithms with predicted accuracy and confidence ``` ## πŸ“ Project Structure ``` causal_selection/ β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ generator.py # Load bnlearn networks, sample data, DAGβ†’CPDAG β”‚ β”œβ”€β”€ bif_files/ # 14 bnlearn BIF files (asia through win95pts) β”‚ └── results/ # Benchmark CSVs: meta-features, SHD matrices β”œβ”€β”€ discovery/ β”‚ β”œβ”€β”€ algorithms.py # 9 algorithm adapters with timeout handling β”‚ └── evaluator.py # SHD, F1, Precision, Recall computation β”œβ”€β”€ features/ β”‚ └── extractor.py # 34 meta-features across 5 tiers β”œβ”€β”€ meta_learner/ β”‚ β”œβ”€β”€ trainer.py # Multi-Output RF/GBM + LONO-CV evaluation β”‚ └── predictor.py # Inference: dataset β†’ top-3 prediction β”œβ”€β”€ models/ β”‚ β”œβ”€β”€ meta_learner.pkl # Trained Random Forest β”‚ └── scaler.pkl # Feature scaler β”œβ”€β”€ benchmark.py # Full benchmark orchestration └── run_benchmark.py # Resumable benchmark runner ``` ## πŸ“ˆ Benchmark Data - **14 bnlearn networks**: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts - **116+ dataset configs**: varying sample sizes (250–10,000) Γ— multiple seeds - **1,000+ algorithm runs**: 9 algorithms Γ— 116 configs with per-algorithm timeout ## πŸ”§ Dependencies ``` causal-learn>=0.1.4 pgmpy>=0.1.25 scikit-learn>=1.8 pandas numpy scipy joblib ``` ## πŸ“š References - **Causal-Copilot** (arxiv:2504.13263) β€” Closest existing algorithm selection system - **AVICI** (arxiv:2205.12934) β€” Amortized causal structure learning (biaffine architecture) - **Dozat & Manning** (arxiv:1611.01734) β€” Deep Biaffine Attention for dependency parsing - **SATzilla** (arxiv:1401.2474) β€” Algorithm selection via meta-learning - **bnlearn** (bnlearn.com) β€” Bayesian network benchmark repository ## License MIT