Oguzz07's picture
Add README with project documentation
3212b49 verified
|
raw
history blame
4.74 kB

Causal Discovery Algorithm Selection Meta-Learner

A meta-learning system that predicts the top-3 best causal discovery algorithms for any discrete observational dataset, based on dataset meta-features.

🎯 What it Does

Given a new discrete dataset (pandas DataFrame), the system:

  1. Extracts 34 meta-features (entropy, mutual information, chi² statistics, CI test probes, etc.)
  2. Predicts normalized SHD for each of 9 algorithms via a trained Random Forest
  3. Ranks and returns the top-3 algorithms expected to produce the most accurate CPDAG

📊 Performance (Leave-One-Network-Out Cross-Validation)

Metric Value
Top-3 Hit Rate 67.2% (true best algorithm is in predicted top-3)
NDCG@3 0.947 (ranking quality)
Mean Regret 0.012 (tiny SHD gap vs oracle selection)
Median Regret 0.000 (majority of predictions are perfect)

Evaluated on 116 benchmark configs across 13 bnlearn networks (5–70 nodes).

🧪 Algorithm Pool (9 algorithms)

Algorithm Family Library Output
PC Constraint-based causal-learn CPDAG
FCI Constraint-based causal-learn PAG
GES Score-based causal-learn CPDAG
BOSS Permutation-based causal-learn CPDAG
GRaSP Permutation-based causal-learn CPDAG
HC Score-based (greedy) pgmpy DAG
Tabu Score-based (meta-heuristic) pgmpy DAG
MMHC Hybrid pgmpy DAG
K2 Score-based (ordering) pgmpy DAG

🔬 Key Insight: Dependency Parsing Connection

This project was inspired by a structural parallel between NLP dependency parsing and causal discovery:

  • Both predict directed graphs over nodes (words/variables)
  • Both have ground-truth annotations (treebanks/bnlearn networks)
  • Both use arc-level evaluation (UAS/LAS ↔ SHD/F1)

The biaffine pairwise scoring mechanism from Dozat & Manning (2017) was independently reinvented by AVICI and CauScale for causal structure learning — validating this connection.

Top predictive meta-features (confirming the parsing analogy):

  1. max_pairwise_MI (24.6%) — strongest pairwise dependency (≈ biaffine arc score)
  2. n_variables (14.8%) — network size
  3. max_entropy (9.5%) — variable complexity
  4. max_cramers_v (6.7%) — strongest association strength

🚀 Quick Start

from causal_selection.meta_learner.predictor import predict_best_algorithms
import pandas as pd

# Load your discrete dataset
df = pd.read_csv("my_discrete_data.csv")

# Get top-3 recommendations
result = predict_best_algorithms(df, k=3)
# Prints ranked algorithms with predicted accuracy and confidence

📁 Project Structure

causal_selection/
├── data/
│   ├── generator.py          # Load bnlearn networks, sample data, DAG→CPDAG
│   ├── bif_files/            # 14 bnlearn BIF files (asia through win95pts)
│   └── results/              # Benchmark CSVs: meta-features, SHD matrices
├── discovery/
│   ├── algorithms.py         # 9 algorithm adapters with timeout handling
│   └── evaluator.py          # SHD, F1, Precision, Recall computation
├── features/
│   └── extractor.py          # 34 meta-features across 5 tiers
├── meta_learner/
│   ├── trainer.py            # Multi-Output RF/GBM + LONO-CV evaluation
│   └── predictor.py          # Inference: dataset → top-3 prediction
├── models/
│   ├── meta_learner.pkl      # Trained Random Forest
│   └── scaler.pkl            # Feature scaler
├── benchmark.py              # Full benchmark orchestration
└── run_benchmark.py          # Resumable benchmark runner

📈 Benchmark Data

  • 14 bnlearn networks: asia, cancer, earthquake, sachs, survey, alarm, barley, child, insurance, mildew, water, hailfinder, hepar2, win95pts
  • 116+ dataset configs: varying sample sizes (250–10,000) × multiple seeds
  • 1,000+ algorithm runs: 9 algorithms × 116 configs with per-algorithm timeout

🔧 Dependencies

causal-learn>=0.1.4
pgmpy>=0.1.25
scikit-learn>=1.8
pandas
numpy
scipy
joblib

📚 References

  • Causal-Copilot (arxiv:2504.13263) — Closest existing algorithm selection system
  • AVICI (arxiv:2205.12934) — Amortized causal structure learning (biaffine architecture)
  • Dozat & Manning (arxiv:1611.01734) — Deep Biaffine Attention for dependency parsing
  • SATzilla (arxiv:1401.2474) — Algorithm selection via meta-learning
  • bnlearn (bnlearn.com) — Bayesian network benchmark repository

License

MIT