COPD Open Models — Model C (72-Hour Exacerbation Prediction)

Model Details

Model C predicts the risk of a COPD exacerbation within 72 hours using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It includes a reproducible training/evaluation pipeline and runs on standard Python ML libraries (pandas, scikit-learn, imbalanced-learn, plus optional gradient-boosting libraries).

Key Characteristics

PRO LOGIC — a clinically-informed validation algorithm that deduplicates and filters patient-reported exacerbation events (14-day minimum between episodes, consecutive negative rescue-medication responses required for borderline events, 7-day rescue-med prescription spacing).
Compares 10 algorithms with per-fold preprocessing to prevent data leakage.
Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.

Note: This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.

Model Type

Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").

Release Notes

Phase 1 (current): Models C, E, H published as the initial "COPD Open Models" collection.
Phase 2 (planned): Additional models may follow after codebase sanitisation.

Intended Use

This model and code are published as reference implementations for research, education, and benchmarking on COPD prediction tasks.

Intended Users

ML practitioners exploring tabular healthcare ML pipelines
Researchers comparing feature engineering and evaluation approaches
Developers building internal prototypes (non-clinical)

Out-of-Scope Uses

Not for clinical decision-making, triage, diagnosis, or treatment planning.
Not a substitute for clinical judgement or validated clinical tools.
Do not deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.

Regulatory Considerations (SaMD)

Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.

Training Data

Source: NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
Data available in this repo: Synthetic/example datasets only.
Cohort: ~302 COPD patients (84 RECEIVER + 218 Scale-Up). Daily predictions per patient.
Train/test split: 85% / 15%, stratified by exacerbation status and sex.
Class balance: Exacerbation days are minority class (~5–10% positive).

Features (35 total)

Category	Features
Daily PROs	CAT Q1–Q8, CAT Score, Symptom Diary Q1–Q3, plus 3-day rolling mean difference variants for each
Weekly PROs	Q5 (rescue meds), Q8 (phlegm difficulty), Q9 (phlegm consistency), Q10 (phlegm colour) — target-encoded
Clinical	Sex_F, RequiredAcuteNIV, RequiredICUAdmission, HighestEosinophilCount_0_3, TripleTherapy, AsthmaOverlap
Categorical (target-encoded)	SmokingStatus, Age (binned: <50 / 50-59 / 60-69 / 70-79 / 80+), FEV1PercentPredicted (Mild / Moderate / Severe / Very Severe), Comorbidities (None / 1-2 / 3+), DaysSinceLastExac (binned)
Temporal	ExacsPrevYear (rolling 365-day sum), AdmissionsPrevYear (rolling 365-day sum)

Data Preprocessing

Target encoding — applied per-fold using K-fold encoding on categorical features.
MinMax scaling — all features scaled to [0, 1], fit on training fold only.
Median imputation — missing values imputed per-fold using training fold medians.

Training Procedure

Training Framework

pandas, scikit-learn, imbalanced-learn
Optional: xgboost, lightgbm, interpret (for EBM)
Experiment tracking: MLflow

Algorithms Evaluated

#	Algorithm	Library
1	RandomForestClassifier	sklearn
2	RandomForestClassifier (class_weight='balanced')	sklearn
3	BalancedBaggingClassifier	imblearn
4	BalancedRandomForestClassifier	imblearn
5	XGBClassifier	xgboost
6	XGBClassifier (scale_pos_weight)	xgboost
7	LGBMClassifier	lightgbm
8	ExplainableBoostingClassifier	interpret
9	LogisticRegression	sklearn
10	LogisticRegression (class_weight='balanced')	sklearn

Evaluation Design

5-fold stratified cross-validation, balanced by class and grouped by patient.
Per-fold preprocessing (encoding, scaling, imputation) to prevent data leakage.
Decision thresholds evaluated at: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8.
Calibration tested: sigmoid and isotonic methods via CalibratedClassifierCV.

Evaluation Results

Replace this section with measured results from your training run.

Metric	Value	Notes
ROC-AUC	TBD	Cross-validation mean (± std)
AUC-PR	TBD	Primary metric for imbalanced outcome
F1 Score	TBD	At threshold 0.5
Balanced Accuracy	TBD	Cross-validation mean
Precision	TBD	At chosen threshold
Recall	TBD	At chosen threshold
Brier Score	TBD	Probability calibration quality

Caveats on Metrics

Performance depends heavily on cohort definition, feature availability, and label construction.
Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
Exacerbation labels are constructed via PRO LOGIC — different event definitions will produce different results.

Bias, Risks, and Limitations

Dataset shift: EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
Label uncertainty: Exacerbations may be incompletely observed in routine data; PRO LOGIC filtering may not generalise to all clinical contexts.
Fairness: Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
Misuse risk: Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
Cohort size: ~302 patients is relatively small; results should be interpreted with appropriate uncertainty.

How to Use

Pipeline Execution Order

# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm interpret mlflow matplotlib seaborn

# 2. Define exacerbations with PRO LOGIC
python training/define_exacerbations_prologic.py

# 3. Train/test split (85/15, stratified)
python training/train_test_split.py

# 4. Prepare training data (encode, scale, impute)
python training/prepare_train_data.py

# 5. Prepare cross-validation folds (per-fold preprocessing)
python training/prepare_train_data_crossval.py

# 6. Prepare test data (using training encodings)
python training/prepare_test_data.py

# 7. Compare algorithms via cross-validation
python training/cross_validation_algorithms.py

# 8. Train final model (BalancedRandomForestClassifier)
python training/cross_validation.py

# 9. Evaluate calibration methods
python training/cross_validation_calibration.py

Adapting to Your Data

Replace the input data paths in define_exacerbations_prologic.py with your own EHR extract. The pipeline expects CSV files with columns for patient ID, dates, diagnoses, PRO responses, and pharmacy records.

Environmental Impact

Training computational requirements are minimal — all models are traditional tabular ML classifiers running on CPU. A full cross-validation sweep across 10 algorithms completes in minutes on a standard laptop.

Citation

If you use this model or code, please cite:

This repository: (add citation format / Zenodo DOI if minted)
Associated publications: (clinical trial results paper — forthcoming)

Authors and Contributors

Storm ID (maintainers)

License

This model and code are released under the Apache 2.0 license.

Downloads last month: -