COPD Open Models β Model C (72-Hour Exacerbation Prediction)
Model Details
Model C predicts the risk of a COPD exacerbation within 72 hours using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It includes a reproducible training/evaluation pipeline and runs on standard Python ML libraries (pandas, scikit-learn, imbalanced-learn, plus optional gradient-boosting libraries).
Key Characteristics
- PRO LOGIC β a clinically-informed validation algorithm that deduplicates and filters patient-reported exacerbation events (14-day minimum between episodes, consecutive negative rescue-medication responses required for borderline events, 7-day rescue-med prescription spacing).
- Compares 10 algorithms with per-fold preprocessing to prevent data leakage.
- Training code is fully decoupled from cloud infrastructure β runs locally with no Azure dependencies.
Note: This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
Model Type
Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").
Release Notes
- Phase 1 (current): Models C, E, H published as the initial "COPD Open Models" collection.
- Phase 2 (planned): Additional models may follow after codebase sanitisation.
Intended Use
This model and code are published as reference implementations for research, education, and benchmarking on COPD prediction tasks.
Intended Users
- ML practitioners exploring tabular healthcare ML pipelines
- Researchers comparing feature engineering and evaluation approaches
- Developers building internal prototypes (non-clinical)
Out-of-Scope Uses
- Not for clinical decision-making, triage, diagnosis, or treatment planning.
- Not a substitute for clinical judgement or validated clinical tools.
- Do not deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
Regulatory Considerations (SaMD)
Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
Training Data
- Source: NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
- Data available in this repo: Synthetic/example datasets only.
- Cohort: ~302 COPD patients (84 RECEIVER + 218 Scale-Up). Daily predictions per patient.
- Train/test split: 85% / 15%, stratified by exacerbation status and sex.
- Class balance: Exacerbation days are minority class (~5β10% positive).
Features (35 total)
| Category | Features |
|---|---|
| Daily PROs | CAT Q1βQ8, CAT Score, Symptom Diary Q1βQ3, plus 3-day rolling mean difference variants for each |
| Weekly PROs | Q5 (rescue meds), Q8 (phlegm difficulty), Q9 (phlegm consistency), Q10 (phlegm colour) β target-encoded |
| Clinical | Sex_F, RequiredAcuteNIV, RequiredICUAdmission, HighestEosinophilCount_0_3, TripleTherapy, AsthmaOverlap |
| Categorical (target-encoded) | SmokingStatus, Age (binned: <50 / 50-59 / 60-69 / 70-79 / 80+), FEV1PercentPredicted (Mild / Moderate / Severe / Very Severe), Comorbidities (None / 1-2 / 3+), DaysSinceLastExac (binned) |
| Temporal | ExacsPrevYear (rolling 365-day sum), AdmissionsPrevYear (rolling 365-day sum) |
Data Preprocessing
- Target encoding β applied per-fold using K-fold encoding on categorical features.
- MinMax scaling β all features scaled to [0, 1], fit on training fold only.
- Median imputation β missing values imputed per-fold using training fold medians.
Training Procedure
Training Framework
- pandas, scikit-learn, imbalanced-learn
- Optional: xgboost, lightgbm, interpret (for EBM)
- Experiment tracking: MLflow
Algorithms Evaluated
| # | Algorithm | Library |
|---|---|---|
| 1 | RandomForestClassifier | sklearn |
| 2 | RandomForestClassifier (class_weight='balanced') | sklearn |
| 3 | BalancedBaggingClassifier | imblearn |
| 4 | BalancedRandomForestClassifier | imblearn |
| 5 | XGBClassifier | xgboost |
| 6 | XGBClassifier (scale_pos_weight) | xgboost |
| 7 | LGBMClassifier | lightgbm |
| 8 | ExplainableBoostingClassifier | interpret |
| 9 | LogisticRegression | sklearn |
| 10 | LogisticRegression (class_weight='balanced') | sklearn |
Evaluation Design
- 5-fold stratified cross-validation, balanced by class and grouped by patient.
- Per-fold preprocessing (encoding, scaling, imputation) to prevent data leakage.
- Decision thresholds evaluated at: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8.
- Calibration tested: sigmoid and isotonic methods via CalibratedClassifierCV.
Evaluation Results
Replace this section with measured results from your training run.
| Metric | Value | Notes |
|---|---|---|
| ROC-AUC | TBD | Cross-validation mean (Β± std) |
| AUC-PR | TBD | Primary metric for imbalanced outcome |
| F1 Score | TBD | At threshold 0.5 |
| Balanced Accuracy | TBD | Cross-validation mean |
| Precision | TBD | At chosen threshold |
| Recall | TBD | At chosen threshold |
| Brier Score | TBD | Probability calibration quality |
Caveats on Metrics
- Performance depends heavily on cohort definition, feature availability, and label construction.
- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
- Exacerbation labels are constructed via PRO LOGIC β different event definitions will produce different results.
Bias, Risks, and Limitations
- Dataset shift: EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
- Label uncertainty: Exacerbations may be incompletely observed in routine data; PRO LOGIC filtering may not generalise to all clinical contexts.
- Fairness: Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
- Misuse risk: Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
- Cohort size: ~302 patients is relatively small; results should be interpreted with appropriate uncertainty.
How to Use
Pipeline Execution Order
# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm interpret mlflow matplotlib seaborn
# 2. Define exacerbations with PRO LOGIC
python training/define_exacerbations_prologic.py
# 3. Train/test split (85/15, stratified)
python training/train_test_split.py
# 4. Prepare training data (encode, scale, impute)
python training/prepare_train_data.py
# 5. Prepare cross-validation folds (per-fold preprocessing)
python training/prepare_train_data_crossval.py
# 6. Prepare test data (using training encodings)
python training/prepare_test_data.py
# 7. Compare algorithms via cross-validation
python training/cross_validation_algorithms.py
# 8. Train final model (BalancedRandomForestClassifier)
python training/cross_validation.py
# 9. Evaluate calibration methods
python training/cross_validation_calibration.py
Adapting to Your Data
Replace the input data paths in define_exacerbations_prologic.py with your own EHR extract. The pipeline expects CSV files with columns for patient ID, dates, diagnoses, PRO responses, and pharmacy records.
Environmental Impact
Training computational requirements are minimal β all models are traditional tabular ML classifiers running on CPU. A full cross-validation sweep across 10 algorithms completes in minutes on a standard laptop.
Citation
If you use this model or code, please cite:
- This repository: (add citation format / Zenodo DOI if minted)
- Associated publications: (clinical trial results paper β forthcoming)
Authors and Contributors
- Storm ID (maintainers)
License
This model and code are released under the Apache 2.0 license.
- Downloads last month
- -