File size: 10,821 Bytes
000de75 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | ---
language: en
license: apache-2.0
tags:
- healthcare
- ehr
- copd
- clinical-risk
- tabular
- scikit-learn
- xgboost
- lightgbm
- catboost
- patient-reported-outcomes
pipeline_tag: tabular-classification
library_name: sklearn
---
# COPD Open Models β Model H (90-Day Exacerbation Prediction)
## Model Details
Model H predicts the risk of a COPD exacerbation within **90 days** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It serves as both a production-grade prediction pipeline and a **template project** for building new COPD prediction models, featuring a reusable 2,000+ line core library (`model_h.py`) with end-to-end training, calibration, evaluation, and SHAP explainability.
### Key Characteristics
- **Comprehensive PRO integration** β the most detailed PRO feature engineering in the portfolio, covering four instruments: EQ-5D (monthly), MRC dyspnoea (weekly), CAT (daily), and Symptom Diary (daily), each with engagement metrics, score differences, and multi-window aggregations.
- **13 algorithms screened** in the first phase, narrowed to top 3 with Bayesian hyperparameter tuning.
- **Forward validation** on 9 months of prospective data (May 2023 β February 2024).
- **Reusable core library** (`model_h.py`) β 40+ functions for label setup, feature engineering, model evaluation, calibration, and SHAP explainability.
- Training code is fully decoupled from cloud infrastructure β runs locally with no Azure dependencies.
> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
### Model Type
Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").
### Release Notes
- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
---
## Intended Use
This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.
### Intended Users
- ML practitioners exploring tabular healthcare ML pipelines
- Researchers comparing feature engineering and evaluation approaches
- Developers building internal prototypes (non-clinical)
### Out-of-Scope Uses
- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
- **Not** a substitute for clinical judgement or validated clinical tools.
- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
### Regulatory Considerations (SaMD)
Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
---
## Training Data
- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
- **Data available in this repo:** Synthetic/example datasets only.
- **Cohort:** COPD patients with RECEIVER and Scale-Up cohort membership.
- **Target:** Binary β `ExacWithin3Months` (hospital + community exacerbations) or `HospExacWithin3Months` (hospital only).
- **Configuration:** 90-day prediction window, 180-day lookback, 5-fold cross-validation.
### Features
| Category | Features |
|----------|----------|
| **Demographics** | Age (binned: <50/50-59/60-69/70-79/80+), Sex_F |
| **Comorbidities** | AsthmaOverlap (binary), Comorbidities count (binned: None/1-2/3+) |
| **Exacerbation History** | Hospital and community exac counts in lookback, days since last exac, recency-weighted counts |
| **Spirometry** | FEV1, FVC, FEV1/FVC ratio β max, min, and latest values |
| **Laboratory (20+ tests)** | MaxLifetime, MinLifetime, Max1Year, Min1Year, Latest values with recency weighting (decay_rate=0.001) β WBC, RBC, haemoglobin, haematocrit, platelets, sodium, potassium, creatinine, albumin, glucose, ALT, AST, GGT, bilirubin, ALP, cholesterol, triglycerides, TSH, and more |
| **EQ-5D (monthly)** | Q1βQ5, total score, latest values, engagement rates, score changes |
| **MRC Dyspnoea (weekly)** | MRC Score (1β5), latest value, engagement, variations |
| **CAT (daily)** | Q1βQ8, total score (0β40), latest values, engagement, score differences |
| **Symptom Diary (daily)** | Q5 rescue medication (binary), weekly aggregates, engagement rates |
### Data Preprocessing
1. **Target encoding** β K-fold encoding with smoothing for categorical features (Age, Comorbidities, FEV1 severity, smoking status, etc.).
2. **Imputation** β median/mean/mode imputation strategies, applied per-fold.
3. **Scaling** β MinMaxScaler to [0, 1], fit on training fold only.
4. **PRO LOGIC filtering** β 14-day minimum between exacerbation episodes, 2 consecutive negative Q5 responses required for borderline events (14β35 days apart).
---
## Training Procedure
### Training Framework
- pandas, scikit-learn, imbalanced-learn, xgboost, lightgbm, catboost
- Hyperparameter tuning: scikit-optimize (BayesSearchCV)
- Explainability: SHAP (TreeExplainer)
- Experiment tracking: MLflow
### Algorithms Evaluated
**First Phase (13 model types):**
| Algorithm | Library |
|-----------|---------|
| DummyClassifier (baseline) | sklearn |
| Logistic Regression | sklearn |
| Logistic Regression (balanced) | sklearn |
| Random Forest | sklearn |
| Random Forest (balanced) | sklearn |
| Balanced Random Forest | imblearn |
| Balanced Bagging | imblearn |
| XGBoost (7 variants) | xgboost |
| LightGBM (2 variants) | lightgbm |
| CatBoost | catboost |
**Hyperparameter Tuning Search Spaces:**
| Algorithm | Parameters Tuned |
|-----------|-----------------|
| Logistic Regression | penalty, class_weight, max_iter (50β300), C (0.001β10) |
| Random Forest | max_depth (4β10), n_estimators (70β850), min_samples_split (2β10), class_weight |
| XGBoost | max_depth (4β10), n_estimators (70β850) |
**Final Phase:** Top 3 models (Balanced Random Forest, XGBoost, Random Forest) retrained with tuned hyperparameters.
### Evaluation Design
- **5-fold** cross-validation with per-fold preprocessing.
- Metrics evaluated at threshold 0.5 and at best-F1 threshold.
- Event-type breakdown: hospital vs. community exacerbations evaluated separately.
- **Forward validation:** 9 months of prospective data (May 2023 β February 2024), assessed with KS test and Wasserstein distance for distribution shift.
### Calibration
- **Sigmoid** (Platt scaling)
- **Isotonic** regression
- Applied via CalibratedClassifierCV with per-fold calibration.
---
## Evaluation Results
> Replace this section with measured results from your training run.
| Metric | Value | Notes |
|--------|-------|-------|
| ROC-AUC | TBD | Cross-validation mean (Β± std) |
| AUC-PR | TBD | Primary metric for imbalanced outcome |
| F1 Score (@ 0.5) | TBD | Default threshold |
| Best F1 Score | TBD | At optimal threshold |
| Balanced Accuracy | TBD | Cross-validation mean |
| Brier Score | TBD | Probability calibration quality |
### Caveats on Metrics
- Performance depends heavily on cohort definition, PRO engagement rates, and label construction.
- Forward validation results may differ from cross-validation due to temporal shifts in data availability and coding practices.
- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
---
## Bias, Risks, and Limitations
- **Dataset shift:** EHR coding practices, PRO engagement, and population characteristics vary across sites and time periods.
- **PRO engagement bias:** Patients who engage more with digital health tools may differ systematically from non-engagers.
- **Label uncertainty:** Exacerbation events are constructed via PRO LOGIC β different definitions produce different results.
- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
---
## How to Use
### Pipeline Execution Order
```bash
# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm catboost scikit-optimize shap mlflow matplotlib seaborn pyyaml joblib scipy
# 2. Set up labels (choose one)
python training/setup_labels_hosp_comm.py # hospital + community exacerbations
python training/setup_labels_only_hosp.py # hospital only
python training/setup_labels_forward_val.py # forward validation set
# 3. Split data
python training/split_train_test_val.py
# 4. Feature engineering (run in sequence)
python training/process_demographics.py
python training/process_comorbidities.py
python training/process_exacerbation_history.py
python training/process_spirometry.py
python training/process_labs.py
python training/process_pros.py
# 5. Combine features
python training/combine_features.py
# 6. Encode and impute
python training/encode_and_impute.py
# 7. Screen algorithms
python training/cross_val_first_models.py
# 8. Hyperparameter tuning
python training/perform_hyper_param_tuning.py
# 9. Final cross-validation with best models
python training/cross_val_final_models.py
# 10. Forward validation (optional)
python training/perform_forward_validation.py
```
### Configuration
Edit `config.yaml` to adjust:
- `prediction_window` (default: 90 days)
- `lookback_period` (default: 180 days)
- `model_type` ('hosp_comm' or 'only_hosp')
- `num_folds` (default: 5)
- Input/output data paths
### Core Library
`model_h.py` provides 40+ reusable functions for:
- PRO LOGIC exacerbation validation
- Recency-weighted feature engineering
- Model evaluation (F1, PR-AUC, ROC-AUC, Brier, calibration curves)
- SHAP explainability (summary, local, interaction, decision plots)
- Calibration (sigmoid, isotonic, spline)
---
## Environmental Impact
Training computational requirements are minimal β all models are traditional tabular ML classifiers running on CPU. A full pipeline run (feature engineering through cross-validation) completes in minutes on a standard laptop.
---
## Citation
If you use this model or code, please cite:
- This repository: *(add citation format / Zenodo DOI if minted)*
- Associated publications: *(clinical trial results paper β forthcoming)*
## Authors and Contributors
- **Storm ID** (maintainers)
## License
This model and code are released under the **Apache 2.0** license.
|