File size: 8,588 Bytes
e69d4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
language: en
license: apache-2.0
tags:
  - healthcare
  - ehr
  - copd
  - clinical-risk
  - tabular
  - scikit-learn
  - xgboost
  - lightgbm
pipeline_tag: tabular-classification
library_name: sklearn
---

# COPD Open Models β€” Model C (72-Hour Exacerbation Prediction)

## Model Details

Model C predicts the risk of a COPD exacerbation within **72 hours** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It includes a reproducible training/evaluation pipeline and runs on standard Python ML libraries (pandas, scikit-learn, imbalanced-learn, plus optional gradient-boosting libraries).

### Key Characteristics

- **PRO LOGIC** β€” a clinically-informed validation algorithm that deduplicates and filters patient-reported exacerbation events (14-day minimum between episodes, consecutive negative rescue-medication responses required for borderline events, 7-day rescue-med prescription spacing).
- Compares **10 algorithms** with per-fold preprocessing to prevent data leakage.
- Training code is fully decoupled from cloud infrastructure β€” runs locally with no Azure dependencies.

> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.

### Model Type

Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").

### Release Notes

- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.

---

## Intended Use

This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.

### Intended Users

- ML practitioners exploring tabular healthcare ML pipelines
- Researchers comparing feature engineering and evaluation approaches
- Developers building internal prototypes (non-clinical)

### Out-of-Scope Uses

- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
- **Not** a substitute for clinical judgement or validated clinical tools.
- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.

### Regulatory Considerations (SaMD)

Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.

---

## Training Data

- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
- **Data available in this repo:** Synthetic/example datasets only.
- **Cohort:** ~302 COPD patients (84 RECEIVER + 218 Scale-Up). Daily predictions per patient.
- **Train/test split:** 85% / 15%, stratified by exacerbation status and sex.
- **Class balance:** Exacerbation days are minority class (~5–10% positive).

### Features (35 total)

| Category | Features |
|----------|----------|
| **Daily PROs** | CAT Q1–Q8, CAT Score, Symptom Diary Q1–Q3, plus 3-day rolling mean difference variants for each |
| **Weekly PROs** | Q5 (rescue meds), Q8 (phlegm difficulty), Q9 (phlegm consistency), Q10 (phlegm colour) β€” target-encoded |
| **Clinical** | Sex_F, RequiredAcuteNIV, RequiredICUAdmission, HighestEosinophilCount_0_3, TripleTherapy, AsthmaOverlap |
| **Categorical (target-encoded)** | SmokingStatus, Age (binned: <50 / 50-59 / 60-69 / 70-79 / 80+), FEV1PercentPredicted (Mild / Moderate / Severe / Very Severe), Comorbidities (None / 1-2 / 3+), DaysSinceLastExac (binned) |
| **Temporal** | ExacsPrevYear (rolling 365-day sum), AdmissionsPrevYear (rolling 365-day sum) |

### Data Preprocessing

1. **Target encoding** β€” applied per-fold using K-fold encoding on categorical features.
2. **MinMax scaling** β€” all features scaled to [0, 1], fit on training fold only.
3. **Median imputation** β€” missing values imputed per-fold using training fold medians.

---

## Training Procedure

### Training Framework

- pandas, scikit-learn, imbalanced-learn
- Optional: xgboost, lightgbm, interpret (for EBM)
- Experiment tracking: MLflow

### Algorithms Evaluated

| # | Algorithm | Library |
|---|-----------|---------|
| 1 | RandomForestClassifier | sklearn |
| 2 | RandomForestClassifier (class_weight='balanced') | sklearn |
| 3 | BalancedBaggingClassifier | imblearn |
| 4 | **BalancedRandomForestClassifier** | imblearn |
| 5 | XGBClassifier | xgboost |
| 6 | XGBClassifier (scale_pos_weight) | xgboost |
| 7 | LGBMClassifier | lightgbm |
| 8 | ExplainableBoostingClassifier | interpret |
| 9 | LogisticRegression | sklearn |
| 10 | LogisticRegression (class_weight='balanced') | sklearn |

### Evaluation Design

- **5-fold** stratified cross-validation, balanced by class and grouped by patient.
- Per-fold preprocessing (encoding, scaling, imputation) to prevent data leakage.
- Decision thresholds evaluated at: **0.3, 0.4, 0.5, 0.6, 0.7, 0.8**.
- Calibration tested: **sigmoid** and **isotonic** methods via CalibratedClassifierCV.

---

## Evaluation Results

> Replace this section with measured results from your training run.

| Metric | Value | Notes |
|--------|-------|-------|
| ROC-AUC | TBD | Cross-validation mean (Β± std) |
| AUC-PR | TBD | Primary metric for imbalanced outcome |
| F1 Score | TBD | At threshold 0.5 |
| Balanced Accuracy | TBD | Cross-validation mean |
| Precision | TBD | At chosen threshold |
| Recall | TBD | At chosen threshold |
| Brier Score | TBD | Probability calibration quality |

### Caveats on Metrics

- Performance depends heavily on cohort definition, feature availability, and label construction.
- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
- Exacerbation labels are constructed via PRO LOGIC β€” different event definitions will produce different results.

---

## Bias, Risks, and Limitations

- **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
- **Label uncertainty:** Exacerbations may be incompletely observed in routine data; PRO LOGIC filtering may not generalise to all clinical contexts.
- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
- **Cohort size:** ~302 patients is relatively small; results should be interpreted with appropriate uncertainty.

---

## How to Use

### Pipeline Execution Order

```bash
# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm interpret mlflow matplotlib seaborn

# 2. Define exacerbations with PRO LOGIC
python training/define_exacerbations_prologic.py

# 3. Train/test split (85/15, stratified)
python training/train_test_split.py

# 4. Prepare training data (encode, scale, impute)
python training/prepare_train_data.py

# 5. Prepare cross-validation folds (per-fold preprocessing)
python training/prepare_train_data_crossval.py

# 6. Prepare test data (using training encodings)
python training/prepare_test_data.py

# 7. Compare algorithms via cross-validation
python training/cross_validation_algorithms.py

# 8. Train final model (BalancedRandomForestClassifier)
python training/cross_validation.py

# 9. Evaluate calibration methods
python training/cross_validation_calibration.py
```

### Adapting to Your Data

Replace the input data paths in `define_exacerbations_prologic.py` with your own EHR extract. The pipeline expects CSV files with columns for patient ID, dates, diagnoses, PRO responses, and pharmacy records.

---

## Environmental Impact

Training computational requirements are minimal β€” all models are traditional tabular ML classifiers running on CPU. A full cross-validation sweep across 10 algorithms completes in minutes on a standard laptop.

---

## Citation

If you use this model or code, please cite:

- This repository: *(add citation format / Zenodo DOI if minted)*
- Associated publications: *(clinical trial results paper β€” forthcoming)*

## Authors and Contributors

- **Storm ID** (maintainers)

## License

This model and code are released under the **Apache 2.0** license.