File size: 8,588 Bytes

e69d4e4

---
language: en
license: apache-2.0
tags:
  - healthcare
  - ehr
  - copd
  - clinical-risk
  - tabular
  - scikit-learn
  - xgboost
  - lightgbm
pipeline_tag: tabular-classification
library_name: sklearn
---

# COPD Open Models — Model C (72-Hour Exacerbation Prediction)

## Model Details

Model C predicts the risk of a COPD exacerbation within **72 hours** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It includes a reproducible training/evaluation pipeline and runs on standard Python ML libraries (pandas, scikit-learn, imbalanced-learn, plus optional gradient-boosting libraries).

### Key Characteristics

- **PRO LOGIC** — a clinically-informed validation algorithm that deduplicates and filters patient-reported exacerbation events (14-day minimum between episodes, consecutive negative rescue-medication responses required for borderline events, 7-day rescue-med prescription spacing).
- Compares **10 algorithms** with per-fold preprocessing to prevent data leakage.
- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.

> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.

### Model Type

Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").

### Release Notes

- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.

---

## Intended Use

This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.

### Intended Users

- ML practitioners exploring tabular healthcare ML pipelines
- Researchers comparing feature engineering and evaluation approaches
- Developers building internal prototypes (non-clinical)

### Out-of-Scope Uses

- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
- **Not** a substitute for clinical judgement or validated clinical tools.
- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.

### Regulatory Considerations (SaMD)

Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.

---

## Training Data

- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
- **Data available in this repo:** Synthetic/example datasets only.
- **Cohort:** ~302 COPD patients (84 RECEIVER + 218 Scale-Up). Daily predictions per patient.
- **Train/test split:** 85% / 15%, stratified by exacerbation status and sex.
- **Class balance:** Exacerbation days are minority class (~5–10% positive).

### Features (35 total)

| Category | Features |
|----------|----------|
| **Daily PROs** | CAT Q1–Q8, CAT Score, Symptom Diary Q1–Q3, plus 3-day rolling mean difference variants for each |
| **Weekly PROs** | Q5 (rescue meds), Q8 (phlegm difficulty), Q9 (phlegm consistency), Q10 (phlegm colour) — target-encoded |
| **Clinical** | Sex_F, RequiredAcuteNIV, RequiredICUAdmission, HighestEosinophilCount_0_3, TripleTherapy, AsthmaOverlap |
| **Categorical (target-encoded)** | SmokingStatus, Age (binned: <50 / 50-59 / 60-69 / 70-79 / 80+), FEV1PercentPredicted (Mild / Moderate / Severe / Very Severe), Comorbidities (None / 1-2 / 3+), DaysSinceLastExac (binned) |
| **Temporal** | ExacsPrevYear (rolling 365-day sum), AdmissionsPrevYear (rolling 365-day sum) |

### Data Preprocessing

1. **Target encoding** — applied per-fold using K-fold encoding on categorical features.
2. **MinMax scaling** — all features scaled to [0, 1], fit on training fold only.
3. **Median imputation** — missing values imputed per-fold using training fold medians.

---

## Training Procedure

### Training Framework

- pandas, scikit-learn, imbalanced-learn
- Optional: xgboost, lightgbm, interpret (for EBM)
- Experiment tracking: MLflow

### Algorithms Evaluated

| # | Algorithm | Library |
|---|-----------|---------|
| 1 | RandomForestClassifier | sklearn |
| 2 | RandomForestClassifier (class_weight='balanced') | sklearn |
| 3 | BalancedBaggingClassifier | imblearn |
| 4 | **BalancedRandomForestClassifier** | imblearn |
| 5 | XGBClassifier | xgboost |
| 6 | XGBClassifier (scale_pos_weight) | xgboost |
| 7 | LGBMClassifier | lightgbm |
| 8 | ExplainableBoostingClassifier | interpret |
| 9 | LogisticRegression | sklearn |
| 10 | LogisticRegression (class_weight='balanced') | sklearn |

### Evaluation Design

- **5-fold** stratified cross-validation, balanced by class and grouped by patient.
- Per-fold preprocessing (encoding, scaling, imputation) to prevent data leakage.
- Decision thresholds evaluated at: **0.3, 0.4, 0.5, 0.6, 0.7, 0.8**.
- Calibration tested: **sigmoid** and **isotonic** methods via CalibratedClassifierCV.

---

## Evaluation Results

> Replace this section with measured results from your training run.

| Metric | Value | Notes |
|--------|-------|-------|
| ROC-AUC | TBD | Cross-validation mean (± std) |
| AUC-PR | TBD | Primary metric for imbalanced outcome |
| F1 Score | TBD | At threshold 0.5 |
| Balanced Accuracy | TBD | Cross-validation mean |
| Precision | TBD | At chosen threshold |
| Recall | TBD | At chosen threshold |
| Brier Score | TBD | Probability calibration quality |

### Caveats on Metrics

- Performance depends heavily on cohort definition, feature availability, and label construction.
- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
- Exacerbation labels are constructed via PRO LOGIC — different event definitions will produce different results.

---

## Bias, Risks, and Limitations

- **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
- **Label uncertainty:** Exacerbations may be incompletely observed in routine data; PRO LOGIC filtering may not generalise to all clinical contexts.
- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
- **Cohort size:** ~302 patients is relatively small; results should be interpreted with appropriate uncertainty.

---

## How to Use

### Pipeline Execution Order

```bash
# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm interpret mlflow matplotlib seaborn

# 2. Define exacerbations with PRO LOGIC
python training/define_exacerbations_prologic.py

# 3. Train/test split (85/15, stratified)
python training/train_test_split.py

# 4. Prepare training data (encode, scale, impute)
python training/prepare_train_data.py

# 5. Prepare cross-validation folds (per-fold preprocessing)
python training/prepare_train_data_crossval.py

# 6. Prepare test data (using training encodings)
python training/prepare_test_data.py

# 7. Compare algorithms via cross-validation
python training/cross_validation_algorithms.py

# 8. Train final model (BalancedRandomForestClassifier)
python training/cross_validation.py

# 9. Evaluate calibration methods
python training/cross_validation_calibration.py
```

### Adapting to Your Data

Replace the input data paths in `define_exacerbations_prologic.py` with your own EHR extract. The pipeline expects CSV files with columns for patient ID, dates, diagnoses, PRO responses, and pharmacy records.

---

## Environmental Impact

Training computational requirements are minimal — all models are traditional tabular ML classifiers running on CPU. A full cross-validation sweep across 10 algorithms completes in minutes on a standard laptop.

---

## Citation

If you use this model or code, please cite:

- This repository: *(add citation format / Zenodo DOI if minted)*
- Associated publications: *(clinical trial results paper — forthcoming)*

## Authors and Contributors

- **Storm ID** (maintainers)

## License

This model and code are released under the **Apache 2.0** license.