File size: 10,821 Bytes

000de75

---
language: en
license: apache-2.0
tags:
  - healthcare
  - ehr
  - copd
  - clinical-risk
  - tabular
  - scikit-learn
  - xgboost
  - lightgbm
  - catboost
  - patient-reported-outcomes
pipeline_tag: tabular-classification
library_name: sklearn
---

# COPD Open Models — Model H (90-Day Exacerbation Prediction)

## Model Details

Model H predicts the risk of a COPD exacerbation within **90 days** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It serves as both a production-grade prediction pipeline and a **template project** for building new COPD prediction models, featuring a reusable 2,000+ line core library (`model_h.py`) with end-to-end training, calibration, evaluation, and SHAP explainability.

### Key Characteristics

- **Comprehensive PRO integration** — the most detailed PRO feature engineering in the portfolio, covering four instruments: EQ-5D (monthly), MRC dyspnoea (weekly), CAT (daily), and Symptom Diary (daily), each with engagement metrics, score differences, and multi-window aggregations.
- **13 algorithms screened** in the first phase, narrowed to top 3 with Bayesian hyperparameter tuning.
- **Forward validation** on 9 months of prospective data (May 2023 – February 2024).
- **Reusable core library** (`model_h.py`) — 40+ functions for label setup, feature engineering, model evaluation, calibration, and SHAP explainability.
- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.

> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.

### Model Type

Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").

### Release Notes

- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.

---

## Intended Use

This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.

### Intended Users

- ML practitioners exploring tabular healthcare ML pipelines
- Researchers comparing feature engineering and evaluation approaches
- Developers building internal prototypes (non-clinical)

### Out-of-Scope Uses

- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
- **Not** a substitute for clinical judgement or validated clinical tools.
- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.

### Regulatory Considerations (SaMD)

Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.

---

## Training Data

- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
- **Data available in this repo:** Synthetic/example datasets only.
- **Cohort:** COPD patients with RECEIVER and Scale-Up cohort membership.
- **Target:** Binary — `ExacWithin3Months` (hospital + community exacerbations) or `HospExacWithin3Months` (hospital only).
- **Configuration:** 90-day prediction window, 180-day lookback, 5-fold cross-validation.

### Features

| Category | Features |
|----------|----------|
| **Demographics** | Age (binned: <50/50-59/60-69/70-79/80+), Sex_F |
| **Comorbidities** | AsthmaOverlap (binary), Comorbidities count (binned: None/1-2/3+) |
| **Exacerbation History** | Hospital and community exac counts in lookback, days since last exac, recency-weighted counts |
| **Spirometry** | FEV1, FVC, FEV1/FVC ratio — max, min, and latest values |
| **Laboratory (20+ tests)** | MaxLifetime, MinLifetime, Max1Year, Min1Year, Latest values with recency weighting (decay_rate=0.001) — WBC, RBC, haemoglobin, haematocrit, platelets, sodium, potassium, creatinine, albumin, glucose, ALT, AST, GGT, bilirubin, ALP, cholesterol, triglycerides, TSH, and more |
| **EQ-5D (monthly)** | Q1–Q5, total score, latest values, engagement rates, score changes |
| **MRC Dyspnoea (weekly)** | MRC Score (1–5), latest value, engagement, variations |
| **CAT (daily)** | Q1–Q8, total score (0–40), latest values, engagement, score differences |
| **Symptom Diary (daily)** | Q5 rescue medication (binary), weekly aggregates, engagement rates |

### Data Preprocessing

1. **Target encoding** — K-fold encoding with smoothing for categorical features (Age, Comorbidities, FEV1 severity, smoking status, etc.).
2. **Imputation** — median/mean/mode imputation strategies, applied per-fold.
3. **Scaling** — MinMaxScaler to [0, 1], fit on training fold only.
4. **PRO LOGIC filtering** — 14-day minimum between exacerbation episodes, 2 consecutive negative Q5 responses required for borderline events (14–35 days apart).

---

## Training Procedure

### Training Framework

- pandas, scikit-learn, imbalanced-learn, xgboost, lightgbm, catboost
- Hyperparameter tuning: scikit-optimize (BayesSearchCV)
- Explainability: SHAP (TreeExplainer)
- Experiment tracking: MLflow

### Algorithms Evaluated

**First Phase (13 model types):**

| Algorithm | Library |
|-----------|---------|
| DummyClassifier (baseline) | sklearn |
| Logistic Regression | sklearn |
| Logistic Regression (balanced) | sklearn |
| Random Forest | sklearn |
| Random Forest (balanced) | sklearn |
| Balanced Random Forest | imblearn |
| Balanced Bagging | imblearn |
| XGBoost (7 variants) | xgboost |
| LightGBM (2 variants) | lightgbm |
| CatBoost | catboost |

**Hyperparameter Tuning Search Spaces:**

| Algorithm | Parameters Tuned |
|-----------|-----------------|
| Logistic Regression | penalty, class_weight, max_iter (50–300), C (0.001–10) |
| Random Forest | max_depth (4–10), n_estimators (70–850), min_samples_split (2–10), class_weight |
| XGBoost | max_depth (4–10), n_estimators (70–850) |

**Final Phase:** Top 3 models (Balanced Random Forest, XGBoost, Random Forest) retrained with tuned hyperparameters.

### Evaluation Design

- **5-fold** cross-validation with per-fold preprocessing.
- Metrics evaluated at threshold 0.5 and at best-F1 threshold.
- Event-type breakdown: hospital vs. community exacerbations evaluated separately.
- **Forward validation:** 9 months of prospective data (May 2023 – February 2024), assessed with KS test and Wasserstein distance for distribution shift.

### Calibration

- **Sigmoid** (Platt scaling)
- **Isotonic** regression
- Applied via CalibratedClassifierCV with per-fold calibration.

---

## Evaluation Results

> Replace this section with measured results from your training run.

| Metric | Value | Notes |
|--------|-------|-------|
| ROC-AUC | TBD | Cross-validation mean (± std) |
| AUC-PR | TBD | Primary metric for imbalanced outcome |
| F1 Score (@ 0.5) | TBD | Default threshold |
| Best F1 Score | TBD | At optimal threshold |
| Balanced Accuracy | TBD | Cross-validation mean |
| Brier Score | TBD | Probability calibration quality |

### Caveats on Metrics

- Performance depends heavily on cohort definition, PRO engagement rates, and label construction.
- Forward validation results may differ from cross-validation due to temporal shifts in data availability and coding practices.
- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.

---

## Bias, Risks, and Limitations

- **Dataset shift:** EHR coding practices, PRO engagement, and population characteristics vary across sites and time periods.
- **PRO engagement bias:** Patients who engage more with digital health tools may differ systematically from non-engagers.
- **Label uncertainty:** Exacerbation events are constructed via PRO LOGIC — different definitions produce different results.
- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.

---

## How to Use

### Pipeline Execution Order

```bash
# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm catboost scikit-optimize shap mlflow matplotlib seaborn pyyaml joblib scipy

# 2. Set up labels (choose one)
python training/setup_labels_hosp_comm.py       # hospital + community exacerbations
python training/setup_labels_only_hosp.py        # hospital only
python training/setup_labels_forward_val.py      # forward validation set

# 3. Split data
python training/split_train_test_val.py

# 4. Feature engineering (run in sequence)
python training/process_demographics.py
python training/process_comorbidities.py
python training/process_exacerbation_history.py
python training/process_spirometry.py
python training/process_labs.py
python training/process_pros.py

# 5. Combine features
python training/combine_features.py

# 6. Encode and impute
python training/encode_and_impute.py

# 7. Screen algorithms
python training/cross_val_first_models.py

# 8. Hyperparameter tuning
python training/perform_hyper_param_tuning.py

# 9. Final cross-validation with best models
python training/cross_val_final_models.py

# 10. Forward validation (optional)
python training/perform_forward_validation.py
```

### Configuration

Edit `config.yaml` to adjust:
- `prediction_window` (default: 90 days)
- `lookback_period` (default: 180 days)
- `model_type` ('hosp_comm' or 'only_hosp')
- `num_folds` (default: 5)
- Input/output data paths

### Core Library

`model_h.py` provides 40+ reusable functions for:
- PRO LOGIC exacerbation validation
- Recency-weighted feature engineering
- Model evaluation (F1, PR-AUC, ROC-AUC, Brier, calibration curves)
- SHAP explainability (summary, local, interaction, decision plots)
- Calibration (sigmoid, isotonic, spline)

---

## Environmental Impact

Training computational requirements are minimal — all models are traditional tabular ML classifiers running on CPU. A full pipeline run (feature engineering through cross-validation) completes in minutes on a standard laptop.

---

## Citation

If you use this model or code, please cite:

- This repository: *(add citation format / Zenodo DOI if minted)*
- Associated publications: *(clinical trial results paper — forthcoming)*

## Authors and Contributors

- **Storm ID** (maintainers)

## License

This model and code are released under the **Apache 2.0** license.