COPD Open Models β Model E (Unsupervised Risk Stratification)
Model Details
Model E groups COPD patients into risk clusters using PCA dimensionality reduction followed by KMeans or hierarchical clustering. Clusters are designed to support risk stratification β identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.
Key Characteristics
- Unsupervised β no target labels required; clusters are derived from patient feature similarity alone.
- Two-stage PCA β Stage 1 selects features explaining β₯90% variance; Stage 2 reduces to 3 components explaining β₯80% variance.
- Modular pipeline β processing, reduction, and modelling stages are fully separated and independently reusable.
- Training code is fully decoupled from cloud infrastructure β runs locally with no Azure dependencies.
Note: This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
Model Type
Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.
Release Notes
- Phase 1 (current): Models C, E, H published as the initial "COPD Open Models" collection.
- Phase 2 (planned): Additional models may follow after codebase sanitisation.
Intended Use
This model and code are published as reference implementations for research, education, and benchmarking on COPD risk stratification tasks.
Intended Users
- ML practitioners exploring unsupervised healthcare ML pipelines
- Researchers comparing dimensionality reduction and clustering approaches for EHR data
- Developers building internal prototypes (non-clinical)
Out-of-Scope Uses
- Not for clinical decision-making, triage, diagnosis, or treatment planning.
- Not a substitute for clinical judgement or validated clinical tools.
- Do not deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
Regulatory Considerations (SaMD)
Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
Training Data
- Source: NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
- Data available in this repo: Synthetic/example datasets only.
- Cohort: COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
- Data split: 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.
Features (~70 after processing)
| Category | Features |
|---|---|
| Admissions | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year |
| Demographics | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile |
| Laboratory (2-year medians) | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year |
| Prescribing | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year |
| Comorbidities | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year |
Data Preprocessing
- Imputation β grouped median imputation by (age_bin Γ sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
- Scaling β MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.
Training Procedure
Training Framework
- pandas, numpy, scikit-learn, matplotlib, mlflow
Algorithms
| Component | Algorithm | Parameters |
|---|---|---|
| PCA Stage 1 | sklearn.decomposition.PCA | Selects features at β₯90% cumulative explained variance |
| PCA Stage 2 | sklearn.decomposition.PCA | Reduces to 3 components at β₯80% cumulative explained variance |
| Clustering (primary) | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' |
| Clustering (alternative) | sklearn.cluster.KMeans | n_clusters=3, random_state=10 |
| Validation | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels |
Cluster Count Selection
Davies-Bouldin Index and Silhouette Score are calculated for k=2 through k=9. Both metrics are logged to MLflow for inspection before selecting the final cluster count.
Evaluation Design
- Clustering quality: Davies-Bouldin Index (lower is better), Silhouette Score (higher is better).
- Cluster validation: Decision Tree accuracy on held-out data (can the clustering be reliably reproduced?).
- Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.
Evaluation Results
Replace this section with measured results from your training run.
| Metric | Value | Notes |
|---|---|---|
| Davies-Bouldin Index | TBD | Lower is better |
| Silhouette Score | TBD | Range [-1, 1], higher is better |
| DTC Accuracy | TBD | Decision Tree on validation set |
| Cluster sizes | TBD | Patient counts per cluster |
Caveats on Metrics
- Cluster quality metrics assess geometric separation, not clinical meaningfulness β clinical validation requires outcome analysis.
- Results depend on the feature set, imputation strategy, and patient population.
Bias, Risks, and Limitations
- Dataset shift: EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
- Feature availability: Lab test availability varies by patient; imputation strategy affects cluster assignment.
- Fairness: Cluster composition may correlate with age, sex, or deprivation β interpret with care.
- Misuse risk: Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
- Interpretability: PCA components are linear combinations of features β clinical interpretation requires examining loadings.
How to Use
Pipeline Execution Order
# 1. Install dependencies
pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone
# 2. Process raw data (run each independently)
python training/src/processing/process_demographics.py
python training/src/processing/process_admissions.py
python training/src/processing/process_comorbidities.py
python training/src/processing/process_labs.py
python training/src/processing/process_prescribing.py
# 3. Combine and reduce features (run in order)
python training/src/reduction/combine.py
python training/src/reduction/post_proc_reduction.py
python training/src/reduction/remove_ids.py
python training/src/reduction/clean_and_scale_train.py
python training/src/reduction/clean_and_scale_test.py
# 4. Run clustering model
python training/src/modelling/run_model.py
# 5. Predict clusters for new patients
python training/src/modelling/predict_clusters.py
Adapting to Your Data
Replace input file paths in config.json with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.
Environmental Impact
Training computational requirements are minimal β PCA and clustering on tabular data completes in seconds on a standard laptop.
Citation
If you use this model or code, please cite:
- This repository: (add citation format / Zenodo DOI if minted)
- Associated publications: (forthcoming)
Authors and Contributors
- Storm ID (maintainers)
License
This model and code are released under the Apache 2.0 license.
- Downloads last month
- -