COPD Open Models — Model E (Unsupervised Risk Stratification)

Model Details

Model E groups COPD patients into risk clusters using PCA dimensionality reduction followed by KMeans or hierarchical clustering. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.

Key Characteristics

Unsupervised — no target labels required; clusters are derived from patient feature similarity alone.
Two-stage PCA — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance.
Modular pipeline — processing, reduction, and modelling stages are fully separated and independently reusable.
Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.

Note: This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.

Model Type

Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.

Release Notes

Phase 1 (current): Models C, E, H published as the initial "COPD Open Models" collection.
Phase 2 (planned): Additional models may follow after codebase sanitisation.

Intended Use

This model and code are published as reference implementations for research, education, and benchmarking on COPD risk stratification tasks.

Intended Users

ML practitioners exploring unsupervised healthcare ML pipelines
Researchers comparing dimensionality reduction and clustering approaches for EHR data
Developers building internal prototypes (non-clinical)

Out-of-Scope Uses

Not for clinical decision-making, triage, diagnosis, or treatment planning.
Not a substitute for clinical judgement or validated clinical tools.
Do not deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.

Regulatory Considerations (SaMD)

Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.

Training Data

Source: NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
Data available in this repo: Synthetic/example datasets only.
Cohort: COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
Data split: 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.

Features (~70 after processing)

Category	Features
Admissions	adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year
Demographics	age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile
Laboratory (2-year medians)	26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year
Prescribing	single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year
Comorbidities	30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year

Data Preprocessing

Imputation — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
Scaling — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.

Training Procedure

Training Framework

pandas, numpy, scikit-learn, matplotlib, mlflow

Algorithms

Component	Algorithm	Parameters
PCA Stage 1	sklearn.decomposition.PCA	Selects features at ≥90% cumulative explained variance
PCA Stage 2	sklearn.decomposition.PCA	Reduces to 3 components at ≥80% cumulative explained variance
Clustering (primary)	sklearn.cluster.AgglomerativeClustering	n_clusters=3, linkage='ward'
Clustering (alternative)	sklearn.cluster.KMeans	n_clusters=3, random_state=10
Validation	sklearn.tree.DecisionTreeClassifier	random_state=42; trained on cluster labels

Cluster Count Selection

Davies-Bouldin Index and Silhouette Score are calculated for k=2 through k=9. Both metrics are logged to MLflow for inspection before selecting the final cluster count.

Evaluation Design

Clustering quality: Davies-Bouldin Index (lower is better), Silhouette Score (higher is better).
Cluster validation: Decision Tree accuracy on held-out data (can the clustering be reliably reproduced?).
Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.

Evaluation Results

Replace this section with measured results from your training run.

Metric	Value	Notes
Davies-Bouldin Index	TBD	Lower is better
Silhouette Score	TBD	Range [-1, 1], higher is better
DTC Accuracy	TBD	Decision Tree on validation set
Cluster sizes	TBD	Patient counts per cluster

Caveats on Metrics

Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis.
Results depend on the feature set, imputation strategy, and patient population.

Bias, Risks, and Limitations

Dataset shift: EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
Feature availability: Lab test availability varies by patient; imputation strategy affects cluster assignment.
Fairness: Cluster composition may correlate with age, sex, or deprivation — interpret with care.
Misuse risk: Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
Interpretability: PCA components are linear combinations of features — clinical interpretation requires examining loadings.

How to Use

Pipeline Execution Order

# 1. Install dependencies
pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone

# 2. Process raw data (run each independently)
python training/src/processing/process_demographics.py
python training/src/processing/process_admissions.py
python training/src/processing/process_comorbidities.py
python training/src/processing/process_labs.py
python training/src/processing/process_prescribing.py

# 3. Combine and reduce features (run in order)
python training/src/reduction/combine.py
python training/src/reduction/post_proc_reduction.py
python training/src/reduction/remove_ids.py
python training/src/reduction/clean_and_scale_train.py
python training/src/reduction/clean_and_scale_test.py

# 4. Run clustering model
python training/src/modelling/run_model.py

# 5. Predict clusters for new patients
python training/src/modelling/predict_clusters.py

Adapting to Your Data

Replace input file paths in config.json with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.

Environmental Impact

Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop.

Citation

If you use this model or code, please cite:

This repository: (add citation format / Zenodo DOI if minted)
Associated publications: (forthcoming)

Authors and Contributors

Storm ID (maintainers)

License

This model and code are released under the Apache 2.0 license.

Downloads last month: -