stormid
/

copd-model-e

@@ -1,87 +1,199 @@
-# Model E
-Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into _k_ clusters as a means of risk stratification. These clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
-The model uses EHR data (information on patients admission history, labs tests data, prescribing data, comorbidities and demographic data) which is first be pre-processed, then reduced using Principal Components Analysis (PCA). The 3D components produced in PCA are then be passed to a variety of clustering algorithms, with results plotted for inspection.
-## Structure
 ```
-C:.
-|   pipeline.yml
-|   README.md
-|   requirements.txt
-|   setup.cfg
-|
-+---documentation
-|       README.md
-|
-+---training
-|   |   README.md
-|   |
-|   +---src
-|   |   |   README.md
-|   |   |
-|   |   +---data
-|   |   |
-|   |   +---modelling
-|   |   |   |   Cluster Data.ipynb
-|   |   |   |
-|   |   +---processing
-|   |   |   |   process_admissions.py
-|   |   |   |   process_comorbidities.py
-|   |   |   |   process_demographics.py
-|   |   |   |   process_gples.py
-|   |   |   |   process_labs.py
-|   |   |   |   process_prescribing.py
-|   |   |   |   README.md
-|   |   |   |   __init__.py
-|   |   |   |
-|   |   |   +---mappings
-|   |   |   |   |   Comorbidity feature review for models & clin summary update v2 May 2021.xlsx
-|   |   |   |   |   diag_copd_resp_desc.json
-|   |   |   |   |   inhaler_mapping.json
-|   |   |   |   |   README.md
-|   |   |   |   |   test_mapping.json
-|   |   |   |   |
-|   |   |   |
-|   |   |   \---utils
-|   |   |           adm_common.py
-|   |   |           adm_processing.py
-|   |   |           adm_reduction.py
-|   |   |           common.py
-|   |   |           comorb_processing.py
-|   |   |           labs_processing.py
-|   |   |           README.md
-|   |   |           __init__.py
-|   |   |
-|   |   \---reduction
-|   |       |   clean_and_scale_test.py
-|   |       |   clean_and_scale_train.py
-|   |       |   combine.py
-|   |       |   README.md
-|   |       |   remove_ids.py
-|   |       |   __init__.py
-|   |       |
-|   |       \---utils
-|   |               reduction.py
-|   |               __init__.py
-|   |
-|   \---tests
-|           README.md
-|
-\---validation
-    +---parameter_calculation
-    |       CAT_MRC_score_metrics_calculation.py
-    |       Fitbit_groups_calculation.py
-    |       GOLD_grade_GOLD_group_calculation.py
-    |       NIV_parameters_calculation.py
-    |       PRO_LOGIC_exacerbation_calculations.py
-    |       README.md
-    |       Time_to_death_calculation.py
-    |       Time_to_first_admission_calculations.py
-    |       Time_to_first_event_calculations.py
-    |
-    \---risk_score_calculation
-            combined_risk_score_RC_SU1.py
-```

+---
+language: en
+license: apache-2.0
+tags:
+  - healthcare
+  - ehr
+  - copd
+  - clinical-risk
+  - tabular
+  - scikit-learn
+  - clustering
+  - unsupervised
+pipeline_tag: tabular-classification
+library_name: sklearn
+---
+# COPD Open Models — Model E (Unsupervised Risk Stratification)
+## Model Details
+Model E groups COPD patients into risk clusters using **PCA dimensionality reduction** followed by **KMeans or hierarchical clustering**. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.
+### Key Characteristics
+- **Unsupervised** — no target labels required; clusters are derived from patient feature similarity alone.
+- **Two-stage PCA** — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance.
+- **Modular pipeline** — processing, reduction, and modelling stages are fully separated and independently reusable.
+- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
+> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
+### Model Type
+Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.
+### Release Notes
+- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
+- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
+---
+## Intended Use
+This model and code are published as **reference implementations** for research, education, and benchmarking on COPD risk stratification tasks.
+### Intended Users
+- ML practitioners exploring unsupervised healthcare ML pipelines
+- Researchers comparing dimensionality reduction and clustering approaches for EHR data
+- Developers building internal prototypes (non-clinical)
+### Out-of-Scope Uses
+- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
+- **Not** a substitute for clinical judgement or validated clinical tools.
+- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
+### Regulatory Considerations (SaMD)
+Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
+---
+## Training Data
+- **Source:** NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
+- **Data available in this repo:** Synthetic/example datasets only.
+- **Cohort:** COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
+- **Data split:** 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.
+### Features (~70 after processing)
+| Category | Features |
+|----------|----------|
+| **Admissions** | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year |
+| **Demographics** | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile |
+| **Laboratory (2-year medians)** | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year |
+| **Prescribing** | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year |
+| **Comorbidities** | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year |
+### Data Preprocessing
+1. **Imputation** — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
+2. **Scaling** — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.
+---
+## Training Procedure
+### Training Framework
+- pandas, numpy, scikit-learn, matplotlib, mlflow
+### Algorithms
+| Component | Algorithm | Parameters |
+|-----------|-----------|------------|
+| **PCA Stage 1** | sklearn.decomposition.PCA | Selects features at ≥90% cumulative explained variance |
+| **PCA Stage 2** | sklearn.decomposition.PCA | Reduces to 3 components at ≥80% cumulative explained variance |
+| **Clustering (primary)** | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' |
+| **Clustering (alternative)** | sklearn.cluster.KMeans | n_clusters=3, random_state=10 |
+| **Validation** | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels |
+### Cluster Count Selection
+Davies-Bouldin Index and Silhouette Score are calculated for **k=2 through k=9**. Both metrics are logged to MLflow for inspection before selecting the final cluster count.
+### Evaluation Design
+- Clustering quality: **Davies-Bouldin Index** (lower is better), **Silhouette Score** (higher is better).
+- Cluster validation: **Decision Tree accuracy** on held-out data (can the clustering be reliably reproduced?).
+- Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.
+---
+## Evaluation Results
+> Replace this section with measured results from your training run.
+| Metric | Value | Notes |
+|--------|-------|-------|
+| Davies-Bouldin Index | TBD | Lower is better |
+| Silhouette Score | TBD | Range [-1, 1], higher is better |
+| DTC Accuracy | TBD | Decision Tree on validation set |
+| Cluster sizes | TBD | Patient counts per cluster |
+### Caveats on Metrics
+- Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis.
+- Results depend on the feature set, imputation strategy, and patient population.
+---
+## Bias, Risks, and Limitations
+- **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
+- **Feature availability:** Lab test availability varies by patient; imputation strategy affects cluster assignment.
+- **Fairness:** Cluster composition may correlate with age, sex, or deprivation — interpret with care.
+- **Misuse risk:** Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
+- **Interpretability:** PCA components are linear combinations of features — clinical interpretation requires examining loadings.
+---
+## How to Use
+### Pipeline Execution Order
+```bash
+# 1. Install dependencies
+pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone
+# 2. Process raw data (run each independently)
+python training/src/processing/process_demographics.py
+python training/src/processing/process_admissions.py
+python training/src/processing/process_comorbidities.py
+python training/src/processing/process_labs.py
+python training/src/processing/process_prescribing.py
+# 3. Combine and reduce features (run in order)
+python training/src/reduction/combine.py
+python training/src/reduction/post_proc_reduction.py
+python training/src/reduction/remove_ids.py
+python training/src/reduction/clean_and_scale_train.py
+python training/src/reduction/clean_and_scale_test.py
+# 4. Run clustering model
+python training/src/modelling/run_model.py
+# 5. Predict clusters for new patients
+python training/src/modelling/predict_clusters.py
 ```
+### Adapting to Your Data
+Replace input file paths in `config.json` with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.
+---
+## Environmental Impact
+Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop.
+---
+## Citation
+If you use this model or code, please cite:
+- This repository: *(add citation format / Zenodo DOI if minted)*
+- Associated publications: *(forthcoming)*
+## Authors and Contributors
+- **Storm ID** (maintainers)
+## License
+This model and code are released under the **Apache 2.0** license.