IamGrooooot commited on
Commit
29a540e
·
1 Parent(s): 53a6def

Model E: Unsupervised PCA + clustering risk stratification

Browse files
Files changed (1) hide show
  1. README.md +194 -82
README.md CHANGED
@@ -1,87 +1,199 @@
1
- # Model E
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into _k_ clusters as a means of risk stratification. These clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
4
 
5
- The model uses EHR data (information on patients admission history, labs tests data, prescribing data, comorbidities and demographic data) which is first be pre-processed, then reduced using Principal Components Analysis (PCA). The 3D components produced in PCA are then be passed to a variety of clustering algorithms, with results plotted for inspection.
6
 
7
- ## Structure
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ```
10
- C:.
11
- | pipeline.yml
12
- | README.md
13
- | requirements.txt
14
- | setup.cfg
15
- |
16
- +---documentation
17
- | README.md
18
- |
19
- +---training
20
- | | README.md
21
- | |
22
- | +---src
23
- | | | README.md
24
- | | |
25
- | | +---data
26
- | | |
27
- | | +---modelling
28
- | | | | Cluster Data.ipynb
29
- | | | |
30
- | | +---processing
31
- | | | | process_admissions.py
32
- | | | | process_comorbidities.py
33
- | | | | process_demographics.py
34
- | | | | process_gples.py
35
- | | | | process_labs.py
36
- | | | | process_prescribing.py
37
- | | | | README.md
38
- | | | | __init__.py
39
- | | | |
40
- | | | +---mappings
41
- | | | | | Comorbidity feature review for models & clin summary update v2 May 2021.xlsx
42
- | | | | | diag_copd_resp_desc.json
43
- | | | | | inhaler_mapping.json
44
- | | | | | README.md
45
- | | | | | test_mapping.json
46
- | | | | |
47
- | | | |
48
- | | | \---utils
49
- | | | adm_common.py
50
- | | | adm_processing.py
51
- | | | adm_reduction.py
52
- | | | common.py
53
- | | | comorb_processing.py
54
- | | | labs_processing.py
55
- | | | README.md
56
- | | | __init__.py
57
- | | |
58
- | | \---reduction
59
- | | | clean_and_scale_test.py
60
- | | | clean_and_scale_train.py
61
- | | | combine.py
62
- | | | README.md
63
- | | | remove_ids.py
64
- | | | __init__.py
65
- | | |
66
- | | \---utils
67
- | | reduction.py
68
- | | __init__.py
69
- | |
70
- | \---tests
71
- | README.md
72
- |
73
- \---validation
74
- +---parameter_calculation
75
- | CAT_MRC_score_metrics_calculation.py
76
- | Fitbit_groups_calculation.py
77
- | GOLD_grade_GOLD_group_calculation.py
78
- | NIV_parameters_calculation.py
79
- | PRO_LOGIC_exacerbation_calculations.py
80
- | README.md
81
- | Time_to_death_calculation.py
82
- | Time_to_first_admission_calculations.py
83
- | Time_to_first_event_calculations.py
84
- |
85
- \---risk_score_calculation
86
- combined_risk_score_RC_SU1.py
87
- ```
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - healthcare
6
+ - ehr
7
+ - copd
8
+ - clinical-risk
9
+ - tabular
10
+ - scikit-learn
11
+ - clustering
12
+ - unsupervised
13
+ pipeline_tag: tabular-classification
14
+ library_name: sklearn
15
+ ---
16
 
17
+ # COPD Open Models Model E (Unsupervised Risk Stratification)
18
 
19
+ ## Model Details
20
 
21
+ Model E groups COPD patients into risk clusters using **PCA dimensionality reduction** followed by **KMeans or hierarchical clustering**. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.
22
 
23
+ ### Key Characteristics
24
+
25
+ - **Unsupervised** — no target labels required; clusters are derived from patient feature similarity alone.
26
+ - **Two-stage PCA** — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance.
27
+ - **Modular pipeline** — processing, reduction, and modelling stages are fully separated and independently reusable.
28
+ - Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
29
+
30
+ > **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
31
+
32
+ ### Model Type
33
+
34
+ Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.
35
+
36
+ ### Release Notes
37
+
38
+ - **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
39
+ - **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
40
+
41
+ ---
42
+
43
+ ## Intended Use
44
+
45
+ This model and code are published as **reference implementations** for research, education, and benchmarking on COPD risk stratification tasks.
46
+
47
+ ### Intended Users
48
+
49
+ - ML practitioners exploring unsupervised healthcare ML pipelines
50
+ - Researchers comparing dimensionality reduction and clustering approaches for EHR data
51
+ - Developers building internal prototypes (non-clinical)
52
+
53
+ ### Out-of-Scope Uses
54
+
55
+ - **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
56
+ - **Not** a substitute for clinical judgement or validated clinical tools.
57
+ - Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
58
+
59
+ ### Regulatory Considerations (SaMD)
60
+
61
+ Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
62
+
63
+ ---
64
+
65
+ ## Training Data
66
+
67
+ - **Source:** NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
68
+ - **Data available in this repo:** Synthetic/example datasets only.
69
+ - **Cohort:** COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
70
+ - **Data split:** 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.
71
+
72
+ ### Features (~70 after processing)
73
+
74
+ | Category | Features |
75
+ |----------|----------|
76
+ | **Admissions** | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year |
77
+ | **Demographics** | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile |
78
+ | **Laboratory (2-year medians)** | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year |
79
+ | **Prescribing** | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year |
80
+ | **Comorbidities** | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year |
81
+
82
+ ### Data Preprocessing
83
+
84
+ 1. **Imputation** — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
85
+ 2. **Scaling** — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.
86
+
87
+ ---
88
+
89
+ ## Training Procedure
90
+
91
+ ### Training Framework
92
+
93
+ - pandas, numpy, scikit-learn, matplotlib, mlflow
94
+
95
+ ### Algorithms
96
+
97
+ | Component | Algorithm | Parameters |
98
+ |-----------|-----------|------------|
99
+ | **PCA Stage 1** | sklearn.decomposition.PCA | Selects features at ≥90% cumulative explained variance |
100
+ | **PCA Stage 2** | sklearn.decomposition.PCA | Reduces to 3 components at ≥80% cumulative explained variance |
101
+ | **Clustering (primary)** | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' |
102
+ | **Clustering (alternative)** | sklearn.cluster.KMeans | n_clusters=3, random_state=10 |
103
+ | **Validation** | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels |
104
+
105
+ ### Cluster Count Selection
106
+
107
+ Davies-Bouldin Index and Silhouette Score are calculated for **k=2 through k=9**. Both metrics are logged to MLflow for inspection before selecting the final cluster count.
108
+
109
+ ### Evaluation Design
110
+
111
+ - Clustering quality: **Davies-Bouldin Index** (lower is better), **Silhouette Score** (higher is better).
112
+ - Cluster validation: **Decision Tree accuracy** on held-out data (can the clustering be reliably reproduced?).
113
+ - Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.
114
+
115
+ ---
116
+
117
+ ## Evaluation Results
118
+
119
+ > Replace this section with measured results from your training run.
120
+
121
+ | Metric | Value | Notes |
122
+ |--------|-------|-------|
123
+ | Davies-Bouldin Index | TBD | Lower is better |
124
+ | Silhouette Score | TBD | Range [-1, 1], higher is better |
125
+ | DTC Accuracy | TBD | Decision Tree on validation set |
126
+ | Cluster sizes | TBD | Patient counts per cluster |
127
+
128
+ ### Caveats on Metrics
129
+
130
+ - Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis.
131
+ - Results depend on the feature set, imputation strategy, and patient population.
132
+
133
+ ---
134
+
135
+ ## Bias, Risks, and Limitations
136
+
137
+ - **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
138
+ - **Feature availability:** Lab test availability varies by patient; imputation strategy affects cluster assignment.
139
+ - **Fairness:** Cluster composition may correlate with age, sex, or deprivation — interpret with care.
140
+ - **Misuse risk:** Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
141
+ - **Interpretability:** PCA components are linear combinations of features — clinical interpretation requires examining loadings.
142
+
143
+ ---
144
+
145
+ ## How to Use
146
+
147
+ ### Pipeline Execution Order
148
+
149
+ ```bash
150
+ # 1. Install dependencies
151
+ pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone
152
+
153
+ # 2. Process raw data (run each independently)
154
+ python training/src/processing/process_demographics.py
155
+ python training/src/processing/process_admissions.py
156
+ python training/src/processing/process_comorbidities.py
157
+ python training/src/processing/process_labs.py
158
+ python training/src/processing/process_prescribing.py
159
+
160
+ # 3. Combine and reduce features (run in order)
161
+ python training/src/reduction/combine.py
162
+ python training/src/reduction/post_proc_reduction.py
163
+ python training/src/reduction/remove_ids.py
164
+ python training/src/reduction/clean_and_scale_train.py
165
+ python training/src/reduction/clean_and_scale_test.py
166
+
167
+ # 4. Run clustering model
168
+ python training/src/modelling/run_model.py
169
+
170
+ # 5. Predict clusters for new patients
171
+ python training/src/modelling/predict_clusters.py
172
  ```
173
+
174
+ ### Adapting to Your Data
175
+
176
+ Replace input file paths in `config.json` with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.
177
+
178
+ ---
179
+
180
+ ## Environmental Impact
181
+
182
+ Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop.
183
+
184
+ ---
185
+
186
+ ## Citation
187
+
188
+ If you use this model or code, please cite:
189
+
190
+ - This repository: *(add citation format / Zenodo DOI if minted)*
191
+ - Associated publications: *(forthcoming)*
192
+
193
+ ## Authors and Contributors
194
+
195
+ - **Storm ID** (maintainers)
196
+
197
+ ## License
198
+
199
+ This model and code are released under the **Apache 2.0** license.