IamGrooooot's picture
Model E: Unsupervised PCA + clustering risk stratification
53a6def

Model E

Abstract

Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.

Aims

  1. To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features.

  2. Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters.

  3. Determine is patients are on the incorrect type of care based on their clusters.

Data - EXAMPLE_STUDY_DATA

Below details the raw EHR features processed for model training, along with the resulting processed feature set.

Raw features

Admissions/Comorbidites - SMR01_Cohort3R.csv

Feature name Description
SafeHavenID Patient ID
ETHGRP Ethnicity
ADMDATE Date of admission
DISDATE Date of discharge
DIAGxDesc (x=1-6) Diagnosis columns 1-6
STAY Length of stay (days)

Demographics - Demographics_Cohort3R.csv

Feature name Description
SafeHavenID Patient ID
OBF_DOB Date of birth
SEX Sex
Marital_Status Marital status
SIMD_2009/12/16_QUINTILE SIMD ranks to quintiles for 2009, 2012 and 2016 data zones
SIMD_2009/12/16_DECILE SIMD ranks to deciles for 2009, 2012 and 2016 data zones
SIMD_2009/12/16_VIGINTILE SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones

Prescribing - Pharmacy_Cohort3R.csv

Feature name Description
SafeHavenID Patient ID
PRESC_DATE Date of prescription
PI_BNF_Item_Code Code describing specific medicine as found in the British National Formulary (BNF) reference book
PI_Approved_Name Name of medicine

Labs - SCI_Store_Cohort3R.csv

Feature name Description
SafeHavenID Patient ID
SAMPLEDATE Date lab test was taken
CLINICALCODEDESCRIPTION Name of test
QUANTITYVALUE Test value
RANGEHIGHVALUE Test range highest value
RANGELOWVALUE Test range lowest value

Mappings

  • inhaler_mapping.json: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found here, was used to create the mapping.

  • test_mapping.json: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.

  • Comorbidity feature review for models & clin summary update v2 May 2021.xlsx: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).

  • diag_copd_resp_desc.json: DIAGDesc for COPD and respiratory admissions.

Processed features

Demographics features

The below features are saved to be used for any necessary validation, but are not included for model training.

Feature name Description
eth_grp Ethnicity one-hot-encoded into 1 of 7 categories
entry_dataset Dataset patient first appeared in within the health board region
first_entry Date of first appearance in the health board region
obf_dob Patient DOB at respective date
sex_bin Sex in binary format: F=1, M=0
marital_status_m Married
marital_status_n Not Known
marital_status_o Other
marital_status_s Single
age_bin Age bucket based on training data (1 of 10)
days_since_copd_resp_med Median days since COPD or respiratory admission
days_since_adm_med Median days since any admission
days_since_rescue_med Median days since rescue event
simd_quintile SIMD ranks to quintile for closest year data zone
simd_decile SIMD ranks to decile for closest year data zone
simd_vigintile SIMD ranks to vigintile for closest year data zone

Final feature set

The final feature set contains 50 features, as detailed below.

Feature name Description
SafeHavenID Patient ID
year Data year
total_hosp_days Total hospital days in current year
mean_los Average length of stay per year
ggc_years Total years appearing in the health board region
age Patient age
EVENT_per_year Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp
EVENT_to_date Total events to date where EVENT=adm/copd/resp/presc/rescue/labs
days_sinced_EVENT Days since event where EVENT=adm/copd_resp/rescue
TEST_med_2yr Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph
n_inhaler Yearly inhaler prescription count where n=single/double/triple

These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing:

Feature name
age
ggc_years
comorb/presc/labs_per_year
presc/labs/rescue_to_date
days_since_adm/copd_resp/rescue
albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr

Method

Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the:

  • Median value for lab tests taken in the previous 2 years
  • Sum of any binary/count features
  • Last value of any to-date features

Once reduced, the datasets are then joined on SafeHavenID and year.

At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set.

Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set.

Data is then passed through a pipeline where:

  • PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components.
  • Davies Bouldin Score is applied to detect the cluster number in the training set.
  • Training data is clustered using the K-Means algorithm, with results plotted using matplotlib.
  • The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set.
  • Clusters are calculated for all validation data.