Model E
Abstract
Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
Aims
To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features.
Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters.
Determine is patients are on the incorrect type of care based on their clusters.
Data - EXAMPLE_STUDY_DATA
Below details the raw EHR features processed for model training, along with the resulting processed feature set.
Raw features
Admissions/Comorbidites - SMR01_Cohort3R.csv
| Feature name | Description |
|---|---|
| SafeHavenID | Patient ID |
| ETHGRP | Ethnicity |
| ADMDATE | Date of admission |
| DISDATE | Date of discharge |
| DIAGxDesc (x=1-6) | Diagnosis columns 1-6 |
| STAY | Length of stay (days) |
Demographics - Demographics_Cohort3R.csv
| Feature name | Description |
|---|---|
| SafeHavenID | Patient ID |
| OBF_DOB | Date of birth |
| SEX | Sex |
| Marital_Status | Marital status |
| SIMD_2009/12/16_QUINTILE | SIMD ranks to quintiles for 2009, 2012 and 2016 data zones |
| SIMD_2009/12/16_DECILE | SIMD ranks to deciles for 2009, 2012 and 2016 data zones |
| SIMD_2009/12/16_VIGINTILE | SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones |
Prescribing - Pharmacy_Cohort3R.csv
| Feature name | Description |
|---|---|
| SafeHavenID | Patient ID |
| PRESC_DATE | Date of prescription |
| PI_BNF_Item_Code | Code describing specific medicine as found in the British National Formulary (BNF) reference book |
| PI_Approved_Name | Name of medicine |
Labs - SCI_Store_Cohort3R.csv
| Feature name | Description |
|---|---|
| SafeHavenID | Patient ID |
| SAMPLEDATE | Date lab test was taken |
| CLINICALCODEDESCRIPTION | Name of test |
| QUANTITYVALUE | Test value |
| RANGEHIGHVALUE | Test range highest value |
| RANGELOWVALUE | Test range lowest value |
Mappings
inhaler_mapping.json: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found here, was used to create the mapping.test_mapping.json: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.Comorbidity feature review for models & clin summary update v2 May 2021.xlsx: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).diag_copd_resp_desc.json: DIAGDesc for COPD and respiratory admissions.
Processed features
Demographics features
The below features are saved to be used for any necessary validation, but are not included for model training.
| Feature name | Description |
|---|---|
| eth_grp | Ethnicity one-hot-encoded into 1 of 7 categories |
| entry_dataset | Dataset patient first appeared in within the health board region |
| first_entry | Date of first appearance in the health board region |
| obf_dob | Patient DOB at respective date |
| sex_bin | Sex in binary format: F=1, M=0 |
| marital_status_m | Married |
| marital_status_n | Not Known |
| marital_status_o | Other |
| marital_status_s | Single |
| age_bin | Age bucket based on training data (1 of 10) |
| days_since_copd_resp_med | Median days since COPD or respiratory admission |
| days_since_adm_med | Median days since any admission |
| days_since_rescue_med | Median days since rescue event |
| simd_quintile | SIMD ranks to quintile for closest year data zone |
| simd_decile | SIMD ranks to decile for closest year data zone |
| simd_vigintile | SIMD ranks to vigintile for closest year data zone |
Final feature set
The final feature set contains 50 features, as detailed below.
| Feature name | Description |
|---|---|
| SafeHavenID | Patient ID |
| year | Data year |
| total_hosp_days | Total hospital days in current year |
| mean_los | Average length of stay per year |
| ggc_years | Total years appearing in the health board region |
| age | Patient age |
| EVENT_per_year | Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp |
| EVENT_to_date | Total events to date where EVENT=adm/copd/resp/presc/rescue/labs |
| days_sinced_EVENT | Days since event where EVENT=adm/copd_resp/rescue |
| TEST_med_2yr | Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph |
| n_inhaler | Yearly inhaler prescription count where n=single/double/triple |
These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing:
| Feature name |
|---|
| age |
| ggc_years |
| comorb/presc/labs_per_year |
| presc/labs/rescue_to_date |
| days_since_adm/copd_resp/rescue |
| albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr |
Method
Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the:
- Median value for lab tests taken in the previous 2 years
- Sum of any binary/count features
- Last value of any to-date features
Once reduced, the datasets are then joined on SafeHavenID and year.
At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set.
Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set.
Data is then passed through a pipeline where:
- PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components.
- Davies Bouldin Score is applied to detect the cluster number in the training set.
- Training data is clustered using the K-Means algorithm, with results plotted using matplotlib.
- The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set.
- Clusters are calculated for all validation data.