Model E

Abstract

Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.

Aims

To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features.
Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters.
Determine is patients are on the incorrect type of care based on their clusters.

Data - EXAMPLE_STUDY_DATA

Below details the raw EHR features processed for model training, along with the resulting processed feature set.

Raw features

Admissions/Comorbidites - SMR01_Cohort3R.csv

Feature name	Description
SafeHavenID	Patient ID
ETHGRP	Ethnicity
ADMDATE	Date of admission
DISDATE	Date of discharge
DIAGxDesc (x=1-6)	Diagnosis columns 1-6
STAY	Length of stay (days)

Demographics - Demographics_Cohort3R.csv

Feature name	Description
SafeHavenID	Patient ID
OBF_DOB	Date of birth
SEX	Sex
Marital_Status	Marital status
SIMD_2009/12/16_QUINTILE	SIMD ranks to quintiles for 2009, 2012 and 2016 data zones
SIMD_2009/12/16_DECILE	SIMD ranks to deciles for 2009, 2012 and 2016 data zones
SIMD_2009/12/16_VIGINTILE	SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones

Prescribing - Pharmacy_Cohort3R.csv

Feature name	Description
SafeHavenID	Patient ID
PRESC_DATE	Date of prescription
PI_BNF_Item_Code	Code describing specific medicine as found in the British National Formulary (BNF) reference book
PI_Approved_Name	Name of medicine

Labs - SCI_Store_Cohort3R.csv

Feature name	Description
SafeHavenID	Patient ID
SAMPLEDATE	Date lab test was taken
CLINICALCODEDESCRIPTION	Name of test
QUANTITYVALUE	Test value
RANGEHIGHVALUE	Test range highest value
RANGELOWVALUE	Test range lowest value

Mappings

inhaler_mapping.json: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found here, was used to create the mapping.
test_mapping.json: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.
Comorbidity feature review for models & clin summary update v2 May 2021.xlsx: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).
diag_copd_resp_desc.json: DIAGDesc for COPD and respiratory admissions.

Processed features

Demographics features

The below features are saved to be used for any necessary validation, but are not included for model training.

Feature name	Description
eth_grp	Ethnicity one-hot-encoded into 1 of 7 categories
entry_dataset	Dataset patient first appeared in within the health board region
first_entry	Date of first appearance in the health board region
obf_dob	Patient DOB at respective date
sex_bin	Sex in binary format: F=1, M=0
marital_status_m	Married
marital_status_n	Not Known
marital_status_o	Other
marital_status_s	Single
age_bin	Age bucket based on training data (1 of 10)
days_since_copd_resp_med	Median days since COPD or respiratory admission
days_since_adm_med	Median days since any admission
days_since_rescue_med	Median days since rescue event
simd_quintile	SIMD ranks to quintile for closest year data zone
simd_decile	SIMD ranks to decile for closest year data zone
simd_vigintile	SIMD ranks to vigintile for closest year data zone

Final feature set

The final feature set contains 50 features, as detailed below.

Feature name	Description
SafeHavenID	Patient ID
year	Data year
total_hosp_days	Total hospital days in current year
mean_los	Average length of stay per year
ggc_years	Total years appearing in the health board region
age	Patient age
EVENT_per_year	Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp
EVENT_to_date	Total events to date where EVENT=adm/copd/resp/presc/rescue/labs
days_sinced_EVENT	Days since event where EVENT=adm/copd_resp/rescue
TEST_med_2yr	Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph
n_inhaler	Yearly inhaler prescription count where n=single/double/triple

These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing:

Feature name
age
ggc_years
comorb/presc/labs_per_year
presc/labs/rescue_to_date
days_since_adm/copd_resp/rescue
albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr

Method

Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the:

Median value for lab tests taken in the previous 2 years
Sum of any binary/count features
Last value of any to-date features

Once reduced, the datasets are then joined on SafeHavenID and year.

At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set.

Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set.

Data is then passed through a pipeline where:

PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components.
Davies Bouldin Score is applied to detect the cluster number in the training set.
Training data is clustered using the K-Means algorithm, with results plotted using matplotlib.
The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set.
Clusters are calculated for all validation data.