diff --git a/MODEL_CARD.md b/MODEL_CARD.md new file mode 100644 index 0000000000000000000000000000000000000000..4eb7308ec41e53f9b5f38174ed742524661af998 --- /dev/null +++ b/MODEL_CARD.md @@ -0,0 +1,199 @@ +--- +language: en +license: apache-2.0 +tags: + - healthcare + - ehr + - copd + - clinical-risk + - tabular + - scikit-learn + - clustering + - unsupervised +pipeline_tag: tabular-classification +library_name: sklearn +--- + +# COPD Open Models — Model E (Unsupervised Risk Stratification) + +## Model Details + +Model E groups COPD patients into risk clusters using **PCA dimensionality reduction** followed by **KMeans or hierarchical clustering**. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time. + +### Key Characteristics + +- **Unsupervised** — no target labels required; clusters are derived from patient feature similarity alone. +- **Two-stage PCA** — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance. +- **Modular pipeline** — processing, reduction, and modelling stages are fully separated and independently reusable. +- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies. + +> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation. + +### Model Type + +Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels. + +### Release Notes + +- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection. +- **Phase 2 (planned):** Additional models may follow after codebase sanitisation. + +--- + +## Intended Use + +This model and code are published as **reference implementations** for research, education, and benchmarking on COPD risk stratification tasks. + +### Intended Users + +- ML practitioners exploring unsupervised healthcare ML pipelines +- Researchers comparing dimensionality reduction and clustering approaches for EHR data +- Developers building internal prototypes (non-clinical) + +### Out-of-Scope Uses + +- **Not** for clinical decision-making, triage, diagnosis, or treatment planning. +- **Not** a substitute for clinical judgement or validated clinical tools. +- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework. + +### Regulatory Considerations (SaMD) + +Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations. + +--- + +## Training Data + +- **Source:** NHS EHR-derived datasets (training performed on controlled datasets; not distributed here). +- **Data available in this repo:** Synthetic/example datasets only. +- **Cohort:** COPD patients from hospital admissions, laboratory, pharmacy, and demographic records. +- **Data split:** 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets. + +### Features (~70 after processing) + +| Category | Features | +|----------|----------| +| **Admissions** | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year | +| **Demographics** | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile | +| **Laboratory (2-year medians)** | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year | +| **Prescribing** | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year | +| **Comorbidities** | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year | + +### Data Preprocessing + +1. **Imputation** — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum. +2. **Scaling** — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler. + +--- + +## Training Procedure + +### Training Framework + +- pandas, numpy, scikit-learn, matplotlib, mlflow + +### Algorithms + +| Component | Algorithm | Parameters | +|-----------|-----------|------------| +| **PCA Stage 1** | sklearn.decomposition.PCA | Selects features at ≥90% cumulative explained variance | +| **PCA Stage 2** | sklearn.decomposition.PCA | Reduces to 3 components at ≥80% cumulative explained variance | +| **Clustering (primary)** | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' | +| **Clustering (alternative)** | sklearn.cluster.KMeans | n_clusters=3, random_state=10 | +| **Validation** | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels | + +### Cluster Count Selection + +Davies-Bouldin Index and Silhouette Score are calculated for **k=2 through k=9**. Both metrics are logged to MLflow for inspection before selecting the final cluster count. + +### Evaluation Design + +- Clustering quality: **Davies-Bouldin Index** (lower is better), **Silhouette Score** (higher is better). +- Cluster validation: **Decision Tree accuracy** on held-out data (can the clustering be reliably reproduced?). +- Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions. + +--- + +## Evaluation Results + +> Replace this section with measured results from your training run. + +| Metric | Value | Notes | +|--------|-------|-------| +| Davies-Bouldin Index | TBD | Lower is better | +| Silhouette Score | TBD | Range [-1, 1], higher is better | +| DTC Accuracy | TBD | Decision Tree on validation set | +| Cluster sizes | TBD | Patient counts per cluster | + +### Caveats on Metrics + +- Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis. +- Results depend on the feature set, imputation strategy, and patient population. + +--- + +## Bias, Risks, and Limitations + +- **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods. +- **Feature availability:** Lab test availability varies by patient; imputation strategy affects cluster assignment. +- **Fairness:** Cluster composition may correlate with age, sex, or deprivation — interpret with care. +- **Misuse risk:** Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm. +- **Interpretability:** PCA components are linear combinations of features — clinical interpretation requires examining loadings. + +--- + +## How to Use + +### Pipeline Execution Order + +```bash +# 1. Install dependencies +pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone + +# 2. Process raw data (run each independently) +python training/src/processing/process_demographics.py +python training/src/processing/process_admissions.py +python training/src/processing/process_comorbidities.py +python training/src/processing/process_labs.py +python training/src/processing/process_prescribing.py + +# 3. Combine and reduce features (run in order) +python training/src/reduction/combine.py +python training/src/reduction/post_proc_reduction.py +python training/src/reduction/remove_ids.py +python training/src/reduction/clean_and_scale_train.py +python training/src/reduction/clean_and_scale_test.py + +# 4. Run clustering model +python training/src/modelling/run_model.py + +# 5. Predict clusters for new patients +python training/src/modelling/predict_clusters.py +``` + +### Adapting to Your Data + +Replace input file paths in `config.json` with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics. + +--- + +## Environmental Impact + +Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop. + +--- + +## Citation + +If you use this model or code, please cite: + +- This repository: *(add citation format / Zenodo DOI if minted)* +- Associated publications: *(forthcoming)* + +## Authors and Contributors + +- **Storm ID** (maintainers) + +## License + +This model and code are released under the **Apache 2.0** license. diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f4219d1a1e2f58a9160999b3b214d8f9e2c89b2e --- /dev/null +++ b/README.md @@ -0,0 +1,87 @@ +# Model E + +Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into _k_ clusters as a means of risk stratification. These clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk. + +The model uses EHR data (information on patients admission history, labs tests data, prescribing data, comorbidities and demographic data) which is first be pre-processed, then reduced using Principal Components Analysis (PCA). The 3D components produced in PCA are then be passed to a variety of clustering algorithms, with results plotted for inspection. + +## Structure + +``` +C:. +| pipeline.yml +| README.md +| requirements.txt +| setup.cfg +| ++---documentation +| README.md +| ++---training +| | README.md +| | +| +---src +| | | README.md +| | | +| | +---data +| | | +| | +---modelling +| | | | Cluster Data.ipynb +| | | | +| | +---processing +| | | | process_admissions.py +| | | | process_comorbidities.py +| | | | process_demographics.py +| | | | process_gples.py +| | | | process_labs.py +| | | | process_prescribing.py +| | | | README.md +| | | | __init__.py +| | | | +| | | +---mappings +| | | | | Comorbidity feature review for models & clin summary update v2 May 2021.xlsx +| | | | | diag_copd_resp_desc.json +| | | | | inhaler_mapping.json +| | | | | README.md +| | | | | test_mapping.json +| | | | | +| | | | +| | | \---utils +| | | adm_common.py +| | | adm_processing.py +| | | adm_reduction.py +| | | common.py +| | | comorb_processing.py +| | | labs_processing.py +| | | README.md +| | | __init__.py +| | | +| | \---reduction +| | | clean_and_scale_test.py +| | | clean_and_scale_train.py +| | | combine.py +| | | README.md +| | | remove_ids.py +| | | __init__.py +| | | +| | \---utils +| | reduction.py +| | __init__.py +| | +| \---tests +| README.md +| +\---validation + +---parameter_calculation + | CAT_MRC_score_metrics_calculation.py + | Fitbit_groups_calculation.py + | GOLD_grade_GOLD_group_calculation.py + | NIV_parameters_calculation.py + | PRO_LOGIC_exacerbation_calculations.py + | README.md + | Time_to_death_calculation.py + | Time_to_first_admission_calculations.py + | Time_to_first_event_calculations.py + | + \---risk_score_calculation + combined_risk_score_RC_SU1.py +``` \ No newline at end of file diff --git a/config.json b/config.json new file mode 100644 index 0000000000000000000000000000000000000000..58f09ffed82aa29dd7a72aaf7a1d1a96e9055fd9 --- /dev/null +++ b/config.json @@ -0,0 +1,8 @@ +{ + "date": "2019-12-31", + "extract_data_path": "/EXAMPLE_STUDY_DATA/", + "rec_data_path": "/EXAMPLE_STUDY_DATA/", + "sup_data_path": "/SU_IDs/", + "model_data_path": "/Model_E_Extracts/", + "model_type": "hierarchical" +} \ No newline at end of file diff --git a/documentation/README.md b/documentation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a72204c596bebc1070b764e51e146c5b6be1c8d6 --- /dev/null +++ b/documentation/README.md @@ -0,0 +1,151 @@ +# Model E + + +## Abstract + +Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk. + + +## Aims + +1. To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features. + +2. Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters. + +3. Determine is patients are on the incorrect type of care based on their clusters. + + +## Data - EXAMPLE_STUDY_DATA + +Below details the raw EHR features processed for model training, along with the resulting processed feature set. + +### Raw features + +#### Admissions/Comorbidites - SMR01_Cohort3R.csv + +Feature name | Description | +-------------|-------------| +SafeHavenID | Patient ID | +ETHGRP | Ethnicity | +ADMDATE | Date of admission | +DISDATE | Date of discharge | +DIAGxDesc (x=1-6) | Diagnosis columns 1-6 | +STAY | Length of stay (days) | + +#### Demographics - Demographics_Cohort3R.csv + +Feature name | Description | +-------------|-------------| +SafeHavenID | Patient ID | +OBF_DOB | Date of birth | +SEX | Sex | +Marital_Status | Marital status | +SIMD_2009/12/16_QUINTILE | SIMD ranks to quintiles for 2009, 2012 and 2016 data zones | +SIMD_2009/12/16_DECILE | SIMD ranks to deciles for 2009, 2012 and 2016 data zones | +SIMD_2009/12/16_VIGINTILE | SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones | + +#### Prescribing - Pharmacy_Cohort3R.csv + +Feature name | Description | +-------------|-------------| +SafeHavenID | Patient ID | +PRESC_DATE | Date of prescription | +PI_BNF_Item_Code | Code describing specific medicine as found in the British National Formulary (BNF) reference book | +PI_Approved_Name | Name of medicine | + +#### Labs - SCI_Store_Cohort3R.csv + +Feature name | Description | +-------------|-------------| +SafeHavenID | Patient ID | +SAMPLEDATE | Date lab test was taken | +CLINICALCODEDESCRIPTION | Name of test | +QUANTITYVALUE | Test value | +RANGEHIGHVALUE | Test range highest value | +RANGELOWVALUE | Test range lowest value | + +#### Mappings + +- `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping. + +- `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset. + +- `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A). + +- `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions. + +### Processed features + +#### Demographics features + +The below features are saved to be used for any necessary validation, but are not included for model training. + +Feature name | Description | +-------------|-------------| +eth_grp | Ethnicity one-hot-encoded into 1 of 7 categories | +entry_dataset | Dataset patient first appeared in within the health board region | +first_entry | Date of first appearance in the health board region | +obf_dob | Patient DOB at respective date | +sex_bin | Sex in binary format: F=1, M=0 | +marital_status_m | Married | +marital_status_n | Not Known | +marital_status_o | Other | +marital_status_s | Single | +age_bin | Age bucket based on training data (1 of 10) | +days_since_copd_resp_med | Median days since COPD or respiratory admission | +days_since_adm_med | Median days since any admission | +days_since_rescue_med | Median days since rescue event | +simd_quintile | SIMD ranks to quintile for closest year data zone | +simd_decile | SIMD ranks to decile for closest year data zone | +simd_vigintile | SIMD ranks to vigintile for closest year data zone | + +#### Final feature set + +The final feature set contains 50 features, as detailed below. + +Feature name | Description | +-------------|-------------| +SafeHavenID | Patient ID | +year | Data year | +total_hosp_days | Total hospital days in current year | +mean_los | Average length of stay per year | +ggc_years | Total years appearing in the health board region | +age | Patient age | +EVENT_per_year | Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp | +EVENT_to_date | Total events to date where EVENT=adm/copd/resp/presc/rescue/labs | +days_sinced_EVENT | Days since event where EVENT=adm/copd_resp/rescue | +TEST_med_2yr | Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph | +n_inhaler | Yearly inhaler prescription count where n=single/double/triple | + +These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing: + +Feature name | +-------------| +age | +ggc_years | +comorb/presc/labs_per_year | +presc/labs/rescue_to_date | +days_since_adm/copd_resp/rescue | +albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr | + +### Method + +Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the: + +- Median value for lab tests taken in the previous 2 years +- Sum of any binary/count features +- Last value of any to-date features + +Once reduced, the datasets are then joined on SafeHavenID and year. + +At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set. + +Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set. + +Data is then passed through a pipeline where: + +- PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components. +- Davies Bouldin Score is applied to detect the cluster number in the training set. +- Training data is clustered using the K-Means algorithm, with results plotted using matplotlib. +- The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set. +- Clusters are calculated for all validation data. \ No newline at end of file diff --git a/pipeline.yml b/pipeline.yml new file mode 100644 index 0000000000000000000000000000000000000000..1acb8721fa3498026e67c5747908b8be8bd69dfb --- /dev/null +++ b/pipeline.yml @@ -0,0 +1,29 @@ +trigger: + branches: + include: + - main + - release/* + +jobs: +- job: 'build' + pool: + vmImage: 'ubuntu-latest' + + steps: + - task: UsePythonVersion@0 + inputs: + versionSpec: '3.8' + architecture: 'x64' + displayName: 'Specify Python version' + + - script: | + python -m pip install --upgrade pip + displayName: 'Install pip' + + - script: | + pip install -r requirements.txt + displayName: 'Install CI dependencies' + + - script: | + flake8 + displayName: 'Run linting' \ No newline at end of file diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..1ae3ec61745e33c7f8fd2960cd9350f72ba12092 --- /dev/null +++ b/requirements.txt @@ -0,0 +1 @@ +flake8 \ No newline at end of file diff --git a/setup.cfg b/setup.cfg new file mode 100644 index 0000000000000000000000000000000000000000..0cd692083f22d52ae499a4577f7514db2fd4ffd2 --- /dev/null +++ b/setup.cfg @@ -0,0 +1,7 @@ +[tool:pytest] +filterwarnings = + ignore::DeprecationWarning +[flake8] +ignore = E501,W293,W292 +exclude = .git,__pycache__,docs/source/conf.py,old,build,dist +max-complexity = 10 \ No newline at end of file diff --git a/training/README.md b/training/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7395939cd12c9a41f0fa44d70aa329582f2371eb --- /dev/null +++ b/training/README.md @@ -0,0 +1,6 @@ +# Training + +Training scripts are split into `src` and `tests` directories, where `src` is is futher segmented into: +- `processing`: containing scripts to processes raw EHR data +- `reduction`: containing scripts to combine, reduce, fill and scale processed EHR data for modelling +- `modelling`: containing notebooks and scripts required for model training \ No newline at end of file diff --git a/training/src/README.md b/training/src/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a60cbe27f278da4756a7122d0a823043051e7e33 --- /dev/null +++ b/training/src/README.md @@ -0,0 +1,8 @@ +# Processing + +The folder contains scripts to process raw EHR data for training. + +Note that scripts must be run in the below order: +1. `processing` +2. `reduction` +3. `modelling` \ No newline at end of file diff --git a/training/src/modelling/__pycache__/run_model.cpython-313.pyc b/training/src/modelling/__pycache__/run_model.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..e4b0122ce960b80c33c0ed3cd0c25fd4198d4ebc Binary files /dev/null and b/training/src/modelling/__pycache__/run_model.cpython-313.pyc differ diff --git a/training/src/modelling/additional_code_onevsone_onevsrest_approaches.py b/training/src/modelling/additional_code_onevsone_onevsrest_approaches.py new file mode 100644 index 0000000000000000000000000000000000000000..4d702dfd1ba38f214f14f9f671053e3cf25087f1 --- /dev/null +++ b/training/src/modelling/additional_code_onevsone_onevsrest_approaches.py @@ -0,0 +1,346 @@ +import pandas as pd +import numpy as np +from sklearn.datasets import load_iris +from sklearn.tree import DecisionTreeClassifier +from sklearn.metrics import confusion_matrix +from sklearn.model_selection import train_test_split +import matplotlib.pyplot as plt +from sklearn.multiclass import OneVsRestClassifier +from sklearn.tree import plot_tree +from tabulate import tabulate +from sklearn.linear_model import LogisticRegression +import mlflow +from sklearn.metrics import accuracy_score +from sklearn.metrics import ConfusionMatrixDisplay +from sklearn.tree.export import export_text +from sklearn import tree +from itertools import combinations + + +# load in the data +data = load_iris() +iris = data +# convert to a dataframe +df = pd.DataFrame(data.data, columns=data.feature_names) +# create the species column +df['Species'] = data.target +# replace this with the actual names +target = np.unique(data.target) +target_names = np.unique(data.target_names) +targets = dict(zip(target, target_names)) +df['Species'] = df['Species'].replace(targets) + +# extract features and target variables +x = df.drop(columns="Species") +y = df["Species"] +# save the feature name and target variables +feature_names = x.columns +labels = y.unique() + +# split the dataset +X_train, test_x, y_train, test_lab = train_test_split(x, y, test_size=0.4, random_state=42) + + +# The below is for classic logistic regression binary classifier one vs rest, +# explainability is based on the coefficents in logistic regression + +# Create a One-vs-Rest logistic regression classifier +clf = LogisticRegression(random_state=0, multi_class='ovr') + +# Train the classifier on the Iris dataset +clf.fit(X_train, y_train) + +# Get the number of classes and features +n_classes = len(set(iris.target)) +n_features = iris.data.shape[1] + +# Create a figure with one subplot for each class +fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes)) + +# Loop over each class +for i in range(n_classes): + # Get the feature importances for the current class + coef = clf.coef_[i] + importance = coef + + # Sort the feature importances in descending order + indices = np.argsort(importance)[::-1] + + # Create a bar plot of the feature importances + axs[i].bar(range(n_features), importance[indices]) + axs[i].set_xticks(range(n_features)) + axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90) + axs[i].set_xlabel('Features') + axs[i].set_ylabel('Importance') + axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i])) + +# Adjust the spacing between subplots +fig.tight_layout() + +# Show the plot +plt.show() + + +# Make predictions on the test data +val_pred = clf.predict(test_x) +accuracy = accuracy_score(test_lab, val_pred) +mlflow.log_metric('dtc accuracy', accuracy) + +cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_) +disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) +disp.plot() +plt.tight_layout() +mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + +# The below is for one vs rest desicion trees with explainability importance values are +# calculated based on the reduction of impurity measured by the Gini index. +# Create a One-vs-Rest Decision Tree classifier +clf_pre = DecisionTreeClassifier(random_state=0) +clf = OneVsRestClassifier(clf_pre) + +# Train the classifier on the Iris dataset +clf.fit(X_train, y_train) + +# Get the number of classes and features +n_classes = len(set(iris.target)) +n_features = iris.data.shape[1] + +# Create a figure with one subplot for each class +fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes)) + +# Loop over each class +for i in range(n_classes): + # Get the feature importances for the current class + importance = clf.estimators_[i].feature_importances_ + + # Sort the feature importances in descending order + indices = np.argsort(importance)[::-1] + + # Create a bar plot of the feature importances + axs[i].bar(range(n_features), importance[indices]) + axs[i].set_xticks(range(n_features)) + axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90) + axs[i].set_xlabel('Features') + axs[i].set_ylabel('Importance') + axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i])) + +# Adjust the spacing between subplots +fig.tight_layout() + +# Show the plot +plt.show() + + +y_pred_DTC = clf.predict(test_x) +accuracy = accuracy_score(test_lab, val_pred) +mlflow.log_metric('dtc accuracy', accuracy) + +cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_) +disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) +disp.plot() +plt.tight_layout() +mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + +# Show desicion tree for each class: two methods + +# Get the feature names +feature_names = iris.feature_names + +# Loop over each decision tree classifier in the one-vs-rest classifier +for i, estimator in enumerate(clf.estimators_): + # Export the decision rules for the current tree + tree_rules = export_text(estimator, feature_names=feature_names) + + # Print the decision rules for the current tree + print(f"Decision rules for tree for cluster {i}:") + print(tree_rules) + +# assume clf is your one vs rest classifier +for i, estimator in enumerate(clf.estimators_): + fig, ax = plt.subplots(figsize=(12, 8)) + tree.plot_tree(estimator, + feature_names=feature_names, + class_names=labels, + rounded=True, + filled=True, + fontsize=14, + ax=ax) + ax.set_title(f'Tree {i+1}') +plt.show() + + +# One vs one approach + +# BLR +# Create a One-vs-One logistic regression classifier +clf = LogisticRegression(random_state=0, multi_class='multinomial', solver='lbfgs') + +# Train the classifier on the Iris dataset +clf.fit(X_train, y_train) + +# Get the number of classes and features +n_classes = len(set(iris.target)) +n_features = iris.data.shape[1] + +# Create a figure with one subplot for each class combination +fig, axs = plt.subplots(n_classes * (n_classes - 1) // 2, 1, figsize=(10, 5 * n_classes * (n_classes - 1) // 2)) + +# Loop over each class combination +index = 0 +for i in range(n_classes): + for j in range(i + 1, n_classes): + # Get the feature importances for the current class combination + coef = clf.coef_[index] + importance = coef + + # Sort the feature importances in descending order + indices = np.argsort(importance)[::-1] + + # Create a bar plot of the feature importances + axs[index].bar(range(n_features), importance[indices]) + axs[index].set_xticks(range(n_features)) + axs[index].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90) + axs[index].set_xlabel('Features') + axs[index].set_ylabel('Importance') + axs[index].set_title('Feature Importance for Class Combination {} vs {}'.format(iris.target_names[i], iris.target_names[j])) + index += 1 + +# Adjust the spacing between subplots +fig.tight_layout() + +# Show the plot +plt.show() + + +# Make predictions on the test data +y_pred_ovo = clf.predict(test_x) +accuracy = accuracy_score(test_lab, val_pred) +mlflow.log_metric('blr accuracy', accuracy) + +# Get confusion matrix +cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_) +disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) +disp.plot() +plt.tight_layout() +mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + +# Desicion tree clasifier + +# assume clf is your one vs one classifier +for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)): + # create a new binary label vector for the current pair of classes + y_binary = (y_train == c1) | (y_train == c2) + + # train a decision tree on the current pair of classes + estimator = DecisionTreeClassifier() + estimator.fit(X_train, y_binary) + + # get feature importances + importances = estimator.feature_importances_ + + # create a bar plot showing feature importances for the current tree + fig, ax = plt.subplots(figsize=(8, 6)) + ax.bar(np.arange(len(feature_names)), importances) + ax.set_xticks(np.arange(len(feature_names))) + ax.set_xticklabels(feature_names, rotation=45, ha='right') + ax.set_title(f'Tree {i+1}: {c1} vs {c2} Feature Importances') + ax.set_ylabel('Importance') + plt.tight_layout() + plt.show() + + # initialize a list to store feature importances for each tree +importances_all = [] + +# assume clf is your one vs one classifier +for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)): + # create a new binary label vector for the current pair of classes + y_binary = (y_train == c1) | (y_train == c2) + + # train a decision tree on the current pair of classes + estimator = DecisionTreeClassifier() + estimator.fit(X_train, y_binary) + + # get feature importances and store them in the list + importances = estimator.feature_importances_ + importances_all.append(importances) + + # plot the decision tree with feature importances + fig, ax = plt.subplots(figsize=(12, 8)) + tree.plot_tree(estimator, + feature_names=feature_names, + class_names=[str(c1), str(c2)], + rounded=True, + filled=True, + fontsize=14, + ax=ax) + + # add feature importances to title + title = f'Tree {i+1}: {c1} vs {c2}\n' + title += 'Feature importances:\n' + for feature, importance in zip(feature_names, importances): + title += f'{feature}: {importance:.3f}\n' + ax.set_title(title) + + +# Get confusion matrix +cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_) +disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) +disp.plot() +plt.tight_layout() +mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + +# Example of code to show explainability (one vs rest for a specific incidence) + +# Split the data into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) + +# Train a binary classifier for each class +binary_classifiers = {} +for i in range(len(iris.target_names)): + binary_y_train = np.where(y_train == i, 1, 0) + model = DecisionTreeClassifier(random_state=42) + model.fit(X_train, binary_y_train) + binary_classifiers[i] = model + +# Choose a specific instance to explain (e.g., the first instance in the test set) +instance = X_test[7] + +# Get the predicted probability scores for each class for the instance +probs = [] +for i in range(len(iris.target_names)): + binary_classifier = binary_classifiers[i] + prob = binary_classifier.predict_proba(instance.reshape(1, -1))[0, 1] + probs.append(prob) + +# Get the index of the class with the highest probability score +predicted_class = np.argmax(probs) + +# Extract the binary classifier with the highest probability score +binary_classifier = binary_classifiers[predicted_class] + +# Plot the decision tree for the binary classifier with the highest probability score +fig, ax = plt.subplots(figsize=(12, 12)) +plot_tree(binary_classifier, filled=True, rounded=True, ax=ax, feature_names=iris.feature_names, class_names=['not ' + iris.target_names[predicted_class], iris.target_names[predicted_class]]) +plt.show() + +# Print the predicted class and probability for the instance +predicted_prob = probs[predicted_class] +print('Predicted Class:', predicted_class) +print('Predicted Probability:', predicted_prob) + + +# Create a table with the ID, characteristics, true class label, and predicted class label for each sample in the test data +table_test = np.column_stack((np.arange(len(y_test)) + 1, X_test, y_test, y_pred_ovo, y_pred_DTC)) +header_test = np.concatenate((['ID'], iris.feature_names, ['True Class', 'Predicted Class_BLR', 'Predicted Class_DTC'])) +table_test = np.vstack((header_test, table_test)) + +# Print the table for the test data +print(tabulate(table_test)) \ No newline at end of file diff --git a/training/src/modelling/dtc_params.json b/training/src/modelling/dtc_params.json new file mode 100644 index 0000000000000000000000000000000000000000..e604a0514a76e9d7aecafc2f2fe0e4e9645893dd --- /dev/null +++ b/training/src/modelling/dtc_params.json @@ -0,0 +1,3 @@ +{ + "random_state": 42 +} \ No newline at end of file diff --git a/training/src/modelling/event_calculations.py b/training/src/modelling/event_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..d9c195f00e2695b1926712354edc22ebabec9ec9 --- /dev/null +++ b/training/src/modelling/event_calculations.py @@ -0,0 +1,183 @@ +""" +Find information on COPD, respiratory, rescue med and death event tracking +for patients within a timeframe +""" +import json +import pandas as pd +import numpy as np + + +merged_cols = ['adm_per_year', 'copd_resp_per_year', + 'anxiety_depression_per_year', + 'rescue_meds_per_year', 'anxiety_depression_presc_per_year'] +base_cols = ['admission_any', 'admission_copd_resp', + 'admission_anxiety_depression', + 'presc_rescue_med', 'presc_anxiety_depression'] +n_cols = ["n_" + col for col in base_cols] +adm_cols = ['SafeHavenID', 'ADMDATE', 'admission_any', 'admission_copd_resp'] +presc_cols = ['SafeHavenID', 'PRESC_DATE', 'rescue_meds'] + + +def read_deaths(extract_data_path): + """ + Read in deaths dataset + -------- + :param extract_data_path: path to data extracts + :return: dataframe + """ + filename = extract_data_path + 'Deaths_Cohort3R.csv' + cols = ['SafeHavenID', 'DOD'] + df = pd.read_csv(filename, usecols=cols).drop_duplicates() + df['DOD'] = pd.to_datetime(df.DOD) + + return df + + +def filter_data(df, date_col, eoy_date, start_date, end_date, typ): + """ + Filter data to only include events occurring within given date range + -------- + :param df: dataframe + :param date_col: str name of date column + :param eoy_date: end of year date + :param start_date: validation data start date + :param end_date: validation data end date + :param typ: type of data: 'adm', 'presc', 'merged', 'deaths' + :return: filtered dataframe + """ + if typ == 'merged': + df = df[df.eoy == eoy_date] + else: + df = df[(df[date_col] >= start_date) & (df[date_col] < end_date)] + + return df + + +def calc_time_to_event(df, date_col, start_date, new_col): + """ + Calculate time to next event + -------- + :param df: dataframe + :param date_col: str name of date column + :param start_date: validation data start date + :param new_col: new column name + :return: dataframe with SafeHavenID days to event + """ + df_next = df.groupby('SafeHavenID').agg(next_event=(date_col, min)) + df_next = (df_next - start_date) / np.timedelta64(1, 'M') + df_next.columns = ['time_to_' + new_col] + + return df_next + + +def bucket_time_to_event(df): + """ + Calculate time in months to next event and bucket into + 1, 3, 6, 12, 12+ months. + -------- + :param df: dataframe + :return: dataframe with event times in categories + """ + month = [-1, 1, 3, 6, 12, 13] + label = ['1', '3', '6', '12', '12+'] + df = df.apply(lambda x: pd.cut(x, month, labels=label)) + df = df.fillna('12+') + + return df + + +def calculate_event_metrics(data_path, eoy_date, start_date, end_date): + """ + Generate tables with number of events in 12 months and + boolean for events + -------- + :param data_path: path to generated data + :param eoy_date: end of year date + :param start_date: validation data start date + :param end_date: validation data end date + """ + # Load in data + merged = pd.read_pickle(data_path + 'merged.pkl') + + # Select relevant dates and columns + merged = filter_data( + merged, 'eoy', eoy_date, start_date, end_date, 'merged') + df_event = merged[['SafeHavenID'] + merged_cols] + + # Create frame with total events within 12mo period + df_count = df_event.copy() + df_count.columns = ['SafeHavenID'] + n_cols + df_count.to_pickle(data_path + 'metric_table_counts.pkl') + + # Create frame with boolean events within 12mo period + df_event[merged_cols] = (df_event[merged_cols] > 0).astype(int) + df_event.columns = ['SafeHavenID'] + base_cols + df_event.to_pickle(data_path + 'metric_table_events.pkl') + + +def calculate_next_event(data_path, extract_data_path, eoy_date, + start_date, end_date): + """ + Generate table with the time in 1, 3, 6, 12, 12+ months + -------- + :param data_path: path to generated data + :param extract_data_path: path to data extracts + :param eoy_date: end of year date + :param start_date: validation data start date + :param end_date: validation data end date + """ + # Find next adm events + adm = pd.read_pickle(data_path + 'validation_adm_proc.pkl') + adm = filter_data( + adm, 'ADMDATE', eoy_date, start_date, end_date, 'adm') + adm['admission_any'] = 1 + adm['admission_copd_resp'] = adm.copd_event | adm.resp_event + adm = adm[adm_cols] + time_to_adm_any = calc_time_to_event( + adm, 'ADMDATE', start_date, 'admission_any') + time_to_adm_copd = calc_time_to_event( + adm[adm.admission_copd_resp == 1], 'ADMDATE', start_date, + 'admission_copd_resp') + + # Find next presc events + presc = pd.read_pickle(data_path + 'validation_presc_proc.pkl') + presc = filter_data( + presc, 'PRESC_DATE', eoy_date, start_date, end_date, 'presc') + presc = presc[presc_cols] + presc = presc[presc.rescue_meds == 1] + time_to_rescue = calc_time_to_event( + presc, 'PRESC_DATE', start_date, 'presc_rescue_med') + + # Find next deaths + deaths = read_deaths(extract_data_path) + deaths = filter_data( + deaths, 'DOD', eoy_date, start_date, end_date, 'deaths') + deaths['death'] = 1 + time_to_death = calc_time_to_event( + deaths, 'DOD', start_date, 'death') + + # Merge results + frames = [time_to_adm_any, time_to_adm_copd, time_to_rescue, time_to_death] + results = pd.concat(frames, join='outer', axis=1) + results = bucket_time_to_event(results) + results.to_pickle(data_path + 'metric_table_next.pkl') + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + data_path = config['model_data_path'] + extract_data_path = config['extract_data_path'] + eoy_date = pd.to_datetime(config['date']) + start_date = eoy_date + pd.Timedelta(days=1) + end_date = eoy_date + pd.offsets.DateOffset(years=1) + + calculate_event_metrics(data_path, eoy_date, start_date, end_date) + calculate_next_event(data_path, extract_data_path, eoy_date, + start_date, end_date) + + +main() diff --git a/training/src/modelling/hierarchical_params.json b/training/src/modelling/hierarchical_params.json new file mode 100644 index 0000000000000000000000000000000000000000..8879ceb709310234dc0d4bd8a3b4ea5d0ad93848 --- /dev/null +++ b/training/src/modelling/hierarchical_params.json @@ -0,0 +1,4 @@ +{ + "n_clusters": 3, + "linkage": "ward" +} \ No newline at end of file diff --git a/training/src/modelling/kmeans_params.json b/training/src/modelling/kmeans_params.json new file mode 100644 index 0000000000000000000000000000000000000000..ab6d75455dfa3d5bbf355da0809c1d718cb9b082 --- /dev/null +++ b/training/src/modelling/kmeans_params.json @@ -0,0 +1,4 @@ +{ + "n_clusters": 3, + "random_state": 10 +} \ No newline at end of file diff --git a/training/src/modelling/one_vs_rest_BLR.py b/training/src/modelling/one_vs_rest_BLR.py new file mode 100644 index 0000000000000000000000000000000000000000..0a753124309772be299543b9229b3ec9bfd8228b --- /dev/null +++ b/training/src/modelling/one_vs_rest_BLR.py @@ -0,0 +1,377 @@ +""" +Modelling process +""" +import pandas as pd +import numpy as np +import pickle +import matplotlib.pyplot as plt +import mlflow +from matplotlib import rcParams +from sklearn.cluster import AgglomerativeClustering, KMeans +from sklearn.decomposition import PCA +from sklearn.metrics import (davies_bouldin_score, silhouette_score, + accuracy_score, confusion_matrix, + ConfusionMatrixDisplay) +from sklearn.linear_model import LogisticRegression +# from sklearn.multiclass import OneVsRestClassifier +import os + + +# Set-up figures +rcParams['figure.figsize'] = 20, 5 +rcParams['axes.spines.top'] = False +rcParams['axes.spines.right'] = False + +# Set parameters for current run +year = 2019 +model_type = 'hierarchical' +data_type = 'train' +k = 3 +stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_') +data_path = '/Model_E_Extracts/' + +# Set MLFlow parameters +mlflow.set_tracking_uri("file:/.") +tracking_uri = mlflow.get_tracking_uri() +experiment_name = 'Model E: one vs rest adaption BLR ' + model_type +run_name = "_".join((str(year), model_type, stamp)) +description = "Clustering model with one vs rest adaption (BLR) for COPD data in " + str(year) + + +def extract_year(df, year): + """ + Extract 1 year of data + -------- + :param df: dataframe to extract from + :param year: year to select data from + :return: data from chosen year + """ + return df[df.year == year] + + +def read_yearly_data(typ, year): + """ + Read in data for year required + -------- + :param typ: type of data to read in + :param year: year to select data from + :return: data from chosen year and ids + """ + df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl') + df_year = extract_year(df, year) + ids = df_year.pop('SafeHavenID').to_list() + df_year = df_year.drop('year', axis=1) + + return df_year, ids + + +def plot_variance(df, typ): + """ + Plot PCA variance + --------- + :param df: dataframe to process with PCA + :param typ: type of plot - for 'full' data or 'reduced' + :return: pca object + """ + pca = PCA().fit(df) + n = list(range(1, len(df.columns) + 1)) + evr = pca.explained_variance_ratio_.cumsum() + fig, ax = plt.subplots() + ax.plot(n, evr) + title = 'PCA Variance - ' + typ + ax.set_title(title, size=20) + ax.set_xlabel('Number of principal components') + ax.set_ylabel('Cumulative explained variance') + ax.grid() + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pca + + +def extract_pca_loadings(df, pca_object): + """ + Extract PCA loadings + -------- + :param df: dataframe to reduce with pca + :param pca_object: pca object with feature loadings + :return: loadings table + """ + cols = df.columns + loadings = pd.DataFrame( + data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_), + columns=[f'PC{i}' for i in range(1, len(cols) + 1)], + index=cols) + + return loadings + + +def plot_loadings(loadings): + """ + Plot loadings for PC1 returned from PCA + -------- + :param loadings: table of feature correlations to PC1 + :return: updated loadings table + """ + loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False) + pc1_abs = loadings_abs[['PC1']].reset_index() + col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'} + pc1_abs = pc1_abs.rename(col_map, axis=1) + fig, ax = plt.subplots() + pc1_abs.plot(ax=ax, kind='bar') + title = 'PCA loading scores (PC1)' + ax.set_title(title, size=20) + ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical') + ax.set_xlabel('Attribute') + ax.set_ylabel('AbsCorrWithPC1') + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pc1_abs + + +def extract_array(df, pca_object, typ): + """ + Extract data to pass to clustering algos + -------- + :param df: dataframe to convert + :param pca_object: initialised PCA object + :param typ: type of return needed, either 'train' or 'test' + :return: converted array (and PCA object if training) + """ + if typ == 'train': + pca_func = pca_object.fit_transform + else: + pca_func = pca_object.transform + + pca_data = pd.DataFrame(pca_func(df)).to_numpy() + + if typ == 'train': + pca_file = data_path + run_name + '_pca.pkl' + pickle.dump(pca_object, open(pca_file, 'wb')) + + return pca_data + + +def get_kmeans_score(data, k): + ''' + Calculate K-Means Davies Bouldin and Silhouette scores + -------- + :param data: dataset to fit K-Means to + :param k: number of centers/clusters + :return: Scores + ''' + kmeans = KMeans(n_clusters=k) + model = kmeans.fit_predict(data) + db_score = davies_bouldin_score(data, model) + sil_score = silhouette_score(data, model) + + return db_score, sil_score + + +def plot_DB(df): + """ + Extract David Bouldin score and plot for a range of cluster numbers, + applied using K-Means clustering. + + "Davies Bouldin index represents the average 'similarity' of clusters, + where similarity is a measure that relates cluster distance to cluster + size" - the lowest score indicates best cluster set. + -------- + :param df: dataframe to plot from + """ + db_scores = [] + sil_scores = [] + centers = list(range(2, 10)) + for center in centers: + db_score, sil_score = get_kmeans_score(df, center) + db_scores.append(db_score) + sil_scores.append(sil_score) + + # Plot DB + fig, ax = plt.subplots() + ax.plot(centers, db_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Davies Bouldin score') + title = 'Davies Bouldin score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + # Plot silhouette + fig, ax = plt.subplots() + ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Silhouette score') + title = 'Silhouette score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def plot_clust(df, labels): + """ + Plot clusters + -------- + :param df: dataframe to plot clusters from + :param labels: cluster labels + """ + fig = plt.figure(figsize=(10, 10)) + ax = fig.add_subplot(111, projection='3d') + sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels) + ax.set_xlabel('Principal Component 1') + ax.set_ylabel('Principal Component 2') + ax.set_zlabel('Principal Component 3') + ax.legend(*sc.legend_elements(), title='clusters') + title = 'Clusters' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def save_clusters(typ, labels): + """ + Save results from clustering + -------- + :param typ: type of datasets - train, val + :param labels: labels from clustering to add to df + :param cols: columns to use for training + :return: reduced dataframe in numpy format + """ + df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl') + df = df_full[df_full.year == year] + df['cluster'] = labels + df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl'))) + + +def main(): + + # Read in data + df_train, train_ids = read_yearly_data('train', year) + df_val, val_ids = read_yearly_data('val', year) + + # Set up ML Flow + print('Setting up ML Flow run') + mlflow.set_tracking_uri('http://127.0.0.1:5000/') + mlflow.set_experiment(experiment_name) + mlflow.start_run(run_name=run_name, description=description) + mlflow.set_tag("model.name", model_type) + mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA") + mlflow.set_tag("model.training_year", year) + mlflow.log_param("n_cols", len(df_train.columns) - 1) + mlflow.log_param("k", k) + + # Select top features using PCA feature importance + print('Feature reduction stage 1') + pca = plot_variance(df_train, 'full') + loadings = extract_pca_loadings(df_train, pca) + pc1_abs_loadings = plot_loadings(loadings) + variance_full = pca.explained_variance_ratio_.cumsum() + + n_cols = np.argmax(variance_full >= 0.9) + 1 + + mlflow.log_param("pca_stage_1", n_cols) + columns = pc1_abs_loadings.Attribute[:n_cols].values + np.save(data_path + run_name + '_cols.npy', columns) + + # Reduce data by selecting n columns + df_train_reduced = df_train[columns] + df_val_reduced = df_val[columns] + + # Convert columns to principal components + print('Feature reduction stage 2') + pca_n_cols = plot_variance(df_train_reduced, 'reduced') + variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum() + + n_components = np.argmax(variance_reduced >= 0.8) + 1 + mlflow.log_param("pca_stage_2", n_components) + pca_reduced = PCA(n_components=n_components) + data_train = extract_array(df_train_reduced, pca_reduced, 'train') + data_val = extract_array(df_val_reduced, pca_reduced, 'test') + + # Find best cluster number + print('Detecting best cluster number') + plot_DB(data_train) + + # Fit clustering model + print('Cluster model training') + data = np.concatenate((data_train, data_val)) + cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward") + # cluster_model = KMeans(n_clusters=k, random_state=10) + cluster_model.fit(data) + cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl')) + pickle.dump(cluster_model, open(cluster_model_file, 'wb')) + + # Split labels + labels = cluster_model.labels_ + train_labels = labels[:len(train_ids)] + val_labels = labels[len(train_ids):] + save_clusters('train', train_labels) + save_clusters('val', val_labels) + + # Plot cluster results + plot_clust(data, labels) + + # Train and validate classifier + print('BLR classifier training') + + # Create a One-vs-Rest logistic regression classifier + clf = LogisticRegression(random_state=42, multi_class='ovr') + clf.fit(df_train_reduced.to_numpy(), train_labels) + clf_model_file = data_path + run_name + '_dtc_model.pkl' + pickle.dump(clf, open(clf_model_file, 'wb')) + + # Create a figure with one feature importance subplot for each class + n_classes = len(set(train_labels)) + n_features = df_train_reduced.shape[1] + + fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes)) + + # Set the vertical spacing between subplots + fig.subplots_adjust(hspace=0.99) + + # Loop over each class + for i in range(n_classes): + # Get the feature importances for the current class + coef = clf.coef_[i] + importance = coef + + # Sort the feature importances in descending order + indices = np.argsort(importance)[::-1] + + # Create a bar plot of the feature importances + axs[i].bar(range(n_features), importance[indices]) + axs[i].set_xticks(range(n_features)) + axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9) + axs[i].set_xlabel('Features') + axs[i].set_ylabel('Importance') + axs[i].set_title('Class {} Feature Importance'.format(i)) + + # save the plot to a temporary file + tmpfile = "plot.png" + fig.savefig(tmpfile) + + # log the plot to MLflow + with open(tmpfile, "rb") as fig: + mlflow.log_artifact(tmpfile, "feature_importance.png") + + # remove the temporary file + os.remove(tmpfile) + + # Make predictions on the test data + val_pred = clf.predict(df_val_reduced.to_numpy()) + accuracy = accuracy_score(val_labels, val_pred) + mlflow.log_metric('dtc accuracy', accuracy) + + cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_) + disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) + disp.plot() + plt.tight_layout() + mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + # Stop ML Flow + mlflow.end_run() + + +main() diff --git a/training/src/modelling/one_vs_rest_DTC.py b/training/src/modelling/one_vs_rest_DTC.py new file mode 100644 index 0000000000000000000000000000000000000000..5c67677c69ff0e4663516a78da75a356513e29b0 --- /dev/null +++ b/training/src/modelling/one_vs_rest_DTC.py @@ -0,0 +1,380 @@ +""" +Modelling process +""" +import pandas as pd +import numpy as np +import pickle +import matplotlib.pyplot as plt +import mlflow +from matplotlib import rcParams +from sklearn.cluster import AgglomerativeClustering, KMeans +from sklearn.decomposition import PCA +from sklearn.metrics import (davies_bouldin_score, silhouette_score, + accuracy_score, confusion_matrix, + ConfusionMatrixDisplay) +from sklearn.multiclass import OneVsRestClassifier +from sklearn.tree import DecisionTreeClassifier # , export_text +import os + + +# Set-up figures +rcParams['figure.figsize'] = 20, 5 +rcParams['axes.spines.top'] = False +rcParams['axes.spines.right'] = False + +# Set parameters for current run +year = 2019 +model_type = 'hierarchical' +data_type = 'train' +k = 3 +stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_') +data_path = '/Model_E_Extracts/' + +# Set MLFlow parameters +mlflow.set_tracking_uri("file:/.") +tracking_uri = mlflow.get_tracking_uri() +experiment_name = 'Model E: one vs rest adaption DTC ' + model_type +run_name = "_".join((str(year), model_type, stamp)) +description = "Clustering model with one vs rest adaption (DTC) for COPD data in " + str(year) + + +def extract_year(df, year): + """ + Extract 1 year of data + -------- + :param df: dataframe to extract from + :param year: year to select data from + :return: data from chosen year + """ + return df[df.year == year] + + +def read_yearly_data(typ, year): + """ + Read in data for year required + -------- + :param typ: type of data to read in + :param year: year to select data from + :return: data from chosen year and ids + """ + df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl') + df_year = extract_year(df, year) + ids = df_year.pop('SafeHavenID').to_list() + df_year = df_year.drop('year', axis=1) + + return df_year, ids + + +def plot_variance(df, typ): + """ + Plot PCA variance + --------- + :param df: dataframe to process with PCA + :param typ: type of plot - for 'full' data or 'reduced' + :return: pca object + """ + pca = PCA().fit(df) + n = list(range(1, len(df.columns) + 1)) + evr = pca.explained_variance_ratio_.cumsum() + fig, ax = plt.subplots() + ax.plot(n, evr) + title = 'PCA Variance - ' + typ + ax.set_title(title, size=20) + ax.set_xlabel('Number of principal components') + ax.set_ylabel('Cumulative explained variance') + ax.grid() + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pca + + +def extract_pca_loadings(df, pca_object): + """ + Extract PCA loadings + -------- + :param df: dataframe to reduce with pca + :param pca_object: pca object with feature loadings + :return: loadings table + """ + cols = df.columns + loadings = pd.DataFrame( + data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_), + columns=[f'PC{i}' for i in range(1, len(cols) + 1)], + index=cols) + + return loadings + + +def plot_loadings(loadings): + """ + Plot loadings for PC1 returned from PCA + -------- + :param loadings: table of feature correlations to PC1 + :return: updated loadings table + """ + loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False) + pc1_abs = loadings_abs[['PC1']].reset_index() + col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'} + pc1_abs = pc1_abs.rename(col_map, axis=1) + fig, ax = plt.subplots() + pc1_abs.plot(ax=ax, kind='bar') + title = 'PCA loading scores (PC1)' + ax.set_title(title, size=20) + ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical') + ax.set_xlabel('Attribute') + ax.set_ylabel('AbsCorrWithPC1') + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pc1_abs + + +def extract_array(df, pca_object, typ): + """ + Extract data to pass to clustering algos + -------- + :param df: dataframe to convert + :param pca_object: initialised PCA object + :param typ: type of return needed, either 'train' or 'test' + :return: converted array (and PCA object if training) + """ + if typ == 'train': + pca_func = pca_object.fit_transform + else: + pca_func = pca_object.transform + + pca_data = pd.DataFrame(pca_func(df)).to_numpy() + + if typ == 'train': + pca_file = data_path + run_name + '_pca.pkl' + pickle.dump(pca_object, open(pca_file, 'wb')) + + return pca_data + + +def get_kmeans_score(data, k): + ''' + Calculate K-Means Davies Bouldin and Silhouette scores + -------- + :param data: dataset to fit K-Means to + :param k: number of centers/clusters + :return: Scores + ''' + kmeans = KMeans(n_clusters=k) + model = kmeans.fit_predict(data) + db_score = davies_bouldin_score(data, model) + sil_score = silhouette_score(data, model) + + return db_score, sil_score + + +def plot_DB(df): + """ + Extract David Bouldin score and plot for a range of cluster numbers, + applied using K-Means clustering. + + "Davies Bouldin index represents the average 'similarity' of clusters, + where similarity is a measure that relates cluster distance to cluster + size" - the lowest score indicates best cluster set. + -------- + :param df: dataframe to plot from + """ + db_scores = [] + sil_scores = [] + centers = list(range(2, 10)) + for center in centers: + db_score, sil_score = get_kmeans_score(df, center) + db_scores.append(db_score) + sil_scores.append(sil_score) + + # Plot DB + fig, ax = plt.subplots() + ax.plot(centers, db_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Davies Bouldin score') + title = 'Davies Bouldin score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + # Plot silhouette + fig, ax = plt.subplots() + ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Silhouette score') + title = 'Silhouette score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def plot_clust(df, labels): + """ + Plot clusters + -------- + :param df: dataframe to plot clusters from + :param labels: cluster labels + """ + fig = plt.figure(figsize=(10, 10)) + ax = fig.add_subplot(111, projection='3d') + sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels) + ax.set_xlabel('Principal Component 1') + ax.set_ylabel('Principal Component 2') + ax.set_zlabel('Principal Component 3') + ax.legend(*sc.legend_elements(), title='clusters') + title = 'Clusters' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def save_clusters(typ, labels): + """ + Save results from clustering + -------- + :param typ: type of datasets - train, val + :param labels: labels from clustering to add to df + :param cols: columns to use for training + :return: reduced dataframe in numpy format + """ + df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl') + df = df_full[df_full.year == year] + df['cluster'] = labels + df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl'))) + + +def main(): + + # Read in data + df_train, train_ids = read_yearly_data('train', year) + df_val, val_ids = read_yearly_data('val', year) + + # Set up ML Flow + print('Setting up ML Flow run') + mlflow.set_tracking_uri('http://127.0.0.1:5000/') + mlflow.set_experiment(experiment_name) + mlflow.start_run(run_name=run_name, description=description) + mlflow.set_tag("model.name", model_type) + mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA") + mlflow.set_tag("model.training_year", year) + mlflow.log_param("n_cols", len(df_train.columns) - 1) + mlflow.log_param("k", k) + + # Select top features using PCA feature importance + print('Feature reduction stage 1') + pca = plot_variance(df_train, 'full') + loadings = extract_pca_loadings(df_train, pca) + pc1_abs_loadings = plot_loadings(loadings) + variance_full = pca.explained_variance_ratio_.cumsum() + + n_cols = np.argmax(variance_full >= 0.9) + 1 + + mlflow.log_param("pca_stage_1", n_cols) + columns = pc1_abs_loadings.Attribute[:n_cols].values + np.save(data_path + run_name + '_cols.npy', columns) + + # Reduce data by selecting n columns + df_train_reduced = df_train[columns] + df_val_reduced = df_val[columns] + + # Convert columns to principal components + print('Feature reduction stage 2') + pca_n_cols = plot_variance(df_train_reduced, 'reduced') + variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum() + + n_components = np.argmax(variance_reduced >= 0.8) + 1 + mlflow.log_param("pca_stage_2", n_components) + pca_reduced = PCA(n_components=n_components) + data_train = extract_array(df_train_reduced, pca_reduced, 'train') + data_val = extract_array(df_val_reduced, pca_reduced, 'test') + + # Find best cluster number + print('Detecting best cluster number') + plot_DB(data_train) + + # Fit clustering model + print('Cluster model training') + data = np.concatenate((data_train, data_val)) + cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward") + # cluster_model = KMeans(n_clusters=k, random_state=10) + cluster_model.fit(data) + cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl')) + pickle.dump(cluster_model, open(cluster_model_file, 'wb')) + + # Split labels + labels = cluster_model.labels_ + train_labels = labels[:len(train_ids)] + val_labels = labels[len(train_ids):] + save_clusters('train', train_labels) + save_clusters('val', val_labels) + + # Plot cluster results + plot_clust(data, labels) + + # Train and validate classifier + print('BLR classifier training') + + # Create a One-vs-Rest DecisionTreeClassifier + clf_pre = DecisionTreeClassifier(random_state=42) + clf = OneVsRestClassifier(clf_pre) + clf.fit(df_train_reduced.to_numpy(), train_labels) + clf_model_file = data_path + run_name + '_dtc_model.pkl' + pickle.dump(clf, open(clf_model_file, 'wb')) + + # Create a figure with one feature importance subplot for each class + n_classes = len(set(train_labels)) + n_features = df_train_reduced.shape[1] + + fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes)) + + # Set the vertical spacing between subplots + fig.subplots_adjust(hspace=0.99) + + # Loop over each class + for i in range(n_classes): + # Get the feature importances for the current class + importance = clf.estimators_[i].feature_importances_ + + # Sort the feature importances in descending order + indices = np.argsort(importance)[::-1] + + # Create a bar plot of the feature importances + axs[i].bar(range(n_features), importance[indices]) + axs[i].set_xticks(range(n_features)) + axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9) + axs[i].set_xlabel('Features') + axs[i].set_ylabel('Importance') + axs[i].set_title('Class {} Feature Importance'.format(i)) + + # Adjust the spacing between the subplots + plt.subplots_adjust(hspace=0.5) + + # save the plot to a temporary file + tmpfile = "plot.png" + fig.savefig(tmpfile) + + # log the plot to MLflow + with open(tmpfile, "rb") as fig: + mlflow.log_artifact(tmpfile, "feature_importance.png") + + # remove the temporary file + os.remove(tmpfile) + + # Make predictions on the test data + val_pred = clf.predict(df_val_reduced.to_numpy()) + accuracy = accuracy_score(val_labels, val_pred) + mlflow.log_metric('dtc accuracy', accuracy) + + cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_) + disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) + disp.plot() + plt.tight_layout() + mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + # Stop ML Flow + mlflow.end_run() + + +main() diff --git a/training/src/modelling/predict_clusters.py b/training/src/modelling/predict_clusters.py new file mode 100644 index 0000000000000000000000000000000000000000..26229f6a7311d1837ec56a6e3714f36e99700975 --- /dev/null +++ b/training/src/modelling/predict_clusters.py @@ -0,0 +1,70 @@ +import sys +import json +import pandas as pd +import numpy as np +import pickle + + +def extract_year(df, eoy_date): + """ + Extract 1 year of data + -------- + :param df: dataframe to extract from + :param eoy_date: user-specified end of year date + :return: data from chosen year + """ + return df[df.eoy == eoy_date] + + +def read_yearly_data(data_path, data_type, eoy_date): + """ + Read in data for year required + -------- + :param data_path: path to generated data + :param data_type: type of data to read in + :param eoy_date: user-specified end of year date + :return: data from chosen year and ids + """ + df = pd.read_pickle(data_path + 'min_max_' + data_type + '.pkl') + df_year = extract_year(df, eoy_date) + ids = df_year.pop('SafeHavenID').to_list() + df_year = df_year.drop('eoy', axis=1) + + return df_year, ids + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Set model parameters + eoy_date = config['date'] + data_path = config['model_data_path'] + + # Get datatype from cmd line + data_type = sys.argv[1] + run_name = sys.argv[2] + + # Read data + print('Loading data') + columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True) + df_scaled, ids = read_yearly_data(data_path, data_type, eoy_date) + df_scaled_reduced = df_scaled[columns] + df_unscaled_full = pd.read_pickle(data_path + 'filled_' + data_type + '.pkl') + df_unscaled = extract_year(df_unscaled_full, eoy_date) + + # Load model + print('Loading model') + clf_model_file = data_path + run_name + '_dtc_model.pkl' + clf = pickle.load(open(clf_model_file, 'rb')) + + # Predict on new data + print('Predicting clusters') + labels = clf.predict(df_scaled_reduced.to_numpy()) + df_unscaled['cluster'] = labels + df_unscaled.to_pickle(data_path + '_'.join((run_name, data_type, 'clusters.pkl'))) + + +main() diff --git a/training/src/modelling/run_model.py b/training/src/modelling/run_model.py new file mode 100644 index 0000000000000000000000000000000000000000..a90ca90c97fa35c74baab2acc75a10f1fea32f24 --- /dev/null +++ b/training/src/modelling/run_model.py @@ -0,0 +1,355 @@ +""" +Modelling process +""" +import json +import pandas as pd +import numpy as np +import pickle +import matplotlib.pyplot as plt +import mlflow +from matplotlib import rcParams +from sklearn.cluster import AgglomerativeClustering, KMeans +from sklearn.tree import DecisionTreeClassifier as DTC +from sklearn.decomposition import PCA +from sklearn.metrics import (davies_bouldin_score, silhouette_score, + accuracy_score, confusion_matrix, + ConfusionMatrixDisplay) + + +# Set-up figures +rcParams['figure.figsize'] = 20, 5 +rcParams['axes.spines.top'] = False +rcParams['axes.spines.right'] = False + + +def extract_year(df, eoy_date): + """ + Extract 1 year of data + -------- + :param df: dataframe to extract from + :param eoy_date: user-specified EOY date for training + :return: data from chosen year + """ + return df[df.eoy == eoy_date] + + +def read_yearly_data(data_path, typ, eoy_date): + """ + Read in data for year required + -------- + :param data_path: path to generated data + :param typ: type of data to read in + :param eoy_date: end of year date to select data from + :return: data from chosen year and ids + """ + df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl') + df_year = extract_year(df, eoy_date) + ids = df_year.pop('SafeHavenID').to_list() + df_year = df_year.drop('eoy', axis=1) + + return df_year, ids + + +def plot_variance(df, typ): + """ + Plot PCA variance + --------- + :param df: dataframe to process with PCA + :param typ: type of plot - for 'full' data or 'reduced' + :return: pca object + """ + pca = PCA().fit(df) + n = list(range(1, len(df.columns) + 1)) + evr = pca.explained_variance_ratio_.cumsum() + fig, ax = plt.subplots() + ax.plot(n, evr) + title = 'PCA Variance - ' + typ + ax.set_title(title, size=20) + ax.set_xlabel('Number of principal components') + ax.set_ylabel('Cumulative explained variance') + ax.grid() + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pca + + +def extract_pca_loadings(df, pca_object): + """ + Extract PCA loadings + -------- + :param df: dataframe to reduce with pca + :param pca_object: pca object with feature loadings + :return: loadings table + """ + cols = df.columns + loadings = pd.DataFrame( + data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_), + columns=[f'PC{i}' for i in range(1, len(cols) + 1)], + index=cols) + + return loadings + + +def plot_loadings(loadings): + """ + Plot loadings for PC1 returned from PCA + -------- + :param loadings: table of feature correlations to PC1 + :return: updated loadings table + """ + loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False) + pc1_abs = loadings_abs[['PC1']].reset_index() + col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'} + pc1_abs = pc1_abs.rename(col_map, axis=1) + fig, ax = plt.subplots() + pc1_abs.plot(ax=ax, kind='bar') + title = 'PCA loading scores (PC1)' + ax.set_title(title, size=20) + ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical') + ax.set_xlabel('Attribute') + ax.set_ylabel('AbsCorrWithPC1') + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + return pc1_abs + + +def extract_array(df, data_path, run_name, pca_object, typ): + """ + Extract data to pass to clustering algos + -------- + :param df: dataframe to convert + :param data_path: path to generated data + :param run_name: name of run in ML Flow + :param pca_object: initialised PCA object + :param typ: type of return needed, either 'train' or 'test' + :return: converted array (and PCA object if training) + """ + if typ == 'train': + pca_func = pca_object.fit_transform + else: + pca_func = pca_object.transform + + pca_data = pd.DataFrame(pca_func(df)).to_numpy() + + if typ == 'train': + pca_file = data_path + run_name + '_pca.pkl' + pickle.dump(pca_object, open(pca_file, 'wb')) + + return pca_data + + +def get_kmeans_score(data, k): + ''' + Calculate K-Means Davies Bouldin and Silhouette scores + -------- + :param data: dataset to fit K-Means to + :param k: number of centers/clusters + :return: Scores + ''' + kmeans = KMeans(n_clusters=k) + model = kmeans.fit_predict(data) + db_score = davies_bouldin_score(data, model) + sil_score = silhouette_score(data, model) + + return db_score, sil_score + + +def plot_DB(df): + """ + Extract David Bouldin score and plot for a range of cluster numbers, + applied using K-Means clustering. + + "Davies Bouldin index represents the average 'similarity' of clusters, + where similarity is a measure that relates cluster distance to cluster + size" - the lowest score indicates best cluster set. + -------- + :param df: dataframe to plot from + """ + db_scores = [] + sil_scores = [] + centers = list(range(2, 10)) + for center in centers: + db_score, sil_score = get_kmeans_score(df, center) + db_scores.append(db_score) + sil_scores.append(sil_score) + + # Plot DB + fig, ax = plt.subplots() + ax.plot(centers, db_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Davies Bouldin score') + title = 'Davies Bouldin score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + # Plot silhouette + fig, ax = plt.subplots() + ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b') + ax.set_xlabel('K') + ax.set_ylabel('Silhouette score') + title = 'Silhouette score vs. K' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def plot_clust(df, labels): + """ + Plot clusters + -------- + :param df: dataframe to plot clusters from + :param labels: cluster labels + """ + fig = plt.figure(figsize=(10, 10)) + ax = fig.add_subplot(111, projection='3d') + sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels) + ax.set_xlabel('Principal Component 1') + ax.set_ylabel('Principal Component 2') + ax.set_zlabel('Principal Component 3') + ax.legend(*sc.legend_elements(), title='clusters') + title = 'Clusters' + ax.set_title(title, size=20) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title + '.png') + + +def save_clusters(data_path, run_name, eoy_date, typ, labels): + """ + Save results from clustering + -------- + :param typ: type of datasets - train, val + :param labels: labels from clustering to add to df + :param cols: columns to use for training + :return: reduced dataframe in numpy format + """ + df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl') + df = df_full[df_full.eoy == eoy_date] + df['cluster'] = labels + df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl'))) + + +def main(): + + # Load in config files + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Set model parameters + eoy_date = config['date'] + data_path = config['model_data_path'] + model_type = config['model_type'] + + # Load in model config + with open(model_type + '_params.json') as json_params_file: + model_params = json.load(json_params_file) + + # Create ML Flow run details + stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_') + experiment_name = 'Model E - Date Specific: ' + model_type + run_name = "_".join((str(eoy_date), model_type, stamp)) + description = "Clustering model for COPD data in the year prior to " + str(eoy_date) + + # Set up ML Flow + print('Setting up ML Flow run') + mlflow.set_tracking_uri('http://127.0.0.1:5000/') + mlflow.set_experiment(experiment_name) + mlflow.start_run(run_name=run_name, description=description) + mlflow.set_tag("model.name", model_type) + mlflow.set_tag("model.training_data", config['extract_data_path']) + mlflow.set_tag("model.training_date", eoy_date) + mlflow.log_param("k", model_params['n_clusters']) + + # Read in data + df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date) + df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date) + mlflow.log_param("n_cols", len(df_train.columns)) + + # Read in data + df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date) + df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date) + mlflow.log_param("n_cols", len(df_train.columns)) + + # Select top features using PCA feature importance + print('Feature reduction stage 1') + pca = plot_variance(df_train, 'full') + loadings = extract_pca_loadings(df_train, pca) + pc1_abs_loadings = plot_loadings(loadings) + variance_full = pca.explained_variance_ratio_.cumsum() + n_cols = np.argmax(variance_full >= 0.9) + 1 + mlflow.log_param("pca_stage_1", n_cols) + columns = pc1_abs_loadings.Attribute[:n_cols].values + np.save(data_path + run_name + '_cols.npy', columns) + + # Reduce data by selecting n columns + df_train_reduced = df_train[columns] + df_val_reduced = df_val[columns] + + # Convert columns to principal components + print('Feature reduction stage 2') + pca_n_cols = plot_variance(df_train_reduced, 'reduced') + variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum() + n_components = np.argmax(variance_reduced >= 0.8) + 1 + mlflow.log_param("pca_stage_2", n_components) + pca_reduced = PCA(n_components=n_components) + data_train = extract_array( + df_train_reduced, data_path, run_name, pca_reduced, 'train') + data_val = extract_array( + df_val_reduced, data_path, run_name, pca_reduced, 'test') + + # Find best cluster number + print('Detecting best cluster number') + plot_DB(data_train) + + # Fit clustering model + print('Cluster model training') + data = np.concatenate((data_train, data_val)) + if model_type == 'hierarchical': + cluster_model = AgglomerativeClustering(**model_params) + else: + cluster_model = KMeans(**model_params) + cluster_model.fit(data) + cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl')) + pickle.dump(cluster_model, open(cluster_model_file, 'wb')) + + # Split labels + labels = cluster_model.labels_ + train_labels = labels[:len(train_ids)] + val_labels = labels[len(train_ids):] + save_clusters(data_path, run_name, eoy_date, 'train', train_labels) + save_clusters(data_path, run_name, eoy_date, 'val', val_labels) + + # Plot cluster results + plot_clust(data, labels) + + # Read in DTC parameters + with open('dtc_params.json') as dtc_params_file: + dtc_params = json.load(dtc_params_file) + + # Train and validate classifier + print('Decision tree classifier training') + clf = DTC(**dtc_params).fit(df_train_reduced.to_numpy(), train_labels) + clf_model_file = data_path + run_name + '_dtc_model.pkl' + pickle.dump(clf, open(clf_model_file, 'wb')) + + # Calculate metrics + val_pred = clf.predict(df_val_reduced.to_numpy()) + + accuracy = accuracy_score(val_labels, val_pred) + mlflow.log_metric('dtc accuracy', accuracy) + + # Plot confusion matrix + cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_) + disp = ConfusionMatrixDisplay( + confusion_matrix=cm, display_labels=clf.classes_) + disp.plot() + plt.tight_layout() + mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png') + + # Stop ML Flow + mlflow.end_run() + + +main() diff --git a/training/src/modelling/validate.py b/training/src/modelling/validate.py new file mode 100644 index 0000000000000000000000000000000000000000..3348cd8a9327f4c9844b2c34eed52b296f85d5cb --- /dev/null +++ b/training/src/modelling/validate.py @@ -0,0 +1,260 @@ +""" +Validation process +""" +import sys +import json +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import mlflow +from matplotlib import rcParams +from tableone import TableOne + + +# Set-up figures +rcParams['figure.figsize'] = 20, 5 +rcParams['axes.spines.top'] = False +rcParams['axes.spines.right'] = False + + +def plot_cluster_size(df, data_type): + """ + Produce a bar plot of cluster size + -------- + :param df: dataframe to plot + :param data_type: type of data - train, test, val, rec, sup + """ + # Number of patients + fig, ax = plt.subplots() + df.groupby('cluster').size().plot(ax=ax, kind='barh') + title = "Patient Cohorts" + ax.set_title(title) + ax.set_xlabel("Number of Patients", size=20) + ax.set_ylabel("Cluster") + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png') + + +def plot_feature_hist(df, col, data_type): + """ + Produce a histogram plot for a chosen feature + -------- + :param df: dataframe to plot + :param col: feature column to plot + :param data_type: type of data - train, test, val, rec, sup + """ + fig, ax = plt.subplots() + df.groupby('cluster')[col].plot(ax=ax, kind='hist', alpha=0.5) + ax.set_xlabel(col) + title = col + ' Histogram' + ax.set_title(title, size=20) + ax.legend() + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png') + + +def plot_feature_bar(data, col, typ, data_type): + """ + Produce a bar plot for a chosen feature + -------- + :param df: dataframe to plot + :param col: feature column to plot + :param typ: 'count' or 'percentage' + :param data_type: type of data - train, test, val, rec, sup + """ + if typ == 'count': + to_plot = data.groupby(['cluster']).apply( + lambda x: x.groupby(col).size()) + x_label = "Number" + else: + to_plot = data.groupby(['cluster']).apply( + lambda x: 100 * x.groupby(col).size() / len(x)) + x_label = "Percentage" + fig, ax = plt.subplots() + to_plot.plot(ax=ax, kind='barh') + title = "Patient Cohorts" + ax.set_title(title, size=20) + ax.set_xlabel(x_label + " of patients") + ax.set_ylabel("Cluster") + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + '_'.join((title.replace(' ', '_'), col, data_type + '.png'))) + + +def plot_cluster_bar(data, typ, data_type): + """ + Produce a bar plot for a chosen feature + -------- + :param data: data to plot + :param typ: 'count' or 'percentage' + :param data_type: type of data - train, test, val, rec, sup + """ + fig, ax = plt.subplots() + data.plot(ax=ax, kind='bar') + ax.set_title(typ, size=20) + ax.set_xlabel("Cluster") + ax.set_ylabel("Percentage") + ax.set_ylim(0, 100) + plt.tight_layout() + mlflow.log_figure(fig, 'fig/' + typ + '_' + data_type + '.png') + + +def plot_events(df, data_type): + """ + Plot events in the next 12 months based on metric table + -------- + :param df: metric table + :param data_type: type of data - train, test, val, rec, sup + """ + df = df.drop('SafeHavenID', axis=1).set_index('cluster') + events = df.groupby('cluster').apply(lambda x: 100 * x.apply( + lambda x: len(x[x == 1]) / len(x))) + plot_cluster_bar(events, 'events', data_type) + + +def process_deceased_metrics(col): + """ + Process deceased column for plotting + ------- + :param col: column to process + """ + n_deceased = 100 * ((col[col < '12+']).count()) / len(col) + res = pd.DataFrame({'alive': [100 - n_deceased], 'deceased': [n_deceased]}) + + return res + + +def plot_deceased(df, data_type): + """ + Plot events in the next 12 months based on metric table + -------- + :param df: metric table + :param data_type: type of data - train, test, val, rec, sup + """ + survival = df.groupby('cluster')['time_to_death'].apply( + process_deceased_metrics).reset_index().drop( + 'level_1', axis=1).set_index('cluster') + plot_cluster_bar(survival, 'survival', data_type) + + +def plot_therapies(df_year, results, data_type): + """ + Plot patient therapies per cluster + -------- + :param df_year: unscaled data for current year + :param results: cluster results and safehaven id + :param data_type: type of data - train, test, val, rec, sup + """ + # Inhaler data for training group + therapies = df_year[['SafeHavenID', 'single_inhaler', 'double_inhaler', 'triple_inhaler']] + res_therapies = pd.merge(therapies, results, on='SafeHavenID', how='inner') + + # Find counts/percentage per cluster + inhaler_cols = ['single_inhaler', 'double_inhaler', 'triple_inhaler'] + inhals = res_therapies[['cluster'] + inhaler_cols].set_index('cluster') + in_res = inhals.groupby('cluster').apply( + lambda x: x.apply(lambda x: 100 * (x[x > 0].count()) / len(x))) + + # Number of people without an inhaler presc + no_in = res_therapies.groupby('cluster').apply( + lambda x: 100 * len(x[(x[inhaler_cols] == 0).all(axis=1)]) / len(x)).values + + # Rename columns for plotting + in_res.columns = [c[0] for c in in_res.columns.str.split('_')] + + # Add those with no inhaler + in_res['no_inhaler'] = no_in + + plot_cluster_bar(in_res, 'therapies', data_type) + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + data_path = config['model_data_path'] + + # Get datatype from cmd line + data_type = sys.argv[1] + run_name = sys.argv[2] + run_id = sys.argv[3] + + # Set MLFlow parameters + model_type = 'hierarchical' + experiment_name = 'Model E - Date Specific: ' + model_type + mlflow.set_tracking_uri('http://127.0.0.1:5000/') + mlflow.set_experiment(experiment_name) + mlflow.start_run(run_id=run_id) + + # Read in unscaled data, results and column names used to train model + columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True) + df_clusters = pd.read_pickle(data_path + "_".join((run_name, data_type, 'clusters.pkl'))) + df_reduced = df_clusters[list(columns) + ['cluster']] + + # Number of patients + plot_cluster_size(df_reduced, data_type) + + # Generate mean/std table + t1_year = TableOne(df_reduced, categorical=[], groupby='cluster', pval=True) + t1yr_file = data_path + 't1_year_' + run_name + '_' + data_type + '.html' + t1_year.to_html(t1yr_file) + mlflow.log_artifact(t1yr_file) + + # Histogram feature plots + plot_feature_hist(df_clusters, 'age', data_type) + plot_feature_hist(df_clusters, 'albumin_med_2yr', data_type) + + # Bar plots + df_clusters['sex'] = df_clusters['sex_bin'].map({0: 'Male', 1: 'Female'}) + plot_feature_bar(df_clusters, 'sex', 'percent', data_type) + plot_feature_bar(df_clusters, 'simd_decile', 'precent', data_type) + + # Metrics for following 12 months + df_events = pd.read_pickle(data_path + 'metric_table_events.pkl') + df_counts = pd.read_pickle(data_path + 'metric_table_counts.pkl') + df_next = pd.read_pickle(data_path + 'metric_table_next.pkl') + + # Merge cluster number with SafeHavenID and metrics + clusters = df_clusters[['SafeHavenID', 'cluster']] + df_events = clusters.merge(df_events, on='SafeHavenID', how='left').fillna(0) + df_counts = clusters.merge(df_counts, on='SafeHavenID', how='left').fillna(0) + df_next = clusters.merge(df_next, on='SafeHavenID', how='left').fillna('12+') + + # Generate TableOne for events + cat_cols = df_events.columns[2:] + df_events[cat_cols] = df_events[cat_cols].astype('int') + event_limit = dict(zip(cat_cols, 5 * [1])) + event_order = dict(zip(cat_cols, 5 * [[1, 0]])) + t1_events = TableOne(df_events[df_events.columns[1:]], groupby='cluster', + limit=event_limit, order=event_order) + t1_events_file = data_path + '_'.join(('t1', data_type, 'events', run_name + '.html')) + t1_events.to_html(t1_events_file) + mlflow.log_artifact(t1_events_file) + + # Generate TableOne for event counts + count_cols = df_counts.columns[2:] + df_counts[count_cols] = df_counts[count_cols].astype('int') + t1_counts = TableOne(df_counts[df_counts.columns[1:]], categorical=[], groupby='cluster') + t1_counts_file = data_path + '_'.join(('t1', data_type, 'counts', run_name + '.html')) + t1_counts.to_html(t1_counts_file) + mlflow.log_artifact(t1_counts_file) + + # Generate TableOne for time to next events + next_cols = df_next.columns[2:] + next_event_order = dict(zip(next_cols, 5 * [['1', '3', '6', '12', '12+']])) + t1_next = TableOne(df_next[df_next.columns[1:]], groupby='cluster', + order=next_event_order) + t1_next_file = data_path + '_'.join(('t1', data_type, 'next', run_name + '.html')) + t1_next.to_html(t1_next_file) + mlflow.log_artifact(t1_next_file) + + # Plot metrics + plot_events(df_events, data_type) + plot_deceased(df_next, data_type) + plot_therapies(df_clusters, clusters, data_type) + + # Stop ML Flow + mlflow.end_run() + + +main() diff --git a/training/src/processing/README.md b/training/src/processing/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ecd22edb53ca816584b041a6a0057628ad109da2 --- /dev/null +++ b/training/src/processing/README.md @@ -0,0 +1,24 @@ +# Processing + +This folder contains scripts for processing raw EHR data, along with the mappings required to carry out the initial processing steps. + +Before running any scripts, first create a directory called 'Model_E_Extracts' within the 'S:/data' directory. + +_NB: The below processing scripts can be run in any order._ + +### Admissions + +- process_admissions.py - SMR01 COPD/Resp admissions per patient per year +- process_comorbidities.py - SMR01 comorbidities per patient per year + +### Demographics + +- process_demographics.py - DOB, sex, marital status and SIMD data + +### Labs + +- process_labs.py - lab test values per patient per year, taking the median lab test value from the 2 years prior + +### Prescribing + +- process_prescribing.py - prescriptions per patient per year diff --git a/training/src/processing/__init__.py b/training/src/processing/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/training/src/processing/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx b/training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..85ea1673e2b3e73a04f4cdbe2a6cf19529d8c925 Binary files /dev/null and b/training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx differ diff --git a/training/src/processing/mappings/README.md b/training/src/processing/mappings/README.md new file mode 100644 index 0000000000000000000000000000000000000000..89375696fcb6a9d558cadab690e95e07a45f6733 --- /dev/null +++ b/training/src/processing/mappings/README.md @@ -0,0 +1,7 @@ +# Mappings + +This folder contains a range of mappings used within the processing stages of model E: +- `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping. +- `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset. +- `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A). +- `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions \ No newline at end of file diff --git a/training/src/processing/mappings/diag_copd_resp_desc.json b/training/src/processing/mappings/diag_copd_resp_desc.json new file mode 100644 index 0000000000000000000000000000000000000000..efa2b9acefdd29932ec5374cecaf90035c221a25 --- /dev/null +++ b/training/src/processing/mappings/diag_copd_resp_desc.json @@ -0,0 +1,5 @@ +{ + "copd": "CHRONIC OBSTRUCTIVE PULMONARY DISEASE", + "resp": ["PNEUMONITIS DUE TO FOOD AND VOMIT", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE UNSPECIFIED", "CHRONIC RESPIRATORY FAILURE; TYPE II [HYPERCAPNIC]", "BRONCHOPNEUMONIA, UNSPECIFIED", "DYSPNOEA", "PLEURAL EFFUSION IN CONDITIONS CLASSIFIED ELSEWHERE", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPERCAPNIC]", "PLEURAL EFFUSION, NOT ELSEWHERE CLASSIFIED", "CHRONIC RESPIRATORY FAILURE", "OTHER BACTERIAL PNEUMONIA", "ABN MICROBIOLOGICAL FINDINGS IN SPECS FROM RESPIRATORY ORGANS AND THORAX", "RESPIRATORY FAILURE, UNSPECIFIED", "PNEUMONIA, UNSPECIFIED", "LOBAR PNEUMONIA, UNSPECIFIED", "COUGH", "PLEURAL PLAQUE WITH PRESENCE OF ASBESTOS", "PLEURAL PLAQUE WITHOUT ASBESTOS", "OTHER DISORDERS OF LUNG", "OTHER SPECIFIED PLEURAL CONDITIONS", "PULMONARY COLLAPSE", "ACQUIRED ABSENCE OF LUNG [PART OF]", "ASPHYXIATION", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPOXIC]", "TRACHEOSTOMY STATUS", "ACUTE RESPIRATORY FAILURE", "UNSPECIFIED ACUTE LOWER RESPIRATORY INFECTION", "OTHER SPECIFIED SYMPTOMS AND SIGNS INVOLVING THE CIRC AND RESP SYSTEMS", "BACTERIAL PNEUMONIA, UNSPECIFIED", "PYOTHORAX WITHOUT FISTULA", "DISEASES OF BRONCHUS, NOT ELSEWHERE CLASSIFIED", "PNEUMONIA DUE TO HAEMOPHILUS INFLUENZAE", "ABNORMAL SPUTUM", "OTHER POSTPROCEDURAL RESPIRATORY DISORDERS", "OTHER AND UNSPECIFIED ABNORMALITIES OF BREATHING", "INFLUENZA WITH OTHER RESP MANIFESTATIONS, SEASONAL INFLUENZA VIRUS IDENTIF", "PERSONAL HISTORY OF DISEASES OF THE RESPIRATORY SYSTEM", "PNEUMONIA DUE TO STREPTOCOCCUS PNEUMONIAE", "WHEEZING", "CHEST PAIN ON BREATHING", "HAEMOPTYSIS", "INFLUENZA WITH OTHER MANIFESTATIONS, VIRUS NOT IDENTIFIED", "OTHER SPECIFIED RESPIRATORY DISORDERS", "ACUTE UPPER RESPIRATORY INFECTION, UNSPECIFIED", "T.B. OF LUNG, W/O MENTION OF BACTERIOLOGICAL OR HISTOLOGICAL CONFIRMATION", "DEPENDENCE ON RESPIRATOR", "PLEURISY", "BRONCHITIS, NOT SPECIFIED AS ACUTE OR CHRONIC"], + "anxiety_depression": ["ADVERSE EFFECTS OF OTHER SEDATIVES, HYPNOTICS AND ANTIANXIETY DRUGS", "ADVERSE EFFECTS OF SEDATIVE, HYPNOTIC AND ANTIANXIETY DRUG, UNSPECIFIED", "ANXIETY DISORDER, UNSPECIFIED", "ANXIOUS [AVOIDANT] PERSONALITY DISORDER", "GENERALIZED ANXIETY DISORDER", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "OTHER MIXED ANXIETY DISORDERS", "OTHER PHOBIC ANXIETY DISORDERS", "OTHER SPECIFIED ANXIETY DISORDERS", "PANIC DISORDER [EPISODIC PAROXYSMAL ANXIETY]", "PHOBIC ANXIETY DISORDER, UNSPECIFIED", "ADVERSE EFFECTS OF MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "ADVERSE EFFECTS OF OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "ADVERSE EFFECTS OF TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "BIPOLAR AFFECT DISORDER, CUR EPISODE SEVERE DEPRESSION WITH PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURR EPISODE SEV DEPRESSION W/O PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURRENT EPISODE MILD OR MODERATE DEPRESSION", "BRIEF DEPRESSIVE REACTION", "DEPRESSIVE EPISODE, UNSPECIFIED", "MILD DEPRESSIVE EPISODE", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "MODERATE DEPRESSIVE EPISODE", "OTHER DEPRESSIVE EPISODES", "OTHER RECURRENT DEPRESSIVE DISORDERS", "POISONING BY MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "POISONING BY OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "POISONING BY TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "POST-SCHIZOPHRENIC DEPRESSION", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE MODERATE", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE W/O PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE WITH PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, UNSPECIFIED", "SCHIZOAFFECTIVE DISORDER, DEPRESSIVE TYPE", "SEVERE DEPRESSIVE EPISODE WITH PSYCHOTIC SYMPTOMS", "SEVERE DEPRESSIVE EPISODE WITHOUT PSYCHOTIC SYMPTOMS"] + } \ No newline at end of file diff --git a/training/src/processing/mappings/inhaler_mapping.json b/training/src/processing/mappings/inhaler_mapping.json new file mode 100644 index 0000000000000000000000000000000000000000..4b62a3733c325693877b991a3b345148af7aa0fe --- /dev/null +++ b/training/src/processing/mappings/inhaler_mapping.json @@ -0,0 +1,55 @@ +{ + "SABA": [ + "BAMBUTEROL HYDROCHLORIDE", + "SALBUTAMOL", + "TERBUTALINE SULFATE" + ], + "LABA": [ + "FORMOTEROL FUMARATE", + "INDACATEROL", + "OLODATEROL", + "SALMETEROL" + ], + "LAMA": [ + "ACLIDINIUM BROMIDE", + "TIOTROPIUM", + "UMECLIDINIUM BROMIDE" + ], + "SAMA": [ + "IPRATROPIUM BROMIDE", + "LABA-LAMA", + "ACLIDINIUM BROMIDE AND FORMOTEROL FUMARATE", + "INDACATEROL WITH GLYCOPYRRONIUM BROMIDE", + "TIOTROPIUM AND OLODATEROL", + "UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE" + ], + "ICS": [ + "BECLOMETASONE DIPROPIONATE", + "BUDESONIDE", + "CICLESONIDE", + "FLUTICASONE PROPIONATE", + "MOMETASONE FUROATE" + ], + "LABA-ICS": [ + "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE", + "BUDESONIDE WITH FORMOTEROL FUMARATE", + "FLUTICASONE FUROATE AND VILANTEROL", + "FLUTICASONE PROPIONATE AND FORMOTEROL FUMARATE", + "SALMETEROL WITH FLUTICASONE PROPIONATE" + ], + "LAMA +LABA-ICS": [ + "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE AND GLYCOPYRRONIUM", + "FLUTICASONE FUROATE WITH UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE" + ], + "LABA-LAMA-ICS": [], + "SABA + SAMA": [ + "SALBUTAMOL WITH IPRATROPIUM" + ], + "MCS": [ + "NEDOCROMIL SODIUM", + "SODIUM CROMOGLICATE" + ], + "Ignore": [ + "MENTHOL WITH EUCALYPTUS" + ] +} \ No newline at end of file diff --git a/training/src/processing/mappings/test_mapping.json b/training/src/processing/mappings/test_mapping.json new file mode 100644 index 0000000000000000000000000000000000000000..260371d801627b9b5ef0eda7e30bea53e5a67ed2 --- /dev/null +++ b/training/src/processing/mappings/test_mapping.json @@ -0,0 +1 @@ +{"ALT": ["A.L.T.", "Alanine Transaminase"], "AST": ["A.S.T.", "Aspartate Transam", "Aspartate Transamina"], "Alkaline Phosphatase": "Alkaline Phos.", "Basophils": ["ABS BASOPHIL", "BASOPHIL (MANUAL)", "BASOPHILS", "Basophil count"], "C Reactive Protein": "C-reactive Protein", "Eosinophils": ["EOSINOPHIL (MANUAL)", "EOSINOPHILS", "Eosinophil count", "EOSINOPHILS ABSOLUTE", "Eosinophils\u017d"], "Haematocrit": "HAEMATOCRIT", "Haemoglobin": ["HAEMOGLOBIN", "HAEMOGLOBIN A1c"], "Lymphocytes": ["ABSOLUTE LYMPHOCYTES", "LYMPHOCYTES", "Lymphocyte Count", "Lymphocyte count"], "Mean Cell Volume": ["MEAN CELL VOLUME", "Mean cell volume"], "Monocytes": ["ABSOLUTE MONOCYTES", "MONOCYTES", "Monocyte count"], "Neutrophils": ["ABSOLUTE NEUTROPHILS", "NEUTROPHILS", "Neutrophil count"], "PCO2 Temp Corrected": "PCO2 temp corrected", "Platelets": ["PLATELET COUNT", "PLATELETS", "Platelet Count", "Platelet count"], "Red Blood Count": ["Red Cell Count", "RED BLOOD COUNT", "RED CELL COUNT", "Red Blood Cell Count", "Red blood cell (RBC) count", "Red blood count"], "Serum Vitamin B12": ["Serum vitamin B12", "SERUM B12"], "White Blood Count": ["WBC", "WBC - Biological Fl", "WHITE BLOOD CELLS", "WHITE BLOOD COUNT", "White Cell Count", "White blood count"]} \ No newline at end of file diff --git a/training/src/processing/misc/process_gples.py b/training/src/processing/misc/process_gples.py new file mode 100644 index 0000000000000000000000000000000000000000..94685355f19cea50f5ebe676034cf27632f1aba2 --- /dev/null +++ b/training/src/processing/misc/process_gples.py @@ -0,0 +1,73 @@ +""" +Process GPLES data +-------- +Extract the number of COPD GP events per patient per year +""" +import pandas as pd +from utils.common import read_data, first_patient_appearance + + +def initialize_gples_data(file): + """ + Load in and convert GPLES dataset to correct format + -------- + :param file: filename to read from + :return: gples dataframe with correct column names and types + """ + print('Loading GPLES data') + + # Read in data + gp_cols = ['SafeHavenID', 'EventDate', 'ShortName'] + gp_types = ['int', 'object', 'str'] + df = read_data(file, gp_cols, gp_types) + + # Drop nulls and duplicates + df = df.dropna().drop_duplicates() + + # Convert date columns to correct type + df.columns = ['SafeHavenID', 'ADMDATE', 'ShortName'] + df['ADMDATE'] = pd.to_datetime(df['ADMDATE']) + + # Only track COPD events + df = df[df.ShortName == 'COPD'][['SafeHavenID', 'ADMDATE']] + df['gp_copd_event'] = 1 + + return df + + +def extract_yearly_data(df): + """ + Extract data per year from GPLES dataset + -------- + :param df: gples dataframe to be processed + :return: reduced gples dataset + """ + print('Reducing GPLES data') + + # Extract year column for historical features + df['year'] = df.ADMDATE.dt.year + + # Extract yearly data + group_cols = ['SafeHavenID', 'year'] + gples_events = df.groupby(group_cols)[['gp_copd_event']].sum() + + return gples_events + + +def main(): + + # Load data + gp_file = "/EXAMPLE_STUDY_DATA/GPLES_Cohort3R.csv" + gples = initialize_gples_data(gp_file) + + # Save first date in dataset + first_patient_appearance(gples, 'ADMDATE', 'gples') + + # Reduce GPLES to 1 row per year per ID + gples_yearly = extract_yearly_data(gples) + + # Save data + gples_yearly.to_pickle('/Model_E_Extracts/gples_proc.pkl') + + +main() diff --git a/training/src/processing/misc/process_validation_adm.py b/training/src/processing/misc/process_validation_adm.py new file mode 100644 index 0000000000000000000000000000000000000000..708bd0b183d74e461968b7b1a4c49aad24b6cddb --- /dev/null +++ b/training/src/processing/misc/process_validation_adm.py @@ -0,0 +1,28 @@ +from utils.adm_common import (initialize_adm_data, correct_stays, + track_copd_resp) + + +def main(): + + # Load in data + adm_file = "/EXAMPLE_STUDY_DATA/SMR01_Cohort3R.csv" + adm = initialize_adm_data(adm_file) + + # Fill null STAY data and combine transfer admissions + adm = correct_stays(adm) + + # Track COPD and respiratory events + adm = track_copd_resp(adm) + + # Select relevant columns + adm_reduced = adm[['SafeHavenID', 'ADMDATE', 'copd_event', 'resp_event']] + + # Track events + adm_reduced['copd_resp_event'] = adm_reduced['copd_event'] | adm_reduced['resp_event'] + adm_reduced['adm_event'] = 1 + + # Save data + adm_reduced.to_pickle('/Model_E_Extracts/validation_adm_proc-og.pkl') + + +main() diff --git a/training/src/processing/misc/process_validation_presc.py b/training/src/processing/misc/process_validation_presc.py new file mode 100644 index 0000000000000000000000000000000000000000..f233896935524538c4ea5112b8bd7820a2069902 --- /dev/null +++ b/training/src/processing/misc/process_validation_presc.py @@ -0,0 +1,20 @@ +from utils.presc_common import initialize_presc_data, track_medication + + +def main(): + + # Read in data + presc_file = "/EXAMPLE_STUDY_DATA/Pharmacy_Cohort3R.csv" + presc = initialize_presc_data(presc_file) + + # Track salbutamol and rescue meds + presc = track_medication(presc) + + # Reduce columns + presc_reduced = presc[['SafeHavenID', 'PRESC_DATE', 'rescue_meds']] + + # Save data + presc_reduced.to_pickle('/Model_E_Extracts/validation_presc_proc-og.pkl') + + +main() diff --git a/training/src/processing/process_admissions.py b/training/src/processing/process_admissions.py new file mode 100644 index 0000000000000000000000000000000000000000..52edbadb5895a66e1f0ff78b2c20bd68614134fc --- /dev/null +++ b/training/src/processing/process_admissions.py @@ -0,0 +1,153 @@ +""" +Process SMR01 admission data +-------- +Clean and process admission data while adding tracking for COPD and respiratory +admissions per year for each SafeHavenID +""" +import json +import pandas as pd +from datetime import date +from dateutil.relativedelta import relativedelta +from utils.common import add_hist_adm_presc, first_patient_appearance +from utils.adm_common import (initialize_adm_data, correct_stays, + track_copd_resp) +from utils.adm_processing import (convert_ethgrp_desc, mode_ethnicity, + search_diag) +from utils.adm_reduction import fill_missing_years, calc_adm_per_year + + +def process_ethnicity(df): + """ + Find relevant ethnic group for each patient, accounting for null data + -------- + :param df: admission dataframe to be updated + :return: admission dataframe with ethnicity cleaned and updated + """ + print('Processing ethnicity') + + # Fill in missing ethnicities + df = df.rename(columns={'ETHGRP': 'eth_grp'}) + df['eth_grp'] = df.eth_grp.str.strip() + df['eth_grp'] = df.groupby('SafeHavenID')['eth_grp'].apply( + lambda x: x.ffill().bfill().fillna('Unknown')) + + # Convert to 1 of 7 ethnic groups + df['eth_grp'] = [convert_ethgrp_desc(eth) for eth in df.eth_grp] + + # Find most commonly occurring ethnicity per SafeHavenID + df = df.groupby('SafeHavenID').apply(mode_ethnicity, 'eth_grp') + + return df + + +def add_eoy_column(df, dt_col, eoy_date): + """ + Add EOY relative to user-specified end date + -------- + :param df: dataframe + :param dt_col: date column in dataframe + :param eoy_date: EOY date from config + :return: updated df with EOY column added + """ + # Needed to stop error with creating a new column + df = df.reset_index(drop=True) + + # Add column with user-specified end of year date + end_date = pd.to_datetime(eoy_date) + end_month = end_date.month + end_day = end_date.day + + # Add for every year + df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year] + + # Check that EOY date is after dt_col for each entry + eoy_index = df.columns[df.columns == 'eoy'] + adm_vs_eoy = df[dt_col] > df.eoy + row_index = df.index[adm_vs_eoy] + df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1) + df['eoy'] = pd.to_datetime(df.eoy) + + return df + + +def extract_yearly_data(df): + """ + Extract features on a yearly basis for each SafeHavenID + -------- + :param adm: admission dataframe to be updated + :return: dataframe with feature values per year + """ + print('Reducing to 1 row SafeHavenID per year') + + # Track rows which are admissions + df['adm'] = 1 + + # Add rows from years where patient did not have admissions + df = df.groupby('SafeHavenID').apply(fill_missing_years) + df = df.reset_index(drop=True) + + # Add any historical count columns + df = df.groupby('SafeHavenID').apply(add_hist_adm_presc, 'adm', 'ADMDATE') + df = df.reset_index(drop=True) + + # Reduce data to 1 row per year + df = calc_adm_per_year(df) + + # Select columns in final order + final_cols = ['eth_grp', 'adm_per_year', 'total_hosp_days', + 'mean_los', 'copd_per_year', 'resp_per_year', + 'anxiety_depression_per_year', 'days_since_copd', + 'days_since_resp', 'days_since_adm', 'adm_to_date', + 'copd_to_date', 'resp_to_date', 'anxiety_depression_to_date', + 'copd_date', 'resp_date', 'adm_date'] + + df = df[final_cols] + + return df + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Load in data + adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv' + adm = initialize_adm_data(adm_file) + + # Fill null STAY data and combine transfer admissions + adm = correct_stays(adm) + + # Save first date in dataset + data_path = config['model_data_path'] + first_patient_appearance(adm, 'ADMDATE', 'adm', data_path) + + # Process ethnicity data + adm = process_ethnicity(adm) + + # Track COPD and respiratory events + adm = track_copd_resp(adm) + + # Track anxiety event + adm = search_diag(adm, 'anxiety_depression') + + # Select relevant columns + reduced_cols = ['SafeHavenID', 'eth_grp', 'ADMDATE', 'STAY', 'copd_event', + 'resp_event', 'anxiety_depression_event'] + adm_reduced = adm[reduced_cols] + + # Save per event dataset + adm_reduced.to_pickle(data_path + 'validation_adm_proc.pkl') + + # Add column relative to user-specified date + adm_reduced = add_eoy_column(adm_reduced, 'ADMDATE', config['date']) + + # Extract yearly data + adm_yearly = extract_yearly_data(adm_reduced) + + # Save data + adm_yearly.to_pickle(data_path + 'adm_proc.pkl') + + +main() diff --git a/training/src/processing/process_comorbidities.py b/training/src/processing/process_comorbidities.py new file mode 100644 index 0000000000000000000000000000000000000000..6d174eab4137757d5b40bccd011ca19d7743b70f --- /dev/null +++ b/training/src/processing/process_comorbidities.py @@ -0,0 +1,161 @@ +""" +Process SMR01 comorbidities data +-------- +Clean and process comorbidities, tracking specific comorbidities and returning +the total number of comorbidities per patient per year +""" +import json +import pandas as pd +from datetime import date +from dateutil.relativedelta import relativedelta +from utils.common import track_event +from utils.adm_common import initialize_adm_data, correct_stays +from utils.comorb_processing import diagnosis_mapping_lists + + +def track_comorbidity(df, excel_file, sheet_name, diag_names): + """ + Map from admission descriptions to comorbidities using provided sheet. + Add new column for each comorbidity. + -------- + :param df: pandas dataframe + :param excel_file: str filename for diagnosis mapping + :param sheet_name: str sheet name for diagnosis mapping + :param diag_names: list of diagnoses + :return: dataframe update with diagnosis mapping + """ + print('Tracking comorbidities') + + # Load in mappings + mapping = diagnosis_mapping_lists(excel_file, sheet_name, diag_names) + + # Select relevant columns + diag_columns = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', + 'DIAG5Desc', 'DIAG6Desc'] + df_diag = df[diag_columns] + + # Create column for each comorbidity + for key in mapping: + com = mapping[key] + com_bool = df_diag.apply(lambda x: track_event(x, com, False)) + com_int = com_bool.any(axis=1).astype(int) + df[key] = com_int + + return df + + +def fill_comorbidities(df, diag_names): + """ + Fill comorbidites + -------- + :param df: dataframe of groupby values + :param diag_names: list of diagnoses + :return: updated dataframe + """ + + df[diag_names] = df[diag_names].replace(to_replace=0, method='ffill') + + return df + + +def add_eoy_column(df, dt_col, eoy_date): + """ + Add EOY relative to user-specified end date + -------- + :param df: dataframe + :param dt_col: date column in dataframe + :param eoy_date: EOY date from config + :return: updated df with EOY column added + """ + # Needed to stop error with creating a new column + df = df.reset_index(drop=True) + + # Add column with user-specified end of year date + end_date = pd.to_datetime(eoy_date) + end_month = end_date.month + end_day = end_date.day + + # Add for every year + df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year] + + # Check that EOY date is after dt_col for each entry + eoy_index = df.columns[df.columns == 'eoy'] + adm_vs_eoy = df[dt_col] > df.eoy + row_index = df.index[adm_vs_eoy] + df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1) + df['eoy'] = pd.to_datetime(df.eoy) + + return df + + +def add_yearly_stats(df): + """ + Sum comorbidities per patient per year + -------- + :param df: dataframe to update + :return: sum of comorbidities per patient per year + """ + print('Adding comorbidity count per year') + + # Drop cols not required anymore + cols_2_drop = ['ADMDATE', 'DISDATE', 'STAY', 'ETHGRP', 'DIAG1Desc', + 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', 'DIAG5Desc', + 'DIAG6Desc', 'DISDATE', 'STAY'] + df = df.drop(cols_2_drop, axis=1) + + # Sum comorbidities + df = df.groupby(['SafeHavenID', 'eoy']).last().sum(axis=1) + df = df.to_frame().rename(columns={0: 'comorb_per_year'}) + + return df + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Load in data + adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv' + adm = initialize_adm_data(adm_file) + + # Fill null STAY data and combine transfer admissions + adm = correct_stays(adm) + + # Prepare text data - strip string columns + adm = adm.apply(lambda x: x.str.strip() if x.dtype == 'object' else x) + + # Track comorbidities + excel_file = "mappings/Comorbidity feature review for models & clin " \ + "summary update v2 May 2021.xlsx" + sheet_name = 'Diagnosis category mapping3' + diag_names = ['Ischaemic_hd', 'Atrial_fib', 'pacemake', 'periph_vasc', + 'cog_imp', 'HF1', 'LV_sys', 'valv_hd', 'HF_pres_ejec', + 'hypertension', 'Cerebrovascula_dis', 'Diabetes_mel', + 'Osteoporosis', 'frailty', 'liver_dis', 'metastat_canc', + 'headneck_canc', 'breast_canc', 'gi_canc', 'other_canc', + 'kidney_dis', 'Asthma_ov', 'Pulmonary_fib', + 'Obstructive_apnoea', 'Pulmonary_hyp', 'Previous_pneum', + 'DVT_PTE', 'Lung_cancer', 'Bronchiectasis', 'Resp_fail'] + adm_comorb = track_comorbidity(adm, excel_file, sheet_name, diag_names) + + # Drop date column + adm_comorb = adm_comorb.sort_values('ADMDATE').reset_index(drop=True) + + # Drop fill comorb cols + print('Filling comorbidities') + adm_filled = adm_comorb.groupby('SafeHavenID').apply( + fill_comorbidities, diag_names) + + # Add column relative to user-specified date + adm_filled = add_eoy_column(adm_filled, 'ADMDATE', config['date']) + + # Add yearly stats + adm_yearly = add_yearly_stats(adm_filled) + + # Save data + adm_yearly.to_pickle(config['model_data_path'] + 'comorb_proc.pkl') + + +main() diff --git a/training/src/processing/process_demographics.py b/training/src/processing/process_demographics.py new file mode 100644 index 0000000000000000000000000000000000000000..4fd7a957619f9f5cd52dc5bbd00e9cb6777fefe8 --- /dev/null +++ b/training/src/processing/process_demographics.py @@ -0,0 +1,74 @@ +""" +Process demographics data +-------- +Process DOB, sex, marital status and SIMD data +""" +import json +from utils.common import read_data, correct_column_names + + +def initialize_demo_data(demo_file): + """ + Load in demographics dataset to correct format + -------- + :param demo_file: demographics data file name + :return: demographics dataframe with correct column names and types + """ + print('Loading demographic data') + + # Read in data + demo_cols = ['SafeHavenID', 'OBF_DOB', 'SEX', 'MARITAL_STATUS', + 'SIMD_2009_QUINTILE', 'SIMD_2009_DECILE', + 'SIMD_2009_VIGINTILE', 'SIMD_2012_QUINTILE', + 'SIMD_2012_DECILE', 'SIMD_2012_VIGINTILE', + 'SIMD_2016_QUINTILE', 'SIMD_2016_DECILE', + 'SIMD_2016_VIGINTILE'] + demo_types = ['int', 'object', 'str', 'str', 'float', 'float', 'float', + 'float', 'float', 'float', 'float', 'float', 'float'] + df = read_data(demo_file, demo_cols, demo_types) + + # Nulls dropped later in process, only drop duplicates + df = df.drop_duplicates() + + return df + + +def process_sex(df): + """ + Process sex column in demographics + -------- + :param df: dataframe to update + :return: updated dataframe + """ + print('One-hot encoding sex') + + df['sex_bin'] = (df.SEX == 'F').astype(int) + + return df + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Load in data + demo_file = config['extract_data_path'] + 'Demographics_Cohort3R.csv' + demo = initialize_demo_data(demo_file) + + # Create binary sex column + demo = process_sex(demo) + + # Drop original columns + demo = demo.drop('SEX', axis=1) + + # Correct column names + new_cols = correct_column_names(demo.columns[1:], 'demo') + demo.columns = ['SafeHavenID'] + new_cols + + # Save data + demo.to_pickle(config['model_data_path'] + 'demo_proc.pkl') + + +main() diff --git a/training/src/processing/process_labs.py b/training/src/processing/process_labs.py new file mode 100644 index 0000000000000000000000000000000000000000..7eb12342e519c42933ffb22aa283df2f24233408 --- /dev/null +++ b/training/src/processing/process_labs.py @@ -0,0 +1,247 @@ +""" +Script for preprocessing labs data +-------- +Track median values for labs tests over the previous 2 years for patients +with resulting dataset containing 1 row of information per patient per year +""" +import json +import pandas as pd +import numpy as np +from datetime import date +from dateutil.relativedelta import relativedelta +from utils.common import (read_data, correct_column_names, + first_patient_appearance) +from utils.labs_processing import add_total_labs + + +def initialize_labs_data(labs_file): + """ + Load in labs dataset to correct format + -------- + :param labs_file: labs data file name + :return: labs dataframe with correct column names and types + """ + print('Loading labs data') + + # Read in data + old_cols = ['SafeHavenID', 'SAMPLEDATE', 'CLINICALCODEDESCRIPTION', + 'QUANTITYVALUE', 'RANGEHIGHVALUE', 'RANGELOWVALUE'] + labs_types = ['int', 'object', 'str', 'float', 'float', 'float'] + df = read_data(labs_file, old_cols, labs_types) + + # Rename columns to CamelCase + new_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription', + 'QuantityValue', 'RangeHighValue', 'RangeLowValue'] + mapping = dict(zip(old_cols, new_cols)) + df = df.rename(columns=mapping) + + # Drop any nulls, duplicates or negative (broken) test values + df = df.dropna().drop_duplicates() + + # Check tests are valid (values > -1) + num_cols = ['QuantityValue', 'RangeHighValue', 'RangeLowValue'] + df = df[(df[num_cols] > -1).all(axis=1)] + + # Select final columns + final_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription', + 'QuantityValue'] + df = df[final_cols] + + # Convert date + df['SampleDate'] = pd.to_datetime(df.SampleDate) + + return df + + +def clean_labs(df): + """ + Clean descriptions and select relevant tests + -------- + :param df: pandas dataframe + :return: cleaned dataframe + """ + print('Cleaning labs data') + + lab_tests = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils', + 'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils', + 'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes', + 'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils', + 'PCO2 (temp corrected', 'Platelets', 'Potassium', + 'Red Blood Count', 'Serum vitamin B12', 'Sodium', + 'Total Bilirubin', 'Urea', 'White Blood Count'] + + # Strip any whitespaces + str_col = 'ClinicalCodeDescription' + df[str_col] = df[str_col].str.strip() + + # Read in test mapping + with open('mappings/test_mapping.json') as json_file: + test_mapping = json.load(json_file) + + # Correct names for relevant tests + for k, v in test_mapping.items(): + df[str_col] = df[str_col].replace(v, k) + + # Select relevant tests + df = df[[desc in lab_tests for desc in df[str_col]]] + + return df + + +def add_neut_lypmh(df): + """ + Pivot dataframe and calculate neut_lypmh feature + -------- + :param df: pandas dataframe + :return: pivoted dataframe + """ + print('Calculating neut_lypmh data') + + # Pivot table with CCDesc as headers and QuantityValue as values + df = pd.pivot_table( + df, index=['SafeHavenID', 'SampleDate'], + columns=['ClinicalCodeDescription'], values='QuantityValue', + dropna=True).reset_index() + + # Add neut_lymph feature + df['neut_lymph'] = df.Neutrophils / df.Lymphocytes + + # Replace any infinite values + df['neut_lymph'] = df.neut_lymph.replace([np.inf, -np.inf], np.nan) + + return df + + +def add_eoy_column(df, dt_col, eoy_date): + """ + Add EOY relative to user-specified end date + -------- + :param df: dataframe + :param dt_col: date column in dataframe + :param eoy_date: EOY date from config + :return: updated df with EOY column added + """ + # Needed to stop error with creating a new column + df = df.reset_index(drop=True) + + # Add column with user-specified end of year date + end_date = pd.to_datetime(eoy_date) + end_month = end_date.month + end_day = end_date.day + + # Add for every year + df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year] + + # Check that EOY date is after dt_col for each entry + eoy_index = df.columns[df.columns == 'eoy'] + adm_vs_eoy = df[dt_col] > df.eoy + row_index = df.index[adm_vs_eoy] + df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1) + df['eoy'] = pd.to_datetime(df.eoy) + + return df + + +def reduce_labs_data(df, dt_col): + """ + Reduce dataset to 1 row per ID per year looking back at the median values + over the previous 2 years + -------- + :param df: pandas dataframe + :param dt_col: date column + :return: reduced labs dataframe + """ + print('Reducing labs to 1 row per patient per year') + + group_cols = ['SafeHavenID', 'eoy'] + med_cols = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils', + 'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils', + 'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes', + 'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils', + 'Platelets', 'Potassium', 'Red Blood Count', 'Sodium', + 'Total Bilirubin', 'Urea', 'White Blood Count', 'neut_lymph'] + + # Add column to track labs per year + df['labs'] = 1 + + # Sort by date and extract year + df = df.sort_values(dt_col) + + # Include data from previous year + shifted = df[['eoy']] + pd.DateOffset(years=1) + new_tab = df[['SafeHavenID', dt_col] + med_cols].join(shifted) + combined_cols = ['SafeHavenID', 'eoy', dt_col] + med_cols + combined = pd.concat([df[combined_cols], new_tab]) + combined = combined.sort_values(dt_col) + + # Extract median data for last 2 years + df_med = combined.groupby(group_cols).median() + + # Rename median columns + new_med_cols = [col + '_med_2yr' for col in df_med.columns] + df_med.columns = new_med_cols + + # Only carry forward year data that appeared in df + test = [] + for k, v in df.groupby('SafeHavenID')['eoy'].unique().to_dict().items(): + test.append(df_med.loc[(k, v), ]) + df_med = pd.concat(test) + + # Extract features to find last value of + df_last = df[group_cols + ['labs_to_date']] + df_last = df_last.groupby(group_cols).last() + + # Extract features to calculate sum of + df_sum = df[group_cols + ['labs']] + df_sum = df.groupby(group_cols)['labs'].sum() + + # Rename sum columns + df_sum = df_sum.to_frame() + df_sum.columns = ['labs_per_year'] + + # Merge datasets + df_annual = df_med.join(df_last).join(df_sum) + + return df_annual + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Load in data + labs_file = config['extract_data_path'] + 'SCI_Store_Cohort3R.csv' + labs = initialize_labs_data(labs_file) + + # Clean data + labs = clean_labs(labs) + + # Save first date in dataset + data_path = config['model_data_path'] + first_patient_appearance(labs, 'SampleDate', 'labs', data_path) + + # Pivot and add neut_lypmh + labs = add_neut_lypmh(labs) + + # Add EOY column relative to user specified date + labs = add_eoy_column(labs, 'SampleDate', config['date']) + labs = labs.sort_values('SampleDate') + + # Track each lab event + labs['labs_to_date'] = 1 + labs = labs.groupby('SafeHavenID').apply(add_total_labs) + labs = labs.reset_index(drop=True) + + # Reduce labs to 1 row per ID per year + labs_yearly = reduce_labs_data(labs, 'SampleDate') + + # Correct column names + labs_yearly.columns = correct_column_names(labs_yearly.columns, 'labs') + + # Save data + labs_yearly.to_pickle(data_path + 'labs_proc.pkl') + + +main() diff --git a/training/src/processing/process_prescribing.py b/training/src/processing/process_prescribing.py new file mode 100644 index 0000000000000000000000000000000000000000..f9c96106988bec66138bc5a9255af3a42fc59372 --- /dev/null +++ b/training/src/processing/process_prescribing.py @@ -0,0 +1,145 @@ +""" +Script for preprocessing pharmacy data +-------- +Process pharmacy data and track inhaler prescriptions and rescue meds +""" +import json +import pandas as pd +from datetime import date +from dateutil.relativedelta import relativedelta +from utils.common import (add_hist_adm_presc, correct_column_names, + first_patient_appearance) +from utils.presc_common import initialize_presc_data, track_medication + + +def add_inhaler_mappings(df): + """ + Load inhaler prescription mappings and track where they appear in the data + -------- + :param df: dataframe + :return: dataframe with column added for each inhaler type + """ + print('Mapping inhaler prescriptions') + + # Load in inhaler mapping + with open('mappings/inhaler_mapping.json') as json_file: + inhaler_mapping = json.load(json_file) + + for k, v in inhaler_mapping.items(): + df[k + '_inhaler'] = df.PI_Approved_Name.str.contains( + '|'.join(v)).astype(int) + + # Remove for now as empty + df = df.drop(['LABA-LAMA-ICS_inhaler', 'Ignore_inhaler'], axis=1) + + return df + + +def add_eoy_column(df, dt_col, eoy_date): + """ + Add EOY relative to user-specified end date + -------- + :param df: dataframe + :param dt_col: date column in dataframe + :param eoy_date: EOY date from config + :return: updated df with EOY column added + """ + # Needed to stop error with creating a new column + df = df.reset_index(drop=True) + + # Add column with user-specified end of year date + end_date = pd.to_datetime(eoy_date) + end_month = end_date.month + end_day = end_date.day + + # Add for every year + df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year] + + # Check that EOY date is after dt_col for each entry + eoy_index = df.columns[df.columns == 'eoy'] + adm_vs_eoy = df[dt_col] > df.eoy + row_index = df.index[adm_vs_eoy] + df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1) + df['eoy'] = pd.to_datetime(df.eoy) + + return df + + +def calc_presc_per_year(df): + """ + Reduce data to 1 row per year + -------- + :param df: dataframe to reduced + :return: reduced dataframe + """ + print('Reducing to 1 row per year') + + # Add end of year columns + eoy_cols = ['presc_to_date', 'days_since_rescue', 'rescue_to_date', + 'anxiety_depression_presc_to_date', 'rescue_date'] + last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last() + + # Total columns + sum_cols = ['SALBUTAMOL', 'SABA_inhaler', 'LABA_inhaler', 'LAMA_inhaler', + 'SAMA_inhaler', 'ICS_inhaler', 'LABA-ICS_inhaler', + 'LAMA +LABA-ICS_inhaler', 'SABA + SAMA_inhaler', + 'MCS_inhaler', 'rescue_meds', 'presc', 'anxiety_depression_presc'] + total_cols = [col + '_per_year' for col in sum_cols] + total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum() + total.columns = total_cols + + # Join together + results = last.join(total) + + return results + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Load in data + presc_file = config['extract_data_path'] + 'Pharmacy_Cohort3R.csv' + presc = initialize_presc_data(presc_file) + + # Save first date in dataset + data_path = config['model_data_path'] + first_patient_appearance(presc, 'PRESC_DATE', 'presc', data_path) + + # Add inhaler mapping + presc = add_inhaler_mappings(presc) + + # Track salbutamol and rescue meds + presc = track_medication(presc) + + # Drop columns + cols_2_drop = ['PI_Approved_Name', 'PI_BNF_Item_Code', 'code'] + presc = presc.drop(cols_2_drop, axis=1) + + # Add column relative to user-specified date + presc = add_eoy_column(presc, 'PRESC_DATE', config['date']) + + # Track rows which are admissions + presc['presc'] = 1 + + # Add any historical count columns + presc = presc.groupby('SafeHavenID').apply( + add_hist_adm_presc, 'presc', 'PRESC_DATE') + presc = presc.reset_index(drop=True) + + # Save per event dataset + presc.to_pickle(data_path + 'validation_presc_proc.pkl') + + # Reduce data to 1 row per year + presc_yearly = calc_presc_per_year(presc) + + # Correct column names + presc_yearly.columns = correct_column_names(presc_yearly.columns, 'presc') + + # Save data + presc_yearly.to_pickle(data_path + 'presc_proc.pkl') + + +main() diff --git a/training/src/processing/utils/README.md b/training/src/processing/utils/README.md new file mode 100644 index 0000000000000000000000000000000000000000..3f4bddb4cd54d536490bc810a614b30b96f4cc64 --- /dev/null +++ b/training/src/processing/utils/README.md @@ -0,0 +1,11 @@ +# Processing Utilities + +This folder contains processing utilities called within the main processing scripts in the folder above. + +- `adm/comorb/labs_processing.py` contain utilities for processing each type of specific data + +- `adm_reduction.py` contains reduction functions required for processing admissions + +- `common.py` functions are used across processing for all datasets + +- `adm_common.py` functions are used in both admissions and comorbidities diff --git a/training/src/processing/utils/__init__.py b/training/src/processing/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/training/src/processing/utils/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/training/src/processing/utils/adm_common.py b/training/src/processing/utils/adm_common.py new file mode 100644 index 0000000000000000000000000000000000000000..989caceb0862cb4e6c28915dc788c418035e77bf --- /dev/null +++ b/training/src/processing/utils/adm_common.py @@ -0,0 +1,77 @@ +""" +Utility functions common across admission processing +(admissions/comorbidities/gples) +""" +import pandas as pd +from utils.common import read_data +from utils.adm_processing import (update_null_stay, calculate_total_stay, + search_diag) + + +def initialize_adm_data(adm_file): + """ + Load in and convert admission dataset to correct format + -------- + :param adm_file: admission data file name + :return: admission dataframe with correct column names and types + """ + print('Loading admission data') + + # Read in data + adm_cols = ['SafeHavenID', 'ETHGRP', 'ADMDATE', 'DISDATE', 'STAY', + 'DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', + 'DIAG5Desc', 'DIAG6Desc'] + adm_types = ['int', 'object', 'object', 'object', 'int', + 'str', 'str', 'str', 'str', 'str', 'str'] + df = read_data(adm_file, adm_cols, adm_types) + + # Drop duplicates - nulls needed in DIAGDesc columns + df = df.drop_duplicates() + + # Convert date columns to correct type + df['ADMDATE'] = pd.to_datetime(df['ADMDATE']) + df['DISDATE'] = pd.to_datetime(df['DISDATE']) + + return df + + +def correct_stays(df): + """ + Fill any null STAY data and consolidate any transfer admissions into single + admission occurrences + -------- + :param df: admission dataframe to be corrected + :return: admission dataframe with null stays filled and transfers combined + """ + print('Correcting stays') + + # Update any null STAY data using ADM and DIS dates + df = update_null_stay(df) + + # Correct stays for patients passed across departments + df = df.sort_values(['SafeHavenID', 'ADMDATE', 'DISDATE']) + df = df.groupby('SafeHavenID').apply(calculate_total_stay) + df = df.reset_index(drop=True) + + return df + + +def track_copd_resp(df): + """ + Search for COPD and/or respiratory admissions + -------- + :param df: admission dataframe to be updated + :return: updated dataframe with events tracked + """ + print('Tracking events') + + # Strip DIAGDesc columns + df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x) + + # Track COPD admissions + df = search_diag(df, 'copd') + + # Track respiratory admissions + df = search_diag(df, 'resp') + + return df \ No newline at end of file diff --git a/training/src/processing/utils/adm_processing.py b/training/src/processing/utils/adm_processing.py new file mode 100644 index 0000000000000000000000000000000000000000..9ca0108a183d4f970c8d92ae82306a84d9e7f2b3 --- /dev/null +++ b/training/src/processing/utils/adm_processing.py @@ -0,0 +1,146 @@ +""" +Admission processing utilities +""" +import json +import numpy as np +from utils.common import track_event + + +def update_null_stay(df): + """ + Calculate length of stay based on ADM/DISDATE for null STAY values + -------- + :param df: pandas dataframe to be updated + :return: updated dataframe + """ + # Check for nulls + is_null = df.STAY.isnull() + + # If null calculate total length of stay + if sum(is_null) > 0: + null_stay = np.where(is_null) + for i in null_stay: + stay = df.loc[i, 'DISDATE'].item() - df.loc[i, 'ADMDATE'].item() + df.loc[i, 'STAY'] = float(stay.days) + + return df + + +def calculate_total_stay(df): + """ + Convert admissions with same ADMDATE as previous DISDATE to single + admission where patient has been transferred between departments + -------- + :param df: pandas dataframe to be updated + :return: updated dataframe + """ + df.reset_index(inplace=True, drop=True) + rows_to_drop = [] + + # If ADMDATE matches previous DISDATE, mark as transfer and combine + df['transfer'] = df.ADMDATE.eq(df.DISDATE.shift()) + for index, row in df.iloc[1:].iterrows(): + if row.transfer is True: + df.loc[index, 'ADMDATE'] = df.iloc[index - 1].ADMDATE + df.loc[index, 'STAY'] = row.STAY + df.iloc[index - 1].STAY + rows_to_drop.append(index - 1) + + # Drop original individual rows in transfer + df.drop(rows_to_drop, inplace=True) + + # Drop tracking column + df.drop('transfer', axis=1, inplace=True) + + return df + + +def convert_ethgrp_desc(eth): + """ + Find ethnic group based on given ETHGRP string + -------- + :param eth: str ethnic group description in the style of SMR01 data + :return: string ethnicity + """ + if ("White" in eth) | ("Irish" in eth) | ("Welsh" in eth) | ("English" in eth): + return "White" + + elif eth.startswith("British"): + return "White" + + elif "mixed" in eth: + return "Mixed" + + elif ("Asian" in eth) | ("Pakistani" in eth) | ("Indian" in eth) | ("Bangladeshi" in eth) | ("Chinese" in eth): + return "Asian" + + elif ("Black" in eth) | ("Caribbean" in eth) | ("African" in eth): + return "Black" + + elif ("Arab" in eth) | ("other ethnic" in eth): + return "Other" + + elif "Refused" in eth: + return "Refused" + + else: + return "Unknown" + + +def mode_ethnicity(v, eth_col): + """ + Select the most commonly occuring ethnicity for each patient in groupby + -------- + :param v: pandas patient dataframe to be updated + :param eth_col: str ethnicity column + :return: updated subset of data with common ethnicity per ID + """ + eth = v[eth_col] + n = eth.nunique() + has_unk = eth.str.contains('Unknown') + any_unk = any(has_unk) + wout_unk = has_unk.apply(lambda x: x is False) + has_ref = eth.str.contains('Refused') + any_ref = any(has_ref) + wout_ref = has_ref.apply(lambda x: x is False) + + # Select ethnicities excluding 'Unknown' or 'Refused' where possible + if any_unk & any_ref & (n > 2): + eth = eth[wout_unk & wout_ref] + elif any_unk & (n > 1): + eth = eth[wout_unk] + elif any_ref & (n > 1): + eth = eth[wout_ref] + + # Select the most commonly appearing ethnicity + main_eth = eth.mode().values[0] + v[eth_col] = main_eth + + return v + + +def search_diag(df, typ): + """ + Search diagnosis columns for descriptions indicative of copd or resp events + -------- + :param df: dataframe to search + :param typ: 'copd', 'resp' or 'anxiety_depression' + :return: dataframe with column added tracking specific type of admission + """ + # Columns to search + diag_cols = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', + 'DIAG5Desc', 'DIAG6Desc'] + + # Load mappings + copd_resp_desc = json.load(open('mappings/diag_copd_resp_desc.json')) + + # Select mappings relevant to desired type of admission + desc = copd_resp_desc[typ] + + # copd descriptions will only require searching a single specific phrase + single = typ == 'copd' + + # Search columns and track + df[typ + '_event'] = df[diag_cols].apply( + lambda x: track_event(x, desc, single)).any(axis=1).astype(int) + + return df diff --git a/training/src/processing/utils/adm_reduction.py b/training/src/processing/utils/adm_reduction.py new file mode 100644 index 0000000000000000000000000000000000000000..fdf5b56b0675f548e8158b6ec5dc9b5e874614cb --- /dev/null +++ b/training/src/processing/utils/adm_reduction.py @@ -0,0 +1,65 @@ +""" +Admission reduction utilities +""" +import pandas as pd +from datetime import date + + +def fill_missing_years(df): + """ + Add admission data from years where patient is missing from the dataset + -------- + :param df: dataframe to be updated + :return: dataframe with missing years added + """ + df = df.sort_values('ADMDATE') + year_col = df.eoy.dt.year.tolist() + end_month = df.eoy.dt.month.iloc[0] + end_day = df.eoy.dt.day.iloc[0] + + # We only want missing years + year_range = range(year_col[0] + 1, year_col[-1]) + years = [y for y in year_range if not (y in year_col)] + + # If any years missing add rows + if len(years) > 0: + sh_id = df.SafeHavenID.iloc[0] + eth_grp = df.eth_grp.iloc[0] + adm_dates = pd.to_datetime([date(y, end_month, end_day) for y in years]) + data = {'SafeHavenID': sh_id, 'eth_grp': eth_grp, 'ADMDATE': adm_dates, + 'STAY': 0, 'copd_event': 0, 'resp_event': 0, 'eoy': adm_dates, + 'adm': 0, 'anxiety_depression_event': 0} + missed_years = pd.DataFrame(data) + df = pd.concat([df, missed_years]).sort_values('ADMDATE') + + return df + + +def calc_adm_per_year(df): + """ + Reduce data to 1 row per year + -------- + :param df: dataframe to reduced + :return: reduced dataframe + """ + # Last EOY columns + eoy_cols = ['eth_grp', 'days_since_copd', 'days_since_resp', 'days_since_adm', + 'adm_to_date', 'copd_to_date', 'resp_to_date', + 'anxiety_depression_to_date', 'copd_date', 'resp_date', 'adm_date'] + last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last() + + # Average column + los = df.groupby(['SafeHavenID', 'eoy'])[['STAY']].mean() + los.columns = ['mean_los'] + + # Total columns + sum_cols = ['adm', 'copd_event', 'resp_event', 'anxiety_depression_event', 'STAY'] + total_cols = ['adm_per_year', 'copd_per_year', 'resp_per_year', + 'anxiety_depression_per_year', 'total_hosp_days'] + total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum() + total.columns = total_cols + + # Join together + results = last.join(los).join(total) + + return results \ No newline at end of file diff --git a/training/src/processing/utils/common.py b/training/src/processing/utils/common.py new file mode 100644 index 0000000000000000000000000000000000000000..fef50baafde1fa67b94aed7326d56017b704e243 --- /dev/null +++ b/training/src/processing/utils/common.py @@ -0,0 +1,132 @@ +""" +Utilities required across all processing scripts +""" +import pandas as pd +import numpy as np + + +def read_data(file, cols, types): + """ + Read in data source + -------- + :param file: string filename + :param cols: string list of column names + :param types: string list of column types + :return: dataframe + """ + schema = dict(zip(cols, types)) + df = pd.read_csv(file, usecols=cols, encoding="cp1252", dtype=schema) + + return df + + +def first_patient_appearance(df, dt_col, typ, data_path): + """ + Save first appearance of patient in dataset + -------- + :param df: dataframe to check + :param dt_col: date column to sort by + :param typ: type of dataset being used + :param data_path: path to data extracts + :return: None, dataframe with first dates saved + """ + df = df.sort_values(dt_col) + df_first = df.groupby('SafeHavenID')[dt_col].first() + df_first = df_first.to_frame().reset_index() + df_first.columns = ['SafeHavenID', 'first_adm'] + df_first.to_pickle(data_path + typ + '_first_dates.pkl') + + +def add_days_since_event(df, typ, dt_col): + """ + Historical features: add days since features e.g. copd/resp/rescue + -------- + :param df: dataframe to be updated + :param typ: 'rescue', 'copd' or 'resp' feature to be created + :param dt_col: str date column name + :return: updated dataframe with historical column added + """ + if typ == 'rescue': + event_col = 'rescue_meds' + elif typ == 'adm': + event_col = 'adm' + else: + event_col = typ + '_event' + date_col = typ + '_date' + days_col = 'days_since_' + typ + df[date_col] = df.apply( + lambda x: x[dt_col] if x[event_col] else np.nan, axis=1).ffill() + if df[date_col].isna().all(): + df[days_col] = np.nan + else: + df[days_col] = (df.eoy - df[date_col]).dt.days + + return df + + +def track_event(x, desc, single): + """ + Fill nulls and search to see if x matches a description + -------- + :param x: str list of features to track + :param desc: str list to compare + :param single: boolean for checking against single description e.g. + "COPD" True otherwise False + :return: tracked feature list + """ + x = x.fillna('') + + # COPD only has single description to search + if single: + result = [desc in s for s in x] + + # Respiratory has a list of descriptions to search + else: + result = [s in desc for s in x] + + return result + + +def add_hist_adm_presc(df, typ, dt_col): + """ + Historical features: add days since and to-date features + -------- + :param df: dataframe to be updated + :param typ: type of data - 'adm' or 'presc' + :param dt_col: string name of date column + :return: updated dataframe with historical columns added + """ + if typ == 'presc': + df = df.sort_values(dt_col).reset_index(drop=True) + df = add_days_since_event(df, 'rescue', dt_col) + df['rescue_to_date'] = df.rescue_meds.cumsum() + df['anxiety_depression_presc_to_date'] = df.anxiety_depression_presc.cumsum() + else: + for col in ['adm', 'copd', 'resp']: + df = add_days_since_event(df, col, dt_col) + for col in ['copd', 'resp', 'anxiety_depression']: + df[col + '_to_date'] = df[col + '_event'].cumsum() + + # Add counter for events to date + df[typ + '_to_date'] = df[typ].cumsum() + + return df + + +def correct_column_names(cols, typ): + """ + Convert column names to lower case and fill any spaces with underscores + -------- + :param cols: string list of column names + :param typ: type of dataset being updated + :return: cleaned column names + """ + print('Correcting column headers') + + if typ == 'presc': + lower_cols = cols.str.replace('[+-]', ' ').str.lower() + new_cols = ["_".join(col.split()) for col in lower_cols] + else: + new_cols = cols.str.lower().str.replace(' ', '_').tolist() + + return new_cols \ No newline at end of file diff --git a/training/src/processing/utils/comorb_processing.py b/training/src/processing/utils/comorb_processing.py new file mode 100644 index 0000000000000000000000000000000000000000..c083e44cba408b47bc8668d0ae8ce84ff16f626b --- /dev/null +++ b/training/src/processing/utils/comorb_processing.py @@ -0,0 +1,20 @@ +""" +Comorbidities processing utilities +""" +import pandas as pd + + +def diagnosis_mapping_lists(excel_file, sheet_name, diagnosis_names): + """ + Create mapping between diagnoses and comorbidities + -------- + :param excel_file: str filename for diagnosis mapping + :param sheet_name: str sheet name for diagnosis mapping + :param diagnosis_names: str list of diagnoses + :return: dictionary of diagnosis names and values + """ + df_diag = pd.read_excel(excel_file, sheet_name, skiprows=range(0, 1)) + df_lists = df_diag.T.values.tolist() + diag_lists = [[s.strip() for s in x if pd.notna(s)] for x in df_lists] + + return dict(zip(diagnosis_names, diag_lists)) diff --git a/training/src/processing/utils/labs_processing.py b/training/src/processing/utils/labs_processing.py new file mode 100644 index 0000000000000000000000000000000000000000..5823f982a36c61d53a70c4d3ae7c371f21017e3e --- /dev/null +++ b/training/src/processing/utils/labs_processing.py @@ -0,0 +1,16 @@ +""" +Labs processing utilities +""" + + +def add_total_labs(df): + """ + Historical features: to-date features + -------- + :param df: dataframe to be updated + :return: updated dataframe with historical columns added + """ + # Add counter for rescue meds to date + df['labs_to_date'] = df.labs_to_date.cumsum() + + return df diff --git a/training/src/processing/utils/presc_common.py b/training/src/processing/utils/presc_common.py new file mode 100644 index 0000000000000000000000000000000000000000..65bb0e774df5158aba293df4a327f3ab92f9a82f --- /dev/null +++ b/training/src/processing/utils/presc_common.py @@ -0,0 +1,68 @@ +import pandas as pd +from utils.common import read_data + + +steroid_codes = ['0603020T0AAACAC', '0603020T0AABKBK', '0603020T0AAAXAX', + '0603020T0AAAGAG', '0603020T0AABHBH', '0603020T0AAACAC', + '0603020T0AABKBK', '0603020T0AABNBN', '0603020T0AAAGAG', + '0603020T0AABHBH'] + +antib_codes = ['0501013B0AAAAAA', '0501013B0AAABAB', '0501030I0AAABAB', + '0501030I0AAAAAA', '0501050B0AAAAAA', '0501050B0AAADAD', + '0501013K0AAAJAJ'] + +exac_meds = steroid_codes + antib_codes + + +def initialize_presc_data(presc_file): + """ + Load in prescribing dataset to correct format + -------- + :param presc_file: prescribing data file name + :return: prescribing dataframe with correct column names and types + """ + print('Loading prescribing data') + + # Read in data + presc_cols = ['SafeHavenID', 'PRESC_DATE', 'PI_Approved_Name', + 'PI_BNF_Item_Code'] + presc_types = ['int', 'object', 'str', 'str'] + df = read_data(presc_file, presc_cols, presc_types) + + # Drop any nulls or duplicates + df = df.dropna() + df = df.drop_duplicates() + + # Convert date + df['PRESC_DATE'] = pd.to_datetime(df.PRESC_DATE) + + return df + + +def track_medication(df): + """ + Track salbutamol and rescue med prescriptions + https://openprescribing.net/bnf/ + -------- + :param df: dataframe + :return: dataframe with tracked meds + """ + print('Tracking medication') + + # Extract BNF codes without brand info + df['code'] = df.PI_BNF_Item_Code.apply(lambda x: x[0:9]) + + # Add flag for salbutamol - marked important by Chris + df['SALBUTAMOL'] = (df.code == '0301011R0').astype(int) + + # Track rescue meds + df['rescue_meds'] = df.PI_BNF_Item_Code.str.contains( + '|'.join(exac_meds)).astype(int) + + # Track anxiety and depression medication + ad_bnf = ('040102', '0403', '0204000R0', '0408010AE') + ad_events = df.PI_BNF_Item_Code.str.startswith(ad_bnf).fillna(False) + drop_dummy = (df.PI_Approved_Name != 'DUMMY') & (df.PI_Approved_Name != 'DUMMY REJECTED') + df['anxiety_depression_presc'] = (drop_dummy & ad_events).astype(int) + + return df \ No newline at end of file diff --git a/training/src/reduction/README.md b/training/src/reduction/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4c563a2ce89bc34ae21b5de8911800a9d6c2505c --- /dev/null +++ b/training/src/reduction/README.md @@ -0,0 +1,12 @@ +# Reduction + +This folder contains scripts for combining, reducing, filling and scaling processed EHR data for modelling. Scripts should be run in the below order. + +Note that scripts must be run in the below order: +1. `combine.py` - combine datasets and perform any post-processing +2. `post_prod_reduction.py` - Combine columns to reduce 0 values +3. `remove_ids.py` - remove receiver, scale up and test IDs +4. `clean_and_scale_train.py` - impute nulls and min-max scale training data +5. `clean_and_scale_test.py` - impute nulls and min-max scale testing data + +_NB: The data_type in `clean_and_scale_test.py` can be changed to rec, sup, val and test._ \ No newline at end of file diff --git a/training/src/reduction/__init__.py b/training/src/reduction/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/training/src/reduction/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/training/src/reduction/clean_and_scale_test.py b/training/src/reduction/clean_and_scale_test.py new file mode 100644 index 0000000000000000000000000000000000000000..19595260a25c5403e73742d39c330ddc451bc5c1 --- /dev/null +++ b/training/src/reduction/clean_and_scale_test.py @@ -0,0 +1,173 @@ +""" +TESTING +Impute any null data, save ethnicity info for each ID and scale +final dataset + +NB: This script can be used for merged receiver, scale up or testing data +""" +import json +import sys +import joblib +import pandas as pd +import numpy as np +from numpy import loadtxt + + +ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue'] + +null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr', + 'alkaline_phosphatase_med_2yr', 'basophils_med_2yr', + 'c_reactive_protein_med_2yr', 'chloride_med_2yr', + 'creatinine_med_2yr', 'eosinophils_med_2yr', + 'estimated_gfr_med_2yr', 'haematocrit_med_2yr', + 'haemoglobin_med_2yr', 'lymphocytes_med_2yr', + 'mch_med_2yr', 'mean_cell_volume_med_2yr', + 'monocytes_med_2yr', 'neutrophils_med_2yr', + 'platelets_med_2yr', 'potassium_med_2yr', + 'red_blood_count_med_2yr', 'sodium_med_2yr', + 'total_bilirubin_med_2yr', 'urea_med_2yr', + 'white_blood_count_med_2yr', 'neut_lymph_med_2yr', + 'days_since_copd_resp', 'days_since_adm', 'days_since_rescue'] + +cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob', + 'marital_status', 'label', 'simd_vigintile', 'simd_decile', + 'simd_quintile', 'sex_bin'] + + +def calc_age_bins_test(df, data_path): + """ + Load training bins and assign to testing data + -------- + :param df: dataframe to be updated + :param data_path: path to generated data + :return: updated dataframe + """ + ed = loadtxt(data_path + 'age_bins_train.csv', delimiter=',') + categories, edges = pd.qcut( + df['age'], q=10, precision=0, retbins=True, labels=ed[1:]) + df['age_bin'] = categories.astype(int) + + return df + + +def create_label(df): + """ + Create a label containing the age and sex bins of the data + -------- + :param df: dataframe + :return: dataframe with label added + """ + df['label'] = df['age_bin'].astype(str) + '_' + df['sex_bin'].astype(str) + df = df.drop('age_bin', axis=1) + + return df + + +def fill_nulls(label, df, medians): + """ + Fill any null values in testing/REC/SUP data with median values from + training data. + -------- + :param label: string label containing age and sex bin values, e.g. '51_0' + for a male patient in the less than 51 age bin + :param df: dataframe + :param medians: dataframe of training set medians for each label and + column + :return: filled dataframe for specified label + """ + meds = medians[medians['label'] == label].iloc[0] + df_2_fill = df[df['label'] == label] + for col in null_cols: + df_2_fill[col] = df_2_fill[col].fillna(meds[col]) + + return df_2_fill + + +def ds_fill_5year_test(df, col, max_vals): + """ + Fill days_since_X columns where patient has been in the dataset less than + 5 years + -------- + :param df: dataframe to be updated + :param col: column to check + :param max_vals: series with columns and their max value from training + :return: dataframe with column nulls filled where patient has ggc_years < 5 + """ + df_5years = df.ggc_years < 5 + df.loc[df_5years, col] = df.loc[df_5years, col].fillna(max_vals[col]) + + return df + + +def scale_data_test(df, scaler): + """ + Min-max scale final dataset + ----- + :param df: dataframe to be scaled + :param scaler: scaler object to apply to df + :return: scaled dataset for modelling + """ + all_cols = df.columns + all_cols = all_cols.drop(['SafeHavenID', 'eoy']) + data_scaled = scaler.transform(df[all_cols].to_numpy()) + df_scaled = pd.DataFrame(data_scaled, columns=all_cols) + df_final = (df[['SafeHavenID', 'eoy']] + .reset_index(drop=True) + .join(df_scaled)) + + return df_final + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Get generated data_path + data_path = config['model_data_path'] + + # Get datatype from cmd line + data_type = sys.argv[1] + + # Read in data + df = pd.read_pickle(data_path + 'merged_' + data_type + '.pkl') + + # Load training age groups and apply to data + df = calc_age_bins_test(df, data_path) + + # Load in training median for each age-bin/sex-bin labelled group + df_medians = pd.read_pickle(data_path + 'medians.pkl') + df_medians = df_medians.reset_index() + df_medians = create_label(df_medians) + df = create_label(df) + labels = df_medians['label'] + + # Fill null days_since columns for patient with ggc_years < 5 + max_vals = pd.read_pickle(data_path + 'maxs.pkl') + for col in ds_cols: + df = ds_fill_5year_test(df, col, max_vals) + + # Fill remaining nulls using training medians + df_filled = pd.concat([fill_nulls(x, df, df_medians) for x in labels]) + + # Convert ds_cols to int + for col in ds_cols: + day = np.timedelta64(1, 'D') + df_filled[col] = (df_filled[col] / day).astype(int) + + # Save processed data before scaling + df_filled.to_pickle(data_path + 'filled_' + data_type + '.pkl') + + # Drop non-modelling columns + df_filled = df_filled.drop(cols2drop, axis=1) + + # Load in min-max scaler from training set + scaler = joblib.load(data_path + 'min_max_scaler_train.pkl') + df_filled = scale_data_test(df_filled, scaler) + + # Save final dataset + df_filled.to_pickle(data_path + 'min_max_' + data_type + '.pkl') + + +main() diff --git a/training/src/reduction/clean_and_scale_train.py b/training/src/reduction/clean_and_scale_train.py new file mode 100644 index 0000000000000000000000000000000000000000..28cd43cdc8c1f0d740f3c03c93c54815dc0f5bd2 --- /dev/null +++ b/training/src/reduction/clean_and_scale_train.py @@ -0,0 +1,171 @@ +""" +TRAIN +Impute any null data, save ethnicity info for each ID and scale +final dataset +""" +import json +import joblib +import pandas as pd +import numpy as np +from numpy import savetxt +from sklearn.preprocessing import MinMaxScaler +from utils.reduction import calc_ds_med + + +demo_cols = ['age_bin', 'sex_bin'] + +ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue'] + +null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr', + 'alkaline_phosphatase_med_2yr', 'basophils_med_2yr', + 'c_reactive_protein_med_2yr', 'chloride_med_2yr', + 'creatinine_med_2yr', 'eosinophils_med_2yr', + 'estimated_gfr_med_2yr', 'haematocrit_med_2yr', + 'haemoglobin_med_2yr', 'lymphocytes_med_2yr', + 'mch_med_2yr', 'mean_cell_volume_med_2yr', + 'monocytes_med_2yr', 'neutrophils_med_2yr', + 'platelets_med_2yr', 'potassium_med_2yr', + 'red_blood_count_med_2yr', 'sodium_med_2yr', + 'total_bilirubin_med_2yr', 'urea_med_2yr', + 'white_blood_count_med_2yr', 'neut_lymph_med_2yr'] + +cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob', + 'sex_bin', 'marital_status', 'age_bin', + 'days_since_copd_resp_med', 'days_since_adm_med', + 'days_since_rescue_med', 'simd_vigintile', 'simd_decile', + 'simd_quintile'] + + +def calc_age_bins_train(df, data_path): + """ + Split ages into 10 bins and save results for median filling test data + -------- + :param df: dataframe to be updated + :param data_path: path to generated data + :return: updated dataframe + """ + # Split age column into 10 buckets and use the edges as labels + cat, ed = pd.qcut(df['age'], q=10, precision=0, retbins=True) + categories, edges = pd.qcut( + df['age'], q=10, precision=0, retbins=True, labels=ed[1:]) + df['age_bin'] = categories.astype(int) + + # Save categories for test data + savetxt(data_path + 'age_bins_train.csv', edges, delimiter=',') + + return df + + +def calc_df_med(df, data_path): + """ + Calculate the medians for all columns in the dataset + -------- + :param df: dataframe to update + :param data_path: path to generated data + :return: dataframe with null columns filled with median values and days_since + median columns added to the dataframe + """ + # Calculate median for all columns except SafeHavenID, year and ds_cols + all_cols = df.columns + all_cols = all_cols.drop(['SafeHavenID', 'eoy']) + df_median = df[all_cols].groupby(demo_cols).median() + + # Calculate medians for ds_cols + ds_med = df[demo_cols + ds_cols].groupby(demo_cols).apply(calc_ds_med) + + # Join ds_cols medians to median table and original dataframe + df_median = df_median.join(ds_med) + + # Save medians for imputing testing data + df_median.to_pickle(data_path + 'medians.pkl') + + # Rename and add to original dataframe + ds_med.columns += '_med' + df = df.join(ds_med, on=demo_cols) + + return df + + +def ds_fill_5year_train(df, col): + """ + Fill days_since_X columns where patient has been in the dataset less than + 5 years + -------- + :param df: dataframe to be updated + :param col: column to check + :return: dataframe with column nulls filled where patient has ggc_years < 5 + """ + df_5years = df.ggc_years < 5 + df.loc[df_5years, col] = df.loc[df_5years, col].fillna(df[col].max()) + + return df + + +def scale_data_train(df, data_path, scaler): + """ + Min-max scale final dataset + ----- + :param df: dataframe to be scaled + :param data_path: path to generated data + :param scaler: scaler object to apply to df + :return: scaled dataset for modelling + """ + all_cols = df.columns + all_cols = all_cols.drop(['SafeHavenID', 'eoy']) + data_scaled = scaler.fit_transform(df[all_cols].to_numpy()) + df_scaled = pd.DataFrame(data_scaled, columns=all_cols) + df_final = (df[['SafeHavenID', 'eoy']] + .reset_index(drop=True) + .join(df_scaled)) + + # Save the scaler for testing + joblib.dump(scaler, data_path + 'min_max_scaler_train.pkl') + + return df_final + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + data_path = config['model_data_path'] + + # Read in combined data + df = pd.read_pickle(data_path + 'merged_train.pkl') + + # Calculate age bins + df = calc_age_bins_train(df, data_path) + + # Calculate medians for each column for imputation + df = calc_df_med(df, data_path) + + # Fill null columns + df[null_cols] = df.groupby(demo_cols)[null_cols].apply( + lambda x: x.fillna(x.median())) + + # Fill null days_since columns + day = np.timedelta64(1, 'D') + df[ds_cols].max().to_pickle(data_path + 'maxs.pkl') + for col in ds_cols: + df = ds_fill_5year_train(df, col) + df[col] = df[col].fillna(df[col + '_med']) + df[col] = (df[col] / day).astype(int) + + # Save processed data before scaling + df.to_pickle(data_path + 'filled_train.pkl') + + # Drop non-modelling columns + df = df.drop(cols2drop, axis=1) + + # Initialize scaler + scaler = MinMaxScaler() + + # Scale final dataset + df_final = scale_data_train(df, data_path, scaler) + + # Save final dataset + df_final.to_pickle(data_path + 'min_max_train.pkl') + + +main() diff --git a/training/src/reduction/combine.py b/training/src/reduction/combine.py new file mode 100644 index 0000000000000000000000000000000000000000..343ac5254dfe5c02ce2f8c504c0868a2e1a14169 --- /dev/null +++ b/training/src/reduction/combine.py @@ -0,0 +1,217 @@ +""" +To Do: +- Refactor script to be more readable/smaller main function +""" +import json +import pandas as pd +import numpy as np +from datetime import timedelta + + +def read_pkl_data(dataset, data_path, path_type): + """ + Read in pickled dataset + -------- + :param dataset: type of dataset to read in + :param data_path: path to generated data + :param path_type: type of path to read from + :return: dataframe + """ + print('Reading in ' + dataset) + + file_path = data_path + dataset + if path_type == 'data': + file_path += '_proc.pkl' + else: + file_path += '_first_dates.pkl' + + return pd.read_pickle(file_path) + + +def fill_eth_grp_data(df): + """ + Fill nulls in eth_grp column introduced in joining + :param df: dataframe to update + :return: Filled dataframe + """ + df['eth_grp'] = df.groupby('SafeHavenID').eth_grp.apply( + lambda x: x.ffill().bfill()) + df['eth_grp'] = df['eth_grp'].fillna('Unknown') + + return df + + +def fill_to_date_columns(df): + """ + Fill nulls in to_date columns introduced in joining + :param df: dataframe to update + :return: Filled dataframe + """ + to_date_cols = ['adm_to_date', 'copd_to_date', 'resp_to_date', + 'presc_to_date', 'rescue_to_date', 'labs_to_date', + 'anxiety_depression_to_date', + 'anxiety_depression_presc_to_date'] + df[to_date_cols] = df.groupby('SafeHavenID')[to_date_cols].apply( + lambda x: x.ffill().fillna(0)) + + return df + + +def fill_yearly_columns(df): + """ + Fill nulls in yearly columns introduced in joining + :param df: dataframe to update + :return: Filled dataframe + """ + zero_cols = ['adm_per_year', 'total_hosp_days', 'mean_los', + 'copd_per_year', 'resp_per_year', 'comorb_per_year', + 'salbutamol_per_year', + 'saba_inhaler_per_year', 'laba_inhaler_per_year', + 'lama_inhaler_per_year', 'sama_inhaler_per_year', + 'ics_inhaler_per_year', 'laba_ics_inhaler_per_year', + 'lama_laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year', + 'mcs_inhaler_per_year', 'rescue_meds_per_year', + 'presc_per_year', 'labs_per_year', + 'anxiety_depression_per_year', 'anxiety_depression_presc_per_year'] + df[zero_cols] = df[zero_cols].fillna(0) + + return df + + +def fill_days_since(df, typ): + """ + Fill days_since_copd/resp/rescue + :param df: dataframe to update + :param typ: type of feature to fill ('copd', 'resp', 'rescue') + :return: Filled dataframe + """ + df['days_since_' + typ] = df.eoy - df[typ + '_date'].ffill() + + return df + + +def process_first_dates(df): + """ + Process dataframe containing patient's first date in the health board region + -------- + :param df: dataframe to process + :return: processed dataframe + """ + df = df.set_index('SafeHavenID') + entry_dataset = df.idxmin(axis=1).apply(lambda x: x.split('_')[1]) + first_entry = df.min(axis=1) + df['entry_dataset'] = entry_dataset + df['first_entry'] = first_entry + df_reduced = df[['entry_dataset', 'first_entry']].reset_index() + + return df_reduced + + +def find_closest_simd(v): + """ + Find closest SIMD vigintile for each row 'v' + -------- + :param v: row of data from apply statement + :param typ: type of simd column to add + :return: simd value + """ + simd_years = [2009, 2012, 2016] + bools = [v.eoy.year >= year for year in simd_years] + if any(bools): + simd_year = str(simd_years[np.where(bools)[0][-1]]) + v['simd_quintile'] = v['simd_' + simd_year + '_quintile'] + v['simd_decile'] = v['simd_' + simd_year + '_decile'] + v['simd_vigintile'] = v['simd_' + simd_year + '_vigintile'] + else: + v['simd_quintile'] = np.nan + v['simd_decile'] = np.nan + v['simd_vigintile'] = np.nan + + return v + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + data_path = config['model_data_path'] + + # Read in data + adm = read_pkl_data('adm', data_path, 'data') + comorb = read_pkl_data('comorb', data_path, 'data') + presc = read_pkl_data('presc', data_path, 'data') + labs = read_pkl_data('labs', data_path, 'data') + demo = read_pkl_data('demo', data_path, 'data') + + # Join datasets + df = adm.join( + comorb, how='left').join( + presc, how='outer').join( + labs, how='outer') + df = df.reset_index() + + # Fill nulls introduced in joining + print('Filling data') + df = fill_eth_grp_data(df) + df = fill_to_date_columns(df) + df = fill_yearly_columns(df) + + # Fill days_since columns + for typ in ['copd', 'resp', 'rescue', 'adm']: + df = df.groupby('SafeHavenID').apply(fill_days_since, typ) + + # Reduce to single column + ds_cols = ['days_since_copd', 'days_since_resp'] + df['days_since_copd_resp'] = df[ds_cols].min(axis=1) + + # Read in first date data + print('Adding first dates') + adm_dates = read_pkl_data('adm', data_path, 'date') + presc_dates = read_pkl_data('presc', data_path, 'date') + labs_dates = read_pkl_data('labs', data_path, 'date') + + # Merge first date data + first_dates = pd.merge( + pd.merge(adm_dates, presc_dates, how="outer", on='SafeHavenID'), + labs_dates, how="outer", on='SafeHavenID') + + # Save first dates if needed + first_dates.to_pickle(data_path + 'overall_first_dates.pkl') + + # Process first_years + date_data = process_first_dates(first_dates) + + # Merge first dates data with dataframe + print('Merging data') + df_merged = pd.merge(df, date_data, on='SafeHavenID', how='inner') + + # Add years in health board region + ggc_years = (df_merged.eoy - df_merged.first_entry) / np.timedelta64(1, 'Y') + df_merged['ggc_years'] = round(ggc_years) + + # Merge demographics + df_merged = pd.merge(df_merged, demo, on='SafeHavenID') + + # Calculate age relative to end of year + dt_diff = df_merged.eoy - pd.to_datetime(df_merged.obf_dob) + df_merged['age'] = dt_diff // timedelta(days=365.2425) + + # Find closest SIMD + df_merged = df_merged.apply(find_closest_simd, axis=1) + + # Drop additional columns + cols2drop = ['copd_date', 'resp_date', 'adm_date', 'rescue_date', + 'simd_2009_quintile', 'simd_2009_decile', + 'simd_2009_vigintile', 'simd_2012_quintile', + 'simd_2012_decile', 'simd_2012_vigintile', + 'simd_2016_quintile', 'simd_2016_decile', + 'simd_2016_vigintile', 'days_since_copd', + 'days_since_resp'] + df_merged = df_merged.drop(cols2drop, axis=1) + + # Save dataset + df_merged.to_pickle(data_path + 'merged_full.pkl') + + +main() diff --git a/training/src/reduction/post_proc_reduction.py b/training/src/reduction/post_proc_reduction.py new file mode 100644 index 0000000000000000000000000000000000000000..3b185dfee9bf8a4419a93b7811bdbc1e275a7cc5 --- /dev/null +++ b/training/src/reduction/post_proc_reduction.py @@ -0,0 +1,37 @@ +import json +import pandas as pd + + +single_inhaler = ['saba_inhaler_per_year', 'laba_inhaler_per_year', + 'lama_inhaler_per_year', 'sama_inhaler_per_year', + 'ics_inhaler_per_year', 'mcs_inhaler_per_year'] +double_inhaler = ['laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year'] +triple_inhaler = 'lama_laba_ics_inhaler_per_year' +adm_cols = ['copd_per_year', 'resp_per_year'] + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + data_path = config['model_data_path'] + + # Read in original data before scaling + df = pd.read_pickle(data_path + 'merged_full.pkl') + + # Create new reduced columns + df['single_inhaler'] = df[single_inhaler].sum(axis=1) + df['double_inhaler'] = df[double_inhaler].sum(axis=1) + df['triple_inhaler'] = df[triple_inhaler] + df['copd_resp_per_year'] = df[adm_cols].sum(axis=1) + + # Drop original columns + cols2drop = single_inhaler + double_inhaler + [triple_inhaler] + adm_cols + df = df.drop(cols2drop, axis=1) + + # Save data + df.to_pickle(data_path + 'merged.pkl') + + +main() diff --git a/training/src/reduction/remove_ids.py b/training/src/reduction/remove_ids.py new file mode 100644 index 0000000000000000000000000000000000000000..ff35b25b3b8433def50203f778e9b9c3e43d5f4a --- /dev/null +++ b/training/src/reduction/remove_ids.py @@ -0,0 +1,108 @@ +""" +Script to remove all receiver IDs from relevant data sources. +""" +import json +import pandas as pd +from sklearn.model_selection import train_test_split + + +def get_ids(path): + """ + Read in IDs + -------- + :return: list of SafeHavenIDs + """ + print('Loading IDs from ' + path) + + df = pd.read_csv(path, encoding="cp1252") + ids = df['SafeHavenID'].tolist() + + return ids + + +def save_rec_sup(df, data_path, rec_ids, sup_ids): + """ + Remove receiver IDs from dataframe and pickle the dataset + -------- + :param df: pandas dataframe to remove ids from + :param data_path: path to generated data + :param rec_ids: list of SafeHavenIDs in receiver cohort to remove + :param sup_ids: list of SafeHavenIDs in scale-up cohort to remove + :return: None + """ + print('Saving REC and SUP data') + + # Remove receiver IDs + df_rec = df[df['SafeHavenID'].isin(rec_ids)] + df_sup = df[df['SafeHavenID'].isin(sup_ids)] + df = df[~df['SafeHavenID'].isin(rec_ids + sup_ids)] + + # Save data + df_rec.to_pickle(data_path + 'merged_rec.pkl') + df_sup.to_pickle(data_path + 'merged_sup.pkl') + + return df + + +def save_df_ids(df, data_path, ids, typ): + """ + Save train, test or validation ids and corresponding data + -------- + :param df: dataframe + :param data_path: path to generated data + :param ids: list of SafeHavenIDs + :param typ: type of dataset to create, 'train', 'test', 'val' + """ + print('Saving ' + typ + ' data') + + df_ids = pd.DataFrame(ids, columns=['SafeHavenID']) + df_ids.to_pickle(data_path + typ + '_ids.pkl') + df_ids_data = df[df['SafeHavenID'].isin(ids)] + df_ids_data.to_pickle(data_path + 'merged_' + typ + '.pkl') + + +def df_tts(df, data_path): + """ + Split data into training and testing sets and save dataframes + -------- + :param df: pandas dataframe to split + :param data_path: path to generated data + :return: None + """ + # Split IDs into training, testing and validation sets + ids = df['SafeHavenID'].tolist() + train_ids, test_ids = train_test_split( + ids, test_size=0.2, random_state=42) + train_ids, val_ids = train_test_split( + train_ids, test_size=0.25, random_state=42) + + # Save IDs and datasets + save_df_ids(df, data_path, train_ids, 'train') + save_df_ids(df, data_path, test_ids, 'test') + save_df_ids(df, data_path, val_ids, 'val') + + +def main(): + + # Load in config items + with open('../../../config.json') as json_config_file: + config = json.load(json_config_file) + + # Set paths + data_path = config['model_data_path'] + rec_path = config['rec_data_path'] + 'Cohort3Rand.csv' + sup_path = config['sup_data_path'] + 'Scale_Up_lookup.csv' + + # Get IDs to exclude + rec_ids = get_ids(rec_path) + sup_ids = get_ids(sup_path) + + # Remove IDs from datasets + df = pd.read_pickle(data_path + 'merged.pkl') + df = save_rec_sup(df, data_path, rec_ids, sup_ids) + + # Split and save the data + df_tts(df, data_path) + + +main() diff --git a/training/src/reduction/utils/README.md b/training/src/reduction/utils/README.md new file mode 100644 index 0000000000000000000000000000000000000000..934d21e71d5345246df06dc46680b1021c447b56 --- /dev/null +++ b/training/src/reduction/utils/README.md @@ -0,0 +1,5 @@ +# Reduction Utilities + +This folder contains reduction utilities called within the main reduction scripts in the folder above. + +- `reduction.py` contain a utility for calculating days_since medians \ No newline at end of file diff --git a/training/src/reduction/utils/__init__.py b/training/src/reduction/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/training/src/reduction/utils/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/training/src/reduction/utils/reduction.py b/training/src/reduction/utils/reduction.py new file mode 100644 index 0000000000000000000000000000000000000000..da4a67edcccbae46bbe78dd33bec0b3961eac123 --- /dev/null +++ b/training/src/reduction/utils/reduction.py @@ -0,0 +1,16 @@ +import numpy as np + + +def calc_ds_med(v): + """ + Calculate the median value of a subgroup by removing any float nulls and + converting from days to integers + -------- + :param v: values in column + :return: median value + """ + day = np.timedelta64(1, 'D') + med_val = (v.dropna() / day).astype(int).median().astype(int) + med_val *= day + + return med_val \ No newline at end of file diff --git a/training/tests/README.md b/training/tests/README.md new file mode 100644 index 0000000000000000000000000000000000000000..cae503b98fdf669a6e0a53edcccbbe6014e5aaa1 --- /dev/null +++ b/training/tests/README.md @@ -0,0 +1 @@ +# Tests \ No newline at end of file diff --git a/validation/event_tracking/Model_e_admissions_calculations.py b/validation/event_tracking/Model_e_admissions_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..cf74790bfed0925a47501f0c72d26920535ab822 --- /dev/null +++ b/validation/event_tracking/Model_e_admissions_calculations.py @@ -0,0 +1,294 @@ +# Import libraries +import pandas as pd +import numpy as np + +# Set file paths +input_file_path = '/EXAMPLE_STUDY_DATA/' +output_file_path = '/summary_files/' + + +copd = 'CHRONIC OBSTRUCTIVE PULMONARY DISEASE' + + +resp = ['PNEUMONITIS DUE TO FOOD AND VOMIT', + 'RESPIRATORY FAILURE, UNSPECIFIED; TYPE UNSPECIFIED', + 'CHRONIC RESPIRATORY FAILURE; TYPE II [HYPERCAPNIC]', + 'BRONCHOPNEUMONIA, UNSPECIFIED', 'DYSPNOEA', + 'PLEURAL EFFUSION IN CONDITIONS CLASSIFIED ELSEWHERE', + 'RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPERCAPNIC]', + 'PLEURAL EFFUSION, NOT ELSEWHERE CLASSIFIED', + 'CHRONIC RESPIRATORY FAILURE', 'OTHER BACTERIAL PNEUMONIA', + 'ABN MICROBIOLOGICAL FINDINGS IN SPECS FROM RESPIRATORY ORGANS AND THORAX', + 'RESPIRATORY FAILURE, UNSPECIFIED', 'PNEUMONIA, UNSPECIFIED', + 'LOBAR PNEUMONIA, UNSPECIFIED', 'COUGH', + 'PLEURAL PLAQUE WITH PRESENCE OF ASBESTOS', + 'PLEURAL PLAQUE WITHOUT ASBESTOS', 'OTHER DISORDERS OF LUNG', + 'OTHER SPECIFIED PLEURAL CONDITIONS', 'PULMONARY COLLAPSE', + 'ACQUIRED ABSENCE OF LUNG [PART OF]', 'ASPHYXIATION', + 'RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPOXIC]', + 'TRACHEOSTOMY STATUS', 'ACUTE RESPIRATORY FAILURE', + 'UNSPECIFIED ACUTE LOWER RESPIRATORY INFECTION', + 'OTHER SPECIFIED SYMPTOMS AND SIGNS INVOLVING THE CIRC AND RESP SYSTEMS', + 'BACTERIAL PNEUMONIA, UNSPECIFIED', 'PYOTHORAX WITHOUT FISTULA', + 'DISEASES OF BRONCHUS, NOT ELSEWHERE CLASSIFIED', + 'PNEUMONIA DUE TO HAEMOPHILUS INFLUENZAE', 'ABNORMAL SPUTUM', + 'OTHER POSTPROCEDURAL RESPIRATORY DISORDERS', + 'OTHER AND UNSPECIFIED ABNORMALITIES OF BREATHING', + 'INFLUENZA WITH OTHER RESP MANIFESTATIONS, SEASONAL INFLUENZA VIRUS IDENTIF', + 'PERSONAL HISTORY OF DISEASES OF THE RESPIRATORY SYSTEM', + 'PNEUMONIA DUE TO STREPTOCOCCUS PNEUMONIAE', + 'WHEEZING', 'CHEST PAIN ON BREATHING', 'HAEMOPTYSIS', + 'INFLUENZA WITH OTHER MANIFESTATIONS, VIRUS NOT IDENTIFIED', + 'OTHER SPECIFIED RESPIRATORY DISORDERS', + 'ACUTE UPPER RESPIRATORY INFECTION, UNSPECIFIED', + 'T.B. OF LUNG, W/O MENTION OF BACTERIOLOGICAL OR HISTOLOGICAL CONFIRMATION', + 'DEPENDENCE ON RESPIRATOR', 'PLEURISY', + 'BRONCHITIS, NOT SPECIFIED AS ACUTE OR CHRONIC'] + + +def read_data(file, cols, types): + """ + Read in data source + -------- + :param file: string filename + :param cols: string list of column names + :param types: string list of column types + :return: dataframe + """ + schema = dict(zip(cols, types)) + df = pd.read_csv(file, usecols=cols, encoding="cp1252", dtype=schema) + return df + + +def update_null_stay(df): + """ + Calculate the values for any null 'STAY' values using the admission and + discharge dates. + -------- + df : pandas dataframe to be updated + """ + is_null = df['STAY'].isnull() + if sum(is_null) > 0: + null_stay = np.where(is_null) + for i in null_stay: + stay = df.loc[i, 'DISDATE'].item() - df.loc[i, 'ADMDATE'].item() + df.loc[i, 'STAY'] = float(stay.days) + + return df + + +def calculate_total_stay(df): + """ + Model A: + Calculate the cumulative (total) length of stay, given data already + grouped by patient ID and sorted by admission date then discharge date. It + sums all stays for which the admission date matches the previous discharge + date, sets the admission date to the first admission and drops all rows + except the final (or only if the patient was not transferred) record + for any given stay. Works for any number of transfers. Also adds a + 'transfer' column to the existing data (True/False) + + df : pandas dataframe + dataframe to be updated + """ + df.reset_index(inplace=True, drop=True) + rows_to_drop = [] + df['transfer'] = df.ADMDATE.eq(df.DISDATE.shift()) + for index, row in df.iloc[1:].iterrows(): + if row.transfer is True: + df.loc[index, 'ADMDATE'] = df.iloc[index - 1].ADMDATE + df.loc[index, 'STAY'] = row.STAY + df.iloc[index - 1].STAY + rows_to_drop.append(index - 1) + df.drop(rows_to_drop, inplace=True) + df.drop('transfer', axis=1, inplace=True) + + return df + + +def track_copd_resp(df, track_type='both'): + """ + Search for COPD and/or respiratory admissions + -------- + df : pandas dataframe + dataframe to be updated + track_type : str + 'copd', 'resp' or 'both' + """ + diag_columns = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', + 'DIAG5Desc', 'DIAG6Desc'] + df_diag = df[diag_columns] + + if track_type in ['copd', 'both']: + copd_event = df_diag.apply(lambda x: track_feature(x, copd, True)) + copd_event = copd_event.any(axis=1).astype(int) + df['copd_event'] = copd_event + + if track_type in ['resp', 'both']: + resp_event = df_diag.apply(lambda x: track_feature(x, resp, False)) + resp_event = resp_event.any(axis=1).astype(int) + df['resp_event'] = resp_event + + return df + + +def track_feature(x, desc, single): + """ + Fill nulls and search to see if x matches a description + ------- + x : str list + feature to track + desc : str list + string list to compare + single : boolean + if checking against single description e.g. "COPD" True otherwise False + """ + x = x.fillna('') + if single: + result = [desc in s for s in x] + else: + result = [s in desc for s in x] + + return result + + +def filter_data(data, date): + """ + Filter data to only include copd or resp admission events occurring after + the index date + -------- + :param data: dataframe + :param date: index date + :return: filtered dataframe + """ + data['ADMDATE'] = pd.to_datetime(data['ADMDATE']) + data = data[data['ADMDATE'] >= date] + data = data[(data['copd_event'] == 1) | (data['resp_event'] == 1)] + return data + + +def calculate_time_to_first_copd_admission(data, date): + """ + Calculate days to first COPD admission + -------- + :param data: dataframe + :param date: Index date in 'DD-MM-YYYY' format + :return: dataframe showing the number of days to the first COPD admission + event for each ID since the index date + """ + copd_data = data[data['copd_event'] == 1] + first_copd_admission = copd_data.groupby('SafeHavenID').agg(first_copd_admission=('ADMDATE', np.min)) + first_copd_admission['index_date'] = date + first_copd_admission['index_date'] = pd.to_datetime(first_copd_admission['index_date']) + first_copd_admission['days_to_first_copd_admission'] = (first_copd_admission['first_copd_admission'] - first_copd_admission['index_date']).dt.days + return first_copd_admission + + +def calculate_time_to_first_resp_admission(data, date): + """ + Calculate days to first resp admission + -------- + :param data: dataframe + :param date: Index date in 'DD-MM-YYYY' format + :return: dataframe showing the number of days to the first resp admission event for each ID since + the index date + """ + resp_data = data[data['resp_event'] == 1] + first_resp_admission = resp_data.groupby('SafeHavenID').agg(first_resp_admission=('ADMDATE', np.min)) + first_resp_admission['index_date'] = date + first_resp_admission['index_date'] = pd.to_datetime(first_resp_admission['index_date']) + first_resp_admission['days_to_first_resp_admission'] = (first_resp_admission['first_resp_admission'] - first_resp_admission['index_date']).dt.days + return first_resp_admission + + +def calculate_time_to_first_copd_or_resp_admission(data, date): + """ + Calculate days to first copd or resp admission + -------- + :param data: dataframe + :param date: Index date in 'DD-MM-YYYY' format + :return: dataframe showing the number of days to the first COPD or resp admission + event for each ID since the index date + """ + data['copd_or_resp_event'] = (data['resp_event'] | data['copd_event']) + resp_copd_data = data[(data['copd_or_resp_event'] == 1)] + first_resp_or_copd_admission = resp_copd_data.groupby('SafeHavenID').agg(first_copd_or_resp_admission=('ADMDATE', np.min)) + first_resp_or_copd_admission['index_date'] = date + first_resp_or_copd_admission['index_date'] = pd.to_datetime(first_resp_or_copd_admission['index_date']) + first_resp_or_copd_admission['first_copd_or_resp_admission'] = pd.to_datetime(first_resp_or_copd_admission['first_copd_or_resp_admission']) + first_resp_or_copd_admission['days_to_first_copd_or_resp_admission'] = (first_resp_or_copd_admission['first_copd_or_resp_admission'] - first_resp_or_copd_admission['index_date']).dt.days + return first_resp_or_copd_admission + + +def calculate_ad_count_1_year(data, year_censor, first_admission_df, adm_col): + """ + Calculate the number of COPD or respiratory admissions in the year + following the index date and join this data to the time to first + admissions data for each ID + -------- + :param data: dataframe containing admissions dates + :param year_censor: date 1 year following Index date 'DD-MM-YYYY' format + :param first_admission_df: dataframe showing days to first admission + :param adm_col: binary column showing if an admission was copd or + respiratory related or not + :return: dataframe showing the number of days to the first COPD or resp + admission event for each ID since the index date + """ + admission_year = data[data['ADMDATE'] < year_censor] + year_admission_count = admission_year.groupby('SafeHavenID').agg(admission_count_year_post_index=(adm_col, 'sum')) + all_admissions_data = pd.merge(year_admission_count, first_admission_df, on="SafeHavenID", how="outer") + all_admissions_data['admission_count_year_post_index'] = all_admissions_data['admission_count_year_post_index'].fillna(0) + return all_admissions_data + + +def main(): + + adm_file = input_file_path + "SMR01_Cohort3R.csv" + adm_cols = ['SafeHavenID', 'ETHGRP', 'ADMDATE', 'DISDATE', 'DIAG1Desc', + 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', 'DIAG5Desc', + 'DIAG6Desc', 'STAY'] + adm_types = ['int', 'object', 'object', 'object', 'str', 'str', 'str', + 'str', 'str', 'str', 'int'] + adm = read_data(adm_file, adm_cols, adm_types) + + # Nulls dropped later in process, only drop duplicates + adm = adm.drop_duplicates() + + # Convert date columns to correct type + adm['ADMDATE'] = pd.to_datetime(adm['ADMDATE']) + adm['DISDATE'] = pd.to_datetime(adm['DISDATE']) + + # Update any null STAY data using ADM and DIS dates + adm = update_null_stay(adm) + + # Correct stays for patients passed across departments + adm = adm.sort_values(['SafeHavenID', 'ADMDATE', 'DISDATE']) + adm = adm.groupby('SafeHavenID').apply(calculate_total_stay) + adm = adm.reset_index(drop=True) + + # Prepare text data - strip string columns + adm = adm.apply(lambda x: x.str.strip() if x.dtype == 'object' else x) + + # Track COPD and respiratory events + adm = track_copd_resp(adm) + + # Filter to only include copd or resp admission events occuring after index + adm = filter_data(adm, '01-01-2020') + + # Calculate time to first respiratory and COPD admissions + first_copd_admission = calculate_time_to_first_copd_admission(adm, '01-01-2020') + first_resp_admission = calculate_time_to_first_resp_admission(adm, '01-01-2020') + first_resp_or_copd_admission = calculate_time_to_first_copd_or_resp_admission(adm, '01-01-2020') + + # Calculate number of respiratory and COPD admissions in the follow up year and join this to the time to admission data + first_copd_admission = calculate_ad_count_1_year(adm, '01-01-2021', first_copd_admission, 'copd_event') + first_resp_admission = calculate_ad_count_1_year(adm, '01-01-2021', first_resp_admission, 'resp_event') + first_resp_or_copd_admission = calculate_ad_count_1_year(adm, '01-01-2021', first_resp_or_copd_admission, 'copd_or_resp_event') + + # Save data + adm.to_pickle(output_file_path + 'all_COPD_and_resp_admissions_from_index_date.pkl') + first_copd_admission.to_pickle(output_file_path + 'copd_admissions_cohort_summary.pkl') + first_resp_admission.to_pickle(output_file_path + 'resp_admissions_cohort_summary.pkl') + first_resp_or_copd_admission.to_pickle(output_file_path + 'copd_or_resp_admissions_cohort_summary.pkl') + + +main() diff --git a/validation/event_tracking/__init__.py b/validation/event_tracking/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/validation/event_tracking/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/validation/event_tracking/model_e_exacerbation_calculations.py b/validation/event_tracking/model_e_exacerbation_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..21fe48f30dc0df8101518070c54351ae5a6ac453 --- /dev/null +++ b/validation/event_tracking/model_e_exacerbation_calculations.py @@ -0,0 +1,150 @@ +# Import libraries +import pandas as pd +import numpy as np + +# Set file paths +input_file_path = '/EXAMPLE_STUDY_DATA/' +output_file_path = '/summary_files/' + +steroid_codes = ['0603020T0AAACAC', '0603020T0AABKBK', '0603020T0AAAXAX', + '0603020T0AAAGAG', '0603020T0AABHBH', '0603020T0AAACAC', + '0603020T0AABKBK', '0603020T0AABNBN', '0603020T0AAAGAG', + '0603020T0AABHBH'] + +antib_codes = ['0501013B0AAAAAA', '0501013B0AAABAB', '0501030I0AAABAB', + '0501030I0AAAAAA', '0501050B0AAAAAA', '0501050B0AAADAD', + '0501013K0AAAJAJ'] + +exac_meds = steroid_codes + antib_codes + + +def read_data(file, cols, types): + """ + Read in data source + -------- + :param file: string filename + :param cols: string list of column names + :param types: string list of column types + :return: dataframe + """ + schema = dict(zip(cols, types)) + df = pd.read_csv(file, usecols=cols, encoding="cp1252", dtype=schema) + return df + + +def initialize_presc_data(presc_file): + """ + Load in prescribing dataset to correct format + -------- + :param presc_file: prescribing data file name + :return: prescribing dataframe with correct column names and types + """ + print('Loading prescribing data') + + # Read in data + presc_cols = ['SafeHavenID', 'PRESC_DATE', 'PI_Approved_Name', + 'PI_BNF_Item_Code'] + presc_types = ['int', 'object', 'str', 'str'] + df = read_data(presc_file, presc_cols, presc_types) + + # Drop any nulls or duplicates + df = df.dropna() + df = df.drop_duplicates() + + # Convert date + df['PRESC_DATE'] = pd.to_datetime(df.PRESC_DATE) + + return df + + +def track_medication(df): + """ + Track salbutamol and rescue med prescriptions + -------- + :param df: dataframe + :return: dataframe with tracked meds + """ + print('Tracking medication') + + # Extract BNF codes without brand info + df['code'] = df.PI_BNF_Item_Code.apply(lambda x: x[0:9]) + + # Track rescue meds + df['rescue_meds'] = df.PI_BNF_Item_Code.str.contains( + '|'.join(exac_meds)).astype(int) + + return df + + +def filter_data(data, date): + """ + Filter data to only include rescue med prescritpions occurring + after the index date + -------- + :param data: dataframe + :param date: Index date in 'DD-MM-YYYY' format + :return: filtered dataframe + """ + data['PRESC_DATE'] = pd.to_datetime(data['PRESC_DATE']) + data = data[data['PRESC_DATE'] >= date] + data = data[data['rescue_meds'] == 1] + return data + + +def calculate_time_to_first_exacerbation(data, date): + """ + Calculate days to first exacerbation + -------- + :param data: dataframe + :param date: Index date in 'DD-MM-YYYY' format + :return: dataframe showing the number of days to the first exacerbation + event for each ID since the index date + """ + first_exac = data.groupby('SafeHavenID').agg(first_exac=('PRESC_DATE', np.min)) + first_exac['index_date'] = date + first_exac['index_date'] = pd.to_datetime(first_exac['index_date']) + first_exac['days_to_first_exac'] = (first_exac['first_exac'] - first_exac['index_date']).dt.days + return first_exac + + +def calculate_exac_count_1_year(data, year_censor, first_exac_df): + """ + Calculate the number of exacerbations in the year following the index date + and join this data to the time to first exacerbation data for each ID + -------- + :param data: dataframe containing exacerbation dates (based on rescue meds) + :param year_censor: date 1 year following Index date 'DD-MM-YYYY' format + :param first_exac_df: dataframe showing days to first exacerbations for IDs + :return: dataframe showing the number of days to the first exacerbation + event for each ID since the index date + """ + presc_year = data[data['PRESC_DATE'] < year_censor] + year_exac_count = presc_year.groupby('SafeHavenID').agg(exac_count_year_post_index=('PRESC_DATE', 'nunique')) + all_exac_data = pd.merge(year_exac_count, first_exac_df, on="SafeHavenID", how="outer") + all_exac_data['exac_count_year_post_index'] = all_exac_data['exac_count_year_post_index'].fillna(0) + return all_exac_data + + +def main(): + + # Initialise prescription data + presc = initialize_presc_data(input_file_path + 'Pharmacy_Cohort3R.csv') + + # Track rescue med prescriptions + presc = track_medication(presc) + + # Filter to only include exacerbation events (rescue med prescriptions) occurring after the index date + presc = filter_data(presc, '01-01-2020') + + # Calculate time to first respiratory and COPD admissions + first_exac = calculate_time_to_first_exacerbation(presc, '01-01-2020') + + # Calculate number of respiratory and COPD admissions in the follow up year and join this to the time to admission data + first_exac = calculate_exac_count_1_year(presc, '01-01-2021', first_exac) + + # Save data + presc.to_csv(output_file_path + 'all_exacerbations_from_index_date.csv') + first_exac.to_pickle(output_file_path + 'community_managed_exacerbations_cohort_summary.pkl') + + +main() diff --git a/validation/event_tracking/model_e_survival_calculations.py b/validation/event_tracking/model_e_survival_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..051fa1945b23f320a1a74a7d21d9cf6dbc18121a --- /dev/null +++ b/validation/event_tracking/model_e_survival_calculations.py @@ -0,0 +1,80 @@ +# Import libraries +import pandas as pd + +# Set file paths +input_file_path = '/EXAMPLE_STUDY_DATA/' +output_file_path = '/summary_csv_files/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data_for_output(survival_data): + """ + Remove columns not needed for output + -------- + :param survival_data: dataframe containing date of death field + :return: above dataframe filtered to only contain columns required + for analysis/ output + """ + survival_data = survival_data[['SafeHavenID', 'DOD']] + return survival_data + + +def filter_data(survival_data, date): + """ + Filter data to only include those alive on the index date for analysis + -------- + :param data: string filename + :param date: Index date in 'DD-MM-YYYY' format + :return: dataframe including only those alive on the index date for + analysis + """ + survival_data['DOD'] = pd.to_datetime(survival_data['DOD']) + return survival_data[survival_data['DOD'] >= date] + + +def calulate_days_survived(survival_data, date): + """ + Calcualte days survived following the index date + -------- + :param survival_data: dataframe containing date of death field + :param date: Index date in 'DD-MM-YYYY' format + :return: days survived from index date + """ + survival_data['index_date'] = date + survival_data['index_date'] = pd.to_datetime(survival_data['index_date']) + return (survival_data['DOD'] - survival_data['index_date']).dt.days + + +def main(): + # Read in data + survival_file = input_file_path + "Deaths_Cohort3R.csv" + survival_data = read_data(survival_file) + + # Drop duplicates + survival_data = survival_data.drop_duplicates() + + # Filter to only include Safehaven ID and date of death fields + survival_data = format_data_for_output(survival_data) + + # Filter data to only include deaths after the index date for analysis + survival_data = filter_data(survival_data, '01-01-2020') + + # Calculate days survived + survival_data['days_survived'] = calulate_days_survived(survival_data, + '01-01-2020') + + # Save data + survival_data.to_pickle(output_file_path + 'Survival_from_index.pkl') + + +main() diff --git a/validation/parameter_calculation/CAT_MRC_score_metrics_calculation.py b/validation/parameter_calculation/CAT_MRC_score_metrics_calculation.py new file mode 100644 index 0000000000000000000000000000000000000000..fb947e34d1f97cbd270d379a0beb14127b55a249 --- /dev/null +++ b/validation/parameter_calculation/CAT_MRC_score_metrics_calculation.py @@ -0,0 +1,117 @@ +# Import libraries +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + + return df + + +def format_data(data, IDs, onboard): + """ + Convert datetime columns to datetime format, add year censor column to onboarding data, + filter the data to only include RC and SU1 patients and then join the onboarding dates data to + the PRO scores dataframe + -------- + :param data: PRO scores dataframe + :param IDs: dataframe containing RC and SU1 IDs + :param onboard: dataframe containing onboarding dates + :return: formatted dataframe + """ + data['SubmissionTime'] = pd.to_datetime(data['SubmissionTime'], utc=True) + onboard['OB_date'] = pd.to_datetime(onboard['OB_date'], utc=True) + onboard['yearcensor'] = onboard['OB_date'] + pd.offsets.DateOffset(days=365) + data = pd.merge(IDs, data, on="Study_ID", how="left") + data = pd.merge(data, onboard, on="Study_ID", how="left") + return data + + +def filter_study_censor(data): + """ + Filter the dataframe to only contain data obtained before the study censor date + -------- + :param data: dataframe + :param col: datetime column used for filtering + :return: dataframe containing only data obtained before the study censor date + """ + return data[data['SubmissionTime'] < '2021-09-01'] + + +def filter_first_year(data): + """ + Filter the dataframe to only contain data obtained in the first year post-onboarding + -------- + :param data: dataframe + :return: dataframe containing only data obtained in the first year post-onboarding + """ + return data[data['yearcensor'] >= data['SubmissionTime']] + + +def median_max_score(data): + """ + Get the median and max score for each study ID in the dataframe + -------- + :param data: dataframe + :return: summary dataframe showing median and max scores for each study ID + """ + return data.groupby("Study_ID").Score.agg( + median_value='median', + max_value='max').copy() + + +def calculate_summary_data(data, typ): + """ + Calculate the average score up to the study censor date and a year + after onboarding for each study ID and save the resulting summary + dataframe as a csv file + -------- + :param data: dataframe + :param typ: string value to be input into file name showing what is summarised + """ + data_filter_censor = filter_study_censor(data) + summary_censor = median_max_score(data_filter_censor) + + data_year_censor = filter_first_year(data) + summary_year = median_max_score(data_year_censor) + + output_file_path = file_path + 'Average_' + typ + '_Scores_to_' + summary_censor.to_csv(output_file_path + 'censor.csv') + summary_year.to_csv(output_file_path + 'year.csv') + + +def main(): + + # Read in data + cat_file = input_file_path + "df_cat.csv" + mrc_file = input_file_path + "df_mrc.csv" + onboard_file = input_file_path + "onboarding_dates.csv" + IDs_file = input_file_path + "RC_SU1_IDs.csv" + + cat = read_data(cat_file) + mrc = read_data(mrc_file) + onboard = read_data(onboard_file) + RC_SU1_IDs = read_data(IDs_file) + + # Format data + cat = format_data(cat, RC_SU1_IDs, onboard) + mrc = format_data(mrc, RC_SU1_IDs, onboard) + + # Calculate and save summary CAT data to year and study censor dates for each ID + calculate_summary_data(cat, 'cat') + + # Calculate and save summary MRC data to year and study censor dates for each ID + calculate_summary_data(mrc, 'mrc') + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/Fitbit_groups_calculation.py b/validation/parameter_calculation/Fitbit_groups_calculation.py new file mode 100644 index 0000000000000000000000000000000000000000..1c18f80ff82284a3cdc0258c617b40222fa8455c --- /dev/null +++ b/validation/parameter_calculation/Fitbit_groups_calculation.py @@ -0,0 +1,43 @@ +# Import libraries +import functools as ft +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def main(): + # Read in data + RC_SU1_IDs_file = input_file_path + "RC_SU1_IDs.csv" + steps_file = input_file_path + "step_groupings.csv" + hr_file = input_file_path + "hr_groupings.csv" + awake_asleep_file = input_file_path + "awake_asleep_groupings.csv" + steps_2000_file = input_file_path + "steps_2000.csv" + + RC_SU1_IDs = read_data(RC_SU1_IDs_file) + Steps = read_data(steps_file) + hr_file = read_data(hr_file) + awake_asleep = read_data(awake_asleep_file) + steps_2000 = read_data(steps_2000_file) + + # Merge groupings columns and RC_IDs + dfs = [RC_SU1_IDs, Steps, hr_file, awake_asleep, steps_2000] + df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='Study_ID', how="outer"), dfs) + + # Save this dataframe as a csv file + df_final.to_csv(file_path + 'Fitbit_groups.csv') + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/GOLD_grade_GOLD_group_calculation.py b/validation/parameter_calculation/GOLD_grade_GOLD_group_calculation.py new file mode 100644 index 0000000000000000000000000000000000000000..4631a59a84c1def5f2de4d4edad48ee9e74a4117 --- /dev/null +++ b/validation/parameter_calculation/GOLD_grade_GOLD_group_calculation.py @@ -0,0 +1,84 @@ +# Import libraries +from numpy import isnan +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def GOLD_grade(data): + """ + Calculate GOLD grade for COPD classification using FEV1% + -------- + :param data: dataframe containing FEV1% column + :return: GOLD grade values based on if else statement + """ + if (data['FEV1%'] >= 80): + val = 'GOLD 1' + elif (data['FEV1%'] >= 50) & (data['FEV1%'] < 80): + val = 'GOLD 2' + elif (data['FEV1%'] >= 30) & (data['FEV1%'] < 50): + val = 'GOLD 3' + else: + val = 'GOLD 4' + return val + + +def GOLD_group(data): + """ + Calculate GOLD group from admissions data, exacerbations data, and CAT data + -------- + :param data: dataframe containing CAT, exacerbations, and admissions data + :return: GOLD group values based on if else statement + """ + if (data['CAT_baseline'] >= 10) & (data['Prior_Ad'] > 0) | (data['exac_prev_year'] > 1): + val = 'GOLD group D' + elif (data['CAT_baseline'] < 10) & (data['Prior_Ad'] > 0) | (data['exac_prev_year'] > 1): + val = 'GOLD group C' + elif (data['CAT_baseline'] >= 10) & ((data['Prior_Ad'] == 0) | (data['exac_prev_year'] < 2) | isnan(data['exac_prev_year'])): + val = 'GOLD group B' + else: + val = 'GOLD group A' + return val + + +def apply_if_else(data, condition): + """ + Apply the criteria of an if else statement to all rows + -------- + :param data: dataframe + :condition: else if statement + :return: dataframe with column based on if else statement + """ + return data.apply(condition, axis=1) + + +def main(): + # Read data + RC_SU1_characteristics_file = input_file_path + "Cohort_characteristics_data_RC_SU.csv" + RC_SU1_characteristics_data = read_data(RC_SU1_characteristics_file) + + # Remove columns that are not required for calculating GOLD criteria + GOLD_data = RC_SU1_characteristics_data[['ID', 'FEV1%', 'CAT_baseline', 'Prior_Ad', 'exac_prev_year']] + + # Create new columns showing the GOLD group and GOLD stage of each study participant + GOLD_data['GOLD grade'] = apply_if_else(GOLD_data, GOLD_grade) + GOLD_data['GOLD group'] = apply_if_else(GOLD_data, GOLD_group) + + # Save data + GOLD_data.to_csv(file_path + 'GOLD_data.csv') + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/NIV_parameters_calculation.py b/validation/parameter_calculation/NIV_parameters_calculation.py new file mode 100644 index 0000000000000000000000000000000000000000..353a9bb386022f196bb18c22feb1e1bffa90cf0e --- /dev/null +++ b/validation/parameter_calculation/NIV_parameters_calculation.py @@ -0,0 +1,115 @@ +# Import libraries +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data(data, IDs, onboard): + """ + Convert datetime columns to datetime format, filter to only include RECEIVER and scale up IDs, + and join oboarding dates + -------- + :param data: NIV dataframe + :param IDs: dataframe containing Study IDs + :param onboard: dataframe containing onboarding dates + :return: formatted dataframe + """ + data = data[['Study_ID', 'ie_ratio_value_50', 'ie_ratio_value_95', + 'ie_ratio_maximum_value', 'resp_events_AHI', + 'resp_events_HI', 'Stop_time', 'Start_time']] + data['Stop_time'] = pd.to_datetime(data['Stop_time']) + onboard['OB_date'] = pd.to_datetime(onboard['OB_date']) + onboard['yearcensor'] = onboard['OB_date'] + pd.offsets.DateOffset(days=365) + data = pd.merge(IDs, data, on="Study_ID", how="left") + data = pd.merge(data, onboard, on="Study_ID", how="left") + return data + + +def filter_study_censor(data): + """ + Filter the dataframe to only contain data obtained before the study censor date + -------- + :param data: dataframe + :return: dataframe containing data obtained before the study censor date + """ + return data[data['Stop_time'] < '2021-09-01'] + + +def filter_first_year(data): + """ + Filter the dataframe to only contain data obtained in the first year post-onboarding + -------- + :param data: dataframe + :return: dataframe containing only data obtained in the first year post-onboarding + """ + return data[data['yearcensor'] >= data['Stop_time']] + + +def mean_max_summary(data, col): + """ + Create a dataframe showing mean and max values per group + -------- + :param data: dataframe + :param col: parameter to group on + :return: summary dataframe showing mean and max scores for each study ID + """ + summary_metrics = ['mean', 'max', 'count'] + return data.groupby(col).agg( + {'ie_ratio_value_50': summary_metrics, + 'ie_ratio_value_95': summary_metrics, + 'ie_ratio_maximum_value': summary_metrics, + 'resp_events_AHI': summary_metrics, + 'resp_events_HI': summary_metrics}) + + +def calculate_summary_data(data): + """ + Calculate the average NIV parameters up to the study censor date and a year + after onboarding for each study ID and save the resulting summary + dataframe as a csv file + -------- + :param data: dataframe + :param typ: string value to be input into file name showing what is summarised + """ + data_filter_censor = filter_study_censor(data) + summary_censor = mean_max_summary(data_filter_censor, 'Study_ID') + + data_year_censor = filter_first_year(data) + summary_year = mean_max_summary(data_year_censor, 'Study_ID') + + output_file_path = file_path + 'NIV_ Average_parameters_to_' + summary_censor.to_csv(output_file_path + 'censor.csv') + summary_year.to_csv(output_file_path + 'year.csv') + + +def main(): + # Read data + NIV_data_file = input_file_path + "NIV_data_wrangled.csv" + onboard_file = input_file_path + "onboarding_dates.csv" + RC_SU1_IDs_file = input_file_path + "RC_SU1_IDs.csv" + + NIV_data = read_data(NIV_data_file) + onboard = read_data(onboard_file) + RC_SU1_IDs = read_data(RC_SU1_IDs_file) + + # Format data + NIV_data = format_data(NIV_data, RC_SU1_IDs, onboard) + + # Calculate and save summary NIV data to year and study censor dates for each ID + calculate_summary_data(NIV_data) + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/PRO_LOGIC_exacerbation_calculations.py b/validation/parameter_calculation/PRO_LOGIC_exacerbation_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..95e8fa778aef34518d72839c0207496051ef81ba --- /dev/null +++ b/validation/parameter_calculation/PRO_LOGIC_exacerbation_calculations.py @@ -0,0 +1,105 @@ +# Import libraries +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data(data, IDs, onboard): + """ + Convert datetime columns to datetime format, filter to only include RECEIVER and scale up 1 IDs, + and join onboarding dates to exacerbations data for each study ID + -------- + :param data: exacerbations dataframe + :param IDs: dataframe containing RC and SU1 study IDs + :param onboard: dataframe containing onboarding dates + :return: formatted dataframe + """ + data['SubmissionTime'] = pd.to_datetime(data['SubmissionTime'], utc=True) + onboard['OB_date'] = pd.to_datetime(onboard['OB_date'], utc=True) + onboard['yearcensor'] = onboard['OB_date'] + pd.offsets.DateOffset(days=365) + data = pd.merge(IDs, data, on="Study_ID", how="left") + data = pd.merge(data, onboard, on="Study_ID", how="left") + return data + + +def filter_study_censor(data): + """ + Filter the dataframe to only contain data obtained before the study censor date + -------- + :param data: dataframe + :return: dataframe containing data obtained before the study censor date + """ + return data[data['SubmissionTime'] < '2021-09-01'] + + +def filter_first_year(data): + """ + Filter a dataframe to only contain data obtained in the first year post-onboarding + -------- + :param data: dataframe + :return: dataframe containing only data obtained in the first year post-onboarding + """ + return data[data['yearcensor'] >= data['SubmissionTime']] + + +def get_exac_data(data, onboard, IDs): + """ + Calculate the number of exacerbations to year censor and study censor + and the length of time to first exacerbation for each study ID and save the + resulting dataframe + -------- + :param censor_data: PRO LOGIC exacerbations data censored at the study censor date + :param year_censor_data: PRO LOGIC exacerbations data censored a year post onboaridng + :param onboard: Dataframe showing onboarding dates for the study participants + :param IDs: Dataframe containing all RC and SU1 study IDs + :return: dataframe showing exacerbation counts and the length of time to first exacerbation for each study ID + """ + censor_data = filter_study_censor(data) + year_censor_data = filter_first_year(data) + + censor_sum = censor_data.groupby("Study_ID").SubmissionTime.agg( + first_exacerbation='min', + exacerbation_count_to_censor='count').copy() + censor_sum = pd.merge(censor_sum, onboard, on="Study_ID", how="outer") + censor_sum["days_to_first_exacerbation"] = (censor_sum["first_exacerbation"] - censor_sum["OB_date"]).dt.days + + year_censor_sum = year_censor_data.groupby("Study_ID").SubmissionTime.agg( + exacerbation_count_to_year='count').copy() + + PRO_LOGIC_exacerbation_data = pd.merge(censor_sum, year_censor_sum, on="Study_ID", how="outer") + PRO_LOGIC_exacerbation_data = pd.merge(IDs, PRO_LOGIC_exacerbation_data, on="Study_ID", how="left") + + PRO_LOGIC_exacerbation_data.to_csv(file_path + 'PRO_LOGIC_exacerbation_data.csv') + + +def main(): + # Read data + PRO_LOGIC_data = input_file_path + "PRO_LOGIC_exacerbations_and_dates.csv" + RC_SU1_IDs_data_file = input_file_path + "RC_SU1_IDs.csv" + onboard_file = input_file_path + "onboarding_dates.csv" + + PRO_LOGIC_data = read_data(PRO_LOGIC_data) + RC_SU1_IDs = read_data(RC_SU1_IDs_data_file) + Onboard = read_data(onboard_file) + + # Format data + PRO_LOGIC_data = format_data(PRO_LOGIC_data, RC_SU1_IDs, Onboard) + + # Calculate and save summary exacerbation data to year and study censor dates for each ID + get_exac_data(PRO_LOGIC_data, Onboard, RC_SU1_IDs) + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/README.md b/validation/parameter_calculation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9bd6c016ba434ef740852127f319ccf99df38814 --- /dev/null +++ b/validation/parameter_calculation/README.md @@ -0,0 +1,27 @@ +# Model E Validation +'In order to support model E validation, analysis was conducted to determine where RECEIVER and Scale up patients were seen in the clusters created by model E. To understand the risk profile of the RECEIVER and Scale up cohorts work was carried out to compile characteristics associated with risk for both cohorts. These files show the work that was done to obtain some key metrics associated with baseline and ongoing risk amongst the study cohorts.' + +# Structure +'CAT_MRC_score_metrics_calculation.py' +'Calculating the median and max CAT and MRC scores for the RECEIVER and Scale up 1 cohorts to study censor and to a year after onboarding' + +'Fitbit_groups_calculation.py' +'Formatting the Fitbit groups data for the RECEIVER and Scale up 1 cohorts' + +'GOLD_grade_GOLD_group_calculation.py' +'Calcualting the gold grade and GOLD group for the RECEIVER and Scale up 1 cohorts' + +'NIV_parameters_calculation.py' +'Calculating the median and max values for IE ratio, apnoea hypopnea index, and hypopnea index for NIV users amongst the RECEIVER and Scale up 1 cohorts' + +'PRO_LOGIC_exacerbation_calculations.py' +'Calculating the number of PRO LOGIC exacerbations to study censor and to a year after onboarding and calculating the time to first PRO LOGIC exacerbation for the RECEIVER and Scale up 1 cohorts' + +'Time_to_death_calulation.py' +'Calculating the number of days survived for those who died during the study period in the RECEIEVR and Scale up 1 cohorts' + +'Time_to_first_admission_calulations.py' +'Calculating the number of days until first admission for those who had an admission during the study period in the RECEIEVR and Scale up 1 cohorts' + +'Time_to_first_event_calulations.py' +'Calculating the number of days until a)first event (PRO LOGIC exacerbation, admission, death) and b)first admission or death for those who met these criteria during the study period in the RECEIEVR and Scale up 1 cohorts' \ No newline at end of file diff --git a/validation/parameter_calculation/Time_to_death_calculation.py b/validation/parameter_calculation/Time_to_death_calculation.py new file mode 100644 index 0000000000000000000000000000000000000000..85ba6ba5a67334c3a7e6bfe17a602b8cb81c9246 --- /dev/null +++ b/validation/parameter_calculation/Time_to_death_calculation.py @@ -0,0 +1,68 @@ +# Import libraries +import pandas as pd +import numpy as np + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data(onboard, IDs): + """ + Convert datetime columns to datetime format, filter to only include RECEIVER and scale up IDs, + and add Date of death column + -------- + :param onboard: dataframe containing onboarding dates + :param IDs: dataframe containing IDs of interest + :return: formatted dataframe + """ + onboard['OB_date'] = pd.to_datetime(onboard['OB_date']) + onboard['censor'] = pd.to_datetime(onboard['censor']) + onboard = pd.merge(IDs, onboard, on="Study_ID", how="left") + conditions_DOD = [onboard['censor'] != '2021-08-31'] + values_DOD = [onboard['censor'].dt.date] + onboard['DOD'] = np.select(conditions_DOD, values_DOD, default=None) + onboard['DOD'] = pd.to_datetime(onboard['DOD']) + return onboard + + +def calculate_suvival(onboard, date_of_death, OB_date): + """ + Calculate days from onboarding to date of death for those who died over the course of the RECEIVER study + and save the dataframe + -------- + :param onboard: dataframe containing onboarding and date of death data + :param date of death: datetime column showing date of death + :param OB_date: datetime column showing onboarding date + """ + onboard['days'] = (onboard[date_of_death] - onboard[OB_date]).dt.days + onboard.to_csv(file_path + 'Time_to_death_for_cohorts.csv') + + +def main(): + # Read in data + onboard_file = input_file_path + "onboarding_dates.csv" + RC_SU1_IDs_file = input_file_path + "RC_SU1_IDs.csv" + + onboard = read_data(onboard_file) + RC_SU1_IDs = read_data(RC_SU1_IDs_file) + + # Format data + onboard = format_data(onboard, RC_SU1_IDs) + + # Calculate days alive following onboarding and save the dataframe + calculate_suvival(onboard, 'DOD', 'OB_date') + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/Time_to_first_admission_calculations.py b/validation/parameter_calculation/Time_to_first_admission_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..32e6670da83a3a559b4befb2160be4f06bdd703a --- /dev/null +++ b/validation/parameter_calculation/Time_to_first_admission_calculations.py @@ -0,0 +1,69 @@ +# Import libraries +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data(data, IDs, onboard): + """ + Convert datetime columns to datetime format, remove additional columns, + filter to only include RECEIVER and scale up IDs, and join oboarding dates to admissions data + -------- + :param data:dataframe + :param IDs: dataframe containing RC and SU1 study IDs + :param onboard: dataframe containing onboarding dates + :return: formatted dataframe + """ + data['admitted_1'] = pd.to_datetime(data['admitted_1'], utc=True) + onboard['OB_date'] = pd.to_datetime(onboard['OB_date'], utc=True) + data = data[['Study_ID', 'admitted_1']] + onboard = onboard[['Study_ID', 'OB_date']] + data = pd.merge(IDs, data, on="Study_ID", how="left") + data = pd.merge(data, onboard, on="Study_ID", how="left") + return data + + +def time_to_admission(data, date_of_admission, OB_date): + """ + Calculate days from onboarding to first admission for those who had an admission in the study period + -------- + :param onboard: dataframe containing onboarding and admissions data + :param date_of_admission: datetime column showing date of first admission + :param OB_date: datetime column showing onboarding dates + :return: dataframe with additional column showing number of days to first admission for those who had an admission + """ + data['days'] = (data['admitted_1'] - data['OB_date']).dt.days + data.to_csv(file_path + 'Days_to_first_admission.csv') + + +def main(): + # Read data + admissions_data_file = input_file_path + "admissions_data_up_to_31082021.csv" + onboard_file = input_file_path + "onboarding_dates.csv" + RC_SU1_IDs_file = input_file_path + "RC_SU1_IDs.csv" + + admissions_data = read_data(admissions_data_file) + onboard = read_data(onboard_file) + RC_SU1_IDs = read_data(RC_SU1_IDs_file) + + # Format data + admissions_onboard = format_data(admissions_data, RC_SU1_IDs, onboard) + + # Determine time to first admission for each ID and save the dataframe + time_to_admission(admissions_onboard, 'admitted_1', 'OB_date') + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/Time_to_first_event_calculations.py b/validation/parameter_calculation/Time_to_first_event_calculations.py new file mode 100644 index 0000000000000000000000000000000000000000..48b498d97e0b8d68ff3e44ed2217a32110e2df0f --- /dev/null +++ b/validation/parameter_calculation/Time_to_first_event_calculations.py @@ -0,0 +1,94 @@ +# Import libraries +import functools as ft +import numpy as np +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +def format_data(exacerbations_data, admissions_data, onboard, IDs): + """ + Remove unescessary columns from dataframes, + merge onboarding, admissions, and exacerbations dataframes, + convert datetime columns to datetime format, + filter to include only RECEIVER and scale up 1 IDs, + and create new column showing date of death for those who died during the study + -------- + :param exacerbations_data: dataframe containing exacerbations data + :param admissions_data: dataframe containing admissions data + :param IDs: dataframe containing RECEIVER and scale up 1 study IDs + :param onboard: dataframe containing onboarding dates + :return: formatted dataframe + """ + admissions_data = admissions_data[['Study_ID', 'admitted_1']] + exacerbations_data = exacerbations_data[['Study_ID', 'first_exacerbation']] + + dfs = [onboard, exacerbations_data, admissions_data] + df_combined = ft.reduce(lambda left, right: pd.merge(left, right, on='Study_ID', how="outer"), dfs) + data = pd.merge(IDs, df_combined, on="Study_ID", how="left") + + data['first_exacerbation'] = pd.to_datetime(data['first_exacerbation']) + data['admitted_1'] = pd.to_datetime(data['admitted_1']) + data['OB_date'] = pd.to_datetime(data['OB_date']) + data['censor'] = pd.to_datetime(data['censor']) + + conditions_DOD = [data['censor'] != '2021-08-31'] + values_DOD = [data['censor'].dt.date] + data['DOD'] = np.select(conditions_DOD, values_DOD, default=None) + data['DOD'] = pd.to_datetime(data['DOD']) + return data + + +def time_to_events(data): + """ + Calculate time to first event (exacerbation, admission, or death) and first admission or death + for each study ID and save the summary dataframe + -------- + :param data: dataframe containing admissions data, exacerbations data, and onboarding dates + :return: dataframe with additional columns showing number of days until first event and number of days + to first admission/ death + """ + data['first_event'] = data[["admitted_1", "first_exacerbation", "DOD"]].min(axis=1) + data['first_event'] = pd.to_datetime(data['first_event']) + data['first_admission_or_death'] = data[["admitted_1", "DOD"]].min(axis=1) + data['first_admission_or_death'] = pd.to_datetime(data['first_admission_or_death']) + + data['days_to_first_event'] = (data['first_event'] - data['OB_date']).dt.days + data['days_to_first_admission_death'] = (data['first_admission_or_death'] - data['OB_date']).dt.days + + data.to_csv(file_path + 'Time_to_first_event.csv') + + +def main(): + # Read data + PRO_LOGIC_data = input_file_path + "First_exacerbation_data.csv" + admissions_data_file = input_file_path + "admissions_data_up_to_31082021.csv" + RC_SU1_IDs_data_file = input_file_path + "RC_SU1_IDs.csv" + onboard_file = input_file_path + "onboarding_dates.csv" + + PRO_LOGIC_data = read_data(PRO_LOGIC_data) + admissions_data = read_data(admissions_data_file) + RC_SU1_IDs = read_data(RC_SU1_IDs_data_file) + Onboard = read_data(onboard_file) + + # Format data + RC_combined_data = format_data(PRO_LOGIC_data, admissions_data, Onboard, RC_SU1_IDs) + + # Calculate time to first event for each study ID and save the summary dataframe + time_to_events(RC_combined_data) + + +main() \ No newline at end of file diff --git a/validation/parameter_calculation/__init__.py b/validation/parameter_calculation/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/validation/parameter_calculation/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/validation/risk_score_calculation/__init__.py b/validation/risk_score_calculation/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..cd5026e4a1076584bc8ccbda0acf1605dbfafa11 --- /dev/null +++ b/validation/risk_score_calculation/__init__.py @@ -0,0 +1 @@ +# Empty file for folder to be recognised as module diff --git a/validation/risk_score_calculation/combined_risk_score_RC_SU1.py b/validation/risk_score_calculation/combined_risk_score_RC_SU1.py new file mode 100644 index 0000000000000000000000000000000000000000..2a6c164e94a16e6ce0a657eff0b9c0eefd773147 --- /dev/null +++ b/validation/risk_score_calculation/combined_risk_score_RC_SU1.py @@ -0,0 +1,218 @@ +# Import libraries +import numpy as np +import pandas as pd + +# Set file paths +file_path = '/' +input_file_path = file_path + 'data_for_model_e_columns/' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + return df + + +# Conditions FEV1 percentage predicted component +def FEV1_conditions(data, col): + """ + Calculate FEV1 percentage predicted component for risk score + -------- + :param data: dataframe containing FEV1% predicted data + :param col: column containing FEV1% predicted data + :return: FEV1 risk score component + """ + data_col = data[col] + conditions_FEV1 = [data_col == "GOLD 4", data_col == "GOLD 3", + data_col == "GOLD 2", data_col == "GOLD 1"] + values_FEV1 = [1, 0.67, 0.33, 0] + return np.select(conditions_FEV1, values_FEV1, None) + + +# Conditions home NIV and home oxygen component +def NIV_oxygen_conditions(data): + """ + Calculate NIV/ oxygen user (yes/no) component for risk score + -------- + :param data: dataframe containing NIV and home oxygen usage data + :return: NIV/ home oxygen user risk score component + """ + conditions_NIV = [(data['NIV_user'] == 1) | (data['homeoxygen'] == 1), + (data['NIV_user'] == '0') & (data['homeoxygen'] == '0')] + values_NIV = [1, 0] + return np.select(conditions_NIV, values_NIV) + + +# Conditions number of comorbidities component +def comorbidities_conditions(data, col): + """ + Calculate comorbidities component for risk score + -------- + :param data: dataframe containing comorbidities data + :param col: column containing comorbidities data + :return: Comorbidities risk score component + """ + data_col = data[col] + conditions_comorbidities = [data_col >= 3, data_col.between(1, 2), data_col == 0] + values_comorbidities = [1, 0.5, 0] + return np.select(conditions_comorbidities, values_comorbidities, None) + + +# Conditions baseline CAT component +def baseline_CAT_conditions(data, col): + """ + Calculate baseline CAT score component for risk score + -------- + :param data: dataframe containing baseline CAT score data + :param col: column containing baseline CAT score data + :return: Baseline CAT risk score component + """ + data_col = data[col] + conditions_baseline_CAT = [data_col >= 30, data_col.between(20, 29), + data_col.between(10, 19), data_col.between(0, 9)] + values_baseline_CAT = [1, 0.67, 0.33, 0] + return np.select(conditions_baseline_CAT, values_baseline_CAT, None) + + +# Conditions baseline MRC component +def baseline_MRC_conditions(data, col): + """ + Calculate baseline MRC score component for risk score + -------- + :param data: dataframe containing baseline MRC score data + :param col: column containing baseline MRC score data + :return: Baseline MRC risk score component + """ + data_col = data[col] + conditions_baseline_MRC = [data_col.between(4.5, 5), + data_col.between(3.5, 4.4), + data_col.between(2.5, 3.4), + data_col.between(1.5, 2.4), + data_col < 1.5] + values_baseline_MRC = [1, 0.75, 0.5, 0.25, 0] + return np.select(conditions_baseline_MRC, values_baseline_MRC, None) + + +# Conditions previous year admissions component +def previous_admissions_conditions(data, col): + """ + Calculate previous admissions component for risk score + -------- + :param data: dataframe containing admissions in previous year data + :param col: column containing admissions in previous year data + :return: Previous admissions risk score component + """ + data_col = data[col] + condition_prev_ad = [data_col >= 6, data_col.between(3, 5), + data_col.between(1, 2), data_col == 0] + value_prev_ad = [1, 0.67, 0.33, 0] + return np.select(condition_prev_ad, value_prev_ad, None) + + +# Conditions previous year occupied bed days component +def previous_OBD_conditions(data, col): + """ + Calculate previous OBDs component for risk score + -------- + :param data: dataframe containing OBDs in previous year data + :param col: column containing OBDs in previous year data + :return: Previous OBDs risk score component + """ + data_col = data[col] + conditions_prev_OBDs = [data_col >= 15, data_col.between(8, 14), + data_col.between(4, 7), data_col <= 3] + values_prev_OBDs = [1, 0.67, 0.33, 0] + return np.select(conditions_prev_OBDs, values_prev_OBDs, None) + + +# Conditions GOLD group component +def GOLD_group_conditions(data, col): + """ + Calculate GOLD group component for risk score + -------- + :param data: dataframe containing GOLD group component + :param col: column containing GOLD group component + :return: GOLD group risk score component + """ + data_col = data[col] + conditions_GOLD_group = [data_col == "GOLD group D", data_col == "GOLD group C", + data_col == "GOLD group B", data_col == "GOLD group A"] + values_GOLD_group = [1, 0.67, 0.33, 0] + return np.select(conditions_GOLD_group, values_GOLD_group, None) + + +# Conditions core comorbdiities component +def core_comorb_conditions(data): + """ + Calculate core comorbdities component for risk score + -------- + :param data: dataframe containing core comorbidities data + :return: core comormidities risk score component + """ + return (data['IHD'] | data['AF'] | data['HF'] | data['Bronchiectasis'] | data['OSA'] | data['PH']) + + +# Format data +def format_data(data): + """ + Remove unnecessary columns + -------- + :param data: dataframe + :return: dataframe containing only outcomes and risk score components + """ + data = data[["ID", "admissions_since_onboard", "365_days_post_admission", "OBDs_censor", + "censored_365_days", "died_before_end_of_study", "days_to_death", + "days_to_first_admission", "days_to_first_exac", "days_to_first_event", + "days_to_first_admission_death", "exacerbation_count_to_censor", + "exacerbation_count_to_year", "smoking_component", "FEV1_component", + "NIV_Oxygen_component", "comorbidities_component", "baseline_CAT_component", + "baseline_MRC_comoonent", "triple_therapy_component", + "prev_admissions_component", "prev_OBDs_component", "GOLD_group_component", + "core_comorb_component"]] + return data + + +def determine_risk_score(data): + """ + Sum all the risk score components to get a combined risk score + both and without comorbidity related components + -------- + :param data: dataframe with risk score component columns + :return: dataframe contaning summary columns showing combined risk score + """ + data['combined_risk_score'] = data.iloc[:, [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]].sum(axis=1) + data['combined_risk_score_no_comorbidities'] = data.iloc[:, [13, 14, 15, 17, 18, 19, 20, 21, 22]].sum(axis=1) + data.to_csv(file_path + 'risk_scores.csv') + + +def main(): + # Read data + RC_SU1_inputs_file = input_file_path + "Cohort_characteristics_data_RC_SU.csv" + risk = read_data(RC_SU1_inputs_file) + + # Get risk score component scores + risk['smoking_component'] = (risk.Smoking_status == 'Current').astype(int) + risk['FEV1_component'] = FEV1_conditions(risk, 'GOLD grade') + risk['NIV_Oxygen_component'] = NIV_oxygen_conditions(risk) + risk['comorbidities_component'] = comorbidities_conditions(risk, 'number_of_comorbidities') + risk['baseline_CAT_component'] = baseline_CAT_conditions(risk, 'CAT_baseline') + risk['baseline_MRC_comoonent'] = baseline_MRC_conditions(risk, 'MRC_baseline') + risk['triple_therapy_component'] = (risk.Triple_therapy).astype(int) + risk['prev_admissions_component'] = previous_admissions_conditions(risk, 'Prior_Ad') + risk['prev_OBDs_component'] = previous_OBD_conditions(risk, 'Prior_OBDs') + risk['GOLD_group_component'] = GOLD_group_conditions(risk, 'GOLD group') + risk['core_comorb_component'] = core_comorb_conditions(risk) + + # Remove_unnecessary columns + risk = format_data(risk) + + # Calculate combined risk score both with and without the comorbidities components + determine_risk_score(risk) + + +main() \ No newline at end of file diff --git a/validation/spirometry_scripts/spirometry_RC_SU_mapping.py b/validation/spirometry_scripts/spirometry_RC_SU_mapping.py new file mode 100644 index 0000000000000000000000000000000000000000..5ab9f7238e33a8971891fda7324507383862e7ca --- /dev/null +++ b/validation/spirometry_scripts/spirometry_RC_SU_mapping.py @@ -0,0 +1,94 @@ +""" +Map GOLD standard COPD groupings from REC/SUP IDs to SafeHavenIDs. +-------- +NB: Data contained within 'RC_SU1_spirometry_data.csv' has been created using +from data within the teams space. +""" +import pandas as pd + + +# Set file paths +file_path = '/copd.model-e/' +input_file_path = file_path + 'training/src/data/' +output_file_path = '/Model_E_Extracts/rec_sup_spirometry_data.pkl' + + +def read_data(file): + """ + Read in data source + -------- + :param file: string filename + :return: dataframe + """ + df = pd.read_csv(file) + + return df + + +def calc_gold_grade(data): + """ + Calculate GOLD grade for COPD classification using FEV1% + -------- + :param data: dataframe containing FEV1% column + :return: GOLD grade values based on if else statement + """ + fev1 = data['FEV1%'] + if fev1 >= 80: + val = 'GOLD 1' + elif (fev1 >= 50) & (fev1 < 80): + val = 'GOLD 2' + elif (fev1 >= 30) & (fev1 < 50): + val = 'GOLD 3' + elif fev1 < 30: + val = 'GOLD 4' + else: + val = '' + + return val + + +def add_SH_mappings_for_RC_and_SU1(RC_IDs, SU1_IDs, spirometry_data): + """ + Join the SH ID mappings to the spirometry data for RC and SU1 + -------- + :param RC_IDs: dataframe containing RECEIVER - SH ID mappings + :param SU1_IDs: dataframe containing SU1 - SH ID mappings + :param spirometry_data: spirometry data for RC and SU1 + :return: RC and SU1 spirometry data with SH ID mapping columns + """ + receiver_IDs = RC_IDs.rename(columns={'RNo': 'StudyId'}) + scaleup_IDs = SU1_IDs.rename(columns={'Study_Number': 'StudyId'}) + all_service_IDs = pd.concat([receiver_IDs, scaleup_IDs], ignore_index=True) + spirometry_mappings = pd.merge( + spirometry_data, all_service_IDs, on="StudyId", how="left").dropna() + type_map = {'FEV1%': 'int32', 'SafeHavenID': 'int32'} + spirometry_mappings = spirometry_mappings.astype(type_map) + + return spirometry_mappings + + +def main(): + + # Read spirometry data + rec_sup_spiro_file = input_file_path + "RC_SU1_spirometry_data.csv" + rec_sup_spiro_data = read_data(rec_sup_spiro_file).dropna() + + # Create new columns showing the GOLD group of each study participant + rec_sup_spiro_data['GOLD grade'] = rec_sup_spiro_data.apply( + calc_gold_grade, axis=1) + + # Read RC and SU1 SafeHaven ID mapping files + rec_id_file = "/EXAMPLE_STUDY_DATA/Cohort3Rand.csv" + sup_id_file = "/SU_IDs/Scale_Up_lookup.csv" + rec_id_map_data = read_data(rec_id_file) + sup_id_map_data = read_data(sup_id_file) + + # Join spirometry data to SH mappings + mapped_data = add_SH_mappings_for_RC_and_SU1( + rec_id_map_data, sup_id_map_data, rec_sup_spiro_data) + + # Save data + mapped_data.to_pickle(output_file_path) + + +main()