IamGrooooot commited on 7 days ago

Commit

000de75

0 Parent(s):

Inital Upload

Browse files

Files changed (28) hide show

.gitignore +27 -0
MODEL_CARD.md +263 -0
README.md +263 -0
lookups/Training_20221125_labs_lookup.csv +23 -0
training/check_index_date_dist.py +149 -0
training/combine_features.py +179 -0
training/config.yaml +45 -0
training/create_sh_lookup_table.py +48 -0
training/cross_val_final_models.py +673 -0
training/cross_val_first_models.py +460 -0
training/encode_and_impute.py +286 -0
training/encoding.py +281 -0
training/imputation.py +255 -0
training/model_h.py +2061 -0
training/perform_forward_validation.py +296 -0
training/perform_hyper_param_tuning.py +215 -0
training/process_comorbidities.py +109 -0
training/process_demographics.py +75 -0
training/process_exacerbation_history.py +297 -0
training/process_labs.py +228 -0
training/process_pros.py +1031 -0
training/process_spirometry.py +116 -0
training/pros_multiple_time_windows.py +618 -0
training/setup_labels_forward_val.py +643 -0
training/setup_labels_hosp_comm.py +935 -0
training/setup_labels_only_hosp.py +338 -0
training/split_train_test_val.py +108 -0
training/splitting.py +292 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,27 @@

+# Folders for model cohort data, training data plots and logs
+data/
+training/logging/
+plots/
+tmp/
+training/explore.ipynb
+# Byte-compiled / optimized / DLL files
+training/__pycache__/
+# Environments
+.venv
+.venvdowhy
+# VS Code
+.vscode
+#MLFlow
+mlruns/
+mlruns.db
+#Catboost
+catboost_info/
+#Dowhy
+training/dowhy_2.py
+training/dowhy_example.py

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,263 @@

+---
+language: en
+license: apache-2.0
+tags:
+  - healthcare
+  - ehr
+  - copd
+  - clinical-risk
+  - tabular
+  - scikit-learn
+  - xgboost
+  - lightgbm
+  - catboost
+  - patient-reported-outcomes
+pipeline_tag: tabular-classification
+library_name: sklearn
+---
+# COPD Open Models — Model H (90-Day Exacerbation Prediction)
+## Model Details
+Model H predicts the risk of a COPD exacerbation within **90 days** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It serves as both a production-grade prediction pipeline and a **template project** for building new COPD prediction models, featuring a reusable 2,000+ line core library (`model_h.py`) with end-to-end training, calibration, evaluation, and SHAP explainability.
+### Key Characteristics
+- **Comprehensive PRO integration** — the most detailed PRO feature engineering in the portfolio, covering four instruments: EQ-5D (monthly), MRC dyspnoea (weekly), CAT (daily), and Symptom Diary (daily), each with engagement metrics, score differences, and multi-window aggregations.
+- **13 algorithms screened** in the first phase, narrowed to top 3 with Bayesian hyperparameter tuning.
+- **Forward validation** on 9 months of prospective data (May 2023 – February 2024).
+- **Reusable core library** (`model_h.py`) — 40+ functions for label setup, feature engineering, model evaluation, calibration, and SHAP explainability.
+- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
+> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
+### Model Type
+Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").
+### Release Notes
+- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
+- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
+---
+## Intended Use
+This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.
+### Intended Users
+- ML practitioners exploring tabular healthcare ML pipelines
+- Researchers comparing feature engineering and evaluation approaches
+- Developers building internal prototypes (non-clinical)
+### Out-of-Scope Uses
+- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
+- **Not** a substitute for clinical judgement or validated clinical tools.
+- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
+### Regulatory Considerations (SaMD)
+Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
+---
+## Training Data
+- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
+- **Data available in this repo:** Synthetic/example datasets only.
+- **Cohort:** COPD patients with RECEIVER and Scale-Up cohort membership.
+- **Target:** Binary — `ExacWithin3Months` (hospital + community exacerbations) or `HospExacWithin3Months` (hospital only).
+- **Configuration:** 90-day prediction window, 180-day lookback, 5-fold cross-validation.
+### Features
+| Category | Features |
+|----------|----------|
+| **Demographics** | Age (binned: <50/50-59/60-69/70-79/80+), Sex_F |
+| **Comorbidities** | AsthmaOverlap (binary), Comorbidities count (binned: None/1-2/3+) |
+| **Exacerbation History** | Hospital and community exac counts in lookback, days since last exac, recency-weighted counts |
+| **Spirometry** | FEV1, FVC, FEV1/FVC ratio — max, min, and latest values |
+| **Laboratory (20+ tests)** | MaxLifetime, MinLifetime, Max1Year, Min1Year, Latest values with recency weighting (decay_rate=0.001) — WBC, RBC, haemoglobin, haematocrit, platelets, sodium, potassium, creatinine, albumin, glucose, ALT, AST, GGT, bilirubin, ALP, cholesterol, triglycerides, TSH, and more |
+| **EQ-5D (monthly)** | Q1–Q5, total score, latest values, engagement rates, score changes |
+| **MRC Dyspnoea (weekly)** | MRC Score (1–5), latest value, engagement, variations |
+| **CAT (daily)** | Q1–Q8, total score (0–40), latest values, engagement, score differences |
+| **Symptom Diary (daily)** | Q5 rescue medication (binary), weekly aggregates, engagement rates |
+### Data Preprocessing
+1. **Target encoding** — K-fold encoding with smoothing for categorical features (Age, Comorbidities, FEV1 severity, smoking status, etc.).
+2. **Imputation** — median/mean/mode imputation strategies, applied per-fold.
+3. **Scaling** — MinMaxScaler to [0, 1], fit on training fold only.
+4. **PRO LOGIC filtering** — 14-day minimum between exacerbation episodes, 2 consecutive negative Q5 responses required for borderline events (14–35 days apart).
+---
+## Training Procedure
+### Training Framework
+- pandas, scikit-learn, imbalanced-learn, xgboost, lightgbm, catboost
+- Hyperparameter tuning: scikit-optimize (BayesSearchCV)
+- Explainability: SHAP (TreeExplainer)
+- Experiment tracking: MLflow
+### Algorithms Evaluated
+**First Phase (13 model types):**
+| Algorithm | Library |
+|-----------|---------|
+| DummyClassifier (baseline) | sklearn |
+| Logistic Regression | sklearn |
+| Logistic Regression (balanced) | sklearn |
+| Random Forest | sklearn |
+| Random Forest (balanced) | sklearn |
+| Balanced Random Forest | imblearn |
+| Balanced Bagging | imblearn |
+| XGBoost (7 variants) | xgboost |
+| LightGBM (2 variants) | lightgbm |
+| CatBoost | catboost |
+**Hyperparameter Tuning Search Spaces:**
+| Algorithm | Parameters Tuned |
+|-----------|-----------------|
+| Logistic Regression | penalty, class_weight, max_iter (50–300), C (0.001–10) |
+| Random Forest | max_depth (4–10), n_estimators (70–850), min_samples_split (2–10), class_weight |
+| XGBoost | max_depth (4–10), n_estimators (70–850) |
+**Final Phase:** Top 3 models (Balanced Random Forest, XGBoost, Random Forest) retrained with tuned hyperparameters.
+### Evaluation Design
+- **5-fold** cross-validation with per-fold preprocessing.
+- Metrics evaluated at threshold 0.5 and at best-F1 threshold.
+- Event-type breakdown: hospital vs. community exacerbations evaluated separately.
+- **Forward validation:** 9 months of prospective data (May 2023 – February 2024), assessed with KS test and Wasserstein distance for distribution shift.
+### Calibration
+- **Sigmoid** (Platt scaling)
+- **Isotonic** regression
+- Applied via CalibratedClassifierCV with per-fold calibration.
+---
+## Evaluation Results
+> Replace this section with measured results from your training run.
+| Metric | Value | Notes |
+|--------|-------|-------|
+| ROC-AUC | TBD | Cross-validation mean (± std) |
+| AUC-PR | TBD | Primary metric for imbalanced outcome |
+| F1 Score (@ 0.5) | TBD | Default threshold |
+| Best F1 Score | TBD | At optimal threshold |
+| Balanced Accuracy | TBD | Cross-validation mean |
+| Brier Score | TBD | Probability calibration quality |
+### Caveats on Metrics
+- Performance depends heavily on cohort definition, PRO engagement rates, and label construction.
+- Forward validation results may differ from cross-validation due to temporal shifts in data availability and coding practices.
+- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
+---
+## Bias, Risks, and Limitations
+- **Dataset shift:** EHR coding practices, PRO engagement, and population characteristics vary across sites and time periods.
+- **PRO engagement bias:** Patients who engage more with digital health tools may differ systematically from non-engagers.
+- **Label uncertainty:** Exacerbation events are constructed via PRO LOGIC — different definitions produce different results.
+- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
+- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
+---
+## How to Use
+### Pipeline Execution Order
+```bash
+# 1. Install dependencies
+pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm catboost scikit-optimize shap mlflow matplotlib seaborn pyyaml joblib scipy
+# 2. Set up labels (choose one)
+python training/setup_labels_hosp_comm.py       # hospital + community exacerbations
+python training/setup_labels_only_hosp.py        # hospital only
+python training/setup_labels_forward_val.py      # forward validation set
+# 3. Split data
+python training/split_train_test_val.py
+# 4. Feature engineering (run in sequence)
+python training/process_demographics.py
+python training/process_comorbidities.py
+python training/process_exacerbation_history.py
+python training/process_spirometry.py
+python training/process_labs.py
+python training/process_pros.py
+# 5. Combine features
+python training/combine_features.py
+# 6. Encode and impute
+python training/encode_and_impute.py
+# 7. Screen algorithms
+python training/cross_val_first_models.py
+# 8. Hyperparameter tuning
+python training/perform_hyper_param_tuning.py
+# 9. Final cross-validation with best models
+python training/cross_val_final_models.py
+# 10. Forward validation (optional)
+python training/perform_forward_validation.py
+```
+### Configuration
+Edit `config.yaml` to adjust:
+- `prediction_window` (default: 90 days)
+- `lookback_period` (default: 180 days)
+- `model_type` ('hosp_comm' or 'only_hosp')
+- `num_folds` (default: 5)
+- Input/output data paths
+### Core Library
+`model_h.py` provides 40+ reusable functions for:
+- PRO LOGIC exacerbation validation
+- Recency-weighted feature engineering
+- Model evaluation (F1, PR-AUC, ROC-AUC, Brier, calibration curves)
+- SHAP explainability (summary, local, interaction, decision plots)
+- Calibration (sigmoid, isotonic, spline)
+---
+## Environmental Impact
+Training computational requirements are minimal — all models are traditional tabular ML classifiers running on CPU. A full pipeline run (feature engineering through cross-validation) completes in minutes on a standard laptop.
+---
+## Citation
+If you use this model or code, please cite:
+- This repository: *(add citation format / Zenodo DOI if minted)*
+- Associated publications: *(clinical trial results paper — forthcoming)*
+## Authors and Contributors
+- **Storm ID** (maintainers)
+## License
+This model and code are released under the **Apache 2.0** license.

README.md ADDED Viewed

	@@ -0,0 +1,263 @@

+---
+language: en
+license: apache-2.0
+tags:
+  - healthcare
+  - ehr
+  - copd
+  - clinical-risk
+  - tabular
+  - scikit-learn
+  - xgboost
+  - lightgbm
+  - catboost
+  - patient-reported-outcomes
+pipeline_tag: tabular-classification
+library_name: sklearn
+---
+# COPD Open Models — Model H (90-Day Exacerbation Prediction)
+## Model Details
+Model H predicts the risk of a COPD exacerbation within **90 days** using features derived from NHS EHR datasets and patient-reported outcomes (PROs). It serves as both a production-grade prediction pipeline and a **template project** for building new COPD prediction models, featuring a reusable 2,000+ line core library (`model_h.py`) with end-to-end training, calibration, evaluation, and SHAP explainability.
+### Key Characteristics
+- **Comprehensive PRO integration** — the most detailed PRO feature engineering in the portfolio, covering four instruments: EQ-5D (monthly), MRC dyspnoea (weekly), CAT (daily), and Symptom Diary (daily), each with engagement metrics, score differences, and multi-window aggregations.
+- **13 algorithms screened** in the first phase, narrowed to top 3 with Bayesian hyperparameter tuning.
+- **Forward validation** on 9 months of prospective data (May 2023 – February 2024).
+- **Reusable core library** (`model_h.py`) — 40+ functions for label setup, feature engineering, model evaluation, calibration, and SHAP explainability.
+- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
+> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
+### Model Type
+Traditional tabular ML classifiers (multiple candidate estimators; see "Training Procedure").
+### Release Notes
+- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
+- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
+---
+## Intended Use
+This model and code are published as **reference implementations** for research, education, and benchmarking on COPD prediction tasks.
+### Intended Users
+- ML practitioners exploring tabular healthcare ML pipelines
+- Researchers comparing feature engineering and evaluation approaches
+- Developers building internal prototypes (non-clinical)
+### Out-of-Scope Uses
+- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
+- **Not** a substitute for clinical judgement or validated clinical tools.
+- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
+### Regulatory Considerations (SaMD)
+Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
+---
+## Training Data
+- **Source:** NHS EHR-derived datasets and Lenus COPD Service PRO data (training performed on controlled datasets; not distributed here).
+- **Data available in this repo:** Synthetic/example datasets only.
+- **Cohort:** COPD patients with RECEIVER and Scale-Up cohort membership.
+- **Target:** Binary — `ExacWithin3Months` (hospital + community exacerbations) or `HospExacWithin3Months` (hospital only).
+- **Configuration:** 90-day prediction window, 180-day lookback, 5-fold cross-validation.
+### Features
+| Category | Features |
+|----------|----------|
+| **Demographics** | Age (binned: <50/50-59/60-69/70-79/80+), Sex_F |
+| **Comorbidities** | AsthmaOverlap (binary), Comorbidities count (binned: None/1-2/3+) |
+| **Exacerbation History** | Hospital and community exac counts in lookback, days since last exac, recency-weighted counts |
+| **Spirometry** | FEV1, FVC, FEV1/FVC ratio — max, min, and latest values |
+| **Laboratory (20+ tests)** | MaxLifetime, MinLifetime, Max1Year, Min1Year, Latest values with recency weighting (decay_rate=0.001) — WBC, RBC, haemoglobin, haematocrit, platelets, sodium, potassium, creatinine, albumin, glucose, ALT, AST, GGT, bilirubin, ALP, cholesterol, triglycerides, TSH, and more |
+| **EQ-5D (monthly)** | Q1–Q5, total score, latest values, engagement rates, score changes |
+| **MRC Dyspnoea (weekly)** | MRC Score (1–5), latest value, engagement, variations |
+| **CAT (daily)** | Q1–Q8, total score (0–40), latest values, engagement, score differences |
+| **Symptom Diary (daily)** | Q5 rescue medication (binary), weekly aggregates, engagement rates |
+### Data Preprocessing
+1. **Target encoding** — K-fold encoding with smoothing for categorical features (Age, Comorbidities, FEV1 severity, smoking status, etc.).
+2. **Imputation** — median/mean/mode imputation strategies, applied per-fold.
+3. **Scaling** — MinMaxScaler to [0, 1], fit on training fold only.
+4. **PRO LOGIC filtering** — 14-day minimum between exacerbation episodes, 2 consecutive negative Q5 responses required for borderline events (14–35 days apart).
+---
+## Training Procedure
+### Training Framework
+- pandas, scikit-learn, imbalanced-learn, xgboost, lightgbm, catboost
+- Hyperparameter tuning: scikit-optimize (BayesSearchCV)
+- Explainability: SHAP (TreeExplainer)
+- Experiment tracking: MLflow
+### Algorithms Evaluated
+**First Phase (13 model types):**
+| Algorithm | Library |
+|-----------|---------|
+| DummyClassifier (baseline) | sklearn |
+| Logistic Regression | sklearn |
+| Logistic Regression (balanced) | sklearn |
+| Random Forest | sklearn |
+| Random Forest (balanced) | sklearn |
+| Balanced Random Forest | imblearn |
+| Balanced Bagging | imblearn |
+| XGBoost (7 variants) | xgboost |
+| LightGBM (2 variants) | lightgbm |
+| CatBoost | catboost |
+**Hyperparameter Tuning Search Spaces:**
+| Algorithm | Parameters Tuned |
+|-----------|-----------------|
+| Logistic Regression | penalty, class_weight, max_iter (50–300), C (0.001–10) |
+| Random Forest | max_depth (4–10), n_estimators (70–850), min_samples_split (2–10), class_weight |
+| XGBoost | max_depth (4–10), n_estimators (70–850) |
+**Final Phase:** Top 3 models (Balanced Random Forest, XGBoost, Random Forest) retrained with tuned hyperparameters.
+### Evaluation Design
+- **5-fold** cross-validation with per-fold preprocessing.
+- Metrics evaluated at threshold 0.5 and at best-F1 threshold.
+- Event-type breakdown: hospital vs. community exacerbations evaluated separately.
+- **Forward validation:** 9 months of prospective data (May 2023 – February 2024), assessed with KS test and Wasserstein distance for distribution shift.
+### Calibration
+- **Sigmoid** (Platt scaling)
+- **Isotonic** regression
+- Applied via CalibratedClassifierCV with per-fold calibration.
+---
+## Evaluation Results
+> Replace this section with measured results from your training run.
+| Metric | Value | Notes |
+|--------|-------|-------|
+| ROC-AUC | TBD | Cross-validation mean (± std) |
+| AUC-PR | TBD | Primary metric for imbalanced outcome |
+| F1 Score (@ 0.5) | TBD | Default threshold |
+| Best F1 Score | TBD | At optimal threshold |
+| Balanced Accuracy | TBD | Cross-validation mean |
+| Brier Score | TBD | Probability calibration quality |
+### Caveats on Metrics
+- Performance depends heavily on cohort definition, PRO engagement rates, and label construction.
+- Forward validation results may differ from cross-validation due to temporal shifts in data availability and coding practices.
+- Reported metrics from controlled datasets may not transfer to other settings without recalibration and validation.
+---
+## Bias, Risks, and Limitations
+- **Dataset shift:** EHR coding practices, PRO engagement, and population characteristics vary across sites and time periods.
+- **PRO engagement bias:** Patients who engage more with digital health tools may differ systematically from non-engagers.
+- **Label uncertainty:** Exacerbation events are constructed via PRO LOGIC — different definitions produce different results.
+- **Fairness:** Outcomes and feature availability may vary by age, sex, deprivation, comorbidity burden, or service access.
+- **Misuse risk:** Using predictions to drive clinical action without clinical safety processes can cause harm through false positives and negatives.
+---
+## How to Use
+### Pipeline Execution Order
+```bash
+# 1. Install dependencies
+pip install pandas numpy scikit-learn imbalanced-learn xgboost lightgbm catboost scikit-optimize shap mlflow matplotlib seaborn pyyaml joblib scipy
+# 2. Set up labels (choose one)
+python training/setup_labels_hosp_comm.py       # hospital + community exacerbations
+python training/setup_labels_only_hosp.py        # hospital only
+python training/setup_labels_forward_val.py      # forward validation set
+# 3. Split data
+python training/split_train_test_val.py
+# 4. Feature engineering (run in sequence)
+python training/process_demographics.py
+python training/process_comorbidities.py
+python training/process_exacerbation_history.py
+python training/process_spirometry.py
+python training/process_labs.py
+python training/process_pros.py
+# 5. Combine features
+python training/combine_features.py
+# 6. Encode and impute
+python training/encode_and_impute.py
+# 7. Screen algorithms
+python training/cross_val_first_models.py
+# 8. Hyperparameter tuning
+python training/perform_hyper_param_tuning.py
+# 9. Final cross-validation with best models
+python training/cross_val_final_models.py
+# 10. Forward validation (optional)
+python training/perform_forward_validation.py
+```
+### Configuration
+Edit `config.yaml` to adjust:
+- `prediction_window` (default: 90 days)
+- `lookback_period` (default: 180 days)
+- `model_type` ('hosp_comm' or 'only_hosp')
+- `num_folds` (default: 5)
+- Input/output data paths
+### Core Library
+`model_h.py` provides 40+ reusable functions for:
+- PRO LOGIC exacerbation validation
+- Recency-weighted feature engineering
+- Model evaluation (F1, PR-AUC, ROC-AUC, Brier, calibration curves)
+- SHAP explainability (summary, local, interaction, decision plots)
+- Calibration (sigmoid, isotonic, spline)
+---
+## Environmental Impact
+Training computational requirements are minimal — all models are traditional tabular ML classifiers running on CPU. A full pipeline run (feature engineering through cross-validation) completes in minutes on a standard laptop.
+---
+## Citation
+If you use this model or code, please cite:
+- This repository: *(add citation format / Zenodo DOI if minted)*
+- Associated publications: *(clinical trial results paper — forthcoming)*
+## Authors and Contributors
+- **Storm ID** (maintainers)
+## License
+This model and code are released under the **Apache 2.0** license.

lookups/Training_20221125_labs_lookup.csv ADDED Viewed

	@@ -0,0 +1,23 @@

+ClinicalCodeDescription,RefUnit,DataQ10,DataQ50,DataQ90,DataMean,DataStd
+ALT,u/l,8.0,17.0,45.0,30.243673995141023,113.66715870850466
+AST,u/l,12.0,20.0,45.0,33.897970974873104,172.29514450554794
+Albumin,g/l,23.0,33.0,40.0,32.143139689111685,6.696146405869818
+Alkaline Phosphatase,u/l,61.0,93.0,172.0,115.04396818241,112.95974512695292
+Basophils,10^9/l,0.0,0.01,0.1,0.04231942316043947,0.0731083879073512
+C Reactive Protein,mg/l,3.0,24.0,153.0,54.79311137263134,72.88535292424446
+Calcium,mmol/l,1.99,2.27,2.47,2.248002582795482,0.1994094643731489
+Chloride,mmol/l,95.0,103.0,108.0,102.1250032055317,5.439721834421554
+Cholesterol,mmol/l,3.3,4.5,6.3,4.702769260607771,1.2336072058201566
+Eosinophils,10^9/l,0.0,0.14,0.4,0.1964911476952577,0.24370462126922973
+Estimated GFR,ml/min,34.0,60.0,60.0,53.78249759753863,12.453638969866372
+Glucose,mmol/l,4.6,6.1,10.8,7.236088370730932,3.9897001667960987
+Haematocrit,l/l,0.298,0.381,0.455,0.3790491791553389,0.062262607997835374
+Haemoglobin,g/l,94.0,123.0,149.0,122.11866599070132,21.51488967036221
+Lymphocytes,10^9/l,0.7,1.5,2.8,1.7822859737658994,4.080028422225104
+Mean Cell Volume,fl,84.7,93.1,102.2,93.31201943900349,7.3459980963707805
+Monocytes,10^9/l,0.4,0.7,1.2,0.7565321844650618,0.5511385273529443
+Neutrophils,10^9/l,3.1,5.8,11.5,6.777581745391032,4.192513421557571
+Platelet Count,10^9/l,156.0,264.0,423.0,280.4851990035415,117.75695119538601
+Sodium,mmol/l,132.0,138.0,142.0,137.67649793633612,4.375384346285572
+Total Bilirubin,umol/l,4.0,8.0,19.0,11.707785816461053,18.33306195550475
+White Blood Count,10^9/l,5.3,8.6,14.5,9.597243934380035,8.600370594737681

training/check_index_date_dist.py ADDED Viewed

	@@ -0,0 +1,149 @@

+"""
+Using a master random seed, create random seeds to iterate through. These random seeds are
+then used to assign a different random seed to every patient to avoid similar index dates
+being generated amongst patients who were enrolled to the service at the same time.
+Histograms were checked and the general random seed that provided the most uniform monthly
+distribution of index dates was chosen for the final index date generation in
+setup_labels_hosp_comm.py and setup_labels_only_hosp.py
+"""
+import numpy as np
+import pandas as pd
+from datetime import timedelta
+import matplotlib.pyplot as plt
+import random
+import os
+# Open patient details file to calculate index dates
+patient_details = pd.read_csv("./data/pat_details_to_calc_index_dt.csv")
+exac_data = pd.read_pickle("./data/hosp_comm_exacs.pkl")
+pat_details_date_cols = ["EarliestIndexDate", "EarliestIndexAfterGap"]
+for col in pat_details_date_cols:
+    patient_details[col] = pd.to_datetime(patient_details[col])
+# Using a master seed, generate random seeds to use for iterations
+master_seed = 42
+random.seed(master_seed)
+general_seeds = random.sample(range(0, 2**32), 200)
+# For each iteration, use a different general seed to create random seeds for each patient
+for general_seed in general_seeds:
+    random.seed(general_seed)
+    # Create different random seeds for each patient
+    patient_details["RandomSeed"] = random.sample(
+        range(0, 2**32), patient_details.shape[0]
+    )
+    # Create random index dates for each patient based on their random seed
+    full_index_date_df = pd.DataFrame()
+    for index, row in patient_details.iterrows():
+        excluded_index_dates_close_to_exac = False
+        iter_num = 0
+        while excluded_index_dates_close_to_exac is False:
+            np.random.seed(row["RandomSeed"] + iter_num)
+            rand_days_dict = {}
+            rand_date_dict = {}
+            # Generate random number of days
+            rand_days_dict[row["StudyId"]] = np.random.choice(
+                row["LengthInService"], size=row["NumRows"], replace=False
+            )
+            rand_date_dict[row["StudyId"]] = []
+            # Using the random days generated, calculate the date
+            for day in rand_days_dict[row["StudyId"]]:
+                if day <= row["NumDaysPossibleIndex"]:
+                    rand_date_dict[row["StudyId"]].append(
+                        row["EarliestIndexDate"] + timedelta(days=int(day))
+                    )
+                else:
+                    rand_date_dict[row["StudyId"]].append(
+                        row["EarliestIndexAfterGap"]
+                        + timedelta(days=int(day - row["NumDaysPossibleIndex"]))
+                    )
+            # Get exacerbation info to exclude any exacerbations that occurred within 14 days
+            # before an index date
+            exac_event_per_patient = exac_data[
+                (exac_data["StudyId"] == row["StudyId"]) & (exac_data["IsExac"] == 1)
+            ][["StudyId", "DateOfEvent", "IsExac"]]
+            # Create df from dictionaries containing random index dates
+            index_date_df = pd.DataFrame.from_dict(
+                rand_date_dict, orient="index"
+            ).reset_index()
+            index_date_df = index_date_df.rename(columns={"index": "StudyId"})
+            # Convert the multiple columns containing index dates to one column
+            index_date_df = (
+                pd.melt(index_date_df, id_vars=["StudyId"], value_name="IndexDate")
+                .drop(["variable"], axis=1)
+                .sort_values(by=["StudyId", "IndexDate"])
+            )
+            index_date_df = index_date_df.dropna()
+            index_date_df = index_date_df.reset_index(drop=True)
+            # Calculate the time to event from exac date (DateOfEvent) to index date
+            exac_event_per_patient = exac_event_per_patient.merge(
+                index_date_df, on="StudyId", how="outer"
+            )
+            exac_event_per_patient["IndexDate"] = pd.to_datetime(
+                exac_event_per_patient["IndexDate"], utc=True
+            )
+            exac_event_per_patient["TimeToEvent"] = (
+                exac_event_per_patient["DateOfEvent"]
+                - exac_event_per_patient["IndexDate"]
+            ).dt.days
+            # End while loop if there are no index dates within 14 days before index date. If
+            # there are, continue loop until there isn't
+            if (
+                not exac_event_per_patient["TimeToEvent"]
+                .between(-14, 0, inclusive="both")
+                .any()
+            ):
+                excluded_index_dates_close_to_exac = True
+                full_index_date_df = pd.concat([full_index_date_df, index_date_df])
+            else:
+                iter_num = iter_num + 1
+    # Check distribution of generated index dates
+    full_index_date_df["IndexYear"] = full_index_date_df["IndexDate"].dt.year
+    full_index_date_df["IndexMonth"] = full_index_date_df["IndexDate"].dt.month
+    full_index_date_df["IndexMonth"].hist(bins=10, density=True)
+    os.makedirs("./plots/index_date", exist_ok=True)
+    plt.savefig(
+        "./plots/index_date/seed_" + str(general_seed) + "_hist.png",
+        bbox_inches="tight",
+    )
+    dates_groupby = full_index_date_df.groupby(by=["IndexYear", "IndexMonth"]).count()[
+        ["StudyId"]
+    ]
+    month_groupby = (
+        full_index_date_df.groupby("IndexMonth").count()[["StudyId"]].reset_index()
+    )
+    year_groupby = (
+        full_index_date_df.groupby("IndexYear").count()[["StudyId"]].reset_index()
+    )
+    dates_groupby.plot.bar(y="StudyId")
+    plt.savefig(
+        "./plots/index_date/seed_" + str(general_seed) + "_month_year.png",
+        bbox_inches="tight",
+    )
+    month_groupby.plot.bar(x="IndexMonth", y="StudyId")
+    plt.savefig(
+        "./plots/index_date/seed_" + str(general_seed) + "_month.png",
+        bbox_inches="tight",
+    )
+    year_groupby.plot.bar(x="IndexYear", y="StudyId")
+    plt.savefig(
+        "./plots/index_date/seed_" + str(general_seed) + "_year.png",
+        bbox_inches="tight",
+    )
+    plt.close(fig="all")

training/combine_features.py ADDED Viewed

	@@ -0,0 +1,179 @@

+"""Script that combines features, performs encoding of categorical features and imputation.
+Demographics, exacerbation history, comorbidities, spirometry, labs, and pro datasets
+combined. Splitting of dataset performed if the data_to_process specified in config.yaml is
+not forward_val. Performs encoding of categorical features, and imputation of missing
+values. Two versions of the data is saved: imputed and not imputed dataframes.
+"""
+import pandas as pd
+import numpy as np
+import os
+import sys
+import yaml
+import json
+import joblib
+import encoding
+import imputation
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/combine_features_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+############################################################################
+# Combine features
+############################################################################
+# Load cohort data
+if data_to_process == "forward_val":
+    demographics = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "demographics_forward_val_{}.pkl".format(model_type),
+        )
+    )
+    exac_history = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "exac_history_forward_val_{}.pkl".format(model_type),
+        )
+    )
+    comorbidities = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "comorbidities_forward_val_{}.pkl".format(model_type),
+        )
+    )
+    spirometry = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "spirometry_forward_val_{}.pkl".format(model_type),
+        )
+    )
+    labs = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "labs_forward_val_{}.pkl".format(model_type),
+        )
+    )
+    pros = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "pros_forward_val_{}.pkl".format(model_type),
+        )
+    )
+else:
+    demographics = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "demographics_{}.pkl".format(model_type),
+        )
+    )
+    exac_history = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "exac_history_{}.pkl".format(model_type),
+        )
+    )
+    comorbidities = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "comorbidities_{}.pkl".format(model_type),
+        )
+    )
+    spirometry = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "spirometry_{}.pkl".format(model_type),
+        )
+    )
+    labs = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"], "labs_{}.pkl".format(model_type)
+        )
+    )
+    pros = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"], "pros_{}.pkl".format(model_type)
+        )
+    )
+data_combined = demographics.merge(
+    exac_history, on=["StudyId", "IndexDate"], how="left"
+)
+data_combined = data_combined.merge(
+    comorbidities, on=["StudyId", "IndexDate"], how="left"
+)
+data_combined = data_combined.merge(spirometry, on=["StudyId", "IndexDate"], how="left")
+data_combined = data_combined.merge(labs, on=["StudyId", "IndexDate"], how="left")
+data_combined = data_combined.merge(pros, on=["StudyId", "IndexDate"], how="left")
+# Print dataset info
+print(
+    "Data date range",
+    data_combined["IndexDate"].min(),
+    data_combined["IndexDate"].max(),
+)
+print("Mean age", data_combined["Age"].mean())
+print("Sex Female:", data_combined["Sex_F"].value_counts())
+if data_to_process != "forward_val":
+    # Load training and test ids
+    train_ids = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"], "train_ids_{}.pkl".format(model_type)
+        )
+    )
+    test_ids = pd.read_pickle(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"], "test_ids_{}.pkl".format(model_type)
+        )
+    )
+    fold_patients = np.load(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"],
+            "fold_patients_{}.npy".format(model_type),
+        ),
+        allow_pickle=True,
+    )
+    # Split data into training and test sets
+    train_data = data_combined[data_combined["StudyId"].isin(train_ids)]
+    test_data = data_combined[data_combined["StudyId"].isin(test_ids)]
+    train_data = train_data.sort_values(by=["StudyId", "IndexDate"]).reset_index(
+        drop=True
+    )
+    test_data = test_data.sort_values(by=["StudyId", "IndexDate"]).reset_index(
+        drop=True
+    )
+    # Save data
+    train_data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "train_combined_{}.pkl".format(model_type),
+        )
+    )
+    test_data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "test_combined_{}.pkl".format(model_type),
+        )
+    )
+else:
+    data_combined.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "forward_val_combined_{}.pkl".format(model_type),
+        )
+    )

training/config.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+model_settings:
+  prediction_window: 90
+  lookback_period: 180
+  seed: 0
+  index_date_generation_master_seed: 2188398760
+  pro_logic_min_days_after_exac: 14
+  pro_logic_max_days_after_exac: 35
+  neg_consecutive_q5_replies: 2
+  model_type: 'hosp_comm'
+  latest_date_before_bug_break: "2022-03-08"
+  after_bug_fixed_start_date: "2022-07-07"
+  training_data_end_date: "2023-05-23"
+  pro_q5_change_date: "2021-04-22"
+  forward_validation_earliest_date: "2023-05-23"
+  forward_validation_latest_date: "2024-02-20"
+  one_row_per_days_in_service: 150
+  num_folds: 5
+  data_to_process: 'test'
+inputs:
+  raw_data_paths:
+    receiver_cohort: "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/Cohort3Rand.csv"
+    scale_up_cohort: "<YOUR_DATA_PATH>/SU_IDs/Scale_Up_lookup.csv"
+    patient_details: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetPatientDetails.txt"
+    patient_events: "<YOUR_DATA_PATH>/copd-dataset/PatientEvents.txt"
+    comorbidities: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetCoMorbidityDetails.txt"
+    copd_status: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetCopdStatusDetails.txt"
+    inhalers: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetUsualTherapies.txt"
+    pro_symptom_diary: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProSymptomDiary.txt"
+    pro_eq5d: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProEQ5D.txt"
+    pro_mrc: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProMrc.txt"
+    pro_cat: "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProCat.txt"
+    receiver_community_verified_events: "<YOUR_DATA_PATH>/LenusEvents/breakdown_of_com_exac.xlsx"
+    scale_up_community_verified_events: "<YOUR_DATA_PATH>/LenusEvents/Scale_Up_comm_exac_count.xlsx"
+    admissions: "<YOUR_DATA_PATH>/03_Training/SMR01.csv"
+    prescribing: "<YOUR_DATA_PATH>/03_Training/Pharmacy.csv"
+    labs: "<YOUR_DATA_PATH>/02_Training/SCI_Store.csv"
+    labs_lookup_table: "./lookups/Training_20221125_labs_lookup.csv"
+    sh_demographics: "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/Demographics_Cohort4.csv"
+outputs:
+  output_data_dir: './data'
+  cohort_info_dir: "./data/cohort_info/"
+  logging_dir: './training/logging'
+  artifact_dir: './tmp'
+  processed_data_dir: './data/processed_data'
+  model_input_data_dir: './data/model_input_data'

training/create_sh_lookup_table.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import os
+import pandas as pd
+import yaml
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Read lookups for RECEIVER
+receiver = pd.read_csv(config['inputs']['raw_data_paths']['receiver_cohort'])
+receiver = receiver.rename(columns={"RNo": "StudyId"})
+# Read lookups for Scale Up
+scaleup = pd.read_csv(config['inputs']['raw_data_paths']['scale_up_cohort'])
+scaleup = scaleup.rename(columns={"Study_Number": "StudyId"})
+# Concatenate tables and drop missing SH IDs (some study patients not in data extract)
+all_patients = pd.concat([receiver, scaleup]).dropna()
+# Save final mapping between StudyId and SafeHavenID
+all_patients.to_pickle(os.path.join(config['outputs']['output_data_dir'], "sh_to_studyid_mapping.pkl"))
+# Check for matching age and sex between SafeHaven and Lenus data (mapping sanity check)
+lenus_demographics = pd.read_csv(
+    config['inputs']['raw_data_paths']['patient_details'],
+    usecols=["StudyId", "DateOfBirth", "Sex"],
+    sep="|",
+)
+sh_demographics = pd.read_csv(
+    config['inputs']['raw_data_paths']['sh_demographics'],
+    usecols=["SafeHavenID", "SEX", "OBF_DOB"],
+)
+sh_demographics["OBF_DOB"] = pd.to_datetime(
+    sh_demographics["OBF_DOB"], utc=True
+).dt.normalize()
+mapping = all_patients.merge(sh_demographics, on="SafeHavenID", how="inner")
+mapping = mapping.merge(lenus_demographics, on="StudyId", how="inner")
+# Check patient sex matches
+print(mapping[mapping.SEX != mapping.Sex])
+# There is one mismatch
+print(all_patients[all_patients.duplicated(subset="SafeHavenID", keep=False)])
+# Check patient DOB matches
+print(mapping[mapping.OBF_DOB != mapping.DateOfBirth])
+print(mapping[mapping["StudyId"] == "SU126"])

training/cross_val_final_models.py ADDED Viewed

	@@ -0,0 +1,673 @@

+import os
+import sys
+import numpy as np
+import pandas as pd
+import model_h
+import shutil
+import pickle
+import yaml
+# Plotting
+import matplotlib.pyplot as plt
+# Model training and evaluation
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import cross_validate, cross_val_predict
+from sklearn.metrics import precision_recall_curve, auc
+from sklearn.calibration import CalibratedClassifierCV
+from imblearn.ensemble import BalancedRandomForestClassifier
+import xgboost as xgb
+import ml_insights as mli
+import mlflow
+# Explainability
+from sklearn.inspection import permutation_importance
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+model_type = config['model_settings']['model_type']
+##############################################################
+# Load data
+##############################################################
+# Setup log file
+log = open(
+    os.path.join(config['outputs']['logging_dir'], "modelling_" + model_type + ".log"), "w")
+sys.stdout = log
+# Load CV folds
+fold_patients = np.load(os.path.join(config['outputs']['cohort_info_dir'],
+    'fold_patients_' + model_type + '.npy'), allow_pickle=True)
+# Load imputed crossval data
+train_data_imp = model_h.load_data_for_modelling(os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "train_imputed_cv_{}.pkl".format(model_type),
+    ))
+# Load not imputed crossval data
+train_data_no_imp = model_h.load_data_for_modelling(os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "train_not_imputed_cv_{}.pkl".format(model_type),
+    ))
+# Load imputed test data
+test_data_imp = model_h.load_data_for_modelling(os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "test_imputed_{}.pkl".format(model_type),
+    ))
+# Load not imputed test data
+test_data_no_imp = model_h.load_data_for_modelling(os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "test_not_imputed_{}.pkl".format(model_type),
+    ))
+# Load exac data
+#train_exac_data = pd.read_pickle('./data/train_exac_data_' + model_type + '.pkl')
+#test_exac_data = pd.read_pickle('./data/test_exac_data_' + model_type + '.pkl')
+# Print date ranges for train and test set
+print('Train date range',
+      train_data_imp['IndexDate'].min(), train_data_imp['IndexDate'].max())
+print('Test date range',
+      test_data_imp['IndexDate'].min(), test_data_imp['IndexDate'].max())
+# Set tags
+tags = {"prediction_window": config['model_settings']['prediction_window'],
+        "lookback_period": config['model_settings']['lookback_period'],
+        "min_index_date": train_data_imp['IndexDate'].min(),
+        "max_index_date": train_data_imp['IndexDate'].max(),
+        "1_row_per_length_in_service_days": config['model_settings']['one_row_per_days_in_service'],
+        }
+# Create a tuple with training and validation indicies for each fold. Can be done with
+# either imputed or not imputed data as both have same patients
+cross_val_fold_indices = []
+for fold in fold_patients:
+    fold_val_ids = train_data_no_imp[train_data_no_imp.StudyId.isin(fold)]
+    fold_train_ids = train_data_no_imp[~(
+        train_data_no_imp.StudyId.isin(fold_val_ids.StudyId))]
+    # Get index of rows in val and train
+    fold_val_index = fold_val_ids.index
+    fold_train_index = fold_train_ids.index
+    # Append tuple of training and val indices
+    cross_val_fold_indices.append((fold_train_index, fold_val_index))
+# Create list of model features
+cols_to_drop = ['StudyId', 'ExacWithin3Months', 'IndexDate', 'HospExacWithin3Months',
+                'CommExacWithin3Months']
+features_list = [col for col in train_data_no_imp.columns if col not in cols_to_drop]
+### Train data ###
+# Separate features from target for data with no imputation performed
+train_features_no_imp = train_data_no_imp[features_list].astype('float')
+train_target_no_imp = train_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+train_features_imp = train_data_imp[features_list].astype('float')
+train_target_imp = train_data_imp.ExacWithin3Months.astype('float')
+### Test data ###
+# Separate features from target for data with no imputation performed
+test_features_no_imp = test_data_no_imp[features_list].astype('float')
+test_target_no_imp = test_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+test_features_imp = test_data_imp[features_list].astype('float')
+test_target_imp = test_data_imp.ExacWithin3Months.astype('float')
+# Check that the target in imputed and not imputed datasets are the same. If not,
+# raise an error
+if not train_target_no_imp.equals(train_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the train set.')
+if not test_target_no_imp.equals(test_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the test set.')
+train_target = train_target_no_imp
+test_target = test_target_no_imp
+# Make sure all features are numeric
+for features in [train_features_no_imp, train_features_imp,
+                 test_features_no_imp, test_features_imp]:
+    for col in features:
+        features[col] = pd.to_numeric(features[col], errors='coerce')
+##############################################################
+# Specify which models to evaluate
+##############################################################
+# Set up MLflow
+mlflow.set_tracking_uri("sqlite:///mlruns.db")
+mlflow.set_experiment('model_h_drop_1_' + model_type)
+# Set CV scoring strategies and any model parameters
+scoring = ['f1', 'balanced_accuracy', 'accuracy', 'precision', 'recall', 'roc_auc',
+           'average_precision', 'neg_brier_score']
+# Set up models, each tuple contains 4 elements: model, model name, imputation status,
+# type of model
+models = []
+# These models are run for both hospital exac model and hospital and community exac model
+models.append((BalancedRandomForestClassifier(random_state=0),
+               'balanced_random_forest', 'imputed', 'tree'))
+models.append((xgb.XGBClassifier(random_state=0, use_label_encoder=False,
+               eval_metric='logloss'),
+               'xgb', 'not_imputed', 'tree'))
+models.append((RandomForestClassifier(),
+                'random_forest', 'imputed', 'tree'))
+# Get the parent run where hyperparameter tuning was done
+if model_type == 'only_hosp':
+    parent_run_id = 'ba2d7244654c4b84a815932a3167648f'
+if model_type == 'hosp_comm':
+    parent_run_id = 'f71edd4c72f14c0692431dca297ec131'
+##############################################################
+# Run models
+##############################################################
+#In MLflow run, perform K-fold cross validation and capture mean score across folds.
+with mlflow.start_run(run_name='hyperparameter_optimised_models_12'):
+    for model in models:
+        # Get parameters of best scoring models
+        best_params = model_h.get_mlflow_run_params(
+            model[1], parent_run_id, 'sqlite:///mlruns.db', model_type)
+        # Each model will have multiple best scores for different scoring metrics.
+        for n, scorer in enumerate(best_params):
+            params = best_params[scorer]
+            model[0].set_params(**params)
+            with mlflow.start_run(run_name=model[1] + '_tuning_scorer_' + scorer, nested=True):
+                print(model[1], scorer)
+                # Create the artifacts directory if it doesn't exist
+                os.makedirs(config['outputs']['artifact_dir'], exist_ok=True)
+                # Remove existing directory contents to not mix files between different runs
+                shutil.rmtree(config['outputs']['artifact_dir'])
+                # Select correct data based on whether model is using imputed or not imputed
+                # dataset
+                if model[2] == 'imputed':
+                    train_features = train_features_imp
+                    test_features = test_features_imp
+                    train_data = train_data_imp
+                    test_data = test_data_imp
+                else:
+                    train_features = train_features_no_imp
+                    test_features = test_features_no_imp
+                    train_data = train_data_no_imp
+                    test_data = test_data_no_imp
+                mlflow.set_tags(tags=tags)
+                # Perform K-fold cross validation with custom folds
+                crossval = cross_validate(model[0], train_features, train_target,
+                                            cv=cross_val_fold_indices,
+                                            return_estimator=True, scoring=scoring,
+                                            return_indices=True)
+                # Get the predicted probabilities from each models
+                probabilities_cv = cross_val_predict(model[0], train_features,
+                                                        train_target,
+                                                        cv=cross_val_fold_indices,
+                                                        method='predict_proba')[:, 1]
+                # Evaluation for uncalibrated model - test set
+                for iter_num, estimator in enumerate(crossval['estimator']):
+                    probs_test = estimator.predict_proba(test_features)[:,1]
+                    preds_test = estimator.predict(test_features)
+                    uncalib_metrics_test = model_h.calc_eval_metrics_for_model(
+                        test_target, preds_test, probs_test, 'uncalib_test')
+                    if iter_num == 0:
+                        uncalib_metrics_test_df = pd.DataFrame(
+                            uncalib_metrics_test, index=[iter_num])
+                    else:
+                        uncalib_metrics_test_df_iter = pd.DataFrame(
+                            uncalib_metrics_test, index=[iter_num])
+                        uncalib_metrics_test_df = pd.concat(
+                            [uncalib_metrics_test_df, uncalib_metrics_test_df_iter])
+                uncalib_metrics_test_mean = uncalib_metrics_test_df.mean()
+                uncalib_metrics_test_mean = uncalib_metrics_test_mean.to_dict()
+                # Get threshold that gives best F1 score for uncalibrated model
+                best_thres_uncal, f1_bt, prec_bt, rec_bt = model_h.get_threshold_with_best_f1_score(
+                    train_target, probabilities_cv)
+                # Save f1 score, precision and recall for the best threshold
+                mlflow.log_metric('best_thres_uncal', best_thres_uncal)
+                mlflow.log_metric('f1_best_thres', f1_bt)
+                mlflow.log_metric('precision_best_thres', prec_bt)
+                mlflow.log_metric('recall_best_thres', rec_bt)
+                #### Plot confusion matrix at different thresholds ####
+                model_h.plot_confusion_matrix(
+                    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, best_thres_uncal], probabilities_cv,
+                    train_target, model[1], model_type, 'uncalib')
+                #### Calculate AUC-PR score ####
+                precision, recall, thresholds = precision_recall_curve(
+                    train_target, probabilities_cv)
+                auc_pr = auc(recall, precision)
+                mlflow.log_metric('auc_pr', auc_pr)
+                #### Generate calibration curves ####
+                if model[1] != 'dummy_classifier':
+                    ### Sigmoid calibration ###
+                    # Perform calibration
+                    model_sig = CalibratedClassifierCV(
+                        model[0], method='sigmoid',cv=cross_val_fold_indices)
+                    model_sig.fit(train_features, train_target)
+                    probs_sig = model_sig.predict_proba(test_features)[:, 1]
+                    probs_sig_2 = model_sig.predict_proba(test_features)
+                    preds_sig = model_sig.predict(test_features)
+                    # Generate metrics for calibrated model
+                    calib_metrics_sig = model_h.calc_eval_metrics_for_model(
+                        test_target, preds_sig, probs_sig, 'sig')
+                    # Get threshold with best f1 score for calibrated model
+                    best_thres_sig, _, _, _ = model_h.get_threshold_with_best_f1_score(
+                        test_target, probs_sig)
+                    mlflow.log_metric('best_thres_sig', best_thres_sig)
+                    # Plot confusion matrices for calibrated model
+                    model_h.plot_confusion_matrix(
+                        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, best_thres_sig], probs_sig,
+                        test_target, model[1], model_type, "sig")
+                    # Plot score distribution for calibrated model
+                    model_h.plot_score_distribution(
+                        test_target, probs_sig, config['outputs']['artifact_dir'], model[1], model_type, 'sig')
+                    # Calculate std of auc-pr between CV folds
+                    model_h.calc_std_for_calibrated_classifiers(
+                        model_sig, 'sig', test_features, test_target)
+                    ### Isotonic calibration ###
+                    # Perform calibration
+                    model_iso = CalibratedClassifierCV(
+                        model[0], method='isotonic', cv=cross_val_fold_indices)
+                    model_iso.fit(train_features, train_target)
+                    probs_iso = model_iso.predict_proba(test_features)[:, 1]
+                    preds_iso = model_iso.predict(test_features)
+                    # Generate metrics for calibrated model
+                    calib_metrics_iso = model_h.calc_eval_metrics_for_model(
+                        test_target, preds_iso, probs_iso, 'iso')
+                    # Get threshold with best f1 score for calibrated model
+                    best_thres_iso, _, _, _ = model_h.get_threshold_with_best_f1_score(
+                        test_target, probs_iso)
+                    mlflow.log_metric('best_thres_iso', best_thres_iso)
+                    # Plot confusion matrices for calibrated model
+                    model_h.plot_confusion_matrix(
+                        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, best_thres_iso], probs_iso,
+                        test_target, model[1], model_type, "iso")
+                    # Plot score distribution for calibrated model
+                    model_h.plot_score_distribution(
+                        test_target, probs_iso, config['outputs']['artifact_dir'], model[1], model_type, 'iso')
+                    # Calculate std of auc-pr between CV folds
+                    model_h.calc_std_for_calibrated_classifiers(
+                        model_iso, 'iso', test_features, test_target)
+                    ### Spline calibration ###
+                    # Perform calibration
+                    spline_calib = mli.SplineCalib()
+                    spline_calib.fit(probabilities_cv, train_target)
+                    model[0].fit(train_features, train_target)
+                    preds_test_uncalib = model[0].predict_proba(test_features)[:,1]
+                    probs_spline = spline_calib.calibrate(preds_test_uncalib)
+                    preds_spline = probs_spline > 0.5
+                    preds_spline = preds_spline.astype(int)
+                    # Generate metrics for calibrated model
+                    calib_metrics_spline = model_h.calc_eval_metrics_for_model(
+                        test_target, preds_spline, probs_spline, 'spline')
+                    # Get threshold with best f1 score for calibrated model
+                    best_thres_spline, _, _, _ = model_h.get_threshold_with_best_f1_score(
+                        test_target, probs_spline)
+                    mlflow.log_metric('best_thres_spline', best_thres_spline)
+                    # Plot confusion matrices for calibrated model
+                    model_h.plot_confusion_matrix(
+                        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, best_thres_spline], probs_spline,
+                        test_target, model[1], model_type, "spline")
+                    # Plot score distribution for calibrated model
+                    model_h.plot_score_distribution(
+                        test_target, probs_spline, config['outputs']['artifact_dir'], model[1], model_type, 'spline')
+                    ### Plot calibration curves ###
+                    # Plot calibration curves for equal width bins (each bin has same width)
+                    # and equal frequency bins (each bin has same number of observations)
+                    for strategy in ['uniform', 'quantile']:
+                        for bins in [5, 6, 10]:
+                            plt.figure(figsize=(8,8))
+                            plt.plot([0, 1], [0, 1], linestyle='--')
+                            model_h.plot_calibration_curve(
+                                train_target, probabilities_cv, bins, strategy, 'Uncalibrated')
+                            model_h.plot_calibration_curve(
+                                test_target, probs_sig, bins, strategy,'Sigmoid')
+                            model_h.plot_calibration_curve(
+                                test_target, probs_iso, bins, strategy, 'Isotonic')
+                            model_h.plot_calibration_curve(
+                                test_target, probs_spline, bins, strategy, 'Spline')
+                            plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
+                            plt.title(model[1])
+                            plt.tight_layout()
+                            plt.savefig(
+                                os.path.join(config['outputs']['artifact_dir'], model[1] +
+                                             '_' + strategy + '_bins' + str(bins) + '_' +
+                                             model_type + '.png'))
+                            plt.close()
+                    # Plot uncalibrated model calibration curve at different bins and
+                    # strategies
+                    fig, (ax1,ax2) = plt.subplots(ncols=2, sharex=True, figsize=(15,10))
+                    #plt.figure(figsize=(8,8))
+                    for ax in [ax1, ax2]:
+                        ax.plot([0, 1], [0, 1], linestyle='--')
+                    for bins in [5, 6, 7, 8, 9]:
+                        model_h.plot_calibration_curve(
+                            train_target, probabilities_cv, bins, 'quantile', 'Bins=' +
+                            str(bins), ax1)
+                    for bins in [5, 6, 7, 8, 9]:
+                        model_h.plot_calibration_curve(
+                            train_target, probabilities_cv, bins, 'uniform', 'Bins=' +
+                            str(bins), ax2)
+                    ax1.title.set_text(model[1] + ' uncalibrated model quantile bins')
+                    ax2.title.set_text(model[1] + ' uncalibrated model uniform bins')
+                    plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
+                    plt.tight_layout()
+                    plt.savefig(
+                        os.path.join(config['outputs']['artifact_dir'], model[1] + '_uncal_'
+                                     + model_type + '.png'))
+                    plt.close()
+                    # Plot calibration curves with error bars
+                    model_h.plot_calibration_plot_with_error_bars(
+                        probabilities_cv, probs_sig, probs_iso, probs_spline, train_target,
+                        test_target, model[1])
+                    plt.close()
+                #### Get total gain and total cover for boosting machine models ####
+                if model[1].startswith("xgb"):
+                    feat_importance_tot_gain_df = model_h.plot_feat_importance_model(
+                        model[0], model[1], model_type)
+                # Save feature importance by total gain
+                if model[1].startswith("xgb"):
+                    feat_importance_tot_gain_df.to_csv(
+                        './data/feature_importance_tot_gain_' + model_type + '.csv', index=False)
+                #### Calculate model performance by event type ####
+                if model[1] not in ['dummy_classifier']:
+                    # Create df to contain prediction data and event type data
+                    preds_event_df_uncalib = model_h.create_df_probabilities_and_predictions(
+                        probabilities_cv, best_thres_uncal,
+                        train_data['StudyId'].tolist(),
+                        train_target,
+                        train_data[['ExacWithin3Months','HospExacWithin3Months','CommExacWithin3Months']],
+                        model[1], model_type, output_dir='./data/prediction_and_events/')
+                    preds_events_df_sig = model_h.create_df_probabilities_and_predictions(
+                        probs_sig, best_thres_sig, test_data['StudyId'].tolist(),
+                        test_target,
+                        test_data[['ExacWithin3Months', 'HospExacWithin3Months','CommExacWithin3Months']],
+                        model[1], model_type, output_dir='./data/prediction_and_events/',
+                        calib_type='sig')
+                    preds_events_df_iso = model_h.create_df_probabilities_and_predictions(
+                        probs_iso, best_thres_iso, test_data['StudyId'].tolist(),
+                        test_target,
+                        test_data[['ExacWithin3Months', 'HospExacWithin3Months','CommExacWithin3Months']],
+                        model[1], model_type, output_dir='./data/prediction_and_events/',
+                        calib_type='iso')
+                    preds_events_df_spline = model_h.create_df_probabilities_and_predictions(
+                        probs_spline, best_thres_spline, test_data['StudyId'].tolist(),
+                        test_target,
+                        test_data[['ExacWithin3Months', 'HospExacWithin3Months','CommExacWithin3Months']],
+                        model[1], model_type, output_dir='./data/prediction_and_events/',
+                        calib_type='spline')
+                    # Subset to each event type and calculate metrics
+                    metrics_by_event_type_uncalib = model_h.calc_metrics_by_event_type(
+                        preds_event_df_uncalib, calib_type="uncalib")
+                    metrics_by_event_type_sig = model_h.calc_metrics_by_event_type(
+                        preds_events_df_sig, calib_type='sig')
+                    metrics_by_event_type_iso = model_h.calc_metrics_by_event_type(
+                        preds_events_df_iso, calib_type='iso')
+                    metrics_by_event_type_spline = model_h.calc_metrics_by_event_type(
+                        preds_events_df_spline, calib_type='spline')
+                    # Subset to each event type and plot ROC curve
+                    model_h.plot_roc_curve_by_event_type(
+                        preds_event_df_uncalib, model[1], 'uncalib')
+                    model_h.plot_roc_curve_by_event_type(
+                        preds_events_df_sig, model[1], 'sig')
+                    model_h.plot_roc_curve_by_event_type(
+                        preds_events_df_iso, model[1], 'iso')
+                    model_h.plot_roc_curve_by_event_type(
+                        preds_events_df_spline, model[1], 'spline')
+                    # Subset to each event type and plot PR curve
+                    model_h.plot_prec_recall_by_event_type(
+                        preds_event_df_uncalib, model[1], 'uncalib')
+                    model_h.plot_prec_recall_by_event_type(
+                        preds_events_df_sig, model[1], 'sig')
+                    model_h.plot_prec_recall_by_event_type(
+                        preds_events_df_iso, model[1], 'iso')
+                    model_h.plot_prec_recall_by_event_type(
+                        preds_events_df_spline, model[1], 'spline')
+                #### SHAP ####
+                if model[1] not in ['dummy_classifier']:
+                    ### Uncalibrated model ###
+                    # Get the average SHAP values from CV folds for uncalibrated model
+                    shap_values_v_uncal, shap_values_t_uncal = model_h.get_uncalibrated_shap(
+                        crossval['estimator'], test_features, train_features,
+                        train_data[features_list].columns,
+                        model[1], model_type)
+                    ## Plot SHAP summary plots ##
+                    model_h.plot_averaged_summary_plot(
+                            shap_values_t_uncal,
+                            train_data[features_list],
+                            model[1], 'uncalib', model_type)
+                    ## Plot SHAP interaction heatmap ##
+                    model_h.plot_shap_interaction_value_heatmap(
+                        crossval['estimator'], train_features,
+                        train_data[features_list].columns,
+                        model[1], model_type)
+                    ### Calibrated models ###
+                    calib_models = {'sig':model_sig, 'iso':model_iso}
+                    for calib_model_name in calib_models:
+                        # Get the average SHAP values from CV folds for calibrated model
+                        shap_values_v, shap_values_t = model_h.get_calibrated_shap_by_classifier(
+                            calib_models[calib_model_name], test_features, train_features,
+                            train_data.drop(
+                                columns=['StudyId', 'ExacWithin3Months', 'IndexDate',
+                                         'HospExacWithin3Months',
+                                         'CommExacWithin3Months']).columns,
+                            calib_model_name, model[1], model_type)
+                        ## Plot SHAP summary plots ##
+                        model_h.plot_averaged_summary_plot(
+                            shap_values_t,
+                            train_data.drop(
+                                columns=['StudyId', 'ExacWithin3Months', 'IndexDate',
+                                         'HospExacWithin3Months','CommExacWithin3Months']),
+                            model[1], calib_model_name, model_type)
+                        ## Get feature importance for local SHAP values ##
+                        feature_imp_df = model_h.get_local_shap_values(
+                            model[1], model_type, shap_values_v, test_features,
+                            calib_model_name,shap_ids_dir='./data/prediction_and_events/')
+                        feature_imp_df.to_csv(
+                            './data/prediction_and_events/local_feature_imp_df' + model[1] +
+                            '_' + calib_model_name + '.csv')
+                        ## Plot local SHAP plots ##
+                        test_feat_enc_conv = model_h.plot_local_shap(
+                            model[1], model_type, shap_values_v, test_features, train_features,
+                            calib_model_name,
+                            row_ids_to_plot=['missed', 'incorrect', 'correct'],
+                            artifact_dir=config['outputs']['artifact_dir'],
+                            shap_ids_dir='./data/prediction_and_events/',
+                            reverse_scaling_flag=False,
+                            convert_target_encodings=True, imputation=model[2],
+                            target_enc_path="./data/artifacts/target_encodings_" + model_type + ".json",
+                            return_enc_converted_df=False)
+                        """
+                        ### Plot SHAP dependency plots ###
+                        os.makedirs( "./tmp/dependence_plots", exist_ok=True)
+                        categorical_cols = [
+                                            "DaysSinceLastExac_te", "FEV1PercentPredicted_te"]
+                        for categorical_col in categorical_cols:
+                            shap.dependence_plot(
+                                categorical_col, shap_values_v, test_feat_enc_conv,
+                                interaction_index=None, show=False)
+                            plt.tight_layout()
+                            plt.savefig(
+                                "./tmp/dependence_plots/dependence_plot_" + categorical_col
+                                + "_" + model[1] + "_" + calib_model_name + file_suffix + ".png")
+                            plt.close()
+                        """
+                ### Plot distribution of model scores for uncalibrated model ###
+                model_h.plot_score_distribution(
+                    train_target, probabilities_cv, config['outputs']['artifact_dir'],
+                    model[1], model_type)
+                """
+                ### Permutation feature importance ###
+                def calc_permutation_importance(model, features, target, scoring, n_repeats):
+                    permutation_imp = permutation_importance(model, features, target, random_state=0, scoring=scoring, n_repeats=n_repeats)
+                    for n, score in enumerate(permutation_imp):
+                        if n == 0:
+                            df = pd.DataFrame(data=permutation_imp[score]['importances_mean'], index=features.columns)
+                            df = df.rename(columns={0:score})
+                        else:
+                            df[score] = permutation_imp[score]['importances_mean']
+                    return df, permutation_imp
+                def plot_permutation_feature_importance(permutation_imp_full, metric, col_names, n_repeats, train_or_test):
+                    os.makedirs("./tmp/permutation_feat_imp", exist_ok=True)
+                    sorted_importances_idx = permutation_imp_full[metric].importances_mean.argsort()
+                    importances = pd.DataFrame(
+                        permutation_imp_full[metric].importances[sorted_importances_idx].T,
+                        columns=col_names[sorted_importances_idx],
+                    )
+                    ax = importances.plot.box(vert=False, whis=10)
+                    ax.set_title("Permutation Importances(" + train_or_test + ")")
+                    ax.axvline(x=0, color="k", linestyle="--")
+                    ax.set_xlabel("Decrease in accuracy score")
+                    ax.figure.tight_layout()
+                    plt.savefig('./tmp/permutation_feat_imp/' + train_or_test + '_' + metric + '_repeats' + str(n_repeats) +'.png')
+                from scipy.cluster import hierarchy
+                from scipy.spatial.distance import squareform
+                from scipy.stats import spearmanr
+                full_dataset_feat = pd.concat([train_features, test_features], axis=0)
+                print(train_features)
+                print(full_dataset_feat)
+                fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
+                corr = spearmanr(full_dataset_feat).correlation
+                # Ensure the correlation matrix is symmetric
+                corr = (corr + corr.T) / 2
+                np.fill_diagonal(corr, 1)
+                # We convert the correlation matrix to a distance matrix before performing
+                # hierarchical clustering using Ward's linkage.
+                distance_matrix = 1 - np.abs(corr)
+                dist_linkage = hierarchy.ward(squareform(distance_matrix))
+                dendro = hierarchy.dendrogram(
+                    dist_linkage, labels=full_dataset_feat.columns.to_list(), ax=ax1, leaf_rotation=90
+                )
+                dendro_idx = np.arange(0, len(dendro["ivl"]))
+                ax2.imshow(corr[dendro["leaves"], :][:, dendro["leaves"]])
+                ax2.set_xticks(dendro_idx)
+                ax2.set_yticks(dendro_idx)
+                ax2.set_xticklabels(dendro["ivl"], rotation="vertical")
+                ax2.set_yticklabels(dendro["ivl"])
+                _ = fig.tight_layout()
+                plt.show()
+                plt.close()
+                #features_to_drop = ["TotalEngagementMRC", "NumCommExacPrior6mo", "WeekAvgCATQ2", "WeekAvgCATQ4"]
+                #X_train_sel = train_features.drop(columns=features_to_drop)
+                #X_test_sel = test_features.drop(columns=features_to_drop)
+                from collections import defaultdict
+                cluster_ids = hierarchy.fcluster(dist_linkage, 0.5, criterion="distance")
+                cluster_id_to_feature_ids = defaultdict(list)
+                for idx, cluster_id in enumerate(cluster_ids):
+                    cluster_id_to_feature_ids[cluster_id].append(idx)
+                selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
+                selected_features_names = full_dataset_feat.columns[selected_features]
+                X_train_sel = train_features[selected_features_names]
+                X_test_sel = test_features[selected_features_names]
+                print(selected_features_names)
+                # retrain
+                # Perform calibration
+                model_sig_perm = CalibratedClassifierCV(
+                    model[0], method='sigmoid',cv=cross_val_fold_indices)
+                model_sig_perm.fit(X_train_sel, train_target)
+                probs_sig = model_sig_perm.predict_proba(X_test_sel)[:, 1]
+                probs_sig_2 = model_sig_perm.predict_proba(X_test_sel)
+                preds_sig = model_sig_perm.predict(X_test_sel)
+                print('before')
+                print(calib_metrics_sig)
+                # Generate metrics for calibrated model
+                calib_metrics_sig = copd.calc_eval_metrics_for_model(
+                    test_target, preds_sig, probs_sig, 'sig')
+                print(calib_metrics_sig)
+                def plot_permutation_importance(clf, X, y, ax):
+                    result = permutation_importance(clf, X, y, n_repeats=10, random_state=42, n_jobs=2,scoring='average_precision')
+                    perm_sorted_idx = result.importances_mean.argsort()
+                    ax.boxplot(
+                        result.importances[perm_sorted_idx].T,
+                        vert=False,
+                        labels=X.columns[perm_sorted_idx],
+                    )
+                    ax.axvline(x=0, color="k", linestyle="--")
+                    return ax
+                fig, ax = plt.subplots(figsize=(7, 6))
+                plot_permutation_importance(model_sig_perm, X_test_sel, test_target, ax)
+                ax.set_title("Permutation Importances on selected subset of features\n(test set)")
+                ax.set_xlabel("Decrease in accuracy score")
+                ax.figure.tight_layout()
+                plt.savefig('./tmp/permutation_feat_imp.png')
+                #for metric in ['f1', 'average_precision', 'roc_auc']:
+                #    for n_repeats in [5,10, 50]:
+                #        permutation_imp_train_df, permutation_imp_train_dict = calc_permutation_importance(model_sig, train_features, train_target, scoring=scoring, n_repeats=n_repeats)
+                #        plot_permutation_feature_importance(permutation_imp_train_dict, metric, train_features.columns, n_repeats, 'train')
+                #    for n_repeats in [5,10, 50]:
+                #        permutation_imp_test_df, permutation_imp_test_dict = calc_permutation_importance(model_sig, test_features, test_target, scoring=scoring, n_repeats=n_repeats)
+                #        plot_permutation_feature_importance(permutation_imp_test_dict, metric, test_features.columns, n_repeats, 'test')
+                """
+                ### Log metrics, parameters, and artifacts ###
+                # Log metrics averaged across folds
+                for score in scoring:
+                    mlflow.log_metric(score, crossval['test_' + score].mean())
+                    mlflow.log_metric(score + '_std', crossval['test_' + score].std())
+                # Log metrics for calibrated models
+                if model[1] != 'dummy_classifier':
+                    mlflow.log_metrics(uncalib_metrics_test_mean)
+                    mlflow.log_metrics(calib_metrics_sig)
+                    mlflow.log_metrics(calib_metrics_iso)
+                    mlflow.log_metrics(calib_metrics_spline)
+                    mlflow.log_metrics(metrics_by_event_type_uncalib)
+                    mlflow.log_metrics(metrics_by_event_type_sig)
+                    mlflow.log_metrics(metrics_by_event_type_iso)
+                    mlflow.log_metrics(metrics_by_event_type_spline)
+                # Log model parameters
+                params = model[0].get_params()
+                for param in params:
+                    mlflow.log_param(param, params[param])
+                # Log artifacts
+                mlflow.log_artifacts(config['outputs']['artifact_dir'])
+                # Save sig model
+                with open('./data/model/trained_sig_' + model[1] + '_pkl', 'wb') as files:
+                    pickle.dump(model_sig, files)
+                with open('./data/model/trained_iso_' + model[1] + '_pkl', 'wb') as files:
+                    pickle.dump(model_iso, files)
+                with open('./data/model/trained_spline_' + model[1] + '_pkl', 'wb') as files:
+                    pickle.dump(spline_calib, files)
+mlflow.end_run()

training/cross_val_first_models.py ADDED Viewed

	@@ -0,0 +1,460 @@

+import os
+import sys
+import numpy as np
+import pandas as pd
+import mlflow
+import model_h
+# Plotting
+import matplotlib.pyplot as plt
+import seaborn as sns
+# Model training and evaluation
+from sklearn.dummy import DummyClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import cross_validate, cross_val_predict
+from sklearn.metrics import confusion_matrix, precision_recall_curve
+from sklearn.calibration import calibration_curve, CalibratedClassifierCV
+from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier
+import lightgbm as lgb
+import xgboost as xgb
+from catboost import CatBoostClassifier
+import ml_insights as mli
+# Explainability
+import shap
+##############################################################
+# Specify which model to perform cross validation on
+##############################################################
+model_only_hosp = True
+if model_only_hosp is True:
+    file_suffix = "_only_hosp"
+else:
+    file_suffix = "_hosp_comm"
+##############################################################
+# Load data
+##############################################################
+# Setup log file
+log = open("./training/logging/modelling" + file_suffix + ".log", "w")
+sys.stdout = log
+# Load CV folds
+fold_patients = np.load(
+    './data/cohort_info/fold_patients' + file_suffix + '.npy', allow_pickle=True)
+# Load imputed train data
+train_data_imp = model_h.load_data_for_modelling(
+    './data/model_data/train_data_cv_imp' + file_suffix + '.pkl')
+train_data_imp = train_data_imp.drop(columns=['Sex_F', 'Age_TEnc'])
+# Load not imputed train data
+train_data_no_imp = model_h.load_data_for_modelling(
+    './data/model_data/train_data_cv_no_imp' + file_suffix + '.pkl')
+train_data_no_imp = train_data_no_imp.drop(columns=['Sex_F', 'Age_TEnc'])
+# Load imputed test data
+test_data_imp = model_h.load_data_for_modelling(
+    './data/model_data/test_data_imp' + file_suffix + '.pkl')
+test_data_imp = test_data_imp.drop(columns=['Sex_F', 'Age_TEnc'])
+# Load not imputed test data
+test_data_no_imp = model_h.load_data_for_modelling(
+    './data/model_data/test_data_no_imp' + file_suffix + '.pkl')
+test_data_no_imp = test_data_no_imp.drop(columns=['Sex_F', 'Age_TEnc'])
+# Create a tuple with training and validation indicies for each fold. Can be done with
+# either imputed or not imputed data as both have same patients
+cross_val_fold_indices = []
+for fold in fold_patients:
+    fold_val_ids = train_data_no_imp[train_data_no_imp.StudyId.isin(fold)]
+    fold_train_ids = train_data_no_imp[~(
+        train_data_no_imp.StudyId.isin(fold_val_ids.StudyId))]
+    # Get index of rows in val and train
+    fold_val_index = fold_val_ids.index
+    fold_train_index = fold_train_ids.index
+    # Append tuple of training and val indices
+    cross_val_fold_indices.append((fold_train_index, fold_val_index))
+# Create list of model features
+cols_to_drop = ['StudyId', 'ExacWithin3Months']
+features_list = [col for col in train_data_no_imp.columns if col not in cols_to_drop]
+# Train data
+# Separate features from target for data with no imputation performed
+train_features_no_imp = train_data_no_imp[features_list].astype('float')
+train_target_no_imp = train_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+train_features_imp = train_data_imp[features_list].astype('float')
+train_target_imp = train_data_imp.ExacWithin3Months.astype('float')
+# Test data
+# Separate features from target for data with no imputation performed
+test_features_no_imp = test_data_no_imp[features_list].astype('float')
+test_target_no_imp = test_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+test_features_imp = test_data_imp[features_list].astype('float')
+test_target_imp = test_data_imp.ExacWithin3Months.astype('float')
+# Check that the target in imputed and not imputed datasets are the same. If not,
+# raise an error
+if not train_target_no_imp.equals(train_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the train set.')
+if not test_target_no_imp.equals(test_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the test set.')
+train_target = train_target_no_imp
+test_target = test_target_no_imp
+# Make sure all features are numeric
+for features in [train_features_no_imp, train_features_imp,
+                 test_features_no_imp, test_features_imp]:
+    for col in features:
+        features[col] = pd.to_numeric(features[col], errors='coerce')
+##############################################################
+# Specify which models to evaluate
+##############################################################
+# Set up MLflow
+mlflow.set_tracking_uri("sqlite:///mlruns.db")
+mlflow.set_experiment('model_h_drop_1' + file_suffix)
+# Set CV scoring strategies and any model parameters
+scoring = ['f1', 'balanced_accuracy', 'accuracy', 'precision', 'recall', 'roc_auc',
+           'average_precision', 'neg_brier_score']
+scale_pos_weight = train_target.value_counts()[0] / train_target.value_counts()[1]
+# Set up models, each tuple contains 4 elements: model, model name, imputation status,
+# type of model
+models = []
+# Dummy classifier
+models.append((DummyClassifier(strategy='stratified'),
+               'dummy_classifier', 'imputed'))
+# Logistic regression
+models.append((LogisticRegression(random_state=0, max_iter=200),
+               'logistic_regression', 'imputed', 'linear'))
+models.append((LogisticRegression(random_state=0, class_weight='balanced', max_iter=200),
+               'logistic_regression_CW_balanced', 'imputed', 'linear'))
+# Random forest
+models.append((RandomForestClassifier(random_state=0),
+               'random_forest', 'imputed', 'tree'))
+models.append((RandomForestClassifier(random_state=0, class_weight='balanced'),
+               'random_forest_CW_balanced', 'imputed', 'tree'))
+models.append((BalancedRandomForestClassifier(random_state=0),
+               'balanced_random_forest', 'imputed', 'tree'))
+# Bagging
+models.append((BalancedBaggingClassifier(random_state=0),
+               'balanced_bagging', 'imputed', 'tree'))
+# XGBoost
+models.append((xgb.XGBClassifier(random_state=0, use_label_encoder=False,
+               eval_metric='logloss', learning_rate=0.1),
+               'xgb', 'not_imputed', 'tree'))
+models.append((xgb.XGBClassifier(random_state=0, use_label_encoder=False,
+               eval_metric='logloss', learning_rate=0.1, max_depth=4),
+               'xgb_mdepth_4', 'not_imputed', 'tree'))
+models.append((xgb.XGBClassifier(random_state=0, use_label_encoder=False,
+               eval_metric='logloss', scale_pos_weight=scale_pos_weight, learning_rate=0.1),
+               'xgb_spw', 'not_imputed', 'tree'))
+models.append((xgb.XGBClassifier(random_state=0, use_label_encoder=False,
+               eval_metric='logloss', scale_pos_weight=scale_pos_weight, learning_rate=0.1,
+               max_depth=4),
+               'xgb_spw_mdepth_4', 'not_imputed', 'tree'))
+# Light GBM
+models.append((lgb.LGBMClassifier(random_state=0, learning_rate=0.1, verbose_eval=-1),
+               'lgbm', 'not_imputed', 'tree'))
+models.append((lgb.LGBMClassifier(random_state=0, learning_rate=0.1,
+                                  scale_pos_weight=scale_pos_weight, verbose_eval=-1),
+                                  'lgbm_spw', 'not_imputed', 'tree'))
+# CatBoost
+models.append((CatBoostClassifier(random_state=0, learning_rate=0.1),
+               'catboost', 'not_imputed', 'tree'))
+# Convert features and target to a numpy array
+# Train data
+#train_features_no_imp = train_features_no_imp.to_numpy()
+#train_features_imp = train_features_imp.to_numpy()
+#train_target = train_target.to_numpy()
+# Test data
+#test_features_no_imp = test_features_no_imp.to_numpy()
+#test_features_imp = test_features_imp.to_numpy()
+#test_target = test_target.to_numpy()
+##############################################################
+# Run models
+##############################################################
+#In MLflow run, perform K-fold cross validation and capture mean score across folds.
+with mlflow.start_run(run_name='model_selection_less_features_3rd_iter_minus_sex'):
+    for model in models:
+        with mlflow.start_run(run_name=model[1], nested=True):
+            print(model[1])
+            # Create the artifacts directory if it doesn't exist
+            artifact_dir = './tmp'
+            os.makedirs(artifact_dir, exist_ok=True)
+            # Remove existing directory contents to not mix files between different runs
+            for f in os.listdir(artifact_dir):
+                os.remove(os.path.join(artifact_dir, f))
+            # Perform K-fold cross validation with custom folds using imputed dataset for
+            # non-sparsity aware models
+            if model[2] == 'imputed':
+                crossval = cross_validate(model[0], train_features_imp, train_target,
+                                          cv=cross_val_fold_indices,
+                                          return_estimator=True, scoring=scoring,
+                                          return_indices=True)
+                # Get the predicted probabilities from each models
+                probabilities_cv = cross_val_predict(model[0], train_features_imp,
+                                                     train_target, cv=cross_val_fold_indices,
+                                                     method='predict_proba')[:, 1]
+            else:
+                crossval = cross_validate(model[0], train_features_no_imp, train_target,
+                                          cv=cross_val_fold_indices, return_estimator=True,
+                                          scoring=scoring, return_indices=True)
+                # Get the predicted probabilities from each models
+                probabilities_cv = cross_val_predict(model[0], train_features_no_imp,
+                                                     train_target, cv=cross_val_fold_indices,
+                                                     method='predict_proba')[:, 1]
+            # Get threshold that gives best F1 score
+            precision, recall, thresholds = precision_recall_curve(
+                train_target, probabilities_cv)
+            fscore = (2 * precision * recall) / (precision + recall)
+            # When getting the max fscore, if fscore is nan, nan will be returned as the
+            # max. Iterate until nan not returned.
+            fscore_zero = True
+            position = -1
+            while fscore_zero is True:
+                best_thres_idx = np.argsort(fscore, axis=0)[position]
+                if np.isnan(fscore[best_thres_idx]) == True:
+                    position = position - 1
+                else:
+                    fscore_zero = False
+            best_threshold = thresholds[best_thres_idx]
+            print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (
+                best_threshold, fscore[best_thres_idx], precision[best_thres_idx],
+                recall[best_thres_idx]))
+            # Save f1 score, precision and recall for the best threshold
+            mlflow.log_metric('best_threshold', best_threshold)
+            mlflow.log_metric('f1_best_thres', fscore[best_thres_idx])
+            mlflow.log_metric('precision_best_thres', precision[best_thres_idx])
+            mlflow.log_metric('recall_best_thres', recall[best_thres_idx])
+            # Plot confusion matrix at different thresholds
+            thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, best_threshold]
+            for threshold in thresholds:
+                y_predicted = probabilities_cv > threshold
+                model_h.plot_confusion_matrix(
+                    train_target, y_predicted, model[1], threshold, file_suffix)
+            # Generate calibration curves
+            if model[1] != 'dummy_classifier':
+                # Calibrated model (Sigmoid)
+                model_sig = CalibratedClassifierCV(
+                    model[0], method='sigmoid',cv=cross_val_fold_indices)
+                if model[2] == 'imputed':
+                    model_sig.fit(train_features_imp, train_target)
+                    probs_sig = model_sig.predict_proba(test_features_imp)[:, 1]
+                else:
+                    model_sig.fit(train_features_no_imp, train_target)
+                    probs_sig = model_sig.predict_proba(test_features_no_imp)[:, 1]
+                # Calibrated model (Isotonic)
+                model_iso = CalibratedClassifierCV(
+                    model[0], method='isotonic', cv=cross_val_fold_indices)
+                if model[2] == 'imputed':
+                    model_iso.fit(train_features_imp, train_target)
+                    probs_iso = model_iso.predict_proba(test_features_imp)[:, 1]
+                else:
+                    model_iso.fit(train_features_no_imp, train_target)
+                    probs_iso = model_iso.predict_proba(test_features_no_imp)[:, 1]
+                # Spline calibration
+                spline_calib = mli.SplineCalib()
+                spline_calib.fit(probabilities_cv, train_target)
+                if model[2] == 'imputed':
+                    model[0].fit(train_features_imp,train_target)
+                    preds_test_uncalib = model[0].predict_proba(test_features_imp)[:,1]
+                else:
+                    model[0].fit(train_features_no_imp,train_target)
+                    preds_test_uncalib = model[0].predict_proba(test_features_no_imp)[:,1]
+                probs_spline = spline_calib.calibrate(preds_test_uncalib)
+                # Plot calibration curves for equal width bins (each bin has same width) and
+                # equal frequency bins (each bin has same number of observations)
+                for strategy in ['uniform', 'quantile']:
+                    for bin_num in [5, 10]:
+                        if strategy == 'uniform':
+                            print('--- Creating calibration curve with equal width bins ---')
+                            print('-- Num bins:', bin_num, ' --')
+                        else:
+                            print('--- Creating calibration curve with equal frequency bins ---')
+                            print('-- Num bins:', bin_num, ' --')
+                        print('Uncalibrated model:')
+                        prob_true_uncal, prob_pred_uncal = calibration_curve(
+                            train_target, probabilities_cv,n_bins=bin_num, strategy=strategy)
+                        print('Calibrated model (sigmoid):')
+                        prob_true_sig, prob_pred_sig = calibration_curve(
+                            test_target, probs_sig, n_bins=bin_num, strategy=strategy)
+                        print('Calibrated model (isotonic):')
+                        prob_true_iso, prob_pred_iso = calibration_curve(
+                            test_target, probs_iso, n_bins=bin_num, strategy=strategy)
+                        print('Calibrated model (spline):')
+                        prob_true_spline, prob_pred_spline = calibration_curve(
+                            test_target, probs_spline, n_bins=bin_num, strategy=strategy)
+                        plt.figure(figsize=(8,8))
+                        plt.plot([0, 1], [0, 1], linestyle='--')
+                        plt.plot(prob_pred_uncal, prob_true_uncal, marker='.',
+                                 label='Uncalibrated\n' + model[1])
+                        plt.plot(prob_pred_sig, prob_true_sig, marker='.',
+                                 label='Calibrated (Sigmoid)\n' + model[1])
+                        plt.plot(prob_pred_iso, prob_true_iso, marker='.',
+                                 label='Calibrated (Isotonic)\n' + model[1])
+                        plt.plot(prob_pred_spline, prob_true_spline, marker='.',
+                                 label='Calibrated (Spline)\n' + model[1])
+                        plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
+                        plt.tight_layout()
+                        plt.savefig(os.path.join(artifact_dir, model[1] + '_uncal_' +
+                                                 strategy + '_bins' + str(bin_num) +
+                                                 file_suffix + '.png'))
+                        plt.close()
+            # Get total gain and total cover for boosting machine models
+            if model[1].startswith("xgb"):
+                feat_importance_tot_gain_df = model_h.plot_feat_importance_model(
+                    model[0], model[1], file_suffix=file_suffix)
+            if (model[1].startswith("lgbm")):
+                feature_names = train_features_no_imp.columns.tolist()
+                feat_importance_tot_gain_df = model_h.plot_feat_importance_model(
+                    model[0], model[1], file_suffix=file_suffix, feature_names=feature_names)
+            # Save feature importance by total gain
+            if (model[1].startswith("xgb")) | (model[1].startswith("lgbm")):
+                feat_importance_tot_gain_df.to_csv(
+                    './data/feature_importance_tot_gain' + file_suffix + '.csv', index=False)
+            # SHAP
+            if model[1] not in ['dummy_classifier', 'balanced_bagging']:
+                shap_values_list_train = []
+                shap_vals_per_cv = {}
+                # Create a dictionary to contain shap values. Dictionary is structured as
+                # index : fold_num : shap_values
+                for idx in range(0, len(train_data_imp)):
+                    shap_vals_per_cv[idx] = {}
+                    for n_fold in range(0, 5):
+                        shap_vals_per_cv[idx][n_fold] = {}
+                # Get SHAP values for each fold
+                fold_num = 0
+                for i, estimator in enumerate(crossval['estimator']):
+                    fold_num = fold_num + 1
+                    # If imputation needed for model, use imputed features
+                    if model[1] in ['logistic_regression',
+                                'logistic_regression_CW_balanced', 'random_forest',
+                                'random_forest_CW_balanced', 'balanced_bagging',
+                                'balanced_random_forest']:
+                        #X_test = train_features_imp[crossval['indices']['test'][i]]
+                        X_train = train_features_imp.iloc[crossval['indices']['train'][i]]
+                        X_test = train_features_imp.iloc[crossval['indices']['test'][i]]
+                    else:
+                        X_train = train_features_no_imp.iloc[crossval['indices']['train'][i]]
+                        X_test = train_features_no_imp.iloc[crossval['indices']['test'][i]]
+                    # Apply different explainers depending on type of model
+                    if model[3] == 'linear':
+                        explainer = shap.LinearExplainer(estimator, X_train)
+                    if model[3] == 'tree':
+                        explainer = shap.TreeExplainer(estimator)
+                    # Get shap values
+                    shap_values_train = explainer.shap_values(X_train)
+                    # Output of shap values for some models is (class, num samples,
+                    # num features). Get these in the format of (num samples, num features)
+                    if len(np.shape(shap_values_train)) == 3:
+                        shap_values_train = shap_values_train[1]
+                    # Plot SHAP plots for each cv fold
+                    shap.summary_plot(np.array(shap_values_train), X_train, show=False)
+                    plt.savefig(os.path.join(artifact_dir, model[1] + '_shap_cv_fold_' +
+                                             str(fold_num) + file_suffix + '.png'))
+                    plt.close()
+                    # Add shap values to a dictionary.
+                    train_idxs = X_train.index.tolist()
+                    for n, train_idx in enumerate(train_idxs):
+                        shap_vals_per_cv[train_idx][i] = shap_values_train[n]
+                # Calculate average shap values
+                average_shap_values, stds, ranges = [],[],[]
+                for i in range(0,len(train_data_imp)):
+                    for n in range(0,5):
+                        # If a cv fold is empty as that set has not been used in training,
+                        # replace empty fold with NaN
+                        try:
+                            if not shap_vals_per_cv[i][n]:
+                                shap_vals_per_cv[i][n] = np.NaN
+                        except:
+                            pass
+                    # Create a df for each index that contains all shap values for each cv
+                    # fold
+                    df_per_obs = pd.DataFrame.from_dict(shap_vals_per_cv[i])
+                    # Get relevant statistics for every sample
+                    average_shap_values.append(df_per_obs.mean(axis=1).values)
+                    stds.append(df_per_obs.std(axis=1).values)
+                    ranges.append(df_per_obs.max(axis=1).values-df_per_obs.min(axis=1).values)
+                # Plot SHAP plots
+                if model[2] == 'imputed':
+                    shap.summary_plot(np.array(average_shap_values), train_data_imp.drop(
+                        columns=['StudyId', 'ExacWithin3Months']), show=False)
+                if model[2] == 'not_imputed':
+                    shap.summary_plot(np.array(average_shap_values), train_data_no_imp.drop(
+                        columns=['StudyId', 'ExacWithin3Months']), show=False)
+                plt.savefig(
+                    os.path.join(artifact_dir, model[1] + '_shap' + file_suffix + '.png'))
+                plt.close()
+                # Get list of most important features in order
+                feat_importance_df = model_h.get_shap_feat_importance(
+                    model[1], average_shap_values, features_list, file_suffix)
+                feat_importance_df.to_csv(
+                    './data/feature_importance_shap' + file_suffix + '.csv', index=False)
+            # Plot distribution of model scores (histogram plus KDE)
+            model_scores = pd.DataFrame({'model_score': probabilities_cv,
+                                        'true_label': train_target})
+            sns.displot(model_scores, x="model_score", hue="true_label", kde=True)
+            plt.savefig(os.path.join(artifact_dir, model[1] + 'score_distribution' +
+                                     file_suffix + '.png'))
+            plt.close()
+            # Log metrics averaged across folds
+            for score in scoring:
+                mlflow.log_metric(score, crossval['test_' + score].mean())
+                mlflow.log_metric(score + '_std', crossval['test_' + score].std())
+            # Log model parameters
+            params = model[0].get_params()
+            for param in params:
+                mlflow.log_param(param, params[param])
+            # Log artifacts
+            mlflow.log_artifacts(artifact_dir)
+mlflow.end_run()
+# Join shap feature importance and total gain
+shap_feat_importance = pd.read_csv(
+    './data/feature_importance_shap' + file_suffix + '.csv')
+tot_gain_feat_importance = pd.read_csv(
+    './data/feature_importance_tot_gain' + file_suffix + '.csv')
+tot_gain_feat_importance = tot_gain_feat_importance.rename(columns={'index':'col_name'})
+feat_importance_hierarchy = shap_feat_importance.merge(
+    tot_gain_feat_importance, on='col_name', how='left')
+feat_importance_hierarchy.to_csv(
+    './data/feat_importance_hierarchy' + file_suffix + '.csv', index=False)

training/encode_and_impute.py ADDED Viewed

	@@ -0,0 +1,286 @@

+"""Script that performs encoding of categorical features and imputation.
+Performs encoding of categorical features, and imputation of missing values. After encoding
+and imputation are performed, features are dropped. Two versions of the data is saved:
+imputed and not imputed dataframes.
+"""
+import pandas as pd
+import numpy as np
+import os
+import sys
+import yaml
+import json
+import joblib
+import encoding
+import imputation
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/encode_and_impute_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load data
+data = pd.read_pickle(
+    os.path.join(
+        config["outputs"]["processed_data_dir"],
+        "{}_combined_{}.pkl".format(data_to_process, model_type),
+    )
+)
+############################################################################
+# Target encode categorical data
+############################################################################
+categorical_cols = [
+    "LatestSymptomDiaryQ8",
+    "LatestSymptomDiaryQ9",
+    "LatestSymptomDiaryQ10",
+    "DaysSinceLastExac",
+    "AgeBinned",
+    "Comorbidities",
+    "FEV1PercentPredicted",
+]
+# Multiple types of nans present in data ('nan' and np.NaN). Convert all these to 'nan' for
+# categorical columns
+for categorical_col in categorical_cols:
+    data[categorical_col] = data[categorical_col].replace(np.nan, "nan")
+if data_to_process == "train":
+    # Get target encodings for entire train set
+    target_encodings = encoding.get_target_encodings(
+        train_data=data,
+        cols_to_encode=categorical_cols,
+        target_col="ExacWithin3Months",
+        smooth="auto",
+    )
+    train_encoded = encoding.apply_target_encodings(
+        data=data,
+        cols_to_encode=categorical_cols,
+        encodings=target_encodings,
+        drop_categorical_cols=False,
+    )
+    json.dump(
+        target_encodings,
+        open("./data/artifacts/target_encodings_" + model_type + ".json", "w"),
+    )
+    # K-fold target encode
+    # Get info on which patients belong to which fold
+    fold_patients = np.load(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"],
+            "fold_patients_{}.npy".format(model_type),
+        ),
+        allow_pickle=True,
+    )
+    train_encoded_cv, target_encodings = encoding.kfold_target_encode(
+        df=data,
+        fold_ids=fold_patients,
+        cols_to_encode=categorical_cols,
+        id_col="StudyId",
+        target="ExacWithin3Months",
+        smooth="auto",
+        drop_categorical_cols=False,
+    )
+    # Drop categorical cols except for AgeBinned as it is needed in imputation step
+    categorical_cols.remove("AgeBinned")
+    train_encoded = train_encoded.drop(columns=categorical_cols)
+    train_encoded_cv = train_encoded_cv.drop(columns=categorical_cols)
+if (data_to_process == "test") | (data_to_process == "forward_val"):
+    # Encode test set/forward val set based on entire train set
+    target_encodings = json.load(
+        open("./data/artifacts/target_encodings_" + model_type + ".json")
+    )
+    test_encoded = encoding.apply_target_encodings(
+        data=data,
+        cols_to_encode=categorical_cols,
+        encodings=target_encodings,
+        drop_categorical_cols=False,
+    )
+    # Drop categorical cols except for AgeBinned as it is needed in imputation step
+    categorical_cols.remove("AgeBinned")
+    test_encoded = test_encoded.drop(columns=categorical_cols)
+############################################################################
+# Impute missing data
+############################################################################
+cols_to_ignore = [
+    "StudyId",
+    "PatientId",
+    "IndexDate",
+    "ExacWithin3Months",
+    "HospExacWithin3Months",
+    "CommExacWithin3Months",
+    "Age",
+    "Sex_F",
+    "SafeHavenID",
+    "AgeBinned",
+]
+if data_to_process == "train":
+    # Impute entire train set
+    not_imputed_train = train_encoded.copy()
+    cols_to_impute = train_encoded.drop(columns=cols_to_ignore).columns
+    imputer = imputation.get_imputer(
+        train_data=train_encoded,
+        cols_to_impute=cols_to_impute,
+        average_type="median",
+        cols_to_groupby=["AgeBinned", "Sex_F"],
+    )
+    imputed_train = imputation.apply_imputer(
+        data=train_encoded,
+        cols_to_impute=cols_to_impute,
+        imputer=imputer,
+        cols_to_groupby=["AgeBinned", "Sex_F"],
+    )
+    joblib.dump(imputer, "./data/artifacts/imputer_" + model_type + ".pkl")
+    # K-fold impute
+    not_imputed_train_cv = train_encoded_cv.copy()
+    imputed_train_cv = imputation.kfold_impute(
+        df=train_encoded,
+        fold_ids=fold_patients,
+        cols_to_impute=cols_to_impute,
+        average_type="median",
+        cols_to_groupby=["AgeBinned", "Sex_F"],
+        id_col="StudyId",
+    )
+    df_columns = imputed_train.columns.tolist()
+if (data_to_process == "test") | (data_to_process == "forward_val"):
+    not_imputed_test = test_encoded.copy()
+    cols_to_impute = test_encoded.drop(columns=cols_to_ignore).columns
+    # Impute test set/forward val set based on entire train set
+    imputer = joblib.load("./data/artifacts/imputer_" + model_type + ".pkl")
+    imputed_test = imputation.apply_imputer(
+        data=test_encoded,
+        cols_to_impute=cols_to_impute,
+        imputer=imputer,
+        cols_to_groupby=["AgeBinned", "Sex_F"],
+    )
+    df_columns = imputed_test.columns.tolist()
+############################################################################
+# Reduce feature space
+############################################################################
+cols_to_drop_startswith = (
+    "DiffLatest",
+    "Var",
+    "LatestEQ5D",
+    "TotalEngagement",
+    "Age",
+    "NumHosp",
+    "Required",
+    "LungFunction",
+    "EngagementCAT",
+    "LatestSymptomDiary",
+    "LatestAlbumin",
+    "LatestEosinophils",
+    "LatestNeutrophils",
+    "LatestWhite Blood Count",
+)
+additional_cols_to_drop = [
+    "PatientId",
+    "SafeHavenID",
+    "Sex_F",
+    "NumCommExacPrior6mo",
+    "AsthmaOverlap",
+    "TimeSinceLungFunc",
+    "LatestNeutLymphRatio",
+    "EngagementEQ5DTW1",
+    "EngagementMRCTW1",
+    "LatestMRCQ1",
+    "WeekAvgCATQ1",
+    "WeekAvgCATQ3",
+    "WeekAvgCATQ4",
+    "WeekAvgCATQ5",
+    "WeekAvgCATQ6",
+    "WeekAvgCATQ7",
+    "WeekAvgCATQ8",
+    "WeekAvgSymptomDiaryQ1",
+    "WeekAvgSymptomDiaryQ3",
+    "WeekAvgSymptomDiaryScore",
+    "EngagementSymptomDiaryTW1",
+    "ScaledSumSymptomDiaryQ3TW1",
+    # "Comorbidities_te",
+]
+cols_to_drop = []
+cols_to_drop.extend(
+    [item for item in df_columns if item.startswith(cols_to_drop_startswith)]
+)
+cols_to_drop.extend(additional_cols_to_drop)
+if data_to_process == "train":
+    imputed_train = imputed_train.drop(columns=cols_to_drop)
+    not_imputed_train = not_imputed_train.drop(columns=cols_to_drop)
+    imputed_train_cv = imputed_train_cv.drop(columns=cols_to_drop)
+    not_imputed_train_cv = not_imputed_train_cv.drop(columns=cols_to_drop)
+if (data_to_process == "test") | (data_to_process == "forward_val"):
+    imputed_test = imputed_test.drop(columns=cols_to_drop)
+    not_imputed_test = not_imputed_test.drop(columns=cols_to_drop)
+############################################################################
+# Save data
+############################################################################
+os.makedirs(config["outputs"]["model_input_data_dir"], exist_ok=True)
+if data_to_process == "train":
+    imputed_train.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_imputed_{}.pkl".format(data_to_process, model_type),
+        )
+    )
+    not_imputed_train.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_not_imputed_{}.pkl".format(data_to_process, model_type),
+        )
+    )
+    imputed_train_cv.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_imputed_cv_{}.pkl".format(data_to_process, model_type),
+        )
+    )
+    not_imputed_train_cv.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_not_imputed_cv_{}.pkl".format(data_to_process, model_type),
+        )
+    )
+if (data_to_process == "test") | (data_to_process == "forward_val"):
+    imputed_test.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_imputed_{}.pkl".format(data_to_process, model_type),
+        )
+    )
+    not_imputed_test.to_pickle(
+        os.path.join(
+            config["outputs"]["model_input_data_dir"],
+            "{}_not_imputed_{}.pkl".format(data_to_process, model_type),
+        )
+    )

training/encoding.py ADDED Viewed

	@@ -0,0 +1,281 @@

+"""Functions for encoding categorical data with additive smoothing techniques applied."""
+import numpy as np
+import pandas as pd
+import itertools
+from sklearn.preprocessing import TargetEncoder
+def get_target_encodings(
+    *,
+    train_data,
+    cols_to_encode,
+    target_col,
+    smooth="auto",
+    keep_nans_as_category=False,
+    cols_to_keep_nan_category=None,
+):
+    """
+    Retrieve target encodings of input data for later use (performs no encoding).
+    The complete train data set is used to target encode the holdout test data set. This
+    function is used to obtain encodings for storage and later use on separate data.
+    Smoothing addresses overfitting caused by sparse data by relying on the global mean
+    (mean of target across all rows) rather than the local mean (mean of target across a
+    specific category) when there are a small number of observations in a category. The
+    degree of smoothing is controlled by the parameter 'smooth'. Higher values of 'smooth'
+    increases the influence of the global mean on the target encoding. A 'smooth' value of
+    100 can be interpreted as: there must be at least 100 values in the category for the
+    sample mean to overtake the global mean.
+    There is also an option to keep nan's as a category for cases on data missing not at
+    random. The format of nan for categorical columns required for the function is 'nan'.
+    Use kfold_target_encode to perform kfold encoding, and apply_target_encodings to use the
+    output of this function on the test data.
+    Parameters
+    ----------
+    train_data : dataframe
+        data to be used to for target encoding at a later stage. This is likely the full
+        train data set.
+    cols_to_encode : list of strings
+        names of columns to be encoded.
+    target_col : str
+        name of the target variable column.
+    smooth : str or float, optional
+        controls the amount of smoothing applied. A larger smooth value will put more
+        weight on the global target mean. If "auto", then smooth is set to an
+        empirical Bayes estimate, defaults to "auto".
+    keep_nans_as_category : bool, optional
+        option to retain nans as a category for cases of data missing not at random, by
+        default False.
+    cols_to_keep_nan_category : list of strings, optional
+        names of columns to keep the encoded nan category, by default None. Need to state
+        names of columns if keep_nans_as_category is True.
+    Returns
+    -------
+    encodings_all : dict
+        encodings used for each column.
+    Raises
+    -------
+    ValueError
+        error raised if there are multiple types of nan's in columns to be encoded.
+    ValueError
+        error raised if nans are not in the correct format: 'nan'.
+    ValueError
+        error raised if keep_nans_as_category is True but columns not provided.
+    """
+    train_data_to_encode = train_data[cols_to_encode]
+    train_target = train_data[target_col]
+    # Raise an error if there are multiple types of nan's
+    all_nan_types = [None, "None", np.NaN, "nan", "NAN", "N/A"]
+    incorrect_nan_types = ["None", np.NaN, "nan", "NAN", "N/A"]
+    for col in train_data_to_encode:
+        cat_present = train_data_to_encode[col].unique().tolist()
+        if len(list(set(all_nan_types) & set(cat_present))) > 1:
+            raise ValueError(
+                "Multiple types of nans present in data. Make sure that missing values in"
+                "categorical columns are all recorded as 'nan'."
+            )
+        # Raise an error if nan not in correct format for function
+        if any(element in all_nan_types for element in cat_present):
+            if not "nan" in cat_present:
+                raise ValueError(
+                    "Missing values in categorical columns are not recorded as 'nan'."
+                )
+    encoder = TargetEncoder(smooth=smooth)
+    encoder = encoder.fit(train_data_to_encode, train_target)
+    # Get dictionary with encodings
+    paired_dicts = []
+    paired_arrays = zip(encoder.categories_, encoder.encodings_)
+    for category_array, value_array in paired_arrays:
+        paired_dict = dict(zip(category_array, value_array))
+        paired_dicts.append(paired_dict)
+    encodings_all = dict(zip(encoder.feature_names_in_, paired_dicts))
+    # Sklearn treats nans as a category. The default in this function is to convert nan
+    # categories back to np.NaN unless stated otherwise.
+    if keep_nans_as_category is False:
+        for col in encodings_all:
+            encodings_all[col].update({"nan": np.nan})
+    # If it is specified to keep nan categories for specific features, only those features
+    # not specified in cols_to_keep_nan_category are converted to np.NaN.
+    if (keep_nans_as_category is True) and (cols_to_keep_nan_category is None):
+        raise ValueError(
+            "Parameter keep_nans_as_category is True but cols_to_keep_nan_category not provided."
+        )
+    if (keep_nans_as_category is True) and not (cols_to_keep_nan_category is None):
+        cols_to_remove_nan_cat = set(cols_to_encode) - set(cols_to_keep_nan_category)
+        for col_to_remove_nan in cols_to_remove_nan_cat:
+            encodings_all[col_to_remove_nan].update({"nan": np.nan})
+    return encodings_all
+def apply_target_encodings(*, data, cols_to_encode, encodings, drop_categorical_cols=False):
+    """Target encode input data with supplied encodings.
+    Parameters
+    ----------
+    data : dataframe
+        data with columns to be target encoded.
+    cols_to_encode : list of strings
+        list of columns to target encode.
+    encodings : dict
+        target encodings to use on input data (from training data).
+    drop_categorical_cols: bool, optional
+        option to drop categorical columns after encoding, defaults to False.
+    Returns
+    -------
+    data : dataframe
+        target encoded version of the input data.
+    Raises
+    -------
+    AssertionError
+        raises an error if the column to be encoded is not in the passed data.
+    ValueError
+        error raised if nans are not in the correct format: 'nan'.
+    ValueError
+        error raised if keep_nans_as_category is True but columns not provided.
+    """
+    data_to_encode = data[cols_to_encode]
+    # Raise an error if there are multiple types of nan's
+    all_nan_types = [None, "None", np.NaN, "nan", "NAN", "N/A"]
+    incorrect_nan_types = ["None", np.NaN, "nan", "NAN", "N/A"]
+    for col in data_to_encode:
+        cat_present = data_to_encode[col].unique().tolist()
+        if len(list(set(all_nan_types) & set(cat_present))) > 1:
+            raise ValueError(
+                "Multiple types of nans present in data. Make sure that missing values in"
+                "categorical columns are all recorded as 'nan'."
+            )
+        # Raise an error if nan not in correct format for function
+        if any(element in all_nan_types for element in cat_present):
+            if not "nan" in cat_present:
+                raise ValueError(
+                    "Missing values in categorical columns are not recorded as 'nan'."
+                )
+    encoded_data = data.copy()
+    for col in cols_to_encode:
+        assert (
+            col in encodings.keys()
+        ), "No target encodings found for {} column".format(col)
+        encodings_col = encodings[col]
+        # Account for the case where the new data includes a category not present
+        # in the train data encodings and set that category encoding nan
+        data_unique = data[col].unique().tolist()
+        encodings_unique = list(set(encodings_col.keys()))
+        diffs = np.setdiff1d(data_unique, encodings_unique)
+        encodings_col.update(zip(diffs, itertools.repeat(np.nan)))
+        # Use the lookup table to place each category in the current fold with its
+        # encoded value from the train data (in the new _te column)
+        filtered = encoded_data.filter(items=[col])
+        filtered_encodings = filtered.replace(encodings_col)
+        filtered_encodings = filtered_encodings.rename(columns={col: col + "_te"})
+        encoded_data = pd.concat([encoded_data, filtered_encodings], axis=1)
+    if drop_categorical_cols is True:
+        encoded_data = encoded_data.drop(columns=cols_to_encode)
+    return encoded_data
+def kfold_target_encode(
+    *,
+    df,
+    fold_ids,
+    cols_to_encode,
+    id_col,
+    target,
+    smooth="auto",
+    keep_nans_as_category=False,
+    cols_to_keep_nan_category=None,
+    drop_categorical_cols=False,
+):
+    """Perform K-fold target encoding.
+    Fold by fold target encoding of train data is used to prevent data leakage in cross-
+    validation (the same folds are used for encoding and CV). For example, in 5-fold
+    target encoding, each fold is encoded using the other 4 folds and that fold is
+    then used as the validation fold in CV. Smoothing is performed on each K-fold.
+    Parameters
+    ----------
+    df : dataframe
+        data with columns to be target encoded. Will generally be the train data.
+    fold_ids : list of arrays
+        each array contains the validation patient IDs for each fold.
+    cols_to_encode : list of strings
+        columns to target encode.
+    id_col : str
+        name of patient ID column.
+    target : str
+        name of target column.
+    smooth : str or float, optional
+        controls the amount of smoothing applied. A larger smooth value will put more
+        weight on the global target mean. If "auto", then smooth is set to an
+        empirical Bayes estimate, defaults to "auto".
+    keep_nans_as_category : bool, optional
+        option to retain nans as a category for cases of data missing not at random, by
+        default False.
+    cols_to_keep_nan_category : list of strings, optional
+        names of columns to keep the encoded nan category, by default None. Need to state
+        names of columns if keep_nans_as_category is True.
+    drop_categorical_cols: bool, optional
+        option to drop categorical columns after encoding, defaults to False.
+    Returns
+    -------
+    encoded_df_cv : dataframe
+        k-fold target encoded version of the input data.
+    fold_encodings_all : dataframe
+        contains target encodings for each fold.
+    """
+    # Loop over CV folds and perform K-fold target encoding
+    encoded_data_cv = []
+    fold_encodings_all = []
+    for fold in fold_ids:
+        # Divide data into train folds and validation fold
+        validation_fold = df[df[id_col].isin(fold)]
+        train_folds = df[~df[id_col].isin(fold)]
+        # Obtain target encodings from train folds
+        fold_encodings = get_target_encodings(
+            train_data=train_folds,
+            cols_to_encode=cols_to_encode,
+            target_col=target,
+            smooth=smooth,
+            keep_nans_as_category=keep_nans_as_category,
+            cols_to_keep_nan_category=cols_to_keep_nan_category,
+        )
+        fold_encodings_all.append(fold_encodings)
+        # Apply to validation fold
+        encoded_data_fold = apply_target_encodings(
+            data=validation_fold,
+            cols_to_encode=cols_to_encode,
+            encodings=fold_encodings,
+            drop_categorical_cols=drop_categorical_cols,
+        )
+        encoded_data_cv.append(encoded_data_fold)
+    # Place the encoded validation fold data into a single df
+    encoded_df_cv = pd.concat(encoded_data_cv)
+    # Place the encodings for all folds into a df
+    fold_encodings_all = pd.json_normalize(fold_encodings_all).T
+    return encoded_df_cv, fold_encodings_all

training/imputation.py ADDED Viewed

	@@ -0,0 +1,255 @@

+"""Functions for imputing missing data."""
+import numpy as np
+import pandas as pd
+from pandas.api.types import is_numeric_dtype
+def replace_nan_with_mode(df, col, col_mode, random_state):
+    """Replaces nan in categorical columns with the mode.
+    Function only used for categorical columns. Replaces nan in categorical columns with the
+    mode. If there are multiple modes, one mode is randomly chosen.
+    Parameters
+    ----------
+    df : dataframe
+        dataframe containing the categorical columns to impute and the columns with the mode.
+    col : str
+        name of column to impute.
+    col_mode : str
+        name of column containing the calculated modes.
+    random_state : int
+        to seed the random generator.
+    Returns
+    -------
+    df : dataframe
+        input dataframe with missing values in categorical column imputed.
+    """
+    np.random.seed(seed=random_state)
+    for index, row in df.iterrows():
+        if row[col] == "nan":
+            # Deals with cases where there are multiple modes. If there are multiple modes,
+            # one mode is chosen at random
+            if (not isinstance(row[col_mode], str)) and len(row[col_mode]) > 1:
+                df.at[index, col] = np.random.choice(list(row[col_mode]))
+            else:
+                df.at[index, col] = row[col_mode]
+    return df
+def get_imputer(*, train_data, cols_to_impute, average_type, cols_to_groupby=None):
+    """Retrieve imputer of input data for later use (performs no imputing).
+    The complete train data set is used to impute the holdout test data set. This function
+    is used to obtain imputations for storage and later use on separate data. The average
+    specified (e.g. median) can be calculated on the data provided or the data can be
+    grouped by specified features (e.g. binned age and sex) prior to average calculation.
+    For categorical columns, the mode is used. The format of nan for categorical columns
+    required for the function is 'nan'.
+    Use apply_imputer to perform imputation.
+    Parameters
+    ----------
+    train_data : dataframe
+        data to be used to for imputation at a later stage. This is likely the full train
+        data set.
+    cols_to_impute : list of strings
+        names of columns to perform imputation on. If cols_to_groupby is not None, these
+        columns cannot appear in cols_to_impute.
+    average_type : str
+        type of average to calculate. Must be either 'median' or 'mean'. For categorical
+        columns, the 'mode' will automatically be calculated.
+    cols_to_groupby : list of strings, optional
+        option to group data before calculating average.
+    Returns
+    -------
+    imputer : dataframe
+        contains average values calculated, to be used in imputation.
+    Raises
+    -------
+    ValueError
+        raises an error if elements in cols_to_groupby appear in cols_to_impute
+    ValueError
+        error raised if nans are not in the correct format: 'nan'.
+    """
+    if average_type not in ["mean", "median"]:
+        raise ValueError("average_type must be either 'mean or 'median'.")
+    if not cols_to_groupby is None:
+        if any(column in cols_to_groupby for column in cols_to_impute):
+            raise ValueError(
+                "Elements in cols_to_groupby should not appear in cols_to_impute"
+            )
+        imputer = []
+        for col in cols_to_impute:
+            is_numeric = is_numeric_dtype(train_data[col])
+            # For numeric columns, calculate specified average_type
+            if is_numeric:
+                imputer_for_col = train_data.groupby(cols_to_groupby)[col].agg(
+                    [average_type]
+                )
+                imputer_for_col = imputer_for_col.rename(columns={average_type: col})
+            # For categorical columns, calculate mode
+            else:
+                # Raise an error if nans are in the incorrect format
+                all_nan_types = [None, "None", np.NaN, "nan", "NAN", "N/A"]
+                cat_present = train_data[col].unique().tolist()
+                if any(element in all_nan_types for element in cat_present):
+                    if not "nan" in cat_present:
+                        raise ValueError(
+                            "Missing values in categorical columns are not recorded as 'nan'."
+                        )
+                # Drop any categories with 'nan' so that when getting the mode, nan isn't
+                # treated as a category
+                cols_to_groupby_plus_impute = cols_to_groupby + [col]
+                train_data_col = train_data[cols_to_groupby_plus_impute]
+                train_data_col = train_data_col[train_data_col[col] != "nan"]
+                imputer_for_col = train_data_col.groupby(cols_to_groupby).agg(
+                    pd.Series.mode
+                )
+            imputer.append(imputer_for_col)
+        imputer = pd.concat(imputer, axis=1)
+        # If there are any nans after grouping, fill nans with values from similar groups
+        imputer = imputer.sort_values(cols_to_groupby)
+        imputer = imputer.ffill().bfill()
+    else:
+        imputer = train_data[cols_to_impute].agg(average_type)
+    return imputer
+def apply_imputer(*, data, cols_to_impute, imputer, cols_to_groupby=None):
+    """Impute input data with supplied imputer.
+    Parameters
+    ----------
+    data : dataframe
+        data with columns to be imputed.
+    cols_to_impute : list of strings
+        names of columns to be imputed. If cols_to_groupby is not None, these columns cannot
+        appear in cols_to_impute.
+    imputer : dataframe
+        contains average values calculated, to be used in imputation.
+    cols_to_groupby : list of strings, optional
+        option to group data before calculating average. Must be the same as in get_imputer.
+    Returns
+    -------
+    data : dataframe
+        imputed dataframe.
+    Raises
+    -------
+    ValueError
+        raises an error if cols_to_groupby in apply_imputer do not match cols_to_groupby in
+        get_imputer.
+    """
+    if (cols_to_groupby) != (imputer.index.names):
+        raise ValueError(
+            "Groups used to generate the imputer and apply imputer must be the same. \
+                Groups used to generate the imputer are: {}, groups used to apply the \
+                imputer are: {}".format(
+                imputer.index.names, cols_to_groupby
+            )
+        )
+    if not cols_to_groupby is None:
+        imputer = imputer.add_suffix("_avg")
+        imputer = imputer.reset_index()
+        data_imputed = data.merge(imputer, on=cols_to_groupby, how="left")
+        for col in cols_to_impute:
+            is_numeric = is_numeric_dtype(data[col])
+            if is_numeric:
+                data_imputed[col] = np.where(
+                    data_imputed[col].isna(),
+                    data_imputed[col + "_avg"],
+                    data_imputed[col],
+                )
+            else:
+                data_imputed = replace_nan_with_mode(
+                    data_imputed, col, col + "_avg", random_state=0
+                )
+        # Drop columns containing the average values
+        data_imputed = data_imputed.loc[:, ~data_imputed.columns.str.endswith("_avg")]
+    else:
+        data_imputed = data_imputed.fillna(imputer)
+    return data_imputed
+def kfold_impute(
+    *,
+    df,
+    fold_ids,
+    cols_to_impute,
+    average_type,
+    cols_to_groupby,
+    id_col,
+):
+    """Perform K-fold imputation.
+    Fold by fold imputation of train data is used to prevent data leakage in cross-
+    validation (the same folds are used for imputation and CV). For example, in 5-fold
+    imputation, each fold is imputed using the other 4 folds and that fold is then used as
+    the validation fold in CV.
+    Parameters
+    ----------
+    df : dataframe
+        data with columns to be imputed. Will generally be the train data.
+    fold_ids : list of arrays
+        each array contains the validation patient IDs for each fold.
+    cols_to_impute : list of strings
+        columns to impute.
+    average_type : str
+        type of average to calculate (e.g. median, mean).
+    cols_to_groupby : list of strings, optional
+        option to group data before calculating average.
+    id_col : str
+        name of patient ID column.
+    Returns
+    -------
+    imputed_df_cv : dataframe
+        k-fold imputed version of the input data.
+    """
+    # Loop over CV folds and perform K-fold imputation
+    imputed_data_cv = []
+    for fold in fold_ids:
+        # Divide data into train folds and validation fold
+        validation_fold = df[df[id_col].isin(fold)]
+        train_folds = df[~df[id_col].isin(fold)]
+        # Obtain imputers from train folds
+        fold_imputer = get_imputer(
+            train_data=train_folds,
+            cols_to_impute=cols_to_impute,
+            average_type=average_type,
+            cols_to_groupby=cols_to_groupby,
+        )
+        # Apply to validation fold
+        imputed_data_fold = apply_imputer(
+            data=validation_fold,
+            cols_to_impute=cols_to_impute,
+            imputer=fold_imputer,
+            cols_to_groupby=cols_to_groupby,
+        )
+        imputed_data_cv.append(imputed_data_fold)
+    # Place the imputed validation fold data into a single df
+    imputed_df_cv = pd.concat(imputed_data_cv)
+    return imputed_df_cv

training/model_h.py ADDED Viewed

	@@ -0,0 +1,2061 @@

+"""Module containing code for model H (longer term exacerbation prediction)."""
+# General
+import numpy as np
+import pandas as pd
+import re
+import os
+from collections import defaultdict
+import random
+import json
+import joblib
+import mlflow
+# Feature engineering
+from sklearn import base
+from sklearn.preprocessing import MinMaxScaler
+from imblearn.over_sampling import RandomOverSampler, SMOTE
+# Calibration
+from sklearn.calibration import calibration_curve, CalibratedClassifierCV
+import ml_insights as mli
+# Metrics
+from sklearn.metrics import (
+    confusion_matrix,
+    precision_recall_curve,
+    auc,
+    average_precision_score,
+    roc_auc_score,
+    brier_score_loss,
+    f1_score,
+    precision_score,
+    recall_score,
+    roc_curve,
+)
+# Explainability
+import shap
+# Plotting
+import matplotlib.pyplot as plt
+import seaborn as sns
+##############################################################
+# Functions for setting up model labels
+##############################################################
+def apply_logic_response_criterion(df, N=2, minimum_period=14, maximum_period=35):
+    """
+    Apply PRO LOGIC criterion 2 (consecutive negative Q5 replies required between events).
+    For events that occur after the minimum required period following a previous exac,
+    e.g. longer than 14 days, but before they are automatically considered as a new exac
+    event, e.g. 35 days, PRO LOGIC considers weekly PRO responses between the two events.
+    For subsequent events to count as separate events, there must be at least N
+    consecutive negative responses (no rescue meds taken) to weekly PROs between each
+    postive reply. Note PRO LOGIC is applied to both hospital and patient reported events.
+    Args:
+        df (pd.DataFrame): must contain columns for PatientId, DateOfEvent, Q5Answered,
+            NegativeQ5, IsExac and DaysSinceLastExac.
+        minimum_period (int): minimum number of days since the previous exac (any exacs
+            within this window will already be removed with PRO LOGIC criterion 1).
+            Default value is 14 days.
+        maximum_period (int): maximum number of days since the previous exac (any exacs
+            occurring after this period will automatically count as a separate event).
+            Default is 35 days.
+    Returns:
+        pd.DataFrame: input df with a new boolean column 'RemoveExac'.
+    """
+    # Retrieve dataframe indices of exacs falling under PRO LOGIC criterion 2 (Q5 replies)
+    indices = get_logic_exacerbation_indices(
+        df, minimum_period=minimum_period, maximum_period=maximum_period
+    )
+    remove_exac = []
+    # Loop over each exac and evaluate PRO LOGIC criterion, returning 1 (remove) or 0
+    for exac_index in indices:
+        remove_exac.append(logic_consecutive_negative_responses(df, exac_index, N))
+    # Create dataframe containing exac indices and a boolean column stating whether to
+    # remove that exac due to failing Q5 response criterion and merge with original df
+    remove_exac = pd.DataFrame({"ind": indices, "RemoveExac": remove_exac})
+    df = df.merge(
+        remove_exac.set_index("ind"), left_index=True, right_index=True, how="left"
+    )
+    return df
+def bin_numeric_column(*, col, bins, labels):
+    """
+    Use pd.cut to bin numeric data into categories.
+    Args:
+        col (pd.Series): dataframe column to be binned.
+        bins (list): numeric values of bins.
+        labels (list): corresponding labels for the bins.
+    Returns:
+        pd.Series: binned column.
+    """
+    return pd.cut(col, bins=bins, labels=labels, right=False).astype("str")
+def calculate_days_since_last_event(*, df, event_col, output_col):
+    """
+    Calculate the days since the last event, e.g. exacerbation or rescue med prescription.
+    Restarts the count from one the day following an event. Any days without a
+    previous event have the output column set to -1
+    Args:
+        df (pd.DataFrame): dataframe with a column containing dates and a boolean column
+            stating whether an event occurred on that date.
+        event_col (str): name of the boolean column for whether an event occurred.
+    Returns:
+        df: the input dateframe with an additional column stating the number of days since
+            the previous event occurred (or -1 if no previous event).
+    """
+    # Get all events
+    all_events = df[df[event_col].eq(1)].copy()
+    all_events["PrevEvent"] = all_events.index
+    # Merge the full df with the event df on their indices to the closest date in the past
+    # i.e. the most recent exacerbation
+    df = pd.merge_asof(
+        df,
+        all_events["PrevEvent"],
+        left_index=True,
+        right_index=True,
+        direction="backward",
+    )
+    # Calculate the days since the previous event, restarting the count from 1 the
+    # day following an exacerbation (using shift)
+    df[output_col] = df.index - df["PrevEvent"].shift(1)
+    # Set to -1 for any rows without a prior exacerbation
+    df[output_col] = df[output_col].fillna(-1).astype("int64")
+    df = df.drop(columns=["PrevEvent"])
+    return df
+def extract_clinician_verified_exacerbations(df):
+    """
+    Extract verified events from clinician verification spreadsheets.
+    Extract only clinician verified events from verification spreadsheets and set the date
+    to the clinician supplied date if entered. Include a flag column for if the date was
+    changed from the PRO question response date.
+    Args:
+        df (pd.DataFrame): event verification data supplied by clinicians.
+    Returns:
+        pd.DataFrame: contains StudyId, DateOfEvent (a mix of true event dates and PRO
+            response dates if true dates unknown), IsCommExac (set to 1 here, used
+            after merging later) and ExacDateUnknown (boolean, 1 if clinicians did not
+            change the date).
+    """
+    # Filter for only verified events
+    df = df[df["Exacerbation confirmed"] == 1].copy()
+    df["DateRecorded"] = pd.to_datetime(df.DateRecorded, utc=True).dt.normalize()
+    df["New Date"] = pd.to_datetime(df["New Date"], utc=True).dt.normalize()
+    # Change the event date to the clinician supplied date if entered. This is considered
+    # the true event date. Set the event date to the PRO response date otherwise and flag
+    # that the true date is unknown
+    df["DateOfEvent"] = np.where(
+        df["Date changed"] == 1, df["New Date"], df["DateRecorded"]
+    )
+    df["DateOfEvent"] = np.where(
+        (df["Date changed"] == 1) & (df["New Date"].isna()),
+        df["DateRecorded"],
+        df["DateOfEvent"],
+    )
+    df["ExacDateUnknown"] = np.int64(np.where(df["Date changed"] == 1, 0, 1))
+    df["ExacDateUnknown"] = np.int64(
+        np.where(
+            (df["Date changed"] == 1) & (df["New Date"].isna()),
+            1,
+            df["ExacDateUnknown"],
+        )
+    )
+    # Flag all events as community events (this df will merge with hospital events later)
+    df["IsCommExac"] = 1
+    df = df[["StudyId", "DateOfEvent", "IsCommExac", "ExacDateUnknown"]]
+    return df
+def define_hospital_admission(events):
+    """
+    Define whether a COPD service event was an admission and return 1 (yes) or 0 (no).
+    Args:
+        events (pd.DataFrame): events from COPD service previously merged with
+            PatientEventTypes.txt to get a column containing EventTypeId
+        event_name_col (str): name of column containing COPD service EventTypeId
+    Returns:
+        array: boolean stating whether an event was a hospital admission.
+    """
+    hospital_event_names = [
+        "Hospital admission - emergency, COPD related",
+        "Hospital admission - emergency, COPD unrelated",
+    ]
+    return np.where(events.isin(hospital_event_names), 1, 0)
+def define_service_exac_event(
+    *, events, event_name_col="EventType", include_community=False
+):
+    """State if a COPD service event was an exacerbation and return 1 (yes) or 0 (no).
+    Args:
+        events (pd.DataFrame): events from COPD service previously merged with
+            PatientEventTypes.txt to get a column containing EventTypeId
+        event_name_col (str): name of column containing COPD service EventTypeId
+        include_community (bool): whether to include event types corresponding to
+            patient reported exacerbations (e.g. community managed with rescue meds).
+            Defaults to False.
+    Returns:
+        array: boolean stating whether an event was an exacerbation.
+    """
+    if include_community is True:
+        exacerbation_event_names = [
+            "Hospital admission - emergency, COPD related",
+            "Exacerbation - self-managed with rescue pack",
+            "GP review - emergency, COPD related",
+            "Emergency department attendance, COPD related",
+            "Exacerbation - started abs/steroid by clinical team",
+        ]
+    else:
+        exacerbation_event_names = [
+            "Hospital admission - emergency, COPD related",
+            "GP review - emergency, COPD related",
+            "Emergency department attendance, COPD related",
+            "Exacerbation - started abs/steroid by clinical team",
+        ]
+    return np.where(events.isin(exacerbation_event_names), 1, 0)
+def fill_column_by_patient(*, df, id_col, col):
+    """
+    Forward and back fill data by patient to fill gaps, e.g. from merges.
+    Args:
+        df (pd.DataFrame): patient data. Must contain col and id_col columns.
+        id_col (str): name of column containing unique patient identifiers.
+        col (str): name of column to be filled.
+    Returns:
+        pd.DataFrame: input data with col infilled.
+    """
+    groupby_df = df.groupby(id_col)[col].apply(lambda x: x)
+    groupby_df = groupby_df.reset_index(level=0)
+    groupby_df = groupby_df.apply(lambda x: x.ffill().bfill())
+    df[col] = groupby_df[col]
+    # df[col] = df.groupby(id_col)[col].apply(lambda x: x.ffill().bfill())
+    return df
+def filter_symptom_diary(*, df, patients, date_cutoff=None):
+    """
+    Filter COPD symptom diary data for patients and dates of interest.
+    Args:
+        df (pd.DataFrame): symptom diary data. Must contain 'SubmissionTime' and
+            'PatientId' columns.
+        patients (list): patient IDs of interest.
+    Returns:
+        pd.DataFrame: filtered symptom diary.
+    """
+    df["SubmissionTime"] = pd.to_datetime(df.SubmissionTime, utc=True).dt.normalize()
+    # Take only data from after the cutoff if provided (e.g. weekly Q5 change)
+    if date_cutoff:
+        df = df[df.SubmissionTime >= date_cutoff]
+    # Filter for patients of interest
+    df = df[df.PatientId.isin(patients)]
+    return df
+def get_logic_exacerbation_indices(df, minimum_period=14, maximum_period=35):
+    """
+    Return dataframe indices of exacs that need checking for PRO reponses since last exac.
+    Get the indices of exacerbations that occur long enough after the previous event to
+    not be removed by PRO LOGIC criterion 1 (e.g. within 14 days of previous exac) but
+    not long enough after to be counted as a separate event without further analysis.
+    Called by apply_logic_response_criterion.
+    Args:
+        df (pd.DataFrame): must contain IsExac and DaysSinceLastExac columns.
+        minimum_period (int): minimum number of days since the previous exac (any exacs
+            within this window will already be removed with PRO LOGIC criterion 1).
+            Default value is 14 days.
+        maximum_period (int): maximum number of days since the previous exac (any exacs
+            occurring after this period will automatically count as a separate event).
+            Default is 35 days.
+    Returns:
+        list: dataframe indices of relevant events.
+    """
+    # Get the dataframe indices for all exacerbations occurring within period of interest
+    indices = df[
+        (df.IsExac.eq(1))
+        & (df.DaysSinceLastExac > minimum_period)
+        & (df.DaysSinceLastExac <= maximum_period)
+    ].index.to_list()
+    return indices
+def get_rescue_med_pro_responses(df):
+    """Extract all responses to weekly PRO Q5 (rescue meds).
+    Add new boolean columns stating if Q5 was answered, whether it was a negative response
+    (no rescue meds taken in previous week) and whether the reply means a community
+    exacerbation. The latter two columns will be opposites.
+    Args:
+        df (pd.DataFrame): PRO symptom diary responses.
+    Returns:
+        pd.DataFrame: filtered weekly PROs with additional boolean columns Q5Answered,
+            NegativeQ5 and IsCommExac.
+    """
+    # Extract responses to weekly PRO Q5 (rescue meds)
+    df = df[df.SymptomDiaryQ5.notna()].copy()
+    df["SymptomDiaryQ5"] = df["SymptomDiaryQ5"].astype("int64")
+    # Columns for whether Q5 was answered and if the response was negative (no exac)
+    df["Q5Answered"] = 1
+    df["NegativeQ5"] = np.int64(np.where(df.SymptomDiaryQ5 == 0, 1, 0))
+    # Define community exacerbation as a positive reply to Q5
+    df["IsCommExac"] = np.int64(np.where(df.SymptomDiaryQ5 == 1, 1, 0))
+    return df
+def logic_consecutive_negative_responses(df, i, N=2):
+    """
+    Calculate number of consecutive -ve Q5 replies since previous exac (PRO LOGIC).
+    Given the dataframe index of the current exac identified as falling under the Q5
+    criterion, calculate the number of negative replies to the weeky rescue med question
+    and check if there are enough for the event to count as distinct from the previous.
+    Called by apply_logic_response_criterion.
+    Args:
+        df (pd.DataFrame): must contain weekly PRO replies and output from
+            get_rescue_med_pro_responses, set_pro_exac_dates and
+            calculate_days_since_exacerbation.
+        i (int): index of exac of interest.
+        N (int): number fo consecutive negative rescue meds required for event to be
+            counted as a separate event and retained in data. Default is 2.
+    Returns:
+        int: flag for whether the exac failed the criterion. Returns 1 for failed (exac to
+            be removed) and 0 for passed (exac to be retained).
+    """
+    # Select data since the previous exacerbation
+    days = int(df.iloc[i].DaysSinceLastExac)
+    data = df.iloc[i - days + 1 : i]
+    # Select replies to Q5
+    data = data[data.Q5Answered.eq(1)][
+        ["PatientId", "DateOfEvent", "Q5Answered", "NegativeQ5"]
+    ]
+    # Check if there are sufficient responses
+    if len(data) < N:
+        return 1
+    else:
+        # Resample to 7 days (weekly) to account for missing responses. Resampling using
+        # the 'W' option can give spurious nans - use '7D' instead
+        data = (
+            data.set_index("DateOfEvent")
+            .resample("7D", origin="start")
+            .sum()
+            .reset_index()
+        )
+        # Calculate number of consecutive negative replies to Q5 (no rescue meds taken)
+        consecutive_negative_responses = (
+            data[data.NegativeQ5.eq(1)]["NegativeQ5"]
+            .groupby(data.NegativeQ5.eq(0).cumsum())
+            .sum()
+            .reset_index(drop=True)
+            .max()
+        )
+        return 1 if consecutive_negative_responses < N else 0
+def minimum_period_between_exacerbations(df, minimum_days=14):
+    """
+    Identify exacs occurring too soon after the previous exac based on DaysSinceLastExac.
+    Returns 1 if the exacerbation occurred within minimum_days of that patient's previous
+    exacerbation and 0 if not.
+    Args:
+        df (pd.DataFrame): must contain DaysSinceLastExac column.
+    Returns:
+        array: contains 1 or 0.
+    """
+    return np.where(
+        (df["DaysSinceLastExac"] > 0) & (df["DaysSinceLastExac"] <= minimum_days), 1, 0
+    )
+def remove_data_between_exacerbations(df):
+    """
+    Remove data between first exac and subsequent exacs that failed PRO LOGIC criterion 2.
+    Ensures only the first in a series of related events are counted. Any subsequent exacs
+    that occurred too close to the initial event without sufficient negative weekly PRO
+    responses in the interim will be flagged for removal. This function removes flags for
+    removal all data from the day after the first event up to the date of events to be
+    removed. Data following the final event in the series will be removed by
+    minimum_period_between_exacerbations.
+    Args:
+        df (pd.DataFrame): must contain RemoveExac and DaysSinceLastExac columns.
+    Returns:
+        pd.DataFrame: days between first event and subsequent event(s) that failed the Q5
+            criterion are now flagged for removal in RemoveRow.
+    """
+    indices = df[df.RemoveExac.eq(1)].index.to_list()
+    # Check there are exacerbations that failed the logic criterion for N consecutive
+    # negative reponses to Q5 of weekly PROs (rescue meds)
+    if len(indices) > 0:
+        for exac_index in indices:
+            # Select data since the previous exacerbation
+            days = int(df.iloc[exac_index].DaysSinceLastExac)
+            # Set data since last exac up to and including current exac to be removed
+            df.loc[exac_index - days + 1 : exac_index, "RemoveRow"] = 1
+    return df
+def remove_unknown_date_exacerbations(df, days_to_remove=7):
+    """
+    Remove data prior to and including an exacerbation whose date is unknown.
+    Args:
+        df (pd.DataFrame): one row per day per patient for full data window. Must include
+           ExacDateUnknown column.
+        days_to_remove (int): number of days of data to remove leading up to (and
+            including) the PRO response date. Default is 7 days.
+    Returns:
+        pd.DataFrame: input dataframe with updated RemoveRow column.
+    """
+    # Get indices of all exacs whose dates are flagged as unknown.
+    indices = df[df.ExacDateUnknown.eq(1)].index.to_list()
+    # Check there are exacerbations with unknown dates (answer=1 in SymptomDiaryQ11a)
+    if len(indices) > 0:
+        for exac_index in indices:
+            # Set specified number of previous days data up to and including current exac
+            # to be removed
+            df.loc[exac_index - days_to_remove + 1 : exac_index, "RemoveRow"] = 1
+    return df
+def set_pro_exac_dates(df):
+    """
+    Set date of community exacerbations reported in weekly PROs Q5 and flag unknown dates.
+    Args:
+        df (pd.DataFrame: processed weekly PROs Q5 respnses, e.g. output of
+            get_rescue_med_pro_responses
+    Returns:
+        pd.DataFrame: input dataframe with additional columns for DateOfEvent (datetime)
+            and ExacDateUnknown (0 or 1).
+    """
+    # Take known exacerbation (rescue med) dates from SymptomDiaryQ11b, otherwise set the
+    # date to the date of PRO response
+    df["DateOfEvent"] = np.where(
+        df.SymptomDiaryQ11a == 2, df.SymptomDiaryQ11b, df.SubmissionTime
+    )
+    # Flag which dates were unknown from the PRO response
+    df["ExacDateUnknown"] = np.int64(
+        np.where((df.IsCommExac == 1) & (df.SymptomDiaryQ11a != 2), 1, 0)
+    )
+    df["DateOfEvent"] = pd.to_datetime(df.DateOfEvent, utc=True).dt.normalize()
+    df = df.drop_duplicates(keep="last", subset=["PatientId", "DateOfEvent"])
+    return df
+##############################################################
+# Functions for generating features
+##############################################################
+def weigh_features_by_recency(
+    *, df, feature, feature_recency_days, median_value, decay_rate=0.01
+):
+    """Gives more weight to more recent observations.
+    More weight is given to more recent observations. The older the observation, the value
+    will be scaled more towards the median. This is because abnormal values observed in the
+    past may not be reflective of current status for the patient.
+    Parameters
+    ----------
+    df : dataframe
+        input dataframe containing data to be scaled.
+    feature : str
+        name of feature to scale.
+    feature_recency_days : int
+        number of days prior to index date when the feature was observed.
+    median_value : int
+        median value for feature across all patients.
+    decay_rate : float, optional
+        the rate at which the observation is scaled, by default 0.01. Higher decay rate
+        leads to more extreme scaling towards the median.
+    Returns
+    -------
+    df : dataframe
+        input dataframe with weighted columns.
+    """
+    df["Weights"] = np.exp(-decay_rate * df[feature_recency_days])
+    df["LowerThanMedian"] = np.where((df[feature] - median_value) < 0, 1, 0)
+    df[feature + "Weighted"] = np.where(
+        df["LowerThanMedian"] == 1,
+        df[feature] / df["Weights"],
+        df[feature] * df["Weights"],
+    )
+    # df[feature + 'Weighted'] = np.where(((df['LowerThanMedian'] == 1) & (df[feature + 'Weighted'] > median_value)), np.NaN, df[feature + 'Weighted'])
+    # df[feature + 'Weighted'] = np.where(((df['LowerThanMedian'] == 0) & (df[feature + 'Weighted'] < median_value)), np.NaN, df[feature + 'Weighted'])
+    df = df.drop(columns=[feature_recency_days, "Weights", "LowerThanMedian"])
+    return df
+##############################################################
+# Functions for modelling
+##############################################################
+def load_data_for_modelling(data_path):
+    """Loading and sorting data by StudyId.
+    Args:
+        data_path (str): filepath of data.
+    Returns:
+        df: df containing features and target.
+    """
+    data = pd.read_pickle(data_path)
+    data = data.sort_values(by=["StudyId", "IndexDate"])
+    data["IndexDate"] = pd.to_datetime(data.IndexDate, utc=True)
+    data = data.reset_index(drop=True)
+    return data
+def get_mlflow_run_params(model_name, run_id, mlflow_db, model_type, parent_run=True):
+    """Searches the mlflow database runs to return the parameters used in the run(s).
+    Searches the database containing previously logged mlflow runs to find the specified
+    run. If parent_run is True, run_id is the run id of the parent run that contains
+    multiple child runs and will return the parameters of all the runs recorded under the
+    parent run id. If parent_run is False, the run_id is the run id of a specific run, only
+    the parameters of that specific run will be returned.
+    Args:
+        model_name (str): name of model.
+        run_id (str): run id value of the run(s) that are to be searched.
+        mlflow_db (str): database where the mlflow runs are recorded.
+        model_type (str): contains information on type of model being run.
+        parent_run (bool, optional): specifies whether to query all child runs in the parent
+                    run specified or a specific child run. Defaults to True.
+    Returns:
+        dict: contains parameters used in a specific run or multiple runs depending on
+              whether parent_run is True.
+    """
+    # Set the tracking uri to the database where runs are recorded
+    mlflow.set_tracking_uri(mlflow_db)
+    # Get the run_ids of the best models from hyperparameter tuning
+    if parent_run is True:
+        param_tuning_ids = mlflow.search_runs(
+            experiment_names=["model_h_drop_1_" + model_type],
+            filter_string="tags.mlflow.parentRunId = '"
+            + run_id
+            + "' and run_name = '"
+            + model_name
+            + "'",
+        )
+    # Get parameters for the best models from hyperparameter tuning
+    best_params = {}
+    if parent_run is True:
+        for index, row in param_tuning_ids.iterrows():
+            best_param = mlflow.get_run(row["run_id"]).data.params
+    else:
+        best_param = mlflow.get_run(run_id).data.params
+    # Values are all given as strings. Convert the values into the correct format
+    for key in best_param:
+        try:
+            best_param[key] = int(best_param[key])
+        except:
+            try:
+                best_param[key] = float(best_param[key])
+            except:
+                best_param[key] = best_param[key]
+        if best_param[key] == "None":
+            best_param[key] = None
+        if best_param[key] == "True":
+            best_param[key] = True
+        if best_param[key] == "False":
+            best_param[key] = False
+    # Return a dictionary with multiple runs if parent_run is True
+    if parent_run is True:
+        best_params[best_param["opt_scorer"]] = best_param
+        for key in best_params:
+            best_params[key].pop("opt_scorer")
+    # Return a dictionary with a single run if parent_run is False
+    else:
+        best_params = best_param
+    return best_params
+##############################################################
+# Functions for calculating metrics
+##############################################################
+def plot_confusion_matrix(
+    thresholds,
+    probs,
+    target,
+    model_name,
+    file_suffix,
+    calibration_type,
+    split_by_event=False,
+    event_type="hosp_comm",
+):
+    """Plot confusion matrices for calibrated models at different thresholds.
+    Args:
+        thresholds (list): list of thresholds to plot the confusion matrices at.
+        probs (array): probability estimates for positive class for the test set.
+        target (array): true values.
+        model_name (str): name of model.
+        file_suffix (str): type of model run.
+        calibration_type (str): type of calibration.
+        split_by_event (bool, optional): determines whether to plot confusion matrix by event
+                        type (hospital vs community events). Defaults to False.
+        event_type (str, optional): specifies the event type(s) that the confusion matrices
+                    are being plotted for. Defaults to 'hosp_comm'.
+    Returns:
+        None.
+    """
+    # Create folders to contain confusion matrices for each calibration type
+    os.makedirs("./tmp/cm_" + calibration_type, exist_ok=True)
+    for threshold in thresholds:
+        y_predicted = probs > threshold
+        cm = confusion_matrix(target, y_predicted)
+        group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
+        group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
+        group_percentages = [
+            "{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)
+        ]
+        labels = [
+            f"{v1}\n{v2}\n{v3}"
+            for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
+        ]
+        labels = np.asarray(labels).reshape(2, 2)
+        sns.heatmap(cm, annot=labels, fmt="", cmap="Blues")
+        if split_by_event is False:
+            output_filename = (
+                "./tmp/cm_"
+                + calibration_type
+                + "/"
+                + model_name
+                + "_cm_"
+                + calibration_type
+                + "_thres_"
+                + str(threshold)
+                + "_"
+                + file_suffix
+                + ".png"
+            )
+        else:
+            output_filename = (
+                "./tmp/cm_"
+                + calibration_type
+                + "/"
+                + model_name
+                + "_cm_"
+                + calibration_type
+                + "_thres_"
+                + str(threshold)
+                + event_type
+                + ".png"
+            )
+        plt.savefig(output_filename)
+        plt.close()
+def calc_best_f1_score(y_true, y_probs, best_threshold=None):
+    """Finds the best f1 score and the respective threshold, precision, and recall.
+    Args:
+        y_true (array): ground truth target values.
+        y_probs (array): probabilites of the positive class for the target.
+        best_threshold (bool, optional): ability to provide a threshold used previously.
+                        Defaults to None.
+    Returns:
+        best_threshold: int.
+        f1_best_thres: int.
+        precision_best_thres: int.
+        recall_best_thres: int.
+    """
+    # Get threshold that gives best F1 score
+    precision, recall, thresholds = precision_recall_curve(y_true, y_probs)
+    fscore = (2 * precision * recall) / (precision + recall)
+    if best_threshold is None:
+        # When getting the max fscore, if fscore is nan, nan will be returned as the max.
+        # Iterate until nan not returned.
+        fscore_zero = True
+        position = -1
+        while fscore_zero is True:
+            best_thres_idx = np.argsort(fscore, axis=0)[position]
+            if np.isnan(fscore[best_thres_idx]) == True:
+                position = position - 1
+            else:
+                fscore_zero = False
+    else:
+        # Find the id of the threshold that is most similar to the threshold provided.
+        best_thres_idx = (np.abs(thresholds - best_threshold)).argmin()
+    # Get the scores at the threshold that gives the best f1 score
+    best_threshold = thresholds[best_thres_idx]
+    f1_best_thres = fscore[best_thres_idx]
+    precision_best_thres = precision[best_thres_idx]
+    recall_best_thres = recall[best_thres_idx]
+    print(
+        "Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f"
+        % (best_threshold, f1_best_thres, precision_best_thres, recall_best_thres)
+    )
+    return best_threshold, f1_best_thres, precision_best_thres, recall_best_thres
+def get_threshold_with_best_f1_score(target, probs):
+    """Calculate threshold at which the best F1 score is obtained.
+    Args:
+        target (array): true values for target.
+        probs (array): probability estimates for positive class.
+    Returns:
+        best_threshold: int.
+        f1_best_thres: int.
+        precision_best_thres: int.
+        recall_best_thres: int.
+    """
+    # Calculate precision, recall and f1 for all thresholds
+    precision, recall, thresholds = precision_recall_curve(target, probs)
+    f1score = (2 * precision * recall) / (precision + recall)
+    # When getting the max f1score, if f1score is nan, nan will be returned as the
+    # max. Iterate until nan not returned.
+    f1score_zero = True
+    position = -1
+    while f1score_zero is True:
+        best_thres_idx = np.argsort(f1score, axis=0)[position]
+        if np.isnan(f1score[best_thres_idx]) == True:
+            position = position - 1
+        else:
+            f1score_zero = False
+    best_threshold = thresholds[best_thres_idx]
+    f1_best_thres = f1score[best_thres_idx]
+    precision_best_thres = precision[best_thres_idx]
+    recall_best_thres = recall[best_thres_idx]
+    print(
+        "Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f"
+        % (best_threshold, f1_best_thres, precision_best_thres, recall_best_thres)
+    )
+    return best_threshold, f1_best_thres, precision_best_thres, recall_best_thres
+def calc_eval_metrics_for_model(
+    y_true, y_pred, y_probs, calib_type, best_threshold=None
+):
+    """Calculates evaluation metrics for models.
+    Calculates the evaluation metrics for uncalibrated and calibrated models. Additionally,
+    also calculates metrics for specific event types (e.g. only hospital events, only
+    community events). Metrics for specific events types are calculated based on the
+    best_threshold provided to allow direct comparison between the performance of the general
+    model and the performance broken down by event type.
+    Calculates the following metrics:
+    - f1 score at a threshold of 0.5.
+    - precision at a threshold of 0.5.
+    - recall at a threshold of 0.5.
+    - highest f1 score across thresholds.
+    - precision at threshold that has highest f1 score.
+    - recall at threshold that has highest f1 score.
+    - auc-pr.
+    - average precision.
+    - roc-auc.
+    - negative brier score.
+    Args:
+        y_true (array): ground truth target values.
+        y_pred (array): estimated targets as returned by classifier.
+        y_probs (array): probabilites of the positive class for the target.
+        calib_type (str): type of calibrated model to calculate metrics.
+        best_threshold (bool, optional): ability to provide a threshold value at which to
+                        calculate metrics.
+    Returns:
+        dict: contains metrics for calibrated model.
+    """
+    # Calculate precision, recall and f1 score at a threshold of 0.5
+    precision = precision_score(y_true, y_pred)
+    recall = recall_score(y_true, y_pred)
+    f1 = f1_score(y_true, y_pred)
+    # Calculate f1 score at best threshold
+    if best_threshold is None:
+        (
+            best_thres,
+            f1_best_thres,
+            prec_best_thres,
+            recall_best_thres,
+        ) = calc_best_f1_score(y_true, y_probs)
+    else:
+        (
+            best_thres,
+            f1_best_thres,
+            prec_best_thres,
+            recall_best_thres,
+        ) = calc_best_f1_score(y_true, y_probs, best_threshold=best_threshold)
+    # Calculate area under curves
+    precision_, recall_, thresholds_ = precision_recall_curve(y_true, y_probs)
+    fscore_ = (2 * precision_ * recall_) / (precision_ + recall_)
+    auc_pr = auc(recall_, precision_)
+    average_precision = average_precision_score(y_true, y_probs)
+    roc_auc = roc_auc_score(y_true, y_probs)
+    # Calculate negative brier loss
+    neg_brier_score = -abs(brier_score_loss(y_true, y_probs))
+    # Create dict with metrics
+    calib_metrics = {
+        "precision_" + calib_type: precision,
+        "recall_" + calib_type: recall,
+        "f1_" + calib_type: f1,
+        "precision_best_thres_" + calib_type: prec_best_thres,
+        "recall_best_thres_" + calib_type: recall_best_thres,
+        "f1_best_thres_" + calib_type: f1_best_thres,
+        "auc_pr_" + calib_type: auc_pr,
+        "average_precision_" + calib_type: average_precision,
+        "roc_auc_" + calib_type: roc_auc,
+        "neg_brier_score_" + calib_type: neg_brier_score,
+        #        "best_thres_" + calib_type: best_thres,
+    }
+    return calib_metrics
+def create_df_probabilities_and_predictions(
+    probs,
+    best_threshold,
+    patient_id,
+    target,
+    hosp_comm_exac_df,
+    model_name,
+    file_suffix,
+    output_dir,
+    calib_type="uncalib",
+):
+    """Creates dataframe that allows plotting of shap local plots.
+    Creates a dataframe that contains patient identifier, model prediction probability,
+    threshold, model prediction, ground truth and an explanation column that describes
+    whether the model prediction was correct, and whether an exacerbation occurred. The
+    dataframe is saved in the output directory specified.
+    Args:
+        probs (array): model output probabilities for class 1 (exacerbation).
+        best_threshold (float): threshold where f1 score is highest, at which classification
+                        is performed.
+        patient_id (list): list of patient ids in the same order as probs.
+        target (array): ground truth label.
+        hosp_comm_exac_df (pd.Dataframe): contains the event data broken down by event type.
+                           Possible event types are hospital or community.
+        model_name (str): name of model.
+        file_suffix (str): type of model.
+        output_dir (str): output directory where df is saved.
+        calib_type (str, optional): type of calibration performed. Defaults to 'uncalib'.
+    Returns:
+        pd.DataFrame: contains probabilities, predictions and event types.
+    """
+    predicted_best_thres = probs > best_threshold
+    predicted_best_thres = predicted_best_thres.astype(int)
+    probs_target = pd.DataFrame(
+        {
+            "StudyId": patient_id,
+            "Probs": probs,
+            "Threshold": best_threshold,
+            "Predicted": predicted_best_thres,
+            "Target": target,
+            "Explanation": np.NaN,
+        }
+    )
+    probs_target = probs_target.merge(
+        hosp_comm_exac_df, right_index=True, left_index=True, how="left"
+    )
+    probs_target["Explanation"] = np.where(
+        (probs_target["Predicted"] == probs_target["Target"])
+        & (probs_target["Predicted"] == 0),
+        "true negative",
+        probs_target["Explanation"],
+    )
+    probs_target["Explanation"] = np.where(
+        (probs_target["Predicted"] == probs_target["Target"])
+        & (probs_target["Predicted"] == 1),
+        "correct",
+        probs_target["Explanation"],
+    )
+    probs_target["Explanation"] = np.where(
+        (probs_target["Predicted"] != probs_target["Target"])
+        & (probs_target["Predicted"] == 0),
+        "missed",
+        probs_target["Explanation"],
+    )
+    probs_target["Explanation"] = np.where(
+        (probs_target["Predicted"] != probs_target["Target"])
+        & (probs_target["Predicted"] == 1),
+        "incorrect",
+        probs_target["Explanation"],
+    )
+    probs_target.to_csv(
+        os.path.join(
+            output_dir,
+            "preds_and_events_"
+            + calib_type
+            + "_"
+            + model_name
+            + "_"
+            + file_suffix
+            + ".csv",
+        )
+    )
+    return probs_target
+def calc_metrics_by_event_type(preds_event_df, calib_type):
+    """Calculates performance metrics by event type (hospital or community).
+    Args:
+        preds_event_df (pd.Dataframe): contains values required for calculating metrics.
+        calib_type (str): type of calibration performed.
+    Returns:
+        dict: contains performance metrics for both community and hospital events.
+    """
+    preds_events_comm = preds_event_df[preds_event_df["HospExacWithin3Months"] == 0]
+    metrics_comm = calc_eval_metrics_for_model(
+        preds_events_comm["Target"],
+        preds_events_comm["Predicted"],
+        preds_events_comm["Probs"],
+        calib_type,
+        best_threshold=preds_events_comm["Threshold"][0],
+    )
+    metrics_comm = {f"{k}_comm": v for k, v in metrics_comm.items()}
+    preds_events_hosp = preds_event_df[preds_event_df["CommExacWithin3Months"] == 0]
+    metrics_hosp = calc_eval_metrics_for_model(
+        preds_events_hosp["Target"],
+        preds_events_hosp["Predicted"],
+        preds_events_hosp["Probs"],
+        calib_type,
+        best_threshold=preds_events_comm["Threshold"][0],
+    )
+    metrics_hosp = {f"{k}_hosp": v for k, v in metrics_hosp.items()}
+    metrics_by_event_type = metrics_comm.copy()
+    metrics_by_event_type.update(metrics_hosp)
+    return metrics_by_event_type
+def plot_roc_curve_by_event_type(preds_event_df, model_name, calib_type):
+    """Plots ROC curve for multiple event types.
+    Args:
+        preds_event_df (pd.Dataframe): contains values required for calculating metrics.
+        model_name (str): name of model.
+        calib_type (str): type of calibration.
+    Returns:
+        None.
+    """
+    os.makedirs("./tmp/performance", exist_ok=True)
+    fpr, tpr, _ = roc_curve(preds_event_df["Target"], preds_event_df["Probs"])
+    auc_roc = roc_auc_score(preds_event_df["Target"], preds_event_df["Probs"])
+    plt.plot(fpr, tpr, label="Hosp+Comm exacs" + "\n AUC=" + str(round(auc_roc, 2)))
+    mapping_dict = {
+        "HospExacWithin3Months": "Hosp exacs",
+        "CommExacWithin3Months": "Comm exacs",
+    }
+    for key in mapping_dict:
+        preds_events_df_subset = preds_event_df[
+            (preds_event_df[key] == 1) | (preds_event_df["ExacWithin3Months"] == 0)
+        ]
+        fpr, tpr, _ = roc_curve(
+            preds_events_df_subset["Target"], preds_events_df_subset["Probs"]
+        )
+        auc_roc = roc_auc_score(
+            preds_events_df_subset["Target"], preds_events_df_subset["Probs"]
+        )
+        plt.plot(fpr, tpr, label=mapping_dict[key] + "\n AUC=" + str(round(auc_roc, 2)))
+    plt.plot([0, 1], [0, 1], linestyle="--", color="black")
+    plt.title("ROC-AUC Curve")
+    plt.ylabel("True Positive Rate")
+    plt.xlabel("False Positive Rate")
+    plt.legend(loc="lower right")
+    plt.savefig("./tmp/performance/" + model_name + "_roc_curve_" + calib_type + ".png")
+    plt.close()
+def plot_prec_recall_by_event_type(preds_event_df, model_name, calib_type):
+    """Plots a precision recall curve for multiple event types.
+    Args:
+        preds_event_df (pd.Dataframe): contains values required for calculating metrics.
+        model_name (str): name of model.
+        calib_type (str): type of calibration.
+    Returns:
+        None.
+    """
+    os.makedirs("./tmp/performance", exist_ok=True)
+    precision, recall, thresholds = precision_recall_curve(
+        preds_event_df["Target"], preds_event_df["Probs"]
+    )
+    auc_pr = auc(recall, precision)
+    plt.plot(
+        recall,
+        precision,
+        label="Hosp+Comm exacs" + "\n AUC-PR=" + str(round(auc_pr, 2)),
+    )
+    mapping_dict = {
+        "HospExacWithin3Months": "Hosp exacs",
+        "CommExacWithin3Months": "Comm exacs",
+    }
+    for key in mapping_dict:
+        preds_event_df_subset = preds_event_df[
+            (preds_event_df[key] == 1) | (preds_event_df["ExacWithin3Months"] == 0)
+        ]
+        precision, recall, thresholds = precision_recall_curve(
+            preds_event_df_subset["Target"], preds_event_df_subset["Probs"]
+        )
+        auc_pr = auc(recall, precision)
+        plt.plot(
+            recall,
+            precision,
+            label=mapping_dict[key] + "\n AUC-PR=" + str(round(auc_pr, 2)),
+        )
+    plt.title("Precision-Recall Curve")
+    plt.ylabel("Precision")
+    plt.xlabel("Recall")
+    plt.legend(loc="upper right")
+    plt.savefig("./tmp/performance/" + model_name + "_pr_curve_" + calib_type + ".png")
+    plt.close()
+def plot_cm_by_event_type(preds_event_df, model_name, file_suffix, calib_type):
+    """Plots confusion matrices by event type (hospital or community).
+    Args:
+        preds_event_df (pd.Dataframe): contains values required for plotting confusion matrices.
+        model_name (str): name of model.
+        file_suffix (str): type of model run.
+        calib_type (str): type of calibration performed.
+    Returns:
+        None.
+    """
+    thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, preds_event_df["Threshold"][0]]
+    # Community events
+    preds_events_comm = preds_event_df[preds_event_df["HospExacWithin3Months"] == 0]
+    plot_confusion_matrix(
+        thresholds,
+        preds_events_comm["Probs"],
+        preds_events_comm["Target"],
+        model_name,
+        file_suffix,
+        calib_type,
+        split_by_event=True,
+        event_type="comm",
+    )
+    # Hospital events
+    preds_events_hosp = preds_event_df[preds_event_df["CommExacWithin3Months"] == 0]
+    plot_confusion_matrix(
+        thresholds,
+        preds_events_hosp["Probs"],
+        preds_events_hosp["Target"],
+        model_name,
+        file_suffix,
+        calib_type,
+        split_by_event=True,
+        event_type="hosp",
+    )
+def reverse_scaling(data, imputation, file_suffix):
+    """Reverse scaling performed.
+    Args:
+        data (pd.Dataframe): data after scaling performed.
+        imputation (str): to identify whether imputation was performed for the model.
+        file_suffix (str): type of model being run.
+    Returns:
+        pd.Dataframe: input dataframe after scaling has been reversed.
+    """
+    if imputation == "imputed":
+        scaler = joblib.load("./data/artifacts/scaler_imp" + file_suffix + ".pkl")
+    else:
+        scaler = joblib.load("./data/artifacts/scaler_no_imp" + file_suffix + ".pkl")
+    data_scaling_reversed = scaler.inverse_transform(data)
+    data_scaling_reversed = pd.DataFrame(
+        data=data_scaling_reversed, columns=data.columns.tolist()
+    )
+    return data_scaling_reversed
+def convert_target_encodings_into_groups(target_enc_path, test_features):
+    """Convert target encodings back into original category names.
+    Target encodings are converted back to their original category names to simplify
+    interpretation using SHAP.
+    Args:
+        target_enc_path (str): path where the target encodings are saved.
+        test_features (pd.Dataframe): test values after scaling reversed.
+    Returns:
+        pd.Dataframe: dataframe after converting target encodings into categories.
+    """
+    target_encodings = json.load(open(target_enc_path))
+    # Add _te suffix as all columns that have been target encoded have the suffix.
+    target_encodings = {key + "_te": val for key, val in target_encodings.items()}
+    # Invert the target encodings so that target encoded value maps to the categorical value.
+    target_encodings_inv = {}
+    for col_name in target_encodings:
+        for category in target_encodings[col_name]:
+            target_encodings[col_name][category] = round(
+                target_encodings[col_name][category], 6
+            )
+        target_encodings_inv[col_name] = dict(
+            (v, k) for k, v in target_encodings[col_name].items()
+        )
+    # Round categorical column values to 6 decimal places as in the target encoding dict.
+    test_features_enc_conv = test_features.copy()
+    target_encoded_cols = test_features_enc_conv.columns[
+        test_features_enc_conv.columns.str.endswith("_te")
+    ]
+    for col in target_encoded_cols:
+        test_features_enc_conv[col] = round(test_features_enc_conv[col], 6)
+    # Replace the target encoding with the corresponding categorical value.
+    test_features_enc_conv = test_features_enc_conv.replace(target_encodings_inv)
+    return test_features_enc_conv
+def plot_score_distribution(
+    target, probs, artifact_dir, model_name, file_suffix, calibration_type="uncal"
+):
+    """Plots distribution of probabilties for each class.
+    Args:
+        target (array): true values for target.
+        probs (array): probability estimates for positive class.
+        artifact_dir (str): output directory where plot is saved.
+        model_name (str): name of model.
+        file_suffix (str): type of model run.
+        calibration_type (str): type of calibration performed. Defaults to 'uncal'.
+    Returns:
+        None.
+    """
+    # Create folders to contain score distributions
+    os.makedirs(os.path.join(artifact_dir, "score_distribution"), exist_ok=True)
+    # Plot score distribution
+    model_scores = pd.DataFrame({"model_score": probs, "true_label": target})
+    sns.displot(model_scores, x="model_score", hue="true_label", kde=True, bins=20)
+    plt.savefig(
+        os.path.join(
+            artifact_dir,
+            "score_distribution",
+            model_name + "_score_dist_" + calibration_type + "_" + file_suffix + ".png",
+        )
+    )
+    plt.close()
+##############################################################
+# Functions for model calibration
+##############################################################
+def plot_calibration_curve(target, probs, bins, strategy, calib_type, ax=None):
+    """Plot calibration curve.
+    Args:
+        target (array): true target values.
+        probs (array): probability estimates for the positive class.
+        bins (int): number of bins for plotting calibration curve.
+        strategy (str): strategy used to define the widths of the bin. Possible options are
+                  'uniform' or 'quantile'.
+        calib_type (str): type of calibration performed.
+        ax (None or matplotlib axis): allows plotting of multiple calibration curves on one
+            plot. Defaults to None.
+    Returns:
+        None.
+    """
+    prob_true, prob_pred = calibration_curve(
+        target, probs, n_bins=bins, strategy=strategy
+    )
+    if ax is None:
+        plt.plot(prob_pred, prob_true, marker=".", label=calib_type)
+    else:
+        ax.plot(prob_pred, prob_true, marker=".", label=calib_type)
+def equalObs(x, nbin):
+    """Function to calculate equal frequency bins.
+    Args:
+        x (array): data to be divided into bins.
+        nbin (int): number of bins required.
+    Returns:
+        array: bin edges that give equal frequency bins.
+    """
+    nlen = len(x)
+    bin_edges = np.interp(np.linspace(0, nlen, nbin + 1), np.arange(nlen), np.sort(x))
+    return bin_edges
+def plot_calibration_plot_with_error_bars(
+    probs_uncal,
+    probs_sig,
+    probs_iso,
+    probs_spline,
+    train_target,
+    test_target,
+    model_name,
+    artifact_dir="./tmp",
+):
+    """Plots calibration plots with 95% confidence interval bars.
+    Args:
+        probs_uncal (array): probability estimates for the positive class for the
+                     uncalibrated model.
+        probs_sig (array): probability estimates for the positive class for the sigmoid model.
+        probs_iso (array): probability estimates for the positive class for the isotonic model.
+        probs_spline (array): probability estimates for the positive class for the spline
+                      model.
+        train_target (array): true target values for the train set.
+        test_target (array): true target values for the test set.
+        model_name (str): name of the model.
+        artifact_dir (str, optional): path of output directory. Defaults to "./tmp".
+    Returns:
+        None.
+    """
+    # Create histogram with equal-frequency bins
+    n, bins_uncal, patches = plt.hist(
+        probs_uncal, equalObs(probs_uncal, 5), edgecolor="black"
+    )
+    n, bins_sig, patches = plt.hist(
+        probs_sig, equalObs(probs_sig, 5), edgecolor="black"
+    )
+    n, bins_iso, patches = plt.hist(
+        probs_iso, equalObs(probs_iso, 5), edgecolor="black"
+    )
+    n, bins_spline, patches = plt.hist(
+        probs_spline, equalObs(probs_spline, 5), edgecolor="black"
+    )
+    # Plot calibration plots
+    fig = plt.figure(figsize=(15, 15))
+    plt.subplot(2, 2, 1)
+    mli.plot_reliability_diagram(
+        train_target,
+        probs_uncal,
+        bins=bins_uncal[:-1],
+        ci_ref="point",
+        reliability_title="Uncalibrated",
+    )
+    plt.subplot(2, 2, 2)
+    mli.plot_reliability_diagram(
+        test_target,
+        probs_sig,
+        bins=bins_sig[:-1],
+        ci_ref="point",
+        reliability_title="Sigmoid",
+    )
+    plt.subplot(2, 2, 3)
+    mli.plot_reliability_diagram(
+        test_target,
+        probs_iso,
+        bins=bins_iso[:-1],
+        ci_ref="point",
+        reliability_title="Isotonic",
+    )
+    plt.subplot(2, 2, 4)
+    mli.plot_reliability_diagram(
+        test_target,
+        probs_spline,
+        bins=bins_spline[:-1],
+        ci_ref="point",
+        reliability_title="Spline",
+    )
+    plt.savefig(
+        os.path.join(artifact_dir, model_name + "calibration_plots_error_bars.png")
+    )
+    plt.close()
+def calc_std_for_calibrated_classifiers(
+    calib_model, calib_model_name, test_features, test_target
+):
+    auc_prs = []
+    for i, clf in enumerate(calib_model.calibrated_classifiers_):
+        probs_calib = clf.predict_proba(test_features)[:, 1]
+        precision_, recall_, thresholds_ = precision_recall_curve(
+            test_target, probs_calib
+        )
+        auc_pr = auc(recall_, precision_)
+        auc_prs.append(auc_pr)
+    print("Calibrated model:", calib_model_name)
+    print("Mean AUC-PR:", np.mean(auc_prs))
+    print("Std AUC-PR:", np.std(auc_prs))
+##############################################################
+# Functions for model explainability
+##############################################################
+def plot_feat_importance_model(model, model_name, file_suffix, feature_names=None):
+    """Plotting model feature importances derived from model.
+    Plots feature importance metrics derived from models. Only performed for XGBoost and
+    Light GBM models currently. For the XGBoost model, total cover and total gain plotted.
+    For the Light GBM model, total gain is plotted. Function also returns a table with a
+    number that is representative of the feature importance.
+    Args:
+        model (variable): model that has been fit on train data.
+        model_name (str): name of model.
+        file_suffix (str): contains information on type of model being run.
+        feature_names (list): list of feature names.
+    Returns:
+        pd.DataFrame: df containing feature importance position.
+    """
+    if model_name.startswith("xgb"):
+        total_gain = model.get_booster().get_score(importance_type="total_gain")
+        total_cover = model.get_booster().get_score(importance_type="total_cover")
+        total_gain = pd.DataFrame.from_dict(
+            total_gain, orient="index", columns=["total_gain"]
+        )
+        total_cover = pd.DataFrame.from_dict(
+            total_cover, orient="index", columns=["total_cover"]
+        )
+        feat_importance = total_gain.join(total_cover)
+    if model_name.startswith("lgbm"):
+        total_gain = model.booster_.feature_importance(importance_type="gain")
+        feat_importance = dict(zip(feature_names, total_gain))
+        feat_importance = pd.DataFrame.from_dict(
+            feat_importance, orient="index", columns=["total_gain"]
+        )
+    feat_importance = feat_importance.sort_values(by="total_gain", ascending=False)
+    print("Total gain and total cover\n", feat_importance)
+    feat_importance.plot.barh(figsize=(10, 10))
+    plt.tight_layout()
+    plt.savefig(
+        "./tmp/" + model_name + "_feat_importance_model_" + file_suffix + ".png"
+    )
+    plt.close()
+    # Create feature importance table
+    feat_importance[model_name] = range(1, 1 + len(feat_importance))
+    feat_importance = feat_importance.drop(
+        columns=["total_gain", "total_cover"], errors="ignore"
+    )
+    feat_importance = feat_importance.reset_index()
+    if model_name == "xgb":
+        feat_importance_tot_gain_df = feat_importance
+    else:
+        feat_importance_tot_gain_df = pd.read_csv(
+            "./data/feature_importance_tot_gain_" + file_suffix + ".csv"
+        )
+        feat_importance_tot_gain_df = feat_importance_tot_gain_df.merge(
+            feat_importance, on="index", how="left"
+        )
+    return feat_importance_tot_gain_df
+def get_shap_feat_importance(model_name, shap_values, feature_names, file_suffix):
+    """Creates a table with feature importance hierarchy using shap values.
+    A table containing feature importance position created for all models except for the
+    dummy classifier.
+    Args:
+        model_name (str): name of model.
+        shap_values (array): contains shap values.
+        feature_names (list): list of feature names.
+        file_suffix (str): contains information on type of model being run.
+    Returns:
+        pd.DataFrame: df containing feature importance position.
+    """
+    if model_name != "dummy_classifier":
+        shap_vals_df = pd.DataFrame(shap_values, columns=feature_names)
+        vals = np.abs(shap_vals_df.values).mean(0)
+        shap_importance = pd.DataFrame(
+            list(zip(feature_names, vals)),
+            columns=["col_name", "feature_importance_vals"],
+        )
+        shap_importance = shap_importance.sort_values(
+            by=["feature_importance_vals"], ascending=False
+        )
+        shap_importance[model_name] = range(1, 1 + len(shap_importance))
+        shap_importance = shap_importance.drop(columns=["feature_importance_vals"])
+        shap_importance = shap_importance.reset_index(drop=True)
+        if model_name == "logistic_regression":
+            feat_importance_df = shap_importance
+        else:
+            feat_importance_df = pd.read_csv(
+                "./data/feature_importance_shap" + file_suffix + ".csv"
+            )
+            feat_importance_df = feat_importance_df.merge(
+                shap_importance, on="col_name", how="left"
+            )
+    else:
+        pass
+    return feat_importance_df
+def get_local_shap_values(
+    model_name, file_suffix, shap_values, test_data, calib_name, shap_ids_dir
+):
+    # Read df containing probabilities and predictions
+    probs_target = pd.read_csv(
+        os.path.join(
+            shap_ids_dir,
+            "preds_and_events_"
+            + calib_name
+            + "_"
+            + model_name
+            + "_"
+            + file_suffix
+            + ".csv",
+        )
+    )
+    explanation = shap.Explanation(shap_values, data=test_data, display_data=True)
+    feature_imp_all = []
+    for row_id in range(0, len(probs_target)):
+        feature_names = explanation[row_id].data.index
+        shap_values = explanation[row_id].values
+        feature_importance = pd.DataFrame(data=shap_values, index=feature_names)
+        feature_importance = feature_importance.sort_values(
+            by=0, ascending=False
+        ).reset_index()
+        feature_importance = feature_importance["index"].rename(row_id)
+        feature_imp_all.append(feature_importance)
+    feature_imp_all = pd.concat(feature_imp_all, axis=1)
+    return feature_imp_all
+def plot_local_shap(
+    model_name,
+    file_suffix,
+    shap_values,
+    test_data,
+    train_data,
+    calib_name,
+    row_ids_to_plot,
+    artifact_dir,
+    shap_ids_dir,
+    reverse_scaling_flag=False,
+    convert_target_encodings=False,
+    imputation=None,
+    target_enc_path=None,
+    return_enc_converted_df=False,
+):
+    """Plots local shap plots for specified row ids.
+    Local shap plots are plotted for the specified row ids. The bars are colored according
+    to their values. If the value is higher than the mean, the bar is colored red. If the
+    value is lower than the mean, the bar is colored blue. A text box is added with patient
+    details and prediction.
+    Args:
+        model_name (str): name of model.
+        file_suffix (str): type of model.
+        test_data (pd.DataFrame): dataframe containing feature names and values. Must be in
+                   the same shape as the shap_values.
+        train_data (pd.DataFrame): dataframe containing train feature names and values.
+                    Needed for calculating the median for custom coloring of bars.
+        shap_values (array): matrix of SHAP values (# samples x # features).
+        calib_name (str): type of calibration performed.
+        row_ids_to_plot (list): list of ids for which to plot the local SHAP plot for.
+        artifact_dir (str): output directory where plot is saved.
+        shap_ids_dir (str): directory where data used for SHAP plots is stored.
+        reverse_scaling_flag (bool, optional): flag to identify whether scaling should be
+                              reversed. Defaults to False.
+        convert_target_encodings (bool, optional): flag to allow conversion of target
+                                  encodings to category groups. Defaults to False.
+        imputation (None or str, optional): whether imputation was performed. Defaults to None.
+        target_enc_path (None or str, optional): path to target encodings. Defaults to None.
+        return_enc_converted_df (bool, optional): flag to identify whether to return the
+                                 dataframe with converted target encodings. Defaults to False.
+    Returns:
+        pd.Dataframe: contains target encodings converted to categories. If reverse_scaling is
+                      True, values in the df will be converted back to their original values.
+    """
+    # Create folder to contain local plots
+    os.makedirs(os.path.join(artifact_dir, "shap_local_plots"), exist_ok=True)
+    # Read df containing probabilities and predictions
+    probs_target = pd.read_csv(
+        os.path.join(
+            shap_ids_dir,
+            "preds_and_events_"
+            + calib_name
+            + "_"
+            + model_name
+            + "_"
+            + file_suffix
+            + ".csv",
+        )
+    )
+    # Calculate median values for each feature
+    median_values = pd.DataFrame(
+        data=[np.median(test_data, axis=0)], columns=test_data.columns
+    )
+    for col_name in median_values.columns:
+        if col_name.endswith("te"):
+            median_values[col_name] = np.NaN
+    median_values = median_values.T
+    median_values = median_values.rename(columns={0: "median"})
+    # TODO Add median for when reverse scaling flag is True
+    # Reverse scaling in order to convert target encodings into groups if flag is True
+    if convert_target_encodings is True:
+        # data_scaling_reversed = reverse_scaling(test_data, imputation, file_suffix)
+        data_enc_conv = convert_target_encodings_into_groups(target_enc_path, test_data)
+        # if reverse_scaling_flag is False:
+        #    data_enc_conv = data_enc_conv.loc[
+        #        :, data_enc_conv.columns.str.endswith("te")
+        #    ]
+        #    data_no_categorical = test_data.loc[
+        #        :, ~test_data.columns.str.endswith("te")
+        #    ]
+        #    data_enc_conv = data_no_categorical.merge(
+        #        data_enc_conv, left_index=True, right_index=True, how="left"
+        #    )
+    for i in row_ids_to_plot:
+        if type(i) == str:
+            probs_target_to_plot = probs_target[probs_target["Explanation"] == i]
+            row_ids_to_plot_section = probs_target_to_plot.index
+            # Create folder to contain local plots
+            os.makedirs(
+                os.path.join(artifact_dir, "shap_local_plots", i), exist_ok=True
+            )
+        # Plot local SHAP plots of selected ids
+        for id in row_ids_to_plot_section:
+            fig, ax = plt.subplots()
+            if convert_target_encodings is True:
+                explanation = shap.Explanation(
+                    shap_values, data=data_enc_conv, display_data=True
+                )
+            else:
+                explanation = shap.Explanation(
+                    shap_values, data=test_data, display_data=True
+                )
+            shap.plots.bar(
+                explanation[id], show_data=True, show=False, max_display=None
+            )
+            # Create a text box with probability and threshold
+            print(probs_target)
+            probability = round(probs_target.iloc[id]["Probs"], 3)
+            threshold = round(probs_target.iloc[id]["Threshold"], 3)
+            textstr = (
+                "StudyId: "
+                + probs_target.iloc[id]["StudyId"]
+                + "\n"
+                + "Probability: "
+                + str(probability)
+                + "\n"
+                + "Threshold: "
+                + str(threshold)
+                + "\n"
+                + "Ground truth: "
+                + str(probs_target.iloc[id]["Target"])
+                + "\n"
+                + "Prediction: "
+                + str(probs_target.iloc[id]["Predicted"])
+            )
+            # Place a text box in upper left of figure
+            props = dict(boxstyle="round", facecolor="wheat", alpha=0.5)
+            ax.text(
+                0.02,
+                0.98,
+                textstr,
+                fontsize=12,
+                transform=fig.transFigure,
+                verticalalignment="top",
+                bbox=props,
+            )
+            plt.title(probs_target.iloc[id]["Explanation"])
+            # Get the current Axes object
+            ax = plt.gca()
+            # Get the input values for the patient
+            data_id = test_data.iloc[id].T.to_frame()
+            data_id = data_id.rename(columns={id: "values"})
+            # Order the columns by importance and merge
+            df_shap_values = pd.DataFrame(
+                data=[shap_values[id]], columns=test_data.columns
+            ).T
+            df_shap_values = df_shap_values.abs()
+            df_shap_values = df_shap_values.rename(columns={0: "shap_values"})
+            df_feat_importance = df_shap_values.sort_values(
+                by="shap_values", ascending=False
+            )
+            df_feat_importance = df_feat_importance.merge(
+                median_values, right_index=True, left_index=True, how="left"
+            )
+            df_feat_importance = df_feat_importance.merge(
+                data_id, right_index=True, left_index=True, how="left"
+            )
+            # Customise the colors of the bars based on whether the value is higher/lower
+            # compared to the median
+            colors = [
+                (
+                    "lightgreen"
+                    if pd.isna(median)
+                    else "#ff0051" if value >= median else "#008bfb"
+                )
+                for value, median in zip(
+                    df_feat_importance["values"], df_feat_importance["median"]
+                )
+            ]
+            for bar, color in zip(ax.patches, colors):
+                bar.set_color(color)
+            # Customize the colors of the SHAP values displayed on the bars
+            for text, color in zip(ax.texts, ["black"] * len(ax.texts)):
+                text.set_color(color)
+            # Save figure
+            plt.tight_layout()
+            plt.savefig(
+                os.path.join(
+                    artifact_dir,
+                    "shap_local_plots",
+                    i,
+                    model_name
+                    + "_shap_local_id_"
+                    + str(id)
+                    + "_"
+                    + calib_name
+                    + "_"
+                    + file_suffix
+                    + ".png",
+                ),
+                bbox_inches="tight",
+            )
+            plt.close()
+    if return_enc_converted_df is True:
+        return data_enc_conv
+def plot_averaged_summary_plot(
+    avg_shap_values_train, train_data, model_name, calib_type, file_suffix
+):
+    """Plots summary SHAP plot using the shap values averaged across cross validation folds.
+    Args:
+        avg_shap_values_train (array): SHAP values averaged across CV folds.
+        train_data (dataframe): dataframe containing feature names and values.
+        model_name (str): name of model.
+        calib_type (str): type of calibration performed.
+        file_suffix (str): type of model.
+    Returns:
+        None.
+    """
+    # Create folder to contain summary plots
+    os.makedirs("./tmp/shap_summary_plots", exist_ok=True)
+    shap.summary_plot(
+        np.array(avg_shap_values_train), train_data, max_display=None, show=False
+    )
+    plt.savefig(
+        os.path.join(
+            "./tmp/shap_summary_plots/",
+            model_name + "_shap_" + calib_type + "_" + file_suffix + ".png",
+        )
+    )
+    plt.close()
+def plot_shap_decision_plot(
+    explainer, shap_values, test_df, link, row_ids_to_plot, artifact_dir
+):
+    """Plots SHAP decision plots for specified row numbers.
+    Args:
+        explainer (object): explainer used for explaining model output.
+        shap_values (array): matrix of SHAP values (# samples x # features).
+        test_df (dataframe): test df which provides the values and feature names.
+        link (str): specifies transformation for the x-axis. Use "logit" to transform
+              log-odds into probabilities.
+        row_ids_to_plot (list): list of ids for which to plot the decision plot for.
+        artifact_dir (str): output directory where plot is saved.
+    Returns:
+        None.
+    """
+    # Create folders to contain decision plots
+    os.makedirs(os.path.join(artifact_dir, "decision_plots"), exist_ok=True)
+    # Check that all rows specified are present in the cv fold. If rows not in the same cv
+    # fold, descision plot not plotted.
+    if all(x in test_df.index.tolist() for x in row_ids_to_plot):
+        # If expected value provided for both classes, get the value for class 1
+        try:
+            expected_val = explainer.expected_value[1]
+        except:
+            expected_val = explainer.expected_value
+    else:
+        raise IndexError("All rows specified are not present in the same CV fold.")
+    # Create decision plot with all samples
+    shap_return = shap.decision_plot(
+        expected_val, shap_values, test_df, link=link, show=False, return_objects=True
+    )
+    plt.tight_layout()
+    plt.savefig(os.path.join(artifact_dir, "decision_plots", "cv_decision_plot.png"))
+    plt.close()
+    # Create decision plot for row ids specified keeping the same feature order as in
+    # the decision plot with all samples
+    for i in row_ids_to_plot:
+        shap.decision_plot(
+            expected_val,
+            shap_values[i],
+            test_df.iloc[i],
+            feature_order=shap_return.feature_idx,
+            link=link,
+            show=False,
+        )
+        plt.tight_layout()
+        plt.savefig(
+            os.path.join(
+                artifact_dir, "decision_plots", "decision_plot_rownum" + str(i) + ".png"
+            )
+        )
+        plt.close()
+def plot_shap_summary_plot_per_cv_fold(
+    shap_values_train, X_train, calib_type, fold_num, model_name, file_suffix
+):
+    """Plots SHAP summary plot for each CV fold.
+    Args:
+        shap_values_train (array): shap values for the train set.
+        X_train (dataframe): contains values for the train set.
+        calib_type (str): type of calibration performed.
+        fold_num (str): CV fold number.
+        model_name (str): name of model.
+        file_suffix (str): type of model run.
+    Returns:
+        None.
+    """
+    # Create folders to contain shap summary plots for cv folds for each calibration type
+    os.makedirs("./tmp/shap_cv_folds_" + calib_type, exist_ok=True)
+    shap.summary_plot(
+        np.array(shap_values_train), X_train, max_display=None, show=False
+    )
+    plt.savefig(
+        os.path.join(
+            "./tmp/",
+            "shap_cv_folds_" + calib_type,
+            model_name
+            + "_shap_"
+            + calib_type
+            + "_cv_fold_"
+            + str(fold_num)
+            + "_"
+            + file_suffix
+            + ".png",
+        )
+    )
+    plt.close()
+def get_calibrated_shap_by_classifier(
+    calib_model, x_test, x_train, features, calib_type, model_name, file_suffix
+):
+    """
+    Iterated over base classifier in calibrated model and averages the shap
+    values for both test and train.
+    Parameters
+    ----------
+    calib_model : calibrated sklearn model
+        Trained calibrated sklearn model.
+    x_test : pandas dataframe
+        test features dataframe.
+    x_train : pandas dataframe
+        training features dataframe.
+    features : list
+        name of columns.
+    calib_type : str
+        type of calibration.
+    model_name : str
+        name of model.
+    file_suffix : str
+        type of model run.
+    Returns
+    -------
+    shap_values_v : array
+        test shap values.
+    shap_values_t : array
+        train shap values.
+    """
+    # https://github.com/slundberg/shap/issues/899
+    shap_values_list_val = []
+    shap_values_list_train = []
+    base_list = []
+    fold_num = 0
+    for calibrated_classifier in calib_model.calibrated_classifiers_:
+        fold_num = fold_num + 1
+        explainer = shap.TreeExplainer(calibrated_classifier.estimator)
+        shap_values_val = explainer.shap_values(x_test)
+        shap_values_train = explainer.shap_values(x_train)
+        if len(np.shape(shap_values_train)) == 3:
+            shap_values_train = shap_values_train[1]
+            shap_values_val = shap_values_val[1]
+        shap_values_list_val.append(shap_values_val)
+        shap_values_list_train.append(shap_values_train)
+        base_list.append(explainer.expected_value)
+        plot_shap_summary_plot_per_cv_fold(
+            shap_values_train, x_train, calib_type, fold_num, model_name, file_suffix
+        )
+    shap_values_v = np.array(shap_values_list_val).sum(axis=0) / len(
+        shap_values_list_val
+    )
+    shap_values_t = np.array(shap_values_list_train).sum(axis=0) / len(
+        shap_values_list_train
+    )
+    shap_values_global = pd.DataFrame(np.abs(shap_values_t), columns=features)
+    global_shap = shap_values_global.mean(axis=0)
+    global_shap_round = global_shap.round(3, None)
+    x = global_shap_round.to_json(double_precision=3)
+    x = x.replace("'", '"')
+    mean_shap = json.loads(x)
+    return shap_values_v, shap_values_t
+def get_uncalibrated_shap(
+    uncal_model_estimators, x_test, x_train, features, model_name, file_suffix
+):
+    """
+    Iterated over base classifier in calibrated model and averages the shap
+    values for both test and train.
+    Parameters
+    ----------
+    uncal_model_estimators : uncalibrated model
+        Trained uncalibrated model.
+    x_test : pandas dataframe
+        test features dataframe.
+    x_train : pandas dataframe
+        training features dataframe.
+    features : list
+        name of columns.
+    model_name : str
+        name of model.
+    file_suffix : str
+        type of model run.
+    Returns
+    -------
+    shap_values_v : array
+        test shap values.
+    shap_values_t : array
+        train shap values.
+    """
+    # https://github.com/slundberg/shap/issues/899
+    shap_values_list_val = []
+    shap_values_list_train = []
+    base_list = []
+    fold_num = 0
+    for estimator in uncal_model_estimators:
+        explainer = shap.TreeExplainer(estimator)
+        shap_values_val = explainer.shap_values(x_test)
+        shap_values_train = explainer.shap_values(x_train)
+        if len(np.shape(shap_values_train)) == 3:
+            shap_values_train = shap_values_train[1]
+            shap_values_val = shap_values_val[1]
+        shap_values_list_val.append(shap_values_val)
+        shap_values_list_train.append(shap_values_train)
+        base_list.append(explainer.expected_value)
+        fold_num = fold_num + 1
+        plot_shap_summary_plot_per_cv_fold(
+            shap_values_train, x_train, "uncalib", fold_num, model_name, file_suffix
+        )
+    shap_values_v = np.array(shap_values_list_val).sum(axis=0) / len(
+        shap_values_list_val
+    )
+    shap_values_t = np.array(shap_values_list_train).sum(axis=0) / len(
+        shap_values_list_train
+    )
+    shap_values_global = pd.DataFrame(np.abs(shap_values_t), columns=features)
+    global_shap = shap_values_global.mean(axis=0)
+    global_shap_round = global_shap.round(3, None)
+    x = global_shap_round.to_json(double_precision=3)
+    x = x.replace("'", '"')
+    mean_shap = json.loads(x)
+    return shap_values_v, shap_values_t
+def plot_shap_interaction_value_heatmap(
+    estimators, train_features, column_names, model_name, file_suffix
+):
+    """Calculate SHAP interaction values and plot on heatmap.
+    Args:
+        estimators (list): list of estimators used during cross validation.
+        train_features (array): train set values.
+        column_names (list): names of columns.
+        model_name (str): name of model.
+        file_suffix (str): name of model run.
+    Returns:
+        None.
+    """
+    os.makedirs("./tmp/interaction_plot", exist_ok=True)
+    fold_num = 0
+    for estimator in estimators:
+        fold_num = fold_num + 1
+        explainer = shap.TreeExplainer(estimator)
+        shap_interaction = explainer.shap_interaction_values(train_features)
+        # Some values come in the shape (#class, #samples, #features, #features). Subset
+        # these cases to class 1.
+        if len(np.shape(shap_interaction)) == 4:
+            shap_interaction = shap_interaction[1]
+        # Plot heatmap
+        mean_shap = np.abs(shap_interaction).mean(0)
+        df = pd.DataFrame(mean_shap, index=column_names, columns=column_names)
+        df.where(df.values == np.diagonal(df), df.values * 2, inplace=True)
+        fig = plt.figure(figsize=(35, 20), facecolor="#002637", edgecolor="r")
+        ax = fig.add_subplot()
+        sns.heatmap(
+            df.round(decimals=3),
+            cmap="coolwarm",
+            annot=True,
+            fmt=".6g",
+            cbar=False,
+            ax=ax,
+        )
+        ax.tick_params(axis="x", colors="w", labelsize=15, rotation=90)
+        ax.tick_params(axis="y", colors="w", labelsize=15)
+        plt.suptitle("SHAP interaction values", color="white", fontsize=60, y=0.97)
+        plt.yticks(rotation=0)
+        plt.savefig(
+            "./tmp/interaction_plot/shap_interaction_heatmap_cv_"
+            + str(fold_num)
+            + "_"
+            + model_name
+            + "_"
+            + file_suffix
+            + ".png"
+        )
+        plt.close()

training/perform_forward_validation.py ADDED Viewed

	@@ -0,0 +1,296 @@

+import numpy as np
+import pandas as pd
+import pickle
+import model_h
+import mlflow
+import os
+import shutil
+import matplotlib.pyplot as plt
+import sys
+import scipy
+import yaml
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+def perform_ks_test(train_data, forward_val_data):
+    """Perform Kolmogorov-Smirnov test.
+    Args:
+        train_data (pd.DataFrame): data used to train model.
+        forward_val_data (pd.DataFrame): data used for the forward validation.
+    Returns:
+        pd.DataFrame: dataframe containing the results of the K-S test.
+    """
+    for num, feature_name in enumerate(train_data.columns.tolist()):
+        statistic, pvalue = scipy.stats.ks_2samp(
+            train_data[feature_name], forward_val_data[feature_name]
+        )
+        pvalue = round(pvalue, 4)
+        if num == 0:
+            df_ks = pd.DataFrame(
+                {
+                    "FeatureName": feature_name,
+                    "KS_PValue": pvalue,
+                    "KS_TestStatistic": statistic,
+                },
+                index=[num],
+            )
+        else:
+            df_ks_feat = pd.DataFrame(
+                {
+                    "FeatureName": feature_name,
+                    "KS_PValue": pvalue,
+                    "KS_TestStatistic": statistic,
+                },
+                index=[num],
+            )
+            df_ks = pd.concat([df_ks, df_ks_feat])
+    df_ks["KS_DistributionsIdentical"] = np.where(df_ks["KS_PValue"] < 0.05, 0, 1)
+    return df_ks
+def compute_wasserstein_distance(train_data, forward_val_data):
+    """Calculate the wasserstein distance.
+    Args:
+        train_data (pd.DataFrame): data used to train model.
+        forward_val_data (pd.DataFrame): data used for the forward validation.
+    Returns:
+        pd.DataFrame: dataframe containing the wasserstein distance results.
+    """
+    for num, feature_name in enumerate(train_data.columns.tolist()):
+        w_distance = scipy.stats.wasserstein_distance(
+            train_data[feature_name], forward_val_data[feature_name]
+        )
+        if num == 0:
+            df_wd = pd.DataFrame(
+                {"FeatureName": feature_name, "WassersteinDistance": w_distance},
+                index=[num],
+            )
+        else:
+            df_wd_feat = pd.DataFrame(
+                {"FeatureName": feature_name, "WassersteinDistance": w_distance},
+                index=[num],
+            )
+            df_wd = pd.concat([df_wd, df_wd_feat])
+    df_wd = df_wd.sort_values(by="WassersteinDistance", ascending=True)
+    return df_wd
+##############################################################
+# Load data
+##############################################################
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open(
+    os.path.join(
+        config["outputs"]["logging_dir"], "run_forward_val_" + model_type + ".log"
+    ),
+    "w",
+)
+sys.stdout = log
+# Load test data
+forward_val_data_imputed = pd.read_pickle(
+    os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "forward_val_imputed_{}.pkl".format(model_type),
+    )
+)
+forward_val_data_not_imputed = pd.read_pickle(
+    os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "forward_val_not_imputed_{}.pkl".format(model_type),
+    )
+)
+# Load exac event type data
+#test_exac_data = pd.read_pickle("./data/forward_val_exac_data.pkl")
+# Load data the model was trained on
+train_data = model_h.load_data_for_modelling(
+    os.path.join(
+        config["outputs"]["model_input_data_dir"],
+        "crossval_imputed_{}.pkl".format(model_type),
+    )
+)
+##############################################################
+# Check for data drift
+##############################################################
+train_data_for_data_drift = train_data.drop(columns=["StudyId", "IndexDate"])
+forward_val_data_for_data_drift = forward_val_data_imputed.drop(columns=["StudyId", "IndexDate"])
+df_ks = perform_ks_test(train_data_for_data_drift, forward_val_data_for_data_drift)
+df_wd = compute_wasserstein_distance(
+    train_data_for_data_drift, forward_val_data_for_data_drift
+)
+df_data_drift = df_wd.merge(df_ks, on="FeatureName", how="left")
+print(df_data_drift)
+##############################################################
+# Prepare data for running model
+##############################################################
+# Value counts for hospital and community exacerbations
+print(forward_val_data_imputed["ExacWithin3Months"].value_counts())
+print(
+    forward_val_data_imputed[forward_val_data_imputed["ExacWithin3Months"] == 1][
+        "HospExacWithin3Months"
+    ].value_counts()
+)
+print(
+    forward_val_data_imputed[forward_val_data_imputed["ExacWithin3Months"] == 1][
+        "CommExacWithin3Months"
+    ].value_counts()
+)
+# Separate features and target
+forward_val_features_imp = forward_val_data_imputed.drop(
+    columns=["StudyId", "IndexDate", "ExacWithin3Months", 'HospExacWithin3Months',
+       'CommExacWithin3Months']
+)
+forward_val_target_imp = forward_val_data_imputed["ExacWithin3Months"]
+forward_val_features_no_imp = forward_val_data_not_imputed.drop(
+    columns=["StudyId", "IndexDate", "ExacWithin3Months", 'HospExacWithin3Months',
+       'CommExacWithin3Months']
+)
+forward_val_target_no_imp = forward_val_data_not_imputed["ExacWithin3Months"]
+# Check that the target in imputed and not imputed datasets are the same. If not,
+# raise an error
+if not forward_val_target_no_imp.equals(forward_val_target_imp):
+    raise ValueError(
+        "Target variable is not the same in imputed and non imputed datasets in the test set."
+    )
+test_target = forward_val_target_no_imp
+# Make sure all features are numeric
+for features in [forward_val_features_imp, forward_val_features_no_imp]:
+    for col in features:
+        features[col] = pd.to_numeric(features[col], errors="coerce")
+# Make a list of models to perform forward validation on. Contains model name, whether
+# imputation was performed, and the threshold used in the original model
+models = [
+    ("balanced_random_forest", "imputed", 0.27),
+    #    ("xgb", "not_imputed", 0.44),
+    #    ("random_forest", "imputed", 0.30),
+]
+##############################################################
+# Run models
+##############################################################
+mlflow.set_tracking_uri("sqlite:///mlruns.db")
+mlflow.set_experiment("model_h_drop_1_hosp_comm")
+with mlflow.start_run(run_name="sig_forward_val_models_10_2023"):
+    for model_info in models:
+        print(model_info[0])
+        with mlflow.start_run(run_name=model_info[0], nested=True):
+            # Create the artifacts directory if it doesn't exist
+            os.makedirs(config["outputs"]["artifact_dir"], exist_ok=True)
+            # Remove existing directory contents to not mix files between different runs
+            shutil.rmtree(config["outputs"]["artifact_dir"])
+            #### Load model ####
+            with open("./data/model/trained_iso_" + model_info[0] + "_pkl", "rb") as f:
+                model = pickle.load(f)
+            # Select the correct data based on model used
+            if model_info[1] == "imputed":
+                test_features = forward_val_features_imp
+            else:
+                test_features = forward_val_features_no_imp
+            #### Run model and get predictions for forward validation data ####
+            test_probs = model.predict_proba(test_features)[:, 1]
+            test_preds = model.predict(test_features)
+            #### Calculate metrics ####
+            metrics = model_h.calc_eval_metrics_for_model(
+                test_target,
+                test_preds,
+                test_probs,
+                "forward_val",
+                best_threshold=model_info[2],
+            )
+            #### Plot confusion matrix ####
+            model_h.plot_confusion_matrix(
+                [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, model_info[2]],
+                test_probs,
+                test_target,
+                model_info[0],
+                model_type,
+                "forward_val",
+            )
+            #### Plot calibration curves ####
+            for bins in [6, 10]:
+                plt.figure(figsize=(8, 8))
+                plt.plot([0, 1], [0, 1], linestyle="--")
+                model_h.plot_calibration_curve(
+                    test_target, test_probs, bins, "quantile", "Forward Validation"
+                )
+                plt.legend(bbox_to_anchor=(1.05, 1.0), loc="upper left")
+                plt.title(model_info[0])
+                plt.tight_layout()
+                plt.savefig(
+                    os.path.join(
+                        config["outputs"]["artifact_dir"],
+                        model_info[0]
+                        + "_"
+                        + "quantile"
+                        + "_bins"
+                        + str(bins)
+                        + model_type
+                        + ".png",
+                    )
+                )
+                plt.close()
+            #### Calculate model performance by event type ####
+            # Create df to contain prediction data and event type data
+            preds_events_df_forward_val = model_h.create_df_probabilities_and_predictions(
+                test_probs,
+                model_info[2],
+                forward_val_data_imputed["StudyId"].tolist(),
+                test_target,
+                forward_val_data_imputed[["ExacWithin3Months", 'HospExacWithin3Months',
+       'CommExacWithin3Months']],
+                model_info[0],
+                model_type,
+                output_dir="./data/prediction_and_events/",
+                calib_type="forward_val",
+            )
+            # Subset to each event type and calculate metrics
+            metrics_by_event_type_forward_val = model_h.calc_metrics_by_event_type(
+                preds_events_df_forward_val, calib_type="forward_val"
+            )
+            # Subset to each event type and plot ROC curve
+            model_h.plot_roc_curve_by_event_type(
+                preds_events_df_forward_val, model_info[0], "forward_val"
+            )
+            # Subset to each event type and plot PR curve
+            model_h.plot_prec_recall_by_event_type(
+                preds_events_df_forward_val, model_info[0], "forward_val"
+            )
+            #### Plot distribution of model scores for uncalibrated model ####
+            model_h.plot_score_distribution(
+                test_target,
+                test_probs,
+                config["outputs"]["artifact_dir"],
+                model_info[0],
+                model_type,
+            )
+            #### Log to MLFlow ####
+            mlflow.log_metrics(metrics)
+            mlflow.log_artifacts(config["outputs"]["artifact_dir"])
+mlflow.end_run()

training/perform_hyper_param_tuning.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import os
+import numpy as np
+import pandas as pd
+import mlflow
+import shutil
+import model_h
+# Model training and evaluation
+from sklearn.linear_model import LogisticRegression
+from sklearn.ensemble import RandomForestClassifier
+from imblearn.ensemble import BalancedRandomForestClassifier
+import xgboost as xgb
+from skopt import BayesSearchCV
+from skopt.space import Real, Integer
+##############################################################
+# Specify which model to perform cross validation on
+##############################################################
+model_only_hosp = False
+if model_only_hosp is True:
+    file_suffix = "_only_hosp"
+else:
+    file_suffix = "_hosp_comm"
+##############################################################
+# Load data
+##############################################################
+# Load CV folds
+fold_patients = np.load(
+    './data/cohort_info/fold_patients' + file_suffix + '.npy', allow_pickle=True)
+# Load imputed train data
+train_data_imp = model_h.load_data_for_modelling(
+    './data/model_data/train_data_cv_imp' + file_suffix + '.pkl')
+# Load not imputed train data
+train_data_no_imp = model_h.load_data_for_modelling(
+    './data/model_data/train_data_cv_no_imp' + file_suffix + '.pkl')
+# Load imputed test data
+test_data_imp = model_h.load_data_for_modelling(
+    './data/model_data/test_data_imp' + file_suffix + '.pkl')
+# Load not imputed test data
+test_data_no_imp = model_h.load_data_for_modelling(
+    './data/model_data/test_data_no_imp' + file_suffix + '.pkl')
+# Create a tuple with training and validation indicies for each fold. Can be done with
+# either imputed or not imputed data as both have same patients
+cross_val_fold_indices = []
+for fold in fold_patients:
+    fold_val_ids = train_data_no_imp[train_data_no_imp.StudyId.isin(fold)]
+    fold_train_ids = train_data_no_imp[~(
+        train_data_no_imp.StudyId.isin(fold_val_ids.StudyId))]
+    # Get index of rows in val and train
+    fold_val_index = fold_val_ids.index
+    fold_train_index = fold_train_ids.index
+    # Append tuple of training and val indices
+    cross_val_fold_indices.append((fold_train_index, fold_val_index))
+# Create list of model features
+cols_to_drop = ['StudyId', 'ExacWithin3Months', 'IndexDate']
+features_list = [col for col in train_data_no_imp.columns if col not in cols_to_drop]
+# Train data
+# Separate features from target for data with no imputation performed
+train_features_no_imp = train_data_no_imp[features_list].astype('float')
+train_target_no_imp = train_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+train_features_imp = train_data_imp[features_list].astype('float')
+train_target_imp = train_data_imp.ExacWithin3Months.astype('float')
+# Test data
+# Separate features from target for data with no imputation performed
+test_features_no_imp = test_data_no_imp[features_list].astype('float')
+test_target_no_imp = test_data_no_imp.ExacWithin3Months.astype('float')
+# Separate features from target for data with no imputation performed
+test_features_imp = test_data_imp[features_list].astype('float')
+test_target_imp = test_data_imp.ExacWithin3Months.astype('float')
+# Check that the target in imputed and not imputed datasets are the same. If not,
+# raise an error
+if not train_target_no_imp.equals(train_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the train set.')
+if not test_target_no_imp.equals(test_target_imp):
+    raise ValueError(
+        'Target variable is not the same in imputed and non imputed datasets in the test set.')
+train_target = train_target_no_imp
+test_target = test_target_no_imp
+# Make sure all features are numeric
+for features in [train_features_no_imp, train_features_imp,
+                 test_features_no_imp, test_features_imp]:
+    for col in features:
+        features[col] = pd.to_numeric(features[col], errors='coerce')
+##############################################################
+# Specify which models to evaluate
+##############################################################
+# Set up MLflow
+mlflow.set_tracking_uri("sqlite:///mlruns.db")
+mlflow.set_experiment('model_h_drop_1' + file_suffix)
+# Set CV scoring strategies and any model parameters
+scoring_methods = ['average_precision']
+scale_pos_weight = train_target.value_counts()[0] / train_target.value_counts()[1]
+# Set up models, each tuple contains 4 elements: model, model name, imputation status,
+# type of model
+models = []
+# Run different models depending on which parallel model is being used.
+if model_only_hosp is True:
+    # Logistic regression
+    models.append((LogisticRegression(),
+                   'logistic_regression', 'imputed', 'linear'))
+    # Balanced random forest
+    models.append((BalancedRandomForestClassifier(),
+                   'balanced_random_forest', 'imputed', 'tree'))
+    # XGBoost
+    models.append((xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
+                                     'xgb', 'not_imputed', 'tree'))
+if model_only_hosp is False:
+    # Logistic regression
+    models.append((LogisticRegression(),
+                   'logistic_regression', 'imputed', 'linear'))
+    # Random forest
+    models.append((RandomForestClassifier(),
+               'random_forest', 'imputed', 'tree'))
+    # Balanced random forest
+    models.append((BalancedRandomForestClassifier(),
+               'balanced_random_forest', 'imputed', 'tree'))
+    # XGBoost
+    models.append((xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
+                                     'xgb', 'not_imputed', 'tree'))
+# Define search spaces
+log_reg_search_spaces = {'penalty': ['l2', None],
+                         'class_weight': ['balanced', None],
+                         'max_iter': Integer(50, 300),
+                         'C': Real(0.001, 10),
+                         }
+rf_search_spaces = {'max_depth': Integer(4, 10),
+                     'n_estimators': Integer(70, 850),
+                     'min_samples_split': Integer(2, 10),
+                     'class_weight': ['balanced', None],
+                     }
+xgb_search_spaces = {'max_depth': Integer(4, 10),
+                     'n_estimators': Integer(70, 850),
+                     'subsample': Real(0.55, 0.95),
+                     'colsample_bytree': Real(0.55, 0.95),
+                     'learning_rate': Real(0.05, 0.14),
+                     'scale_pos_weight': Real(1, scale_pos_weight),
+                     }
+##############################################################
+# Run models
+##############################################################
+#In MLflow run, perform K-fold cross validation and capture mean score across folds.
+with mlflow.start_run(run_name='hyperparameter_tuning_2023_tot_length'):
+    for scoring_method in scoring_methods:
+        for model in models:
+            with mlflow.start_run(run_name=model[1], nested=True):
+                print(model[1])
+                # Create the artifacts directory if it doesn't exist
+                artifact_dir = './tmp'
+                os.makedirs(artifact_dir, exist_ok=True)
+                # Remove existing directory contents to not mix files between different runs
+                shutil.rmtree(artifact_dir)
+                # Run hyperparameter tuning
+                if (model[1] == 'balanced_random_forest') | (model[1] == 'random_forest'):
+                    opt = BayesSearchCV(model[0],
+                        search_spaces= rf_search_spaces,
+                        n_iter=200,
+                        random_state=0,
+                        cv=cross_val_fold_indices,
+                        scoring=scoring_method,
+                        )
+                    # Execute bayesian optimization
+                    np.int = int
+                    opt.fit(train_features_imp, train_target)
+                if model[1] == 'logistic_regression':
+                    opt = BayesSearchCV(model[0],
+                        search_spaces= log_reg_search_spaces,
+                        n_iter=200,
+                        random_state=0,
+                        cv=cross_val_fold_indices,
+                        scoring=scoring_method,
+                        )
+                    np.int = int
+                    opt.fit(train_features_imp, train_target)
+                if model[1] == 'xgb':
+                    opt = BayesSearchCV(model[0],
+                        search_spaces= xgb_search_spaces,
+                        n_iter=200,
+                        random_state=0,
+                        cv=cross_val_fold_indices,
+                        scoring=scoring_method,
+                        )
+                    np.int = int
+                    opt.fit(train_features_no_imp, train_target)
+                # Get scores from hyperparameter tuning
+                print(opt.best_params_)
+                print(opt.best_score_)
+                # Log scores from hyperparameter tuning
+                mlflow.log_param('opt_scorer', scoring_method)
+                mlflow.log_params(opt.best_params_)
+                mlflow.log_metric("opt_best_score", opt.best_score_)
+mlflow.end_run()

training/process_comorbidities.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""
+Derive features from comorbidities dataset for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import os
+import yaml
+import model_h
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/process_comorbidities_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    exac_data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    exac_data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+exac_data = exac_data[["StudyId", "IndexDate"]]
+patient_details = exac_data.merge(
+    patient_details[["StudyId", "PatientId"]],
+    on="StudyId",
+    how="left",
+)
+comorbidities = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["comorbidities"], delimiter="|"
+)
+comorbidities = patient_details.merge(comorbidities, on="PatientId", how="left")
+# Only keep records submitted before index date
+comorbidities["Created"] = pd.to_datetime(comorbidities["Created"], utc=True)
+comorbidities["TimeSinceSubmission"] = (
+    comorbidities["IndexDate"] - comorbidities["Created"]
+).dt.days
+comorbidities = comorbidities[comorbidities["TimeSinceSubmission"] > 0]
+# If multiple records submitted for same patient keep the most recent record (in relation
+# to index date)
+comorbidities = comorbidities.sort_values(
+    by=["StudyId", "IndexDate", "TimeSinceSubmission"]
+)
+comorbidities = comorbidities.drop_duplicates(
+    subset=["StudyId", "IndexDate"], keep="first"
+)
+# Get list of comorbidities captured in the service
+comorbidity_list = list(comorbidities)
+comorbidity_list = [
+    e
+    for e in comorbidity_list
+    if e
+    not in ("PatientId", "Id", "StudyId", "IndexDate", "TimeSinceSubmission", "Created")
+]
+# Map True/False values to integers
+bool_mapping = {True: 1, False: 0}
+comorbidities[comorbidity_list] = (
+    comorbidities[comorbidity_list].replace(bool_mapping).fillna(0)
+)
+# Get comorbidity counts for each patient
+comorbidities["Comorbidities"] = comorbidities[comorbidity_list].sum(axis=1)
+# Drop comorbidities columns from train data but retain AsthmaOverlap
+comorbidity_list.remove("AsthmaOverlap")
+comorbidities = comorbidities.drop(columns=comorbidity_list)
+comorbidities = comorbidities.drop(columns=["Id", "Created", "TimeSinceSubmission"])
+# Bin number of comorbidities
+comorb_bins = [0, 1, 3, np.inf]
+comorb_labels = ["No comorbidities", "1-2", "3+"]
+comorbidities["Comorbidities"] = model_h.bin_numeric_column(
+    col=comorbidities["Comorbidities"], bins=comorb_bins, labels=comorb_labels
+)
+comorbidities = comorbidities.drop(columns=["PatientId"])
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    comorbidities.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "comorbidities_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    comorbidities.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "comorbidities_" + model_type + ".pkl",
+        )
+    )

training/process_demographics.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""
+Derive features from demographics for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import os
+import model_h
+import yaml
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/process_demographics_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data.merge(
+    patient_details[["StudyId"]],
+    on="StudyId",
+    how="left",
+)
+# Calculate age
+data["DateOfBirth"] = pd.to_datetime(data["DateOfBirth"], utc=True)
+data["Age"] = (data["IndexDate"] - data["DateOfBirth"]).dt.days
+data["Age"] = np.floor(data["Age"] / 365)
+data = data.drop(columns="DateOfBirth")
+# Bin patient age
+age_bins = [0, 50, 60, 70, 80, np.inf]
+age_labels = ["<50", "50-59", "60-69", "70-79", "80+"]
+data["AgeBinned"] = model_h.bin_numeric_column(
+    col=data["Age"], bins=age_bins, labels=age_labels
+)
+# Smoking status: TODO
+# Map the M and F sex column to binary (1=F)
+sex_mapping = {"F": 1, "M": 0}
+data["Sex_F"] = data.Sex.map(sex_mapping)
+data = data.drop(columns=["Sex"])
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "demographics_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "demographics_" + model_type + ".pkl",
+        )
+    )

training/process_exacerbation_history.py ADDED Viewed

	@@ -0,0 +1,297 @@

+"""
+Derive features from exacerbation event history for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import re
+import model_h
+import os
+import yaml
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+if model_type == "only_hosp":
+    cols_required = ["IsHospExac", "IsHospAdmission"]
+    pharmacy_prescriptions_req = False
+if model_type == "hosp_comm":
+    cols_required = ["IsExac", "IsHospExac", "IsCommExac", "IsHospAdmission"]
+    pharmacy_prescriptions_req = True
+# Setup log file
+log = open("./training/logging/process_exacerbation_history_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data[["StudyId", "IndexDate"]]
+data = data.merge(
+    patient_details[["StudyId", "PatientId", "FirstSubmissionDate"]],
+    on="StudyId",
+    how="left",
+)
+# Read mapping between StudyId and SafeHavenID
+id_mapping = pd.read_pickle("./data/sh_to_studyid_mapping.pkl")
+# Remove mapping for patient SU125 as the mapping for this patient is incorrect
+id_mapping["SafeHavenID"] = np.where(
+    id_mapping["StudyId"] == "SU125", np.NaN, id_mapping["SafeHavenID"]
+)
+id_mapping = id_mapping.merge(
+    data[["StudyId"]], on="StudyId", how="inner"
+).drop_duplicates()
+print(
+    "Num patients with SafeHaven mapping: {} of {}".format(
+        len(id_mapping), data.StudyId.nunique()
+    )
+)
+# Add column with SafeHavenID to main df
+data = data.merge(id_mapping, on="StudyId", how="left")
+# Calculate the lookback start date. Will need this to aggreggate data for model
+# features
+data["LookbackStartDate"] = data["IndexDate"] - pd.DateOffset(
+    days=config["model_settings"]["lookback_period"]
+)
+############################################################################
+# Derive features from patient history
+############################################################################
+#########################################
+# Num previous exacerbations/admissions
+#########################################
+exacs = pd.read_pickle("./data/{}_exacs.pkl".format(model_type))
+exacs = exacs.fillna(0)
+print(exacs.columns)
+print(data.columns)
+exacs = data[["StudyId", "PatientId", "LookbackStartDate", "IndexDate"]].merge(
+    exacs, on=["StudyId", "PatientId"], how="left"
+)
+# Calculate total number of exacerbations prior to index date and normalise by length in
+# service
+exac_counts_tot = exacs[exacs["DateOfEvent"] < exacs["IndexDate"]]
+exac_counts_tot = exac_counts_tot.groupby(["StudyId", "IndexDate"])["IsExac"].sum()
+exac_counts_tot = pd.DataFrame(exac_counts_tot).reset_index()
+exac_counts_tot = exac_counts_tot.merge(
+    patient_details[["StudyId", "FirstSubmissionDate"]], on="StudyId", how="left"
+)
+exac_counts_tot["LengthInServiceBeforeIndex"] = (
+    exac_counts_tot["IndexDate"] - exac_counts_tot["FirstSubmissionDate"]
+).dt.days
+exac_counts_tot["LengthInServiceBeforeIndex"] = (
+    exac_counts_tot["LengthInServiceBeforeIndex"] / 30
+)
+exac_counts_tot["TotExacPerMonthBeforeIndex"] = (
+    exac_counts_tot["IsExac"] / exac_counts_tot["LengthInServiceBeforeIndex"]
+)
+exac_counts_tot = exac_counts_tot.drop(
+    columns=["IsExac", "FirstSubmissionDate", "LengthInServiceBeforeIndex"]
+)
+# Calculate number of previous exacerbations in 6 months before index date
+exac_counts_6mo = exacs[
+    (exacs["DateOfEvent"] >= exacs["LookbackStartDate"])
+    & (exacs["DateOfEvent"] < exacs["IndexDate"])
+]
+exac_counts_6mo = exac_counts_6mo.groupby(["StudyId", "IndexDate"])[cols_required].sum()
+# Remove 'Is' prefix and add 'Num' prefix and 'Prior6mo' suffix
+new_col_names = []
+for col in cols_required:
+    base_col_name = re.findall(r"[A-Z][^A-Z]*", col)
+    base_col_name.pop(0)
+    base_col_name = "".join(base_col_name)
+    new_col_names.append("Num" + base_col_name + "Prior6mo")
+# Rename columns and merge to main df
+exac_counts_6mo = exac_counts_6mo.rename(
+    columns=dict(zip(cols_required, new_col_names))
+).reset_index()
+data = data.merge(exac_counts_6mo, on=["StudyId", "IndexDate"], how="left")
+data = data.merge(exac_counts_tot, on=["StudyId", "IndexDate"], how="left")
+data = data.fillna(0)
+#########################################
+# Days since previous exacerbation
+#########################################
+# Calculate the number of days since last exacerbation before index date
+days_since_exac = exacs[exacs["DateOfEvent"] < exacs["IndexDate"]]
+days_since_exac = days_since_exac[days_since_exac[cols_required[0]] == 1]
+days_since_exac = days_since_exac.sort_values(
+    by=["StudyId", "IndexDate", "DateOfEvent"], ascending=False
+)
+days_since_exac = days_since_exac.drop_duplicates(
+    subset=["StudyId", "IndexDate"], keep="first"
+)
+days_since_exac["DaysSinceLastExac"] = (
+    days_since_exac["IndexDate"] - days_since_exac["DateOfEvent"]
+).dt.days
+data = data.merge(
+    days_since_exac[["StudyId", "IndexDate", "DaysSinceLastExac"]],
+    on=["StudyId", "IndexDate"],
+    how="left",
+)
+# If patient have missing values in DaysSinceLastExac, find exacerbations in SafeHaven
+# data as service data is only manually entered up to one year before onboarding
+missing_data_to_lookup = data[data["DaysSinceLastExac"].isna()].drop_duplicates(
+    subset="StudyId", keep="first"
+)
+missing_data_to_lookup = missing_data_to_lookup[
+    ["StudyId", "SafeHavenID", "IndexDate", "FirstSubmissionDate"]
+]
+# For both parallel models 1 and 2 find hospital exacerbations in SMR data for patients
+# who have missing data in the DaysSinceLastExac column
+smr = pd.read_csv(config["inputs"]["raw_data_paths"]["admissions"])
+smr = missing_data_to_lookup.merge(smr, on="SafeHavenID", how="left")
+# Only find exacerbations prior to onboarding
+smr = smr.rename(columns={"ADMDATE": "DateOfEvent"})
+smr["DateOfEvent"] = pd.to_datetime(smr["DateOfEvent"], utc=True)
+smr = smr[smr["DateOfEvent"] < smr["FirstSubmissionDate"]]
+# COPD admission defined as: admission with J40-J44 under principal diagnosis or J20 in
+# the principal diagnosis with one of J41-J44 in secondary diagnosis field
+principal_diag = [
+    "Bronchitis, not specified as acute or chronic",
+    "chronic bronchitis",
+    "Emphysema",
+    "MacLeod syndrome",
+    "chronic obstructive pulmonary disease",
+]
+principal_diag_alt = ["Acute bronchitis"]
+secondary_diag_alt = [
+    "chronic bronchitis",
+    "Emphysema",
+    "MacLeod syndrome",
+    "chronic obstructive pulmonary disease",
+]
+condition_primary = smr["DIAG1Desc"].str.contains(
+    r"\b(?:" + "|".join(principal_diag) + r")\b", case=False, regex=True
+)
+condition_secondary = (
+    smr["DIAG1Desc"].str.contains(
+        r"\b(?:" + "|".join(principal_diag_alt) + r")\b", case=False, regex=True
+    )
+) & (
+    smr["DIAG2Desc"].str.contains(
+        r"\b(?:" + "|".join(secondary_diag_alt) + r")\b", case=False, regex=True
+    )
+)
+smr["COPD_admission_smr"] = np.where(condition_primary, True, False)
+smr["COPD_admission_smr"] = np.where(
+    condition_secondary, True, smr["COPD_admission_smr"]
+)
+smr = smr[smr["COPD_admission_smr"] == True]
+# Find rescue med prescriptions prior to onboarding for parallel model 1 (where both
+# hospital and community exacerbations are used)
+if pharmacy_prescriptions_req is True:
+    # Read pharmacy data and filter for model cohort
+    pharmacy = pd.read_csv(config["inputs"]["raw_data_paths"]["prescribing"])
+    pharmacy = missing_data_to_lookup.merge(pharmacy, on="SafeHavenID", how="left")
+    # Pull out rescue med prescriptions only
+    steroid_codes = [
+        "0603020T0AAACAC",
+        "0603020T0AABKBK",
+        "0603020T0AAAXAX",
+        "0603020T0AAAGAG",
+        "0603020T0AABHBH",
+        "0603020T0AAACAC",
+        "0603020T0AABKBK",
+        "0603020T0AABNBN",
+        "0603020T0AAAGAG",
+        "0603020T0AABHBH",
+    ]
+    antibiotic_codes = [
+        "0501013B0AAAAAA",
+        "0501013B0AAABAB",
+        "0501030I0AAABAB",
+        "0501030I0AAAAAA",
+        "0501050B0AAAAAA",
+        "0501050B0AAADAD",
+        "0501013K0AAAJAJ",
+    ]
+    rescue_med_bnf_codes = steroid_codes + antibiotic_codes
+    pharmacy = pharmacy[pharmacy.PI_BNF_Item_Code.isin(rescue_med_bnf_codes)]
+    # Only keep rescue meds before patient onboarding
+    pharmacy = pharmacy.rename(columns={"PRESC_DATE": "DateOfEvent"})
+    pharmacy["DateOfEvent"] = pd.to_datetime(
+        pharmacy["DateOfEvent"], utc=True
+    ).dt.normalize()
+    pharmacy = pharmacy[pharmacy["DateOfEvent"] < pharmacy["FirstSubmissionDate"]]
+    # Combine pharmacy data with smr admissions data
+    smr = pd.concat([smr, pharmacy])
+# Calculate the days since last exacerbation
+smr["DaysSinceLastExac"] = (smr["IndexDate"] - smr["DateOfEvent"]).dt.days
+smr = smr.sort_values(by=["StudyId", "IndexDate", "DaysSinceLastExac"], ascending=True)
+smr = smr.drop_duplicates(subset=["StudyId", "IndexDate"], keep="first")
+# Merge back to main df
+data = data.merge(
+    smr[["StudyId", "IndexDate", "DaysSinceLastExac"]],
+    on=["StudyId", "IndexDate"],
+    how="left",
+)
+data["DaysSinceLastExac"] = np.where(
+    data["DaysSinceLastExac_x"].notnull(),
+    data["DaysSinceLastExac_x"],
+    data["DaysSinceLastExac_y"],
+)
+data = data.drop(columns=["DaysSinceLastExac_x", "DaysSinceLastExac_y"])
+print(
+    "Number of patients with missing DaysSinceLastExac:{}".format(
+        data[data["DaysSinceLastExac"].isna()].StudyId.nunique()
+    )
+)
+# Bin days since last exacerbation
+exac_bins = [0, 21, 90, 180, np.inf]
+exac_labels = ["<21 days", "21 - 89 days", "90 - 179 days", ">= 180 days"]
+data["DaysSinceLastExac"] = model_h.bin_numeric_column(
+    col=data["DaysSinceLastExac"], bins=exac_bins, labels=exac_labels
+)
+# If DaysSinceLastExac is nan, put into >= 180 category
+data["DaysSinceLastExac"] = data["DaysSinceLastExac"].replace("nan", ">= 180 days")
+data = data.drop(columns=["FirstSubmissionDate", "LookbackStartDate", "PatientId"])
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "exac_history_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "exac_history_" + model_type + ".pkl",
+        )
+    )

training/process_labs.py ADDED Viewed

	@@ -0,0 +1,228 @@

+"""
+Derive features from lab tests for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import os
+import model_h
+import ggc.preprocessing.labs as labs_preprocessing
+import yaml
+def calc_lab_metric(lab_df, data, lab_name, metric, weigh_data_by_recency=False):
+    """
+    Calculate metrics on laboratory data.
+    Args:
+        lab_df (pd.DataFrame): dataframe containing labs to be used in calculations.
+        data (pd.DataFrame): main dataframe to which columns containing the results from
+              the lab calculations are merged onto.
+        lab_name (list): name of labs required for metric calculations.
+        metric (str): name of metric to be calculated. The possible metrics are:
+                'MaxLifetime': calculates the maximum value of lab for patient within
+                entire dataset before their index date.
+                'MinLifetime': calculates the minimum value of lab for patient within
+                entire dataset before their index date.
+                'Max1Year': calculates the maximum value of lab for patient within 1
+                year prior to index date.
+                'Min1Year': calculates the maximum value of lab for patient within 1
+                year prior to index date.
+                'Latest': finds the closest lab value prior to index date.
+        weigh_data_by_recency (bool): option to weigh data based on how recent it is. Older
+            observations are decreased or increased towards the median. Defaults to False.
+    Returns:
+        pd.DataFrame: the input dataframe with additional columns with calculated
+                      metrics.
+    """
+    # Subset labs to only those specified in lab_names
+    cols_to_keep = ["StudyId", "IndexDate", "TimeSinceLab"]
+    cols_to_keep.append(lab_name)
+    labs_calc = lab_df[cols_to_keep]
+    # Subset labs to correct time frames and calculate metrics
+    if (metric == "Max1Year") | (metric == "Min1Year"):
+        labs_calc = labs_calc[labs_calc["TimeSinceLab"] <= 365]
+    if (metric == "MaxLifetime") | (metric == "Max1Year"):
+        labs_calc = labs_calc.groupby(["StudyId", "IndexDate"]).max()
+    if (metric == "MinLifetime") | (metric == "Min1Year"):
+        labs_calc = labs_calc.groupby(["StudyId", "IndexDate"]).min()
+        labs_calc = labs_calc.drop(columns=["TimeSinceLab"])
+    if metric == "Latest":
+        labs_calc = labs_calc[labs_calc["TimeSinceLab"] <= 365]
+        labs_calc = labs_calc.sort_values(
+            by=["StudyId", "IndexDate", "TimeSinceLab"], ascending=True
+        )
+        labs_calc["TimeSinceLab"] = np.where(
+            labs_calc[lab_name].isna(), np.NaN, labs_calc["TimeSinceLab"]
+        )
+        labs_calc = labs_calc.bfill()
+        labs_calc = labs_calc.drop_duplicates(
+            subset=["StudyId", "IndexDate"], keep="first"
+        )
+        if weigh_data_by_recency is True:
+            median_val = labs_calc[lab_name].median()
+            labs_calc = model_h.weigh_features_by_recency(
+                df=labs_calc,
+                feature=lab_name,
+                feature_recency_days="TimeSinceLab",
+                median_value=median_val,
+                decay_rate=0.001,
+            )
+        labs_calc = labs_calc.set_index(["StudyId", "IndexDate"])
+    # Add prefix to lab names and merge with main df
+    labs_calc = labs_calc.add_prefix(metric)
+    labs_calc = labs_calc.reset_index()
+    data = data.merge(labs_calc, on=["StudyId", "IndexDate"], how="left")
+    return data
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/process_labs_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data[["StudyId", "IndexDate"]]
+patient_details = data.merge(
+    patient_details[["StudyId", "PatientId"]],
+    on="StudyId",
+    how="left",
+)
+# Read mapping between StudyId and SafeHavenID
+id_mapping = pd.read_pickle("./data/sh_to_studyid_mapping.pkl")
+# Remove mapping for patient SU125 as the mapping for this patient is incorrect
+id_mapping["SafeHavenID"] = np.where(
+    id_mapping["StudyId"] == "SU125", np.NaN, id_mapping["SafeHavenID"]
+)
+id_mapping = id_mapping.merge(
+    data[["StudyId"]], on="StudyId", how="inner"
+).drop_duplicates()
+print(
+    "Num patients with SafeHaven mapping: {} of {}".format(
+        len(id_mapping), data.StudyId.nunique()
+    )
+)
+# Add column with SafeHavenID to main df
+patient_details = patient_details.merge(id_mapping, on="StudyId", how="left")
+# Calculate the lookback start date. Will need this to aggreggate data for model
+# features
+patient_details["LookbackStartDate"] = patient_details["IndexDate"] - pd.DateOffset(
+    days=config["model_settings"]["lookback_period"]
+)
+############################################################################
+# Derive features from labs
+############################################################################
+# Convert column names into format required for labs processing using the ggc package
+cols_to_use = [
+    "SafeHavenID",
+    "ClinicalCodeDescription",
+    "QuantityUnit",
+    "RangeHighValue",
+    "RangeLowValue",
+    "QuantityValue",
+    "SampleDate",
+]
+labs = pd.read_csv(config["inputs"]["raw_data_paths"]["labs"], usecols=cols_to_use)
+# Subset labs table to only patients of interest
+labs = labs[labs.SafeHavenID.isin(patient_details.SafeHavenID)]
+# Process labs
+lookup_table = pd.read_csv(config["inputs"]["raw_data_paths"]["labs_lookup_table"])
+tests_of_interest = [
+    "Eosinophils",
+    "Albumin",
+    "Neutrophils",
+    "White Blood Count",
+    "Lymphocytes",
+]
+labs_processed = labs_preprocessing.clean_labs_data(
+    df=labs,
+    tests_of_interest=tests_of_interest,
+    units_lookup=lookup_table,
+    print_log=True,
+)
+labs_processed = patient_details[["StudyId", "IndexDate", "SafeHavenID"]].merge(
+    labs_processed, on="SafeHavenID", how="left"
+)
+labs_processed["SampleDate"] = pd.to_datetime(labs_processed["SampleDate"], utc=True)
+labs_processed["TimeSinceLab"] = (
+    labs_processed["IndexDate"] - labs_processed["SampleDate"]
+).dt.days
+# Only keep labs performed before IndexDate
+labs_processed = labs_processed[labs_processed["TimeSinceLab"] >= 0]
+# Convert lab names to columns
+labs_processed = pd.pivot_table(
+    labs_processed,
+    values="QuantityValue",
+    index=["StudyId", "IndexDate", "TimeSinceLab"],
+    columns=["ClinicalCodeDescription"],
+)
+labs_processed = labs_processed.reset_index()
+# Calculate neutrophil/lymphocyte ratio
+labs_processed["NeutLymphRatio"] = (
+    labs_processed["Neutrophils"] / labs_processed["Lymphocytes"]
+)
+# Calculate lowest albumin in past year
+data = calc_lab_metric(labs_processed, data, lab_name="Albumin", metric="Min1Year")
+# Calculate the latest lab value
+lab_names = [
+    "NeutLymphRatio",
+    "Albumin",
+    "Eosinophils",
+    "Neutrophils",
+    "White Blood Count",
+]
+for lab_name in lab_names:
+    data = calc_lab_metric(
+        labs_processed, data, lab_name, metric="Latest", weigh_data_by_recency=True
+    )
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "labs_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    data.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "labs_" + model_type + ".pkl",
+        )
+    )

training/process_pros.py ADDED Viewed

	@@ -0,0 +1,1031 @@

+"""
+Derive features PRO responses for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import os
+import re
+from collections import defaultdict
+import yaml
+def calc_total_pro_engagement(pro_df, pro_name):
+    """
+    Calculates PRO engagement per patient across their entire time within the service.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing the onboarding date and the latest
+                prediction date.
+        pro_name (str): name of the PRO.
+    Returns:
+        pd.DataFrame: the input dateframe with an additional column stating the total
+                      engagement for each patient across the service.
+    """
+    # Calculate time in service according to type of PRO
+    if pro_name == "EQ5D":
+        date_unit = "M"
+    if pro_name == "MRC":
+        date_unit = "W"
+    if (pro_name == "CAT") | (pro_name == "SymptomDiary"):
+        date_unit = "D"
+    pro_df["TimeInService"] = np.floor(
+        (
+            (pro_df.LatestPredictionDate - pro_df.FirstSubmissionDate)
+            / np.timedelta64(1, date_unit)
+        )
+    )
+    # PRO engagement for the total time in service
+    pro_response_count = pro_df.groupby("StudyId").count()[["PatientId"]].reset_index()
+    pro_response_count = pro_response_count.rename(
+        columns={"PatientId": "Response" + pro_name}
+    )
+    pro_df = pro_df.merge(pro_response_count, on="StudyId", how="left")
+    pro_df["TotalEngagement" + pro_name] = round(
+        pro_df["Response" + pro_name] / pro_df["TimeInService"], 2
+    )
+    return pro_df
+def calc_pro_engagement_in_time_window(pro_df, pro_name, time_window, data):
+    """
+    Calculates PRO engagement per patient across a specified time window. The time
+    window is in format 'months', and consists of the specified time period prior to
+    IndexDate.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing the index dates and PRO response
+                submission dates.
+        pro_name (str): name of the PRO.
+        time_window (int): number of months in which to calculate PRO engagement.
+        data (pd.DataFrame): main dataframe.
+    Returns:
+        pd.DataFrame: a dataframe containing the calculated PRO engagement.
+    """
+    # Calculate time in service according to type of PRO.
+    if pro_name == "EQ5D":
+        unit_val = 1
+    if pro_name == "MRC":
+        unit_val = 4
+    if (pro_name == "CAT") | (pro_name == "SymptomDiary"):
+        unit_val = 30
+    pro_df["SubmissionTime"] = pd.to_datetime(pro_df["SubmissionTime"], utc=True)
+    pro_engagement_6mo = pro_df.copy()
+    pro_engagement_6mo["TimeSinceSubmission"] = (
+        pro_engagement_6mo["IndexDate"] - pro_engagement_6mo["SubmissionTime"]
+    ).dt.days
+    # Only include PRO responses within the specified time window
+    pro_engagement_6mo = pro_engagement_6mo[
+        pro_engagement_6mo["TimeSinceSubmission"].between(
+            0, (time_window * 30), inclusive="both"
+        )
+    ]
+    # Calculate number of PRO responses within specified time window
+    pro_engagement_6mo = (
+        pro_engagement_6mo.groupby(["StudyId", "IndexDate"])
+        .count()[["PatientId"]]
+        .reset_index()
+    )
+    pro_engagement_6mo = pro_engagement_6mo.rename(
+        columns={"PatientId": "ResponseCountTW" + str(time_window)}
+    )
+    pro_engagement_6mo["Engagement" + pro_name + "TW" + str(time_window)] = round(
+        pro_engagement_6mo["ResponseCountTW" + str(time_window)]
+        / (time_window * unit_val),
+        2,
+    )
+    pro_engagement_6mo = data[["StudyId", "IndexDate"]].merge(
+        pro_engagement_6mo, on=["StudyId", "IndexDate"], how="left"
+    )
+    # Fill N/As with 0 as no engagement was observed for those patients
+    pro_engagement_6mo = pro_engagement_6mo.fillna(0)
+    return pro_engagement_6mo
+def calc_pro_engagement_at_specific_month(pro_df, pro_name, month_num, data):
+    # Calculate time in service according to type of PRO.
+    if pro_name == "EQ5D":
+        unit_val = 1
+    if pro_name == "MRC":
+        unit_val = 4
+    if (pro_name == "CAT") | (pro_name == "SymptomDiary"):
+        unit_val = 30
+    pro_df["SubmissionTime"] = pd.to_datetime(pro_df["SubmissionTime"], utc=True)
+    pro_engagement = pro_df.copy()
+    pro_engagement["TimeSinceSubmission"] = (
+        pro_engagement["IndexDate"] - pro_engagement["SubmissionTime"]
+    ).dt.days
+    # Only include PRO responses for the month specified
+    # Calculate the number of months between index date and specified month
+    months_between_index_and_specified = month_num - 1
+    pro_engagement = pro_engagement[
+        pro_engagement["TimeSinceSubmission"].between(
+            (months_between_index_and_specified * 30),
+            (month_num * 30),
+            inclusive="both",
+        )
+    ]
+    # Calculate number of PRO responses within specified time window
+    pro_engagement = (
+        pro_engagement.groupby(["StudyId", "IndexDate"])
+        .count()[["PatientId"]]
+        .reset_index()
+    )
+    pro_engagement = pro_engagement.rename(
+        columns={"PatientId": "ResponseCountMonth" + str(month_num)}
+    )
+    pro_engagement["Engagement" + pro_name + "Month" + str(month_num)] = round(
+        pro_engagement["ResponseCountMonth" + str(month_num)] / (1 * unit_val),
+        2,
+    )
+    pro_engagement = data[["StudyId", "IndexDate"]].merge(
+        pro_engagement, on=["StudyId", "IndexDate"], how="left"
+    )
+    # Fill N/As with 0 as no engagement was observed for those patients
+    pro_engagement = pro_engagement.fillna(0)
+    return pro_engagement
+def calc_last_pro_score(pro_df, pro_name):
+    """
+    Calculates the most recent PRO response. The latest PRO score is set to be within 2
+    months of the index date to allow recency of data without having many missing
+    values.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing the index dates and PRO response
+                submission dates.
+        pro_name (str): name of the PRO.
+    Returns:
+        pd.DataFrame: the input dateframe with additional columns stating the latest PRO
+                      score for each PRO question.
+    """
+    # Calculate last PRO score
+    pro_df["TimeSinceSubmission"] = (
+        pro_df["IndexDate"] - pro_df["SubmissionTime"]
+    ).dt.days
+    pro_df = pro_df[pro_df["TimeSinceSubmission"] > 0]
+    pro_df = pro_df.sort_values(
+        by=["StudyId", "IndexDate", "TimeSinceSubmission"], ascending=True
+    )
+    latest_pro = pro_df.drop_duplicates(subset=["StudyId", "IndexDate"], keep="first")
+    # Ensure that the latest PRO Score is within 2 months of the index date
+    latest_pro = latest_pro[latest_pro["TimeSinceSubmission"] <= 365]
+    # Select specific columns
+    question_cols = latest_pro.columns[
+        latest_pro.columns.str.startswith(pro_name)
+    ].tolist()
+    question_cols.extend(
+        ["StudyId", "IndexDate", "Score", "SubmissionTime", "TimeSinceSubmission"]
+    )
+    latest_pro = latest_pro[question_cols]
+    # if pro_name == "EQ5D":
+    #    median_val_q1 = latest_pro["EQ5DScoreWithoutQ6"].median()
+    #    print(median_val_q1)
+    #    latest_pro = weigh_features_by_recency(
+    #        df=latest_pro,
+    #        feature="EQ5DScoreWithoutQ6",
+    #        feature_recency_days="TimeSinceSubmission",
+    #        median_value=median_val_q1,
+    #        decay_rate=0.001,
+    #    )
+    #    print(latest_pro.columns)
+    #
+    #    # Add prefix to question columns
+    #    cols_to_rename = latest_pro.columns[
+    #        ~latest_pro.columns.isin(
+    #            ["StudyId", "IndexDate", "Score", "SubmissionTime"]
+    #        )
+    #    ]
+    #    latest_pro = latest_pro.rename(
+    #        columns=dict(zip(cols_to_rename, "Latest" + cols_to_rename))
+    #    )
+    #
+    #    # Rename columns where prefix not added
+    #    latest_pro = latest_pro.rename(
+    #        columns={
+    #            "Score": "Latest" + pro_name + "Score",
+    #            "SubmissionTime": "LatestPRODate",
+    #        }
+    #    )
+    #
+    # elif pro_name == "MRC":
+    #    median_val_q1 = latest_pro["Score"].median()
+    #    print(median_val_q1)
+    #    latest_pro = weigh_features_by_recency(
+    #        df=latest_pro,
+    #        feature="Score",
+    #        feature_recency_days="TimeSinceSubmission",
+    #        median_value=median_val_q1,
+    #        decay_rate=0.001,
+    #    )
+    #    print(latest_pro.columns)
+    #    # Add prefix to question columns
+    #    cols_to_rename = latest_pro.columns[
+    #        ~latest_pro.columns.isin(
+    #            ["StudyId", "IndexDate", "Score", "SubmissionTime", "ScoreWeighted"]
+    #        )
+    #    ]
+    #    latest_pro = latest_pro.rename(
+    #        columns=dict(zip(cols_to_rename, "Latest" + cols_to_rename))
+    #    )
+    #    # Rename columns where prefix not added
+    #    latest_pro = latest_pro.rename(
+    #        columns={
+    #            "Score": "Latest" + pro_name + "Score",
+    #            "ScoreWeighted": "Latest" + pro_name + "ScoreWeighted",
+    #            "SubmissionTime": "LatestPRODate",
+    #        }
+    #    )
+    # else:
+    # Add prefix to question columns
+    cols_to_rename = latest_pro.columns[
+        ~latest_pro.columns.isin(["StudyId", "IndexDate", "Score", "SubmissionTime"])
+    ]
+    latest_pro = latest_pro.rename(
+        columns=dict(zip(cols_to_rename, "Latest" + cols_to_rename))
+    )
+    # Rename columns where prefix not added
+    latest_pro = latest_pro.rename(
+        columns={
+            "Score": "Latest" + pro_name + "Score",
+            "SubmissionTime": "LatestPRODate",
+        }
+    )
+    pro_df = pro_df.merge(latest_pro, on=["StudyId", "IndexDate"], how="left")
+    return pro_df
+def calc_pro_score_prior_to_latest(pro_df, pro_name, time_prior_to_latest=60):
+    """
+    Finds the PRO score prior to the latest PRO score before index date.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing the latest PRO score and PRO
+                response submission dates.
+        pro_name (str): name of the PRO.
+        time_prior_to_latest (int, optional): time period before latest PRO score in
+                              days. Default time frame set to 60 days (two months).
+    Returns:
+        pd.DataFrame: the input dateframe with additional columns stating the previous
+                      score closest to the latest PRO score for each PRO question.
+    """
+    pro_previous = pro_df.copy()
+    pro_previous = pro_previous[
+        pro_previous["SubmissionTime"] < pro_previous["LatestPRODate"]
+    ]
+    pro_previous = pro_previous.sort_values(
+        by=["StudyId", "IndexDate", "SubmissionTime"], ascending=[True, True, False]
+    )
+    pro_previous = pro_previous.drop_duplicates(
+        subset=["StudyId", "IndexDate"], keep="first"
+    )
+    # Make sure that previous score is within two months of the LatestPRODate
+    pro_previous["TimeSinceLatestPRODate"] = (
+        pro_previous["LatestPRODate"] - pro_previous["SubmissionTime"]
+    ).dt.days
+    pro_previous = pro_previous[
+        pro_previous["TimeSinceLatestPRODate"] <= time_prior_to_latest
+    ]
+    # Add prefix to question columns
+    cols_to_rename = [col for col in pro_previous if col.startswith(pro_name)]
+    cols_to_rename = pro_previous[cols_to_rename].columns
+    pro_previous = pro_previous.rename(
+        columns=dict(zip(cols_to_rename, "Prev" + cols_to_rename))
+    )
+    pro_previous = pro_previous[["StudyId", "IndexDate", "Score"]].join(
+        pro_previous.filter(regex="^Prev")
+    )
+    pro_previous = pro_previous.rename(columns={"Score": "Prev" + pro_name + "Score"})
+    pro_df = pro_df.merge(pro_previous, on=["StudyId", "IndexDate"], how="left")
+    return pro_df
+def define_mapping_for_calcs(pro_name, questions, prefixes):
+    """
+    Defines the mapping for calculations between PRO responses.
+    Args:
+        pro_name (str): name of the PRO.
+        questions (list): question names of PRO.
+        prefixes (list): prefixes to identify which columns to use in calculations. The
+                  possible prefixes are: 'Avg', 'Prev', 'LongerAvg', 'WeekPrevAvg'.
+    Returns:
+        dict: mapping that maps columns for performing calculations.
+    """
+    # Create empty dictionary to append questions
+    mapping = defaultdict(list)
+    # Iterate through questions and create mapping for calculations
+    for question in questions:
+        if (pro_name == "EQ5D") | (pro_name == "MRC"):
+            map_key = "Latest" + pro_name + question
+        if (pro_name == "CAT") | (pro_name == "SymptomDiary"):
+            map_key = "WeekAvg" + pro_name + question
+        for prefix in prefixes:
+            mapping[map_key].append(prefix + pro_name + question)
+    return mapping
+def calc_pro_average(pro_df, pro_name, time_window=None, avg_period=None):
+    """
+    Calculate the PRO average before the latest PRO score and within a specified time
+    window.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing index dates and PRO submission
+                dates.
+        pro_name (str): name of the PRO.
+        time_window (int, optional): time window (in months) used for calculating the
+                     average of PRO responses. Defaults to None.
+        avg_period (str, optional): identifies which prefix to add to output columns.
+                    Defaults to None.
+    Returns:
+        pd.Dataframe: the input dateframe with additional columns with the calculated
+                      averages.
+    """
+    # Calculate average in PRO responses for the time window specified prior to the
+    # index date
+    pro_df = pro_df.loc[
+        :,
+        ~(
+            pro_df.columns.str.startswith("Avg")
+            | pro_df.columns.str.startswith("Longer")
+        ),
+    ]
+    if avg_period is None:
+        prefix = "Avg"
+        pro_df["AvgStartDate"] = pro_df["IndexDate"] - pd.DateOffset(months=time_window)
+        avg_pro = pro_df[
+            (pro_df["SubmissionTime"] >= pro_df["AvgStartDate"])
+            & (pro_df["SubmissionTime"] < pro_df["LatestPRODate"])
+        ]
+    else:
+        pro_df["WeekStartDate"] = pro_df["IndexDate"] - pd.DateOffset(weeks=1)
+        pro_df["WeekPrevStartDate"] = pro_df["WeekStartDate"] - pd.DateOffset(weeks=1)
+    # When looking at daily PROs, three averages are calculated:
+    # The weekly average is the average of PRO scores in the week prior to IndexDate
+    if avg_period == "WeeklyAvg":
+        prefix = "WeekAvg"
+        avg_pro = pro_df[
+            (pro_df["SubmissionTime"] >= pro_df["WeekStartDate"])
+            & (pro_df["SubmissionTime"] <= pro_df["IndexDate"])
+        ]
+    # The weekly previous average is the average of PRO scores in the week prior to the
+    # WeeklyAvg. This is needed to calculate the difference of scores between the most
+    # recent week and the week before that
+    elif avg_period == "WeekPrevAvg":
+        prefix = "WeekPrevAvg"
+        avg_pro = pro_df[
+            (pro_df["SubmissionTime"] >= pro_df["WeekPrevStartDate"])
+            & (pro_df["SubmissionTime"] < pro_df["WeekStartDate"])
+        ]
+    # Longer average calculated is the time window specified prior to the WeekStartDate
+    elif avg_period == "LongerAvg":
+        prefix = "LongerAvg"
+        pro_df["AvgStartDate"] = pro_df["IndexDate"] - pd.DateOffset(months=time_window)
+        avg_pro = pro_df[
+            (pro_df["SubmissionTime"] >= pro_df["AvgStartDate"])
+            & (pro_df["SubmissionTime"] < pro_df["WeekStartDate"])
+        ]
+    # Select specific columns
+    cols_required = avg_pro.columns[avg_pro.columns.str.startswith(pro_name)].tolist()
+    cols_required.extend(["StudyId", "IndexDate", "Score"])
+    avg_pro = avg_pro[cols_required]
+    # Calculate average pro scores
+    avg_pro = avg_pro.groupby(["StudyId", "IndexDate"]).mean().reset_index()
+    # Add prefix to question columns
+    cols_to_rename = avg_pro.columns[
+        ~avg_pro.columns.isin(["StudyId", "IndexDate", "Score"])
+    ]
+    avg_pro = avg_pro.rename(columns=dict(zip(cols_to_rename, prefix + cols_to_rename)))
+    # Rename columns where prefix not added
+    avg_pro = avg_pro.rename(columns={"Score": prefix + pro_name + "Score"})
+    # Merge average PRO with rest of the df
+    pro_df = pro_df.merge(avg_pro, on=["StudyId", "IndexDate"], how="left")
+    return pro_df
+def calc_diff_pro_scores(pro_df, pro_name, latest_pro, other_pro, time_window=None):
+    """
+    Calculate the difference between PRO scores.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing columns required for calculations.
+        pro_name (str): name of the PRO.
+        latest_pro (str): column name containing the latest PRO response for PROs EQ5D
+                    and MRC, and the latest week average for PROs CAT and SymptomDiary.
+        other_pro (str): column name containing the other variable for calculating
+                   difference.
+        time_window (int, optional): time window (in months) used to specify which
+                     column to use when calculating differences.
+    Returns:
+        pd.Dataframe: the input dateframe with additional columns with the calculated
+                      differences.
+    """
+    # Remove prefix of score
+    split_feat_name = re.findall(r"[A-Z][^A-Z]*", latest_pro)
+    # Remove first element of list to get the base name of feature
+    split_feat_name.pop(0)
+    # Remove the second element in list if PRO is CAT or SymptomDiary
+    if pro_name in ["CAT", "SymptomDiary"]:
+        split_feat_name.pop(0)
+    # Combine remaining elements of list
+    stripped_feat_name = "".join(split_feat_name)
+    if time_window is None:
+        pro_df["DiffLatestPrev" + stripped_feat_name] = (
+            pro_df[latest_pro] - pro_df[other_pro]
+        )
+    else:
+        pro_df["DiffLatestAvg" + stripped_feat_name + "TW" + str(time_window)] = (
+            pro_df[latest_pro] - pro_df[other_pro]
+        )
+    return pro_df
+def calc_variation(pro_df, pro_name):
+    """
+    Calculate the variation (standard deviation) of PRO responses for a time window of
+    1 month.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing index dates and PRO submission
+                dates.
+        pro_name (str): name of the PRO.
+    Returns:
+        pd.Dataframe: the input dateframe with additional columns with the calculated
+                      variance.
+    """
+    # Only calculate variation in the scores within 1 month before the IndexDate
+    if "TimeSinceSubmission" not in pro_df:
+        pro_df["TimeSinceSubmission"] = (
+            pro_df["IndexDate"] - pro_df["SubmissionTime"]
+        ).dt.days
+    pro_var = pro_df[
+        (pro_df["TimeSinceSubmission"] > 0) & (pro_df["TimeSinceSubmission"] <= 30)
+    ]
+    # Select specific columns
+    cols_required = pro_var.columns[pro_var.columns.str.startswith(pro_name)].tolist()
+    cols_required.extend(["StudyId", "IndexDate", "Score"])
+    pro_var = pro_var[cols_required]
+    # Calculate variation
+    pro_var = pro_var.groupby(["StudyId", "IndexDate"]).std().reset_index()
+    # Add prefix to question columns
+    cols_to_rename = pro_var.columns[
+        ~pro_var.columns.isin(["StudyId", "IndexDate", "Score"])
+    ]
+    pro_var = pro_var.rename(columns=dict(zip(cols_to_rename, "Var" + cols_to_rename)))
+    # Rename columns where prefix not added
+    pro_var = pro_var.rename(columns={"Score": "Var" + pro_name + "Score"})
+    # Merge back to main df
+    pro_df = pro_df.merge(pro_var, on=["StudyId", "IndexDate"], how="left")
+    return pro_df
+def calc_sum_binary_vals(pro_df, binary_cols, time_window=1):
+    """
+    For SymptomDiary questions that contain binary values, calculate the sum of the
+    binary values for a specified time window.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing index dates and PRO submission
+                dates.
+        binary_cols (list): column names for which sum of binary values is to be
+                     calculated for.
+        time_window (int, optional): time window (in months) for which the sum of the
+                     binary values is calculated for. Defaults to 1.
+    Returns:
+        pd.Dataframe: a dataframe containing the sum of the binary values.
+    """
+    # Make sure only entries before the index date and after the time window start date
+    # are used
+    pro_df["TimeWindowStartDate"] = pro_df["IndexDate"] - pd.DateOffset(
+        months=time_window
+    )
+    pro_df = pro_df[
+        (pro_df["SubmissionTime"] >= pro_df["TimeWindowStartDate"])
+        & (pro_df["SubmissionTime"] <= pro_df["IndexDate"])
+    ]
+    sum_df = pro_df.groupby(["StudyId", "IndexDate"])[binary_cols].sum()
+    # Rename columns
+    sum_df = sum_df.add_prefix("Sum")
+    sum_df = sum_df.add_suffix("TW" + str(time_window))
+    sum_df = sum_df.reset_index()
+    return sum_df
+def scale_sum_to_response_rate(pro_df, sum, engagement_rate):
+    """
+    Scale the sum calculated using copd.calc_sum_binary_vals() to the response
+    rate to obtain a feature that is comparable between patients.
+    Args:
+        pro_df (pd.DataFrame): dataframe containing the columns for scaling the sum by
+                the engagement rate.
+        sum (str): column name that contains the data for the sum of the binary values.
+        engagement_rate (str): column name that contains the data for the response rate.
+    Returns:
+        pd.Dataframe: the input dateframe with additional columns with  the scaled sum.
+    """
+    pro_df["Scaled" + sum] = pro_df[sum] / pro_df[engagement_rate]
+    return pro_df
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/process_pros_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data[["StudyId", "IndexDate"]]
+patient_details = data.merge(
+    patient_details[["StudyId", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="left",
+)
+# Calculate the lookback start date. Will need this to aggreggate data for model
+# features
+data["LookbackStartDate"] = data["IndexDate"] - pd.DateOffset(
+    days=config["model_settings"]["lookback_period"]
+)
+############################################
+# Monthly PROs - EQ5D
+############################################
+pro_eq5d = pd.read_csv(config["inputs"]["raw_data_paths"]["pro_eq5d"], delimiter="|")
+pro_eq5d = pro_eq5d.merge(
+    patient_details,
+    on="StudyId",
+    how="inner",
+)
+# EQ5DQ6 is a much less structured question compared to the other questions in EQ5D.
+# A new score will be calculated using only EQ5DQ1-Q5 to prevent Q6 affecting the score
+pro_eq5d["EQ5DScoreWithoutQ6"] = pro_eq5d.loc[:, "EQ5DQ1":"EQ5DQ5"].sum(axis=1)
+# Calculate engagement over service
+pro_eq5d = calc_total_pro_engagement(pro_eq5d, "EQ5D")
+# Calculate engagement for a time window of 1 month (time window chosen based on signal
+# output observed from results of feature_eng_multiple_testing)
+pro_eq5d_engagement = calc_pro_engagement_in_time_window(
+    pro_eq5d, "EQ5D", time_window=1, data=data
+)
+pro_eq5d = pro_eq5d.merge(pro_eq5d_engagement, on=["StudyId", "IndexDate"], how="left")
+# Calculate last PRO score
+pro_eq5d = calc_last_pro_score(pro_eq5d, "EQ5D")
+# Mapping to calculate the difference between the latest PRO scores and the average
+# PRO score
+question_names_eq5d = ["Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Score", "ScoreWithoutQ6"]
+mapping_eq5d = define_mapping_for_calcs("EQ5D", question_names_eq5d, prefixes=["Avg"])
+# Calculate average PRO score for a time window of 1 month prior to IndexDate,
+# ignoring the latest PRO score
+pro_eq5d = calc_pro_average(pro_eq5d, "EQ5D", time_window=1)
+for key in mapping_eq5d:
+    calc_diff_pro_scores(pro_eq5d, "EQ5D", key, mapping_eq5d[key][0], time_window=1)
+# Calculate variation of scores across 1 month
+pro_eq5d = calc_variation(pro_eq5d, "EQ5D")
+# Remove unwanted columns and duplicates
+pro_eq5d = pro_eq5d.loc[
+    :,
+    ~(
+        pro_eq5d.columns.str.startswith("Avg")
+        | pro_eq5d.columns.str.startswith("EQ5D")
+        | pro_eq5d.columns.str.startswith("Response")
+    ),
+]
+pro_eq5d = pro_eq5d.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "FirstSubmissionDate",
+        "TimeInService",
+        "TimeSinceSubmission",
+        "LatestPredictionDate",
+        "LatestPRODate",
+    ]
+)
+pro_eq5d = pro_eq5d.drop_duplicates()
+############################################
+# Weekly PROs - MRC
+############################################
+pro_mrc = pd.read_csv(config["inputs"]["raw_data_paths"]["pro_mrc"], delimiter="|")
+pro_mrc = pro_mrc.merge(
+    patient_details,
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service
+pro_mrc = calc_total_pro_engagement(pro_mrc, "MRC")
+# Calculate engagement for a time window of 1 month
+pro_mrc_engagement = calc_pro_engagement_in_time_window(
+    pro_mrc, "MRC", time_window=1, data=data
+)
+pro_mrc = pro_mrc.merge(pro_mrc_engagement, on=["StudyId", "IndexDate"], how="left")
+# Calculate last PRO score
+pro_mrc = calc_last_pro_score(pro_mrc, "MRC")
+# Mapping to calculate the difference between the latest PRO scores and the average
+# PRO score
+question_names_mrc = ["Q1"]
+mapping_mrc = define_mapping_for_calcs("MRC", question_names_mrc, prefixes=["Avg"])
+# Calculate average PRO score for a time window of 1 month prior to IndexDate,
+# ignoring the latest PRO score
+pro_mrc = calc_pro_average(pro_mrc, "MRC", time_window=1)
+for key in mapping_mrc:
+    calc_diff_pro_scores(pro_mrc, "MRC", key, mapping_mrc[key][0], time_window=1)
+# Calculate variation of scores across 1 month
+pro_mrc = calc_variation(pro_mrc, "MRC")
+# Remove unwanted columns and duplicates
+pro_mrc = pro_mrc.loc[
+    :,
+    ~(
+        pro_mrc.columns.str.startswith("Avg")
+        | pro_mrc.columns.str.startswith("MRC")
+        | pro_mrc.columns.str.startswith("Response")
+    ),
+]
+pro_mrc = pro_mrc.drop(
+    columns=[
+        "SubmissionTime",
+        "Score",
+        "FirstSubmissionDate",
+        "TimeInService",
+        "TimeSinceSubmission",
+        "LatestPredictionDate",
+        "LatestPRODate",
+    ]
+)
+pro_mrc = pro_mrc.drop_duplicates()
+############################################
+# Daily PROs - CAT
+############################################
+pro_cat_full = pd.read_csv(config["inputs"]["raw_data_paths"]["pro_cat"], delimiter="|")
+pro_cat = pro_cat_full.merge(
+    patient_details,
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service
+pro_cat = calc_total_pro_engagement(pro_cat, "CAT")
+# Calculate engagement for a time window of 1 month
+pro_cat_engagement = calc_pro_engagement_in_time_window(
+    pro_cat, "CAT", time_window=1, data=data
+)
+pro_cat = pro_cat.merge(pro_cat_engagement, on=["StudyId", "IndexDate"], how="left")
+# Calculate engagement in the month prior to the most recent month to index date
+pro_cat_month1 = calc_pro_engagement_at_specific_month(
+    pro_cat, "CAT", month_num=1, data=data
+)
+pro_cat_month2 = calc_pro_engagement_at_specific_month(
+    pro_cat, "CAT", month_num=2, data=data
+)
+pro_cat_month3 = calc_pro_engagement_at_specific_month(
+    pro_cat, "CAT", month_num=3, data=data
+)
+pro_cat = pro_cat.merge(pro_cat_month1, on=["StudyId", "IndexDate"], how="left")
+pro_cat = pro_cat.merge(pro_cat_month2, on=["StudyId", "IndexDate"], how="left")
+pro_cat = pro_cat.merge(pro_cat_month3, on=["StudyId", "IndexDate"], how="left")
+pro_cat["EngagementDiffMonth1and2"] = (
+    pro_cat["EngagementCATMonth1"] - pro_cat["EngagementCATMonth2"]
+)
+pro_cat["EngagementDiffMonth1and3"] = (
+    pro_cat["EngagementCATMonth1"] - pro_cat["EngagementCATMonth3"]
+)
+# Calculate PRO average for the week before the index date
+pro_cat = calc_pro_average(pro_cat, "CAT", avg_period="WeeklyAvg")
+# Calculate variation of scores across 1 month
+pro_cat = calc_variation(pro_cat, "CAT")
+# Remove unwanted columns and duplicates
+pro_cat = pro_cat.loc[
+    :,
+    ~(
+        pro_cat.columns.str.startswith("CAT")
+        | pro_cat.columns.str.startswith("Response")
+    ),
+]
+pro_cat = pro_cat.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "FirstSubmissionDate",
+        "TimeSinceSubmission",
+        "LatestPredictionDate",
+        "TimeInService",
+        "WeekStartDate",
+        "WeekPrevStartDate",
+    ]
+)
+pro_cat = pro_cat.drop_duplicates()
+############################################
+# Daily PROs - Symptom Diary
+############################################
+# Symptom diary have some questions that are numeric and some that are categorical
+pro_sd_full = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["pro_symptom_diary"], delimiter="|"
+)
+pro_sd = pro_sd_full.merge(
+    patient_details,
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service
+pro_sd = calc_total_pro_engagement(pro_sd, "SymptomDiary")
+pro_sd_engagement = pro_sd[
+    ["StudyId", "PatientId", "IndexDate", "TotalEngagementSymptomDiary"]
+]
+# Calculate engagement for 1 month prior to IndexDate
+pro_sd_engagement_tw = calc_pro_engagement_in_time_window(
+    pro_sd, "SymptomDiary", time_window=1, data=data
+)
+pro_sd_engagement = pro_sd_engagement.merge(
+    pro_sd_engagement_tw, on=["StudyId", "IndexDate"], how="left"
+)
+pro_sd_engagement = pro_sd_engagement.drop_duplicates()
+###############################
+# Categorical questions
+# (Q8, Q9, Q10)
+###############################
+pro_cat_q5 = pro_cat_full[["StudyId", "SubmissionTime", "CATQ5"]]
+pro_sd_categ = pro_sd_full[
+    [
+        "StudyId",
+        "SubmissionTime",
+        "SymptomDiaryQ8",
+        "SymptomDiaryQ9",
+        "SymptomDiaryQ10",
+        "Score",
+    ]
+]
+# Split timestamp column into separate date and time columns as same day entries in CAT
+# and SymptomDiary have different timestamps
+for df in [pro_cat_q5, pro_sd_categ]:
+    df["Date"] = (pd.to_datetime(df["SubmissionTime"], utc=True)).dt.date
+pro_sd_cat = pro_sd_categ.merge(pro_cat_q5, on=["StudyId", "Date"], how="outer")
+# If CATQ5 is a 0, then Symptom Diary questions 8, 9 and 10 don't get asked. Add this as
+# an option to the columns. There are some cases where patients have a 0 in CATQ5 but
+# have also answered Symptom Diary questions 8, 9, and 10 - keep these answers as is.
+for col in ["SymptomDiaryQ8", "SymptomDiaryQ9", "SymptomDiaryQ10"]:
+    pro_sd_cat[col] = np.where(
+        (pro_sd_cat["CATQ5"] == 0) & (pro_sd_cat[col].isna()),
+        "Question Not Asked",
+        pro_sd_cat[col],
+    )
+# Calculate the most recent score for SymptomDiary categorical questions
+pro_sd_cat = pro_sd_cat.merge(data[["StudyId", "IndexDate"]], on="StudyId", how="inner")
+pro_sd_cat = pro_sd_cat.rename(columns={"SubmissionTime_x": "SubmissionTime"})
+pro_sd_cat["SubmissionTime"] = pd.to_datetime(pro_sd_cat["SubmissionTime"], utc=True)
+pro_sd_cat = calc_last_pro_score(pro_sd_cat, "SymptomDiary")
+pro_sd_cat = pro_sd_cat.drop(
+    columns=[
+        "SubmissionTime",
+        "SubmissionTime_y",
+        "CATQ5",
+        "SymptomDiaryQ8",
+        "SymptomDiaryQ9",
+        "Date",
+        "SymptomDiaryQ10",
+        "Score",
+        "LatestSymptomDiaryScore",
+        "LatestPRODate",
+        "TimeSinceSubmission",
+    ]
+)
+pro_sd_cat = pro_sd_cat.drop_duplicates()
+###############################
+# Numeric questions
+# (Q1, Q2)
+# Q3 included for comparison
+###############################
+# Calculate PRO average for the week before the index date
+pro_sd_numeric = pro_sd[
+    [
+        "StudyId",
+        "PatientId",
+        "IndexDate",
+        "SubmissionTime",
+        "Score",
+        "SymptomDiaryQ1",
+        "SymptomDiaryQ2",
+        "SymptomDiaryQ3",
+    ]
+]
+pro_sd_numeric = calc_pro_average(
+    pro_sd_numeric, "SymptomDiary", avg_period="WeeklyAvg"
+)
+# Calculate variation of scores across 1 month
+pro_sd_numeric = calc_variation(pro_sd_numeric, "SymptomDiary")
+###############################
+# Binary questions
+# (Q3)
+###############################
+# Calculate sum of binary values for a time window of 1 months
+sd_sum_all = pro_sd_numeric[["StudyId", "IndexDate"]]
+sd_sum_all = sd_sum_all.drop_duplicates()
+sd_sum = calc_sum_binary_vals(
+    pro_sd_numeric, binary_cols=["SymptomDiaryQ3"], time_window=1
+)
+sd_sum_all = sd_sum_all.merge(sd_sum, on=["StudyId", "IndexDate"], how="left")
+# Scale sums by how often patients responded
+sd_sum_all = sd_sum_all.merge(
+    pro_sd_engagement, on=["StudyId", "IndexDate"], how="left"
+)
+mapping_scaling = {"SumSymptomDiaryQ3TW1": "EngagementSymptomDiaryTW1"}
+for key in mapping_scaling:
+    scale_sum_to_response_rate(sd_sum_all, key, mapping_scaling[key])
+# Combine numeric, categorical and binary dfs
+pro_sd_all = pro_sd_numeric.merge(
+    sd_sum_all, on=["StudyId", "PatientId", "IndexDate"], how="left"
+)
+pro_sd_all = pro_sd_all.merge(pro_sd_cat, on=["StudyId", "IndexDate"], how="left")
+# Remove unwanted columns from numeric df
+pro_sd_all = pro_sd_all.loc[
+    :,
+    ~(
+        pro_sd_all.columns.str.startswith("Symptom")
+        | pro_sd_all.columns.str.startswith("Sum")
+        | pro_sd_all.columns.str.startswith("Response")
+    ),
+]
+pro_sd_all = pro_sd_all.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "TimeWindowStartDate",
+        "WeekStartDate",
+        "WeekPrevStartDate",
+        "TimeSinceSubmission",
+    ]
+)
+pro_sd_all = pro_sd_all.drop_duplicates()
+# Combine pros into one df
+pro_df = pro_eq5d.merge(pro_mrc, on=["StudyId", "PatientId", "IndexDate"], how="left")
+pro_df = pro_df.merge(pro_cat, on=["StudyId", "PatientId", "IndexDate"], how="left")
+pro_df = pro_df.merge(pro_sd_all, on=["StudyId", "PatientId", "IndexDate"], how="left")
+###############################
+# Map some categorical features
+###############################
+# Replace SDQ8 with strings for phlegm difficulty
+q8_dict = {
+    "1.0": "Not difficult",
+    "2.0": "A little difficult",
+    "3.0": "Quite difficult",
+    "4.0": "Very difficult",
+}
+for key in q8_dict:
+    pro_df["LatestSymptomDiaryQ8"] = pro_df["LatestSymptomDiaryQ8"].str.replace(
+        key, q8_dict[key]
+    )
+# Replace SDQ9 with strings for phlegm consistency
+q9_dict = {
+    "1.0": "Watery",
+    "2.0": "Sticky liquid",
+    "3.0": "Semi-solid",
+    "4.0": "Solid",
+}
+for key in q9_dict:
+    pro_df["LatestSymptomDiaryQ9"] = pro_df["LatestSymptomDiaryQ9"].str.replace(
+        key, q9_dict[key]
+    )
+# Replace SDQ10 with strings for phlegm colour
+q10_dict = {
+    "1.0": "White",
+    "2.0": "Yellow",
+    "3.0": "Green",
+    "4.0": "Dark green",
+}
+for key in q10_dict:
+    pro_df["LatestSymptomDiaryQ10"] = pro_df["LatestSymptomDiaryQ10"].str.replace(
+        key, q10_dict[key]
+    )
+pro_df = pro_df.drop(
+    columns=[
+        "PatientId",
+        "LatestTimeSinceSubmission",
+        "LatestTimeSinceSubmission_x",
+        "LatestTimeSinceSubmission_y",
+    ]
+)
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    pro_df.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "pros_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    pro_df.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "pros_" + model_type + ".pkl",
+        )
+    )

training/process_spirometry.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+Derive features from spirometry for 2 models:
+    Parallel model 1: uses both hospital and community exacerbation events
+    Parallel model 2: uses only hospital exacerbation events
+"""
+import numpy as np
+import pandas as pd
+import sys
+import os
+import yaml
+import model_h
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Specify which model to generate features for
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/process_spirometry_" + model_type + ".log", "w")
+sys.stdout = log
+# Dataset to process - set through config file
+data_to_process = config["model_settings"]["data_to_process"]
+# Load cohort data
+if data_to_process == "forward_val":
+    data = pd.read_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+    patient_details = pd.read_pickle("./data/patient_details_forward_val.pkl")
+else:
+    data = pd.read_pickle("./data/patient_labels_" + model_type + ".pkl")
+    patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data[["StudyId", "IndexDate"]]
+patient_details = data.merge(
+    patient_details[["StudyId", "PatientId"]],
+    on="StudyId",
+    how="left",
+)
+copd_status = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["copd_status"], delimiter="|"
+)
+copd_status = patient_details.merge(copd_status, on="PatientId", how="left")
+copd_status["LungFunction_Date"] = pd.to_datetime(
+    copd_status["LungFunction_Date"], utc=True
+)
+copd_status["TimeSinceLungFunc"] = (
+    copd_status["IndexDate"] - copd_status["LungFunction_Date"]
+).dt.days
+print(
+    "COPD Status Details: Number of patients with a lung function date < 1 year \
+from index date: {} of {}".format(
+        len(copd_status[copd_status["TimeSinceLungFunc"] < 365]), len(patient_details)
+    )
+)
+copd_status = copd_status[
+    [
+        "StudyId",
+        "IndexDate",
+        "RequiredAcuteNIV",
+        "RequiredICUAdmission",
+        "LungFunction_FEV1PercentPredicted",
+        "LungFunction_FEV1Litres",
+        "LungFunction_FEV1FVCRatio",
+        "TimeSinceLungFunc",
+    ]
+]
+# Map bool values
+bool_mapping = {True: 1, False: 0}
+copd_status["RequiredAcuteNIV"] = copd_status.RequiredAcuteNIV.map(bool_mapping)
+copd_status["RequiredICUAdmission"] = copd_status.RequiredICUAdmission.map(bool_mapping)
+# Convert columns in COPD Status to numeric
+copd_status["LungFunction_FEV1PercentPredicted"] = copd_status[
+    "LungFunction_FEV1PercentPredicted"
+].str.replace("%", "")
+for col in copd_status.drop(
+    columns=["StudyId", "IndexDate", "RequiredAcuteNIV", "RequiredICUAdmission"]
+).columns:
+    copd_status[col] = pd.to_numeric(copd_status[col])
+# Bin patient spirometry at onboarding
+spirometry_bins = [0, 30, 50, 80, np.inf]
+spirometry_labels = ["Very severe", "Severe", "Moderate", "Mild"]
+copd_status["FEV1PercentPredicted"] = model_h.bin_numeric_column(
+    col=copd_status["LungFunction_FEV1PercentPredicted"],
+    bins=spirometry_bins,
+    labels=spirometry_labels,
+)
+copd_status = copd_status.drop(columns=["LungFunction_FEV1PercentPredicted"])
+# Assign patients without spirometry in service data to the Mild category
+copd_status.loc[
+    copd_status["FEV1PercentPredicted"] == "nan", "FEV1PercentPredicted"
+] = "Mild"
+# Save data
+os.makedirs(config["outputs"]["processed_data_dir"], exist_ok=True)
+if data_to_process == "forward_val":
+    copd_status.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "spirometry_forward_val_" + model_type + ".pkl",
+        )
+    )
+else:
+    copd_status.to_pickle(
+        os.path.join(
+            config["outputs"]["processed_data_dir"],
+            "spirometry_" + model_type + ".pkl",
+        )
+    )

training/pros_multiple_time_windows.py ADDED Viewed

	@@ -0,0 +1,618 @@

+"""
+Derive features from PRO responses for multiple time windows and select time window
+that gives the best signal.
+"""
+import numpy as np
+import pandas as pd
+import model_h
+import matplotlib.pyplot as plt
+from collections import defaultdict
+def create_cols_for_plotting(pro_name, question_col_names=None, var_engagement=False):
+    """
+    Create a mapping for the PRO questions specified that allows plotting of the results
+    from the same question with different time windows on the same grid. The key of the
+    dictionary is the PRO question (e.g. 'EQ5DQ1') and the values are a list containing
+    column names to be plotted (e.g. ['LatestEQ5DQ1', 'DiffLatestAvgEQ5DQ1TW1']).
+    Args:
+        pro_name (str): name of PRO.
+        question_col_names (list, optional): a list of question names required for
+                            plotting. Defaults to None.
+        var_engagement (bool, optional): whether the variable to be plot is engagement.
+                        Defaults to False.
+    Returns:
+        dict of (str:list): dictionary containing mapping where each key maps to a list
+                 of column names.
+    """
+    cols_for_plotting = defaultdict(list)
+    if var_engagement is False:
+        for question in question_col_names:
+            for time_window_num in range(1, 7):
+                col_name = (
+                    "DiffLatestAvg" + pro_name + question + "TW" + str(time_window_num)
+                )
+                cols_for_plotting[pro_name + question].append(col_name)
+                if (pro_name == "SymptomDiary") & (question == "Q3"):
+                    col_name = (
+                        "ScaledSum" + pro_name + question + "TW" + str(time_window_num)
+                    )
+                    cols_for_plotting["ScaledSum" + pro_name + question].append(
+                        col_name
+                    )
+            cols_for_plotting[pro_name + question].append(
+                "DiffLatestPrev" + pro_name + question
+            )
+            if (pro_name == "EQ5D") | (pro_name == "MRC"):
+                cols_for_plotting[pro_name + question].append(
+                    "Latest" + pro_name + question
+                )
+            if (pro_name == "CAT") | (pro_name == "SymptomDiary"):
+                cols_for_plotting[pro_name + question].append(
+                    "WeekAvg" + pro_name + question
+                )
+    if var_engagement is True:
+        for time_window_num in range(1, 7):
+            col_name = "Engagement" + pro_name + "TW" + str(time_window_num)
+            cols_for_plotting[pro_name].append(col_name)
+    return cols_for_plotting
+def plot_feature_signal(
+    data, nrows, ncols, figsize, cols_to_plot, fig_name, outcome="ExacWithin3Months"
+):
+    """
+    Plot boxplots for each multiple columns onto the same grid if multiple columns are
+    specified.
+    Args:
+        data (pd.DataFrame): dataframe containing all data to plot and outcome column.
+        nrows (int): number of rows for the subplot grid.
+        ncols (int): number of columns for the subplot grid.
+        figsize (tuple): (width, height) in inches.
+        cols_to_plot (list): column names to plot.
+        fig_name (str): name of figure required to save figure.
+        outcome (str, optional): name of column to group values by for plotting the
+                 data. Defaults to 'ExacWithinMonths'.
+    Returns:
+        None.
+    """
+    fig, ax = plt.subplots(nrows, ncols, figsize=figsize)
+    if (nrows > 1) | (ncols > 1):
+        ax = ax.flatten()
+        for i, col in enumerate(cols_to_plot):
+            data.boxplot(
+                col,
+                outcome,
+                ax=ax[i],
+                flierprops={"markersize": 2},
+                medianprops={"color": "black"},
+                # oxprops={"color": "black"},
+            )
+    else:
+        for i, col in enumerate(cols_to_plot):
+            data.boxplot(
+                col,
+                outcome,
+                flierprops={"markersize": 2},
+                medianprops={"color": "black"},
+            )
+    plt.tight_layout()
+    plt.savefig("./plots/boxplots/" + fig_name + ".png")
+    plt.close()
+data = pd.read_pickle("./data/patient_labels_hosp_comm.pkl")
+patient_details = pd.read_pickle("./data/patient_details.pkl")
+data = data.merge(
+    patient_details[["StudyId", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="left",
+)
+# Calculate the lookback start date. Will need this to aggreggate data for model
+# features
+data["LookbackStartDate"] = data["IndexDate"] - pd.DateOffset(days=180)
+############################################################################
+# Derive features from PRO responses
+############################################################################
+############################################
+# Monthly PROs - EQ5D
+############################################
+pro_eq5d = pd.read_csv("<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProEQ5D.txt", delimiter="|")
+pro_eq5d = pro_eq5d.merge(
+    data[["StudyId", "IndexDate", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="inner",
+)
+# EQ5DQ6 is a much less structured question compared to the other questions in EQ5D. A
+# new score will be calculated using only EQ5DQ1-Q5 to prevent Q6 affecting the score.
+pro_eq5d["EQ5DScoreWithoutQ6"] = pro_eq5d.loc[:, "EQ5DQ1":"EQ5DQ5"].sum(axis=1)
+# Calculate engagement over service
+pro_eq5d = model_h.calc_total_pro_engagement(pro_eq5d, "EQ5D")
+# Calculate engagement over multiple time windows
+for time_window in range(1, 7):
+    pro_eq5d_engagement = model_h.calc_pro_engagement_in_time_window(
+        pro_eq5d, "EQ5D", time_window=time_window, data=data
+    )
+    pro_eq5d = pro_eq5d.merge(
+        pro_eq5d_engagement, on=["StudyId", "IndexDate"], how="left"
+    )
+# Calculate last PRO score
+pro_eq5d = model_h.calc_last_pro_score(pro_eq5d, "EQ5D")
+# Calculate the PRO score prior to the last PRO score
+pro_eq5d = model_h.calc_pro_score_prior_to_latest(pro_eq5d, "EQ5D")
+#############################
+# Scores across time windows
+#############################
+# Mapping to calculate the difference between the latest PRO scores and both the average
+# and previous PRO score
+question_names_eq5d = ["Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Score", "ScoreWithoutQ6"]
+mapping_eq5d = model_h.define_mapping_for_calcs(
+    "EQ5D", question_names_eq5d, prefixes=["Avg", "Prev"]
+)
+# Calculate average PRO score across various time windows (months) prior to IndexDate,
+# ignoring the latest PRO score
+for time_window in range(1, 7):
+    pro_eq5d = model_h.calc_pro_average(pro_eq5d, "EQ5D", time_window=time_window)
+    for key in mapping_eq5d:
+        model_h.calc_diff_pro_scores(
+            pro_eq5d, "EQ5D", key, mapping_eq5d[key][0], time_window=time_window
+        )
+# Calculate difference between latest PRO score and PRO score prior to the latest
+for key in mapping_eq5d:
+    model_h.calc_diff_pro_scores(pro_eq5d, "EQ5D", key, mapping_eq5d[key][1])
+# Remove unwanted columns and duplicates
+pro_eq5d = pro_eq5d.loc[
+    :,
+    ~(
+        pro_eq5d.columns.str.startswith("Avg")
+        | pro_eq5d.columns.str.startswith("EQ5D")
+        | pro_eq5d.columns.str.startswith("Prev")
+        | pro_eq5d.columns.str.startswith("Response")
+    ),
+]
+pro_eq5d = pro_eq5d.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "FirstSubmissionDate",
+        "TimeInService",
+        "TimeSinceSubmission",
+        "LatestPredictionDate",
+        "LatestPRODate",
+    ]
+)
+pro_eq5d = pro_eq5d.drop_duplicates()
+############################################
+# Weekly PROs - MRC
+############################################
+pro_mrc = pd.read_csv("<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProMrc.txt", delimiter="|")
+pro_mrc = pro_mrc.merge(
+    data[["StudyId", "IndexDate", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service
+pro_mrc = model_h.calc_total_pro_engagement(pro_mrc, "MRC")
+# Calculate engagement over multiple time windows
+for time_window in range(1, 7):
+    pro_mrc_engagement = model_h.calc_pro_engagement_in_time_window(
+        pro_mrc, "MRC", time_window=time_window, data=data
+    )
+    pro_mrc = pro_mrc.merge(pro_mrc_engagement, on=["StudyId", "IndexDate"], how="left")
+# Calculate last PRO score
+pro_mrc = model_h.calc_last_pro_score(pro_mrc, "MRC")
+# Calculate the PRO score prior to the last PRO score
+pro_mrc = model_h.calc_pro_score_prior_to_latest(pro_mrc, "MRC")
+#############################
+# Scores across time windows
+#############################
+# Mapping to calculate the difference between the latest PRO scores and both the average
+# and previous PRO score
+question_names_mrc = ["Q1"]
+mapping_mrc = model_h.define_mapping_for_calcs(
+    "MRC", question_names_mrc, prefixes=["Avg", "Prev"]
+)
+# Calculate average PRO score across various time windows (months) prior to IndexDate,
+# ignoring the latest PRO score
+for time_window in range(1, 7):
+    pro_mrc = model_h.calc_pro_average(pro_mrc, "MRC", time_window=time_window)
+    for key in mapping_mrc:
+        model_h.calc_diff_pro_scores(
+            pro_mrc, "MRC", key, mapping_mrc[key][0], time_window=time_window
+        )
+# Calculate difference between latest PRO score and PRO score prior to the latest
+for key in mapping_mrc:
+    model_h.calc_diff_pro_scores(pro_mrc, "MRC", key, mapping_mrc[key][1])
+# Remove unwanted columns and duplicates
+pro_mrc = pro_mrc.loc[
+    :,
+    ~(
+        pro_mrc.columns.str.startswith("Avg")
+        | pro_mrc.columns.str.startswith("MRC")
+        | pro_mrc.columns.str.startswith("Prev")
+        | pro_mrc.columns.str.startswith("Response")
+    ),
+]
+pro_mrc = pro_mrc.drop(
+    columns=[
+        "SubmissionTime",
+        "Score",
+        "FirstSubmissionDate",
+        "TimeInService",
+        "TimeSinceSubmission",
+        "LatestPredictionDate",
+        "LatestPRODate",
+    ]
+)
+pro_mrc = pro_mrc.drop_duplicates()
+############################################
+# Daily PROs - CAT
+############################################
+pro_cat = pd.read_csv("<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProCat.txt", delimiter="|")
+pro_cat = pro_cat.merge(
+    data[["StudyId", "IndexDate", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service and 1 month prior to index date
+pro_cat = model_h.calc_total_pro_engagement(pro_cat, "CAT")
+# Calculate engagement over multiple time windows
+for time_window in range(1, 7):
+    pro_cat_engagement = model_h.calc_pro_engagement_in_time_window(
+        pro_cat, "CAT", time_window=time_window, data=data
+    )
+    pro_cat = pro_cat.merge(pro_cat_engagement, on=["StudyId", "IndexDate"], how="left")
+# Calculate PRO average for the week before the index date
+pro_cat = model_h.calc_pro_average(pro_cat, "CAT", avg_period="WeeklyAvg")
+# Calculate PRO average for the week before most recent week to the index date
+pro_cat = model_h.calc_pro_average(pro_cat, "CAT", avg_period="WeekPrevAvg")
+#############################
+# Scores across time windows
+#############################
+# Mapping to calculate the difference between the latest PRO scores and both the average
+# and previous PRO score
+question_names_cat = ["Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Score"]
+mapping_cat = model_h.define_mapping_for_calcs(
+    "CAT", question_names_cat, prefixes=["LongerAvg", "WeekPrevAvg"]
+)
+# Calculate average PRO score across various time windows (months) prior to IndexDate,
+# ignoring the latest PRO score
+for time_window in range(1, 7):
+    pro_cat = model_h.calc_pro_average(
+        pro_cat, "CAT", time_window=time_window, avg_period="LongerAvg"
+    )
+    for key in mapping_cat:
+        model_h.calc_diff_pro_scores(
+            pro_cat, "CAT", key, mapping_cat[key][0], time_window=time_window
+        )
+# Calculate difference between latest PRO score and PRO score prior to the latest
+for key in mapping_cat:
+    model_h.calc_diff_pro_scores(pro_cat, "CAT", key, mapping_cat[key][1])
+# Remove unwanted columns and duplicates
+pro_cat = pro_cat.loc[
+    :,
+    ~(
+        pro_cat.columns.str.startswith("WeekPrev")
+        | pro_cat.columns.str.startswith("Longer")
+        | pro_cat.columns.str.startswith("CAT")
+        | pro_cat.columns.str.startswith("Response")
+    ),
+]
+pro_cat = pro_cat.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "FirstSubmissionDate",
+        "LatestPredictionDate",
+        "TimeInService",
+        "AvgStartDate",
+        "WeekStartDate",
+    ]
+)
+pro_cat = pro_cat.drop_duplicates()
+############################################
+# Daily PROs - Symptom Diary
+############################################
+# Symptom diary have some questions that are numeric and some that are categorical
+pro_sd = pd.read_csv(
+    "<YOUR_DATA_PATH>/copd-dataset/CopdDatasetProSymptomDiary.txt", delimiter="|"
+)
+pro_sd = pro_sd.merge(
+    data[["StudyId", "IndexDate", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on="StudyId",
+    how="inner",
+)
+# Calculate engagement over service
+pro_sd = model_h.calc_total_pro_engagement(pro_sd, "SymptomDiary")
+pro_sd_engagement = pro_sd[
+    ["StudyId", "PatientId", "IndexDate", "TotalEngagementSymptomDiary"]
+]
+# Calculate engagement over multiple time windows
+for time_window in range(1, 7):
+    pro_sd_engagement_tw = model_h.calc_pro_engagement_in_time_window(
+        pro_sd, "SymptomDiary", time_window=time_window, data=data
+    )
+    pro_sd_engagement = pro_sd_engagement.merge(
+        pro_sd_engagement_tw, on=["StudyId", "IndexDate"], how="left"
+    )
+pro_sd_engagement = pro_sd_engagement.drop_duplicates()
+###############################
+# Numeric questions
+# (Q1, Q2)
+# Q3 included for comparison
+###############################
+# Calculate PRO average for the week before the index date
+pro_sd_numeric = pro_sd[
+    [
+        "StudyId",
+        "PatientId",
+        "IndexDate",
+        "SubmissionTime",
+        "Score",
+        "SymptomDiaryQ1",
+        "SymptomDiaryQ2",
+        "SymptomDiaryQ3",
+    ]
+]
+pro_sd_numeric = model_h.calc_pro_average(
+    pro_sd_numeric, "SymptomDiary", avg_period="WeeklyAvg"
+)
+# Calculate PRO average for the week before most recent week to the index date
+pro_sd_numeric = model_h.calc_pro_average(
+    pro_sd_numeric, "SymptomDiary", avg_period="WeekPrevAvg"
+)
+#############################
+# Scores across time windows
+#############################
+# Mapping to calculate the difference between the latest PRO scores and both the average
+# and previous PRO score
+question_names_sd = ["Q1", "Q2", "Q3"]
+mapping_sd = model_h.define_mapping_for_calcs(
+    "SymptomDiary", question_names_sd, prefixes=["LongerAvg", "WeekPrevAvg"]
+)
+# Calculate average PRO score across various time windows (months) prior to IndexDate,
+# ignoring the latest PRO score
+for time_window in range(1, 7):
+    pro_sd_numeric = model_h.calc_pro_average(
+        pro_sd_numeric, "SymptomDiary", time_window=time_window, avg_period="LongerAvg"
+    )
+    for key in mapping_sd:
+        model_h.calc_diff_pro_scores(
+            pro_sd_numeric,
+            "SymptomDiary",
+            key,
+            mapping_sd[key][0],
+            time_window=time_window,
+        )
+# Calculate difference between latest PRO score and PRO score prior to the latest week
+for key in mapping_sd:
+    model_h.calc_diff_pro_scores(pro_sd_numeric, "SymptomDiary", key, mapping_sd[key][1])
+###############################
+# Binary questions
+# (Q3)
+###############################
+# Calculate sum of binary values across previous months
+sd_sum_all = pro_sd_numeric[["StudyId", "IndexDate"]]
+sd_sum_all = sd_sum_all.drop_duplicates()
+for time_window in range(1, 7):
+    sd_sum = model_h.calc_sum_binary_vals(
+        pro_sd_numeric, binary_cols=["SymptomDiaryQ3"], time_window=time_window
+    )
+    sd_sum_all = sd_sum_all.merge(sd_sum, on=["StudyId", "IndexDate"], how="left")
+# Scale sums by how often patients responded
+sd_sum_all = sd_sum_all.merge(
+    pro_sd_engagement, on=["StudyId", "IndexDate"], how="left"
+)
+mapping_scaling = {}
+for time_window in range(1, 7):
+    mapping_scaling["SumSymptomDiaryQ3TW" + str(time_window)] = (
+        "EngagementSymptomDiaryTW" + str(time_window)
+    )
+for key in mapping_scaling:
+    model_h.scale_sum_to_response_rate(sd_sum_all, key, mapping_scaling[key])
+# Combine numeric and binary dfs
+pro_sd_full = pro_sd_numeric.merge(
+    sd_sum_all, on=["StudyId", "PatientId", "IndexDate"], how="left"
+)
+# Remove unwanted columns from numeric df
+pro_sd_full = pro_sd_full.loc[
+    :,
+    ~(
+        pro_sd_full.columns.str.startswith("WeekPrev")
+        | pro_sd_full.columns.str.startswith("Longer")
+        | pro_sd_full.columns.str.startswith("Symptom")
+        | pro_sd_full.columns.str.startswith("Sum")
+        | pro_sd_full.columns.str.startswith("Response")
+    ),
+]
+pro_sd_full = pro_sd_full.drop(
+    columns=[
+        "Score",
+        "SubmissionTime",
+        "AvgStartDate",
+        "TimeWindowStartDate",
+        "WeekStartDate",
+    ]
+)
+pro_sd_full = pro_sd_full.drop_duplicates()
+############################################################################
+# Combine PROs with main df
+############################################################################
+data = data.merge(pro_eq5d, on=["StudyId", "PatientId", "IndexDate"], how="left")
+data = data.merge(pro_mrc, on=["StudyId", "PatientId", "IndexDate"], how="left")
+data = data.merge(pro_cat, on=["StudyId", "PatientId", "IndexDate"], how="left")
+data = data.merge(pro_sd_full, on=["StudyId", "PatientId", "IndexDate"], how="left")
+# Calculate mean for features grouped by outcome
+feat_to_explore = data.loc[:, "TotalEngagementEQ5D":"ScaledSumSymptomDiaryQ3TW6"]
+feat_to_explore.loc[:, "ExacWithin3Months"] = data.loc[:, "ExacWithin3Months"]
+grouped_data_by_outcome = feat_to_explore.groupby("ExacWithin3Months").mean()
+grouped_data_by_outcome = grouped_data_by_outcome.T
+############################################################################
+# Plotting
+############################################################################
+##############################
+# EQ5D Boxplots
+##############################
+# Plotting score values
+cols_for_plotting = model_h.create_cols_for_plotting(
+    "EQ5D", question_col_names=question_names_eq5d
+)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=3,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_boxplot",
+    )
+# Plotting engagement
+cols_for_plotting = model_h.create_cols_for_plotting("EQ5D", var_engagement=True)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=2,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_engagement",
+    )
+##############################
+# MRC Boxplots
+##############################
+# Plotting score values
+cols_for_plotting = model_h.create_cols_for_plotting(
+    "MRC", question_col_names=question_names_mrc
+)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=3,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_boxplot",
+    )
+# Plotting engagement
+cols_for_plotting = model_h.create_cols_for_plotting("MRC", var_engagement=True)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=2,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_engagement",
+    )
+##############################
+# CAT Boxplots
+##############################
+# Plotting score values
+cols_for_plotting = model_h.create_cols_for_plotting(
+    "CAT", question_col_names=question_names_cat
+)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=3,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_boxplot",
+    )
+# Plotting engagement
+cols_for_plotting = model_h.create_cols_for_plotting("CAT", var_engagement=True)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=2,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_engagement",
+    )
+##############################
+# SymptomDiary Boxplots
+##############################
+# Plotting score values
+cols_for_plotting = model_h.create_cols_for_plotting(
+    "SymptomDiary", question_col_names=question_names_sd
+)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=3,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_boxplot",
+    )
+# Plotting engagement
+cols_for_plotting = model_h.create_cols_for_plotting("SymptomDiary", var_engagement=True)
+for key in cols_for_plotting:
+    model_h.plot_feature_signal(
+        data,
+        nrows=2,
+        ncols=3,
+        figsize=(12, 12),
+        cols_to_plot=cols_for_plotting[key],
+        fig_name=key + "_engagement",
+    )

training/setup_labels_forward_val.py ADDED Viewed

	@@ -0,0 +1,643 @@

+"""
+Script uses both hospital and community exacerbation events.
+Collate all hospital, patient reported events and apply PRO LOGIC to determine the number
+of exacerbation events. Use exacerbation events to determine the number of rows required per
+patient in the data and generate random index dates and setup labels. Data starts at July
+2022 and runs until Dec 2023 and will be used for forward validation of the model.
+"""
+import model_h
+import numpy as np
+import os
+import sys
+import pandas as pd
+import matplotlib.pyplot as plt
+from datetime import timedelta
+import random
+import yaml
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Setup log file
+log = open(os.path.join(config['outputs']['logging_dir'], "setup_labels_hosp_comm.log"), "w")
+sys.stdout = log
+############################################################################
+# Define model cohort and training data windows
+############################################################################
+# Read relevant info from patient details
+patient_details = pd.read_csv(
+    config['inputs']['raw_data_paths']['patient_details'],
+    usecols=[
+        "PatientId",
+        "FirstSubmissionDate",
+        "MostRecentSubmissionDate",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+    ],
+    delimiter="|",
+)
+# Select patients for inclusion (those with up to date events in service)
+# Original RECEIVER cohort study id list
+receiver_patients = ["RC{:02d}".format(i) for i in range(1, 85)]
+# This patient needs removing
+receiver_patients.remove("RC34")
+# Scale up patients (subset)
+scaleup_patients = ["SU{:02d}".format(i) for i in range(1, 219)]
+# scaleup_patients.append('SU287') #Removed as study ID contains 2 patients
+# List of all valid patients for modelling
+valid_patients = receiver_patients + scaleup_patients
+# Filter for valid patients accounting for white spaces in StudyId (e.g. RC 26 and RC 52)
+patient_details = patient_details[
+    patient_details.StudyId.str.replace(" ", "").isin(valid_patients)
+]
+# Select only non null entries in patient data start/end dates
+patient_details = patient_details[
+    (patient_details.FirstSubmissionDate.notna())
+    & (patient_details.MostRecentSubmissionDate.notna())
+]
+# Create a column stating the earliest permitted date for forward validation
+patient_details["EarliestIndexDate"] = config['model_settings']['forward_validation_earliest_date']
+# Create a column stating the latest date permitted based on events added to service data
+patient_details["LatestPredictionDate"] = config['model_settings']['forward_validation_latest_date']
+date_cols = ["FirstSubmissionDate", "MostRecentSubmissionDate", "LatestPredictionDate", "EarliestIndexDate"]
+patient_details[date_cols] = patient_details[date_cols].apply(
+    lambda x: pd.to_datetime(x, utc=True, format="mixed").dt.normalize(), axis=1
+)
+# Choose the earlier date out of the patient's last submission and the latest COPD data
+# events
+patient_details["LatestPredictionDate"] = patient_details[
+    ["MostRecentSubmissionDate", "LatestPredictionDate"]
+].min(axis=1)
+# Calculate the latest date that the index date can be for each patient
+patient_details["LatestIndexDate"] = patient_details[
+    "LatestPredictionDate"
+] - pd.DateOffset(days=config['model_settings']['prediction_window'])
+# Add 6 months to start of data window to allow enough of a lookback period
+patient_details["EarliestDataDate"] = patient_details[
+    "EarliestIndexDate"
+] - pd.DateOffset(days=config['model_settings']['lookback_period'])
+# Remove any patients for whom the index start date overlaps the last index
+# date, i.e. they have too short a window of data
+print("Number of total patients", len(patient_details))
+print(
+    "Number of patients with too short of a window of data:",
+    len(
+        patient_details[
+            patient_details["EarliestIndexDate"] > patient_details["LatestIndexDate"]
+        ]
+    ),
+)
+patient_details = patient_details[
+    patient_details["EarliestIndexDate"] < patient_details["LatestIndexDate"]
+]
+patient_details.to_pickle("./data/patient_details_forward_val.pkl")
+# List of remaining patients
+model_patients = list(patient_details.PatientId.unique())
+model_study_ids = list(patient_details.StudyId.unique())
+print(
+    "Model cohort: {} patients. {} RECEIVER and {} SU".format(
+        len(model_patients),
+        len(patient_details[patient_details["StudyId"].str.startswith("RC")]),
+        len(patient_details[patient_details["StudyId"].str.startswith("SU")]),
+    )
+)
+df = patient_details[
+    [
+        "PatientId",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+        "EarliestDataDate",
+        "EarliestIndexDate",
+        "LatestIndexDate",
+        "LatestPredictionDate",
+    ]
+].copy()
+# Create a row per day between the EarliestDataDate and the LatestPredictionDate
+df["DateOfEvent"] = df.apply(
+    lambda x: pd.date_range(x.EarliestDataDate, x.LatestPredictionDate, freq="D"),
+    axis=1,
+)
+df = df.explode("DateOfEvent").reset_index(drop=True)
+############################################################################
+# Extract hospital exacerbations and admissions from COPD service data
+############################################################################
+# Contains exacerbations among other event types
+patient_events = pd.read_csv(
+    config['inputs']['raw_data_paths']['patient_events'],
+    delimiter="|",
+    usecols=["PatientId", "DateOfEvent", "EventType"],
+)
+# Filter for only patients in model cohort
+patient_events = patient_events[patient_events.PatientId.isin(model_patients)]
+# Identify hospital exacerbation events
+patient_events["IsHospExac"] = model_h.define_service_exac_event(
+    events=patient_events.EventType, include_community=False
+)
+# Identify hospital admissions (all causes)
+patient_events["IsHospAdmission"] = model_h.define_hospital_admission(
+    patient_events.EventType
+)
+admissions = patient_events[patient_events.IsHospAdmission == 1][
+    ["PatientId", "DateOfEvent", "IsHospAdmission"]
+]
+hosp_exacs = patient_events[patient_events.IsHospExac == 1][
+    ["PatientId", "DateOfEvent", "IsHospExac"]
+]
+admissions["DateOfEvent"] = pd.to_datetime(
+    admissions.DateOfEvent, utc=True
+).dt.normalize()
+hosp_exacs["DateOfEvent"] = pd.to_datetime(
+    hosp_exacs.DateOfEvent, utc=True
+).dt.normalize()
+hosp_exacs = hosp_exacs.drop_duplicates()
+admissions = admissions.drop_duplicates()
+# Save hospital exacerbations and admissions data
+hosp_exacs.to_pickle("./data/hospital_exacerbations.pkl")
+admissions.to_pickle("./data/hospital_admissions.pkl")
+############################################################################
+# Extract patient reported exacerbation events
+############################################################################
+########################
+# Data post Q5 change
+#######################
+# Read file containing patient reported events (not patient_events because it contains
+# the dates when patients answered PROs and not which date they reported as having taken
+# their rescue meds)
+symptom_diary = pd.read_csv(
+    config['inputs']['raw_data_paths']['pro_symptom_diary'],
+    usecols=[
+        "PatientId",
+        "StudyId",
+        "Score",
+        "SubmissionTime",
+        "SymptomDiaryQ5",
+        "SymptomDiaryQ11a",
+        "SymptomDiaryQ11b",
+    ],
+    delimiter="|",
+)
+Q5ChangeDate = pd.to_datetime(config['model_settings']['pro_q5_change_date'], utc=True)
+symptom_diary = model_h.filter_symptom_diary(
+    df=symptom_diary, date_cutoff=Q5ChangeDate, patients=model_patients
+)
+weekly_pros = model_h.get_rescue_med_pro_responses(symptom_diary)
+weekly_pros = model_h.set_pro_exac_dates(weekly_pros)
+weekly_pros = weekly_pros[
+    [
+        "PatientId",
+        "Q5Answered",
+        "NegativeQ5",
+        "IsCommExac",
+        "DateOfEvent",
+        "ExacDateUnknown",
+    ]
+]
+####################################################################################
+# Merge hospital and patient reported events with daily patient records
+#
+# Exacerbations occurring in Lenus service period include verified clinician events
+# pre-April 2021 (after onboarding) and community exacerbations recorded in weekly
+# PROs post-April 2021. Hospital exacerbations include exacerbations occuring during
+# service period.
+#####################################################################################
+# Patient reported, clinician verified
+#df = df.merge(verified_exacs, on=["StudyId", "DateOfEvent"], how="left")
+# Patient reported, new rescue med PRO (April 2021 onwards)
+df = df.merge(weekly_pros, on=["PatientId", "DateOfEvent"], how="left")
+# Hospital exacerbations
+df = df.merge(hosp_exacs, on=["PatientId", "DateOfEvent"], how="left")
+df = model_h.fill_column_by_patient(df=df, id_col="PatientId", col="StudyId")
+# Hospital admissions
+df = df.merge(admissions, on=["PatientId", "DateOfEvent"], how="left")
+df = model_h.fill_column_by_patient(df=df, id_col="PatientId", col="StudyId")
+# Column for whether an exacerbation of any kind occurred on each date. To be filtered
+# using (PRO) LOGIC
+df["IsExac"] = np.where((df.IsCommExac == 1) | (df.IsHospExac == 1), 1, 0)
+# Resample the df to one day per patient starting from the earliest record
+df = (
+    df.set_index("DateOfEvent")
+    .groupby("StudyId")
+    .resample("D")
+    .asfreq()
+    .drop("StudyId", axis=1)
+    .reset_index()
+)
+# Infill binary cols with zero where applicable
+df[
+    [
+        "Q5Answered",
+        "NegativeQ5",
+        "IsHospExac",
+        "IsCommExac",
+        "ExacDateUnknown",
+        "IsExac",
+        "IsHospAdmission",
+    ]
+] = df[
+    [
+        "Q5Answered",
+        "NegativeQ5",
+        "IsHospExac",
+        "IsCommExac",
+        "ExacDateUnknown",
+        "IsExac",
+        "IsHospAdmission",
+    ]
+].fillna(
+    0
+)
+# Infill some columns by StudyId to populate entire df
+#df = copd.fill_column_by_patient(df=df, id_col="StudyId", col="FirstSubmissionDate")
+df = model_h.fill_column_by_patient(df=df, id_col="StudyId", col="LatestPredictionDate")
+df = model_h.fill_column_by_patient(df=df, id_col="StudyId", col="PatientId")
+# Retain only dates before the end of each patient's data window
+df = df[df.DateOfEvent <= df.LatestPredictionDate]
+print("Starting number of exacerbations: {}".format(df.IsExac.sum()))
+print(
+    "Number of exacerbations during COPD service: {}".format(
+        len(df[(df.IsExac == 1) & (df.DateOfEvent >= df.EarliestDataDate)])
+    )
+)
+print(
+    "Number of unique exacerbation patients: {}".format(
+        len(df[df.IsExac == 1].PatientId.unique())
+    )
+)
+print(
+    "Exacerbation breakdown: {} hospital, {} patient reported and {} overlapping".format(
+        df.IsHospExac.sum(),
+        df.IsCommExac.sum(),
+        len(df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1)]),
+    )
+)
+print(
+    "Number of hospital exacerbations during COPD service: {} ({} unique patients)".format(
+        len(df[(df.IsHospExac == 1) & (df.DateOfEvent >= df.EarliestDataDate)]),
+        len(
+            df[
+                (df.IsHospExac == 1) & (df.DateOfEvent >= df.EarliestDataDate)
+            ].StudyId.unique()
+        ),
+    )
+)
+print(
+    "Community exacerbations from weekly PROs: {} ({} unique patients)".format(
+        len(df[df.IsCommExac == 1]), len(df[df.IsCommExac == 1].StudyId.unique())
+    )
+)
+print(
+    "Number of patient reported exacerbations with unknown dates: {} ({} overlapping\
+ with hospital events)".format(
+        df.ExacDateUnknown.sum(),
+        len(df[(df.IsHospExac == 1) & (df.ExacDateUnknown == 1)]),
+    )
+)
+# Check for any patient reported events with unknown dates that occurred on the same day
+# as a hospital event. Hospital events are trusted so set the date to known
+df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1), "ExacDateUnknown"] = 0
+print("Remaining exacerbations with unknown dates: {}".format(df.ExacDateUnknown.sum()))
+############################################################################
+# Implement PRO LOGIC on hospital and patient reported exacerbation events
+############################################################################
+# Define min and max days for PRO LOGIC. No predictions made or data used within
+# logic_min_days after an exacerbation. Events falling between logic_min_days and
+# logic_max_days after an event are subject to the weekly rescue med LOGIC criterion
+logic_min_days = config['model_settings']['pro_logic_min_days_after_exac']
+logic_max_days = config['model_settings']['pro_logic_max_days_after_exac']
+# Calculate the days since the previous exacerbation for all patient days.
+df = (
+    df.groupby("StudyId")
+    .apply(
+        lambda x: model_h.calculate_days_since_last_event(
+            df=x, event_col="IsExac", output_col="DaysSinceLastExac"
+        )
+    )
+    .reset_index(drop=True)
+)
+# Apply exclusion period following all exacerbations
+df["RemoveRow"] = model_h.minimum_period_between_exacerbations(
+    df, minimum_days=logic_min_days
+)
+# Do not remove hospital exacerbations even if they get flagged up by PRO logic
+df["RemoveRow"] = np.where(df["IsHospExac"] == 1, 0, df["RemoveRow"])
+print(
+    "Number of community exacerbations excluded by PRO LOGIC {} day criterion: {}".format(
+        logic_min_days, len(df[(df.IsExac == 1) & (df.RemoveRow == 1)])
+    )
+)
+# Apply criterion for negative weekly Q5 responses - doesn't capture anything post Q5
+# change
+consecutive_replies = config['model_settings']['neg_consecutive_q5_replies']
+df = model_h.apply_logic_response_criterion(
+    df,
+    minimum_period=logic_min_days,
+    maximum_period=logic_max_days,
+    N=consecutive_replies,
+)
+# Do not remove hospital exacerbations even if they get flagged up by PRO logic
+df["RemoveExac"] = np.where(df["IsHospExac"] == 1, 0, df["RemoveExac"])
+print(
+    "Weekly rescue med (Q5) criterion applied to events occurring between {} and {} \
+days after a previous event. {} consecutive negative replies required for the event to \
+count as a new event".format(
+        logic_min_days, logic_max_days, consecutive_replies
+    )
+)
+print(
+    "Number of exacerbations excluded by PRO LOGIC Q5 response criterion: {}".format(
+        df.RemoveExac.sum()
+    )
+)
+print(
+    "Earliest and latest exacerbations excluded: {}, {}".format(
+        df[df.RemoveExac == 1].DateOfEvent.min(),
+        df[df.RemoveExac == 1].DateOfEvent.max(),
+    )
+)
+print(
+    "Remaining number of exacerbations: {}".format(
+        len(df[(df.IsExac == 1) & (df.RemoveRow != 1) & (df.RemoveExac != 1)])
+    )
+)
+print(
+    "Remaining exacerbations with unknown dates: {}".format(
+        len(df[(df.ExacDateUnknown == 1) & (df.RemoveRow != 1) & (df.RemoveExac != 1)])
+    )
+)
+# Remove data between segments of prolonged events, count only first occurrence
+df = model_h.remove_data_between_exacerbations(df)
+# Remove 7 days before each reported exacerbation within unknown date (meds in last week)
+df = model_h.remove_unknown_date_exacerbations(df)
+# Remove rows flagged as to remove
+df = df[df["RemoveRow"] != 1]
+# New df with unwanted rows removed for events breakdown.
+print("---Final exacerbation counts---")
+print("Final number of exacerbations: {}".format(df.IsExac.sum()))
+exac_patients = pd.Series(df[df.IsExac == 1].StudyId.unique())
+print(
+    "Number of unique exacerbation patients: {} ({} RC and {} SU)".format(
+        len(exac_patients),
+        exac_patients.str.startswith("RC").sum(),
+        exac_patients.str.startswith("SU").sum(),
+    )
+)
+print(
+    "Exacerbation breakdown: {} hospital, {} patient reported and {} overlapping".format(
+        df.IsHospExac.sum(),
+        df.IsCommExac.sum(),
+        len(df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1)]),
+    )
+)
+df.to_pickle("./data/hosp_comm_exacs.pkl")
+############################################################################
+# Calculate the number of rows to include per patient in the dataset. This
+# is calculated based on the average number of exacerbations per patient and
+# is then adjusted to the average time within the service
+############################################################################
+# Calculate the average time patients have data recorded in the COPD service
+service_time = df[["StudyId", "LatestPredictionDate", "EarliestDataDate"]]
+service_time = service_time.drop_duplicates(subset="StudyId", keep="first")
+service_time["ServiceTime"] = (
+    service_time["LatestPredictionDate"] - service_time["EarliestDataDate"]
+).dt.days
+avg_service_time = sum(service_time["ServiceTime"]) / len(service_time["ServiceTime"])
+avg_service_time_months = round(avg_service_time / 30)
+print("Average time in service (days):", avg_service_time)
+print("Average time in service (months):", avg_service_time_months)
+# Calculate the average number of exacerberations per patient
+avg_exac_per_patient = round(
+    len(df[df["IsExac"] == 1]) / df[df["IsExac"] == 1][["StudyId"]].nunique().item(), 2
+)
+print(
+    "Number of exac/patient/months: {} exacerbations/patient in {} months".format(
+        avg_exac_per_patient, avg_service_time_months
+    )
+)
+print(
+    "On average, 1 exacerbation occurs in a patient every: {} months".format(
+        round(avg_service_time_months / avg_exac_per_patient, 2)
+    )
+)
+#################################################################
+# Calculate index dates. 1 row/patient for every 5 months in service.
+#################################################################
+# Obtain the number of rows required per patient.
+service_time["NumRows"] = round(service_time["ServiceTime"] / config['model_settings']['one_row_per_days_in_service']).astype("int")
+patient_details = pd.merge(
+    patient_details, service_time[["StudyId", "NumRows"]], on="StudyId", how="left"
+)
+# Calculate the number of days between earliest and latest index
+patient_details["NumDaysPossibleIndex"] = (
+    patient_details["LatestIndexDate"] - patient_details["EarliestIndexDate"]
+).dt.days
+patient_details.to_csv("./data/pat_details_to_calc_index_dt.csv", index=False)
+# Make sure the number of rows isn't larger than the number of possible index dates
+patient_details["NumRows"] = np.where(patient_details["NumRows"] > patient_details["NumDaysPossibleIndex"], patient_details["NumDaysPossibleIndex"], patient_details["NumRows"])
+# Generate random index dates
+# Multiple seeds tested to identify the random index dates that give a good
+# distribution across months. Seed chosen as 2188398760 from check_index_date_dist.py
+random_seed_general = config['model_settings']['index_date_generation_master_seed']
+random.seed(random_seed_general)
+# Create different random seeds for each patient
+patient_details["RandomSeed"] = random.sample(
+    range(0, 2**32), patient_details.shape[0]
+)
+# Create random index dates for each patient based on their random seed
+rand_days_dict = {}
+rand_date_dict = {}
+for index, row in patient_details.iterrows():
+    np.random.seed(row["RandomSeed"])
+    rand_days_dict[row["StudyId"]] = np.random.choice(
+        row["NumDaysPossibleIndex"], size=row["NumRows"], replace=False
+    )
+    rand_date_dict[row["StudyId"]] = [
+        row["EarliestIndexDate"] + timedelta(days=int(day))
+        for day in rand_days_dict[row["StudyId"]]
+    ]
+# Create df from dictionaries containing random index dates
+index_date_df = pd.DataFrame.from_dict(rand_date_dict, orient="index").reset_index()
+index_date_df = index_date_df.rename(columns={"index": "StudyId"})
+# Convert the multiple columns containing index dates to one column
+index_date_df = (
+    pd.melt(index_date_df, id_vars=["StudyId"], value_name="IndexDate")
+    .drop(["variable"], axis=1)
+    .sort_values(by=["StudyId", "IndexDate"])
+)
+index_date_df = index_date_df.dropna()
+index_date_df = index_date_df.reset_index(drop=True)
+# Join index dates with exacerbation events
+exac_events = pd.merge(index_date_df, df, on="StudyId", how="left")
+exac_events["IndexDate"] = pd.to_datetime(exac_events["IndexDate"], utc=True)
+# Calculate whether an exacerbation event occurred within the model time window (3 months)
+# after the index date
+exac_events["TimeToEvent"] = (
+    exac_events["DateOfEvent"] - exac_events["IndexDate"]
+).dt.days
+exac_events["ExacWithin3Months"] = np.where(
+    (exac_events["TimeToEvent"].between(1, config['model_settings']['prediction_window'], inclusive="both"))
+    & (exac_events["IsExac"] == 1),
+    1,
+    0,
+)
+exac_events["HospExacWithin3Months"] = np.where(
+    (exac_events["TimeToEvent"].between(1, config['model_settings']['prediction_window'], inclusive="both"))
+    & (exac_events["IsHospExac"] == 1),
+    1,
+    0,
+)
+exac_events["CommExacWithin3Months"] = np.where(
+    (exac_events["TimeToEvent"].between(1, config['model_settings']['prediction_window'], inclusive="both"))
+    & (exac_events["IsCommExac"] == 1),
+    1,
+    0,
+)
+exac_events = exac_events.sort_values(
+    by=["StudyId", "IndexDate", "ExacWithin3Months"], ascending=[True, True, False]
+)
+exac_events = exac_events.drop_duplicates(subset=["StudyId", "IndexDate"], keep="first")
+exac_events = exac_events[
+    [
+        "StudyId",
+        "PatientId",
+        "IndexDate",
+        "DateOfBirth",
+        "Sex",
+        "ExacWithin3Months",
+        "HospExacWithin3Months",
+        "CommExacWithin3Months",
+    ]
+]
+# Save exac_events
+exac_events.to_pickle("./data/patient_labels_forward_val_hosp_comm.pkl")
+# Summary info
+class_distribution = (
+    exac_events.groupby("ExacWithin3Months").count()[["StudyId"]].reset_index()
+)
+class_distribution.plot.bar(x="ExacWithin3Months", y="StudyId")
+plt.savefig(
+    "./plots/class_distributions/final_seed_"
+    + str(random_seed_general)
+    + "_class_distribution_hosp_comm.png",
+    bbox_inches="tight",
+)
+print("---Summary info after setting up labels---")
+print("Number of unique patients:", exac_events["StudyId"].nunique())
+print("Number of rows:", len(exac_events))
+print(
+    "Number of exacerbations within 3 months of index date:",
+    len(exac_events[exac_events["ExacWithin3Months"] == 1]),
+)
+print(
+    "Percentage positive class (num exac/total rows): {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 1]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage negative class: {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 0]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage hospital exacs: {} %".format(
+        round(
+            (len(exac_events[exac_events["HospExacWithin3Months"] == 1]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage community exacs: {} %".format(
+        round(
+            (len(exac_events[exac_events["CommExacWithin3Months"] == 1]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print("Class balance:")
+print(class_distribution)

training/setup_labels_hosp_comm.py ADDED Viewed

	@@ -0,0 +1,935 @@

+"""
+Script uses both hospital and community exacerbation events.
+Collate all hospital, clincian verified and patient reported events and apply
+PRO LOGIC to determine the number of exacerbation events. Use exacerbation events to
+determine the number of rows required per patient in the data and generate random
+index dates and setup labels.
+"""
+import model_h
+import numpy as np
+import os
+import sys
+import pandas as pd
+import matplotlib.pyplot as plt
+from datetime import timedelta
+import random
+import yaml
+# Need to have pyyaml in environment to use this
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+# Setup log file
+log = open(
+    os.path.join(
+        config["outputs"]["logging_dir"],
+        "setup_labels" + config["model_settings"]["model_type"] + "2023.log",
+    ),
+    "w",
+)
+sys.stdout = log
+############################################################################
+# Define model cohort and training data windows
+############################################################################
+# Read relevant info from patient details
+patient_details = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["patient_details"],
+    usecols=[
+        "PatientId",
+        "FirstSubmissionDate",
+        "MostRecentSubmissionDate",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+    ],
+    delimiter="|",
+)
+# Select patients for inclusion (those with up to date events in service)
+# Original RECEIVER cohort study id list
+receiver_patients = ["RC{:02d}".format(i) for i in range(1, 85)]
+# This patient needs removing
+receiver_patients.remove("RC34")
+# Scale up patients (subset)
+scaleup_patients = ["SU{:02d}".format(i) for i in range(1, 219)]
+# List of all valid patients for modelling
+valid_patients = receiver_patients + scaleup_patients
+# Filter for valid patients accounting for white spaces in StudyId (e.g. RC 26 and RC 52)
+patient_details = patient_details[
+    patient_details.StudyId.str.replace(" ", "").isin(valid_patients)
+]
+# Select only non null entries in patient data start/end dates
+patient_details = patient_details[
+    (patient_details.FirstSubmissionDate.notna())
+    & (patient_details.MostRecentSubmissionDate.notna())
+]
+# Get death data
+patient_deaths = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["patient_events"],
+    usecols=[
+        "PatientId",
+        "DateOfEvent",
+        "EventType",
+    ],
+    delimiter="|",
+)
+patient_deaths = patient_deaths[patient_deaths["EventType"] == "Death"]
+patient_deaths = patient_deaths.rename(columns={"DateOfEvent": "DeathDate"})
+patient_deaths = patient_deaths.drop(columns=["EventType"])
+# Merge patient details with deaths
+patient_details = patient_details.merge(patient_deaths, on="PatientId", how="left")
+############################################################################
+# Define training data windows
+############################################################################
+# Create a column stating the latest date permitted based on events added to service data
+patient_details["LatestPredictionDate"] = config["model_settings"][
+    "latest_date_before_bug_break"
+]
+# Create a column stating when the events start again
+patient_details["AfterGapStartDate"] = config["model_settings"][
+    "after_bug_fixed_start_date"
+]
+patient_details["DataEndDate"] = config["model_settings"]["training_data_end_date"]
+date_cols = [
+    "FirstSubmissionDate",
+    "MostRecentSubmissionDate",
+    "LatestPredictionDate",
+    "AfterGapStartDate",
+    "DataEndDate",
+    "DeathDate",
+]
+patient_details[date_cols] = patient_details[date_cols].apply(
+    lambda x: pd.to_datetime(x, utc=True, format="mixed").dt.normalize(), axis=1
+)
+# Choose the earlier date out of the patient's last submission, death date and the latest
+# COPD data events
+patient_details["LatestPredictionDate"] = patient_details[
+    ["MostRecentSubmissionDate", "LatestPredictionDate", "DeathDate"]
+].min(axis=1)
+patient_details["DataEndDate"] = patient_details[
+    ["MostRecentSubmissionDate", "DataEndDate", "DeathDate"]
+].min(axis=1)
+# Calculate the latest date that the index date can be for each patient
+patient_details["LatestIndexDate"] = patient_details[
+    "LatestPredictionDate"
+] - pd.DateOffset(days=config["model_settings"]["prediction_window"])
+patient_details["LatestIndexAfterGap"] = patient_details["DataEndDate"] - pd.DateOffset(
+    days=config["model_settings"]["prediction_window"]
+)
+# Add 6 months to start of data window to allow enough of a lookback period
+patient_details["EarliestIndexDate"] = patient_details[
+    "FirstSubmissionDate"
+] + pd.DateOffset(days=config["model_settings"]["lookback_period"])
+patient_details["EarliestIndexAfterGap"] = patient_details[
+    "AfterGapStartDate"
+] + pd.DateOffset(days=config["model_settings"]["lookback_period"])
+# Remove any patients for whom the index start date overlaps the last index
+# date, i.e. they have too short a window of data
+print("Number of total patients", len(patient_details))
+print(
+    "Number of patients with too short of a window of data:",
+    len(
+        patient_details[
+            patient_details["EarliestIndexDate"] > patient_details["LatestIndexDate"]
+        ]
+    ),
+)
+patient_details = patient_details[
+    patient_details["EarliestIndexDate"] < patient_details["LatestIndexDate"]
+]
+patient_details["DatesAfterGap"] = np.where(
+    patient_details["EarliestIndexAfterGap"] > patient_details["LatestIndexAfterGap"],
+    False,
+    True,
+)
+# Calculate length in service
+patient_details["FirstLength"] = (
+    patient_details["LatestIndexDate"] - patient_details["EarliestIndexDate"]
+).dt.days
+patient_details["SecondLength"] = (
+    patient_details["LatestIndexAfterGap"] - patient_details["EarliestIndexAfterGap"]
+).dt.days
+patient_details["LengthInService"] = np.where(
+    patient_details["DatesAfterGap"] == True,
+    patient_details["FirstLength"] + patient_details["SecondLength"],
+    patient_details["FirstLength"],
+)
+patient_details["TotalLength1"] = (
+    patient_details["LatestPredictionDate"] - patient_details["FirstSubmissionDate"]
+).dt.days
+patient_details["TotalLength2"] = (
+    patient_details["DataEndDate"] - patient_details["AfterGapStartDate"]
+).dt.days
+patient_details["TotalLengthInService"] = np.where(
+    patient_details["DatesAfterGap"] == True,
+    patient_details["TotalLength1"] + patient_details["TotalLength2"],
+    patient_details["TotalLength1"],
+)
+# Save patient details
+patient_details.to_pickle(
+    os.path.join(config["outputs"]["output_data_dir"], "patient_details.pkl")
+)
+# List of remaining patients
+model_patients = list(patient_details.PatientId.unique())
+model_study_ids = list(patient_details.StudyId.unique())
+print(
+    "Model cohort: {} patients. {} RECEIVER and {} SU".format(
+        len(model_patients),
+        len(patient_details[patient_details["StudyId"].str.startswith("RC")]),
+        len(patient_details[patient_details["StudyId"].str.startswith("SU")]),
+    )
+)
+df1 = patient_details[
+    [
+        "PatientId",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+        "FirstSubmissionDate",
+        "EarliestIndexDate",
+        "LatestIndexDate",
+        "LatestPredictionDate",
+        "AfterGapStartDate",
+        "EarliestIndexAfterGap",
+        "LatestIndexAfterGap",
+        "DataEndDate",
+    ]
+].copy()
+df2 = df1.copy()
+# Create a row per day between the FirstSubmissionDate and the LatestPredictionDate
+df1["DateOfEvent"] = df1.apply(
+    lambda x: pd.date_range(x.FirstSubmissionDate, x.LatestPredictionDate, freq="D"),
+    axis=1,
+)
+# Create a row per day between AfterGapStartDate and DataEndDate
+df2["DateOfEvent"] = df2.apply(
+    lambda x: pd.date_range(x.AfterGapStartDate, x.DataEndDate, freq="D"),
+    axis=1,
+)
+# Combine dfs from before and after the time gap where exac data is unreliable
+df1 = df1.explode("DateOfEvent").reset_index(drop=True)
+df2 = df2.explode("DateOfEvent").reset_index(drop=True)
+df2 = df2.dropna(subset=["DateOfEvent"])
+df = pd.concat([df1, df2])
+df = df.sort_values(by=["StudyId", "DateOfEvent"])
+############################################################################
+# Extract hospital exacerbations and admissions from COPD service data
+############################################################################
+# Contains exacerbations among other event types
+patient_events = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["patient_events"],
+    delimiter="|",
+    usecols=["PatientId", "DateOfEvent", "EventType"],
+)
+# Filter for only patients in model cohort
+patient_events = patient_events[patient_events.PatientId.isin(model_patients)]
+# Identify hospital exacerbation events
+patient_events["IsHospExac"] = model_h.define_service_exac_event(
+    events=patient_events.EventType, include_community=False
+)
+# Identify hospital admissions (all causes)
+patient_events["IsHospAdmission"] = model_h.define_hospital_admission(
+    patient_events.EventType
+)
+admissions = patient_events[patient_events.IsHospAdmission == 1][
+    ["PatientId", "DateOfEvent", "IsHospAdmission"]
+]
+hosp_exacs = patient_events[patient_events.IsHospExac == 1][
+    ["PatientId", "DateOfEvent", "IsHospExac"]
+]
+admissions["DateOfEvent"] = pd.to_datetime(
+    admissions.DateOfEvent, utc=True
+).dt.normalize()
+hosp_exacs["DateOfEvent"] = pd.to_datetime(
+    hosp_exacs.DateOfEvent, utc=True
+).dt.normalize()
+hosp_exacs = hosp_exacs.drop_duplicates()
+admissions = admissions.drop_duplicates()
+# Save hospital exacerbations and admissions data
+hosp_exacs.to_pickle(
+    os.path.join(config["outputs"]["output_data_dir"], "hospital_exacerbations.pkl")
+)
+admissions.to_pickle(
+    os.path.join(config["outputs"]["output_data_dir"], "hospital_admissions.pkl")
+)
+############################################################################
+# Extract patient reported exacerbation events
+############################################################################
+########################
+# Data post Q5 change
+#######################
+# Read file containing patient reported events (not patient_events because it contains
+# the dates when patients answered PROs and not which date they reported as having taken
+# their rescue meds)
+symptom_diary = pd.read_csv(
+    config["inputs"]["raw_data_paths"]["pro_symptom_diary"],
+    usecols=[
+        "PatientId",
+        "StudyId",
+        "Score",
+        "SubmissionTime",
+        "SymptomDiaryQ5",
+        "SymptomDiaryQ11a",
+        "SymptomDiaryQ11b",
+    ],
+    delimiter="|",
+)
+Q5ChangeDate = pd.to_datetime(config["model_settings"]["pro_q5_change_date"], utc=True)
+symptom_diary = model_h.filter_symptom_diary(
+    df=symptom_diary, date_cutoff=Q5ChangeDate, patients=model_patients
+)
+weekly_pros = model_h.get_rescue_med_pro_responses(symptom_diary)
+weekly_pros = model_h.set_pro_exac_dates(weekly_pros)
+weekly_pros = weekly_pros[
+    [
+        "PatientId",
+        "Q5Answered",
+        "NegativeQ5",
+        "IsCommExac",
+        "DateOfEvent",
+        "ExacDateUnknown",
+    ]
+]
+#########################
+# Pre Q5 change events
+#########################
+# RECEIVER cohort - community events verified up to 16/03/21
+receiver = pd.read_excel(
+    config["inputs"]["raw_data_paths"]["receiver_community_verified_events"]
+)
+receiver = receiver.rename(
+    columns={"Study number": "StudyId", "Exacerbation recorded": "DateRecorded"}
+)
+receiver_exacs = model_h.extract_clinician_verified_exacerbations(receiver)
+# Scale up cohort - community events verified up to 17/05/2021
+scaleup = pd.read_excel(
+    config["inputs"]["raw_data_paths"]["scale_up_community_verified_events"]
+)
+scaleup = scaleup.rename(
+    columns={"Study Number": "StudyId", "Date Exacerbation recorded": "DateRecorded"}
+)
+scaleup["StudyId"] = scaleup["StudyId"].ffill()
+scaleup_exacs = model_h.extract_clinician_verified_exacerbations(scaleup)
+# Combine RECEIVER and scale up events into one df
+verified_exacs = pd.concat([receiver_exacs, scaleup_exacs])
+verified_exacs = verified_exacs[verified_exacs.StudyId.isin(model_study_ids)]
+####################################################################################
+# Merge hospital and patient reported events with daily patient records
+#
+# Exacerbations occurring in Lenus service period include verified clinician events
+# pre-April 2021 (after onboarding) and community exacerbations recorded in weekly
+# PROs post-April 2021. Hospital exacerbations include exacerbations occuring during
+# service period.
+#####################################################################################
+# Patient reported, clinician verified
+df = df.merge(verified_exacs, on=["StudyId", "DateOfEvent"], how="left")
+# Patient reported, new rescue med PRO (April 2021 onwards)
+df = df.merge(weekly_pros, on=["PatientId", "DateOfEvent"], how="left")
+# Hospital exacerbations
+df = df.merge(hosp_exacs, on=["PatientId", "DateOfEvent"], how="left")
+df = model_h.fill_column_by_patient(df=df, id_col="PatientId", col="StudyId")
+# Hospital admissions
+df = df.merge(admissions, on=["PatientId", "DateOfEvent"], how="left")
+df = model_h.fill_column_by_patient(df=df, id_col="PatientId", col="StudyId")
+# Combine cols from individual datasets into one
+df["ExacDateUnknown"] = np.where(
+    (df.ExacDateUnknown_x == 1) | (df.ExacDateUnknown_y == 1), 1, 0
+)
+df["IsCommExac"] = np.where((df.IsCommExac_x == 1) | (df.IsCommExac_y == 1), 1, 0)
+# Column for whether an exacerbation of any kind occurred on each date. To be filtered
+# using (PRO) LOGIC
+df["IsExac"] = np.where((df.IsCommExac == 1) | (df.IsHospExac == 1), 1, 0)
+# Resample the df to one day per patient starting from the earliest record
+df = (
+    df.set_index("DateOfEvent")
+    .groupby("StudyId")
+    .resample("D")
+    .asfreq()
+    .drop("StudyId", axis=1)
+    .reset_index()
+)
+# Infill binary cols with zero where applicable
+df[
+    [
+        "Q5Answered",
+        "NegativeQ5",
+        "IsHospExac",
+        "IsCommExac",
+        "ExacDateUnknown",
+        "IsExac",
+        "IsHospAdmission",
+    ]
+] = df[
+    [
+        "Q5Answered",
+        "NegativeQ5",
+        "IsHospExac",
+        "IsCommExac",
+        "ExacDateUnknown",
+        "IsExac",
+        "IsHospAdmission",
+    ]
+].fillna(
+    0
+)
+# Infill some columns by StudyId to populate entire df
+df = model_h.fill_column_by_patient(df=df, id_col="StudyId", col="FirstSubmissionDate")
+df = model_h.fill_column_by_patient(df=df, id_col="StudyId", col="LatestPredictionDate")
+df = model_h.fill_column_by_patient(df=df, id_col="StudyId", col="PatientId")
+print("Starting number of exacerbations: {}".format(df.IsExac.sum()))
+print(
+    "Number of exacerbations during COPD service: {}".format(
+        len(df[(df.IsExac == 1) & (df.DateOfEvent >= df.FirstSubmissionDate)])
+    )
+)
+print(
+    "Number of unique exacerbation patients: {}".format(
+        len(df[df.IsExac == 1].PatientId.unique())
+    )
+)
+print(
+    "Exacerbation breakdown: {} hospital, {} patient reported and {} overlapping".format(
+        df.IsHospExac.sum(),
+        df.IsCommExac.sum(),
+        len(df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1)]),
+    )
+)
+print(
+    "Number of hospital exacerbations during COPD service: {} ({} unique patients)".format(
+        len(df[(df.IsHospExac == 1) & (df.DateOfEvent >= df.FirstSubmissionDate)]),
+        len(
+            df[
+                (df.IsHospExac == 1) & (df.DateOfEvent >= df.FirstSubmissionDate)
+            ].StudyId.unique()
+        ),
+    )
+)
+print(
+    "Clinician verified community exacerbations during COPD service: {} ({} unique patients)".format(
+        len(df[df.IsCommExac_x == 1]), len(df[df.IsCommExac_x == 1].StudyId.unique())
+    )
+)
+print(
+    "Community exacerbations from weekly PROs: {} ({} unique patients)".format(
+        len(df[df.IsCommExac_y == 1]), len(df[df.IsCommExac_y == 1].StudyId.unique())
+    )
+)
+print(
+    "Number of patient reported exacerbations with unknown dates: {} ({} overlapping\
+ with hospital events)".format(
+        df.ExacDateUnknown.sum(),
+        len(df[(df.IsHospExac == 1) & (df.ExacDateUnknown == 1)]),
+    )
+)
+# Check for any patient reported events with unknown dates that occurred on the same day
+# as a hospital event. Hospital events are trusted so set the date to known
+df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1), "ExacDateUnknown"] = 0
+print("Remaining exacerbations with unknown dates: {}".format(df.ExacDateUnknown.sum()))
+df = df.drop(
+    columns=["IsCommExac_x", "IsCommExac_y", "ExacDateUnknown_x", "ExacDateUnknown_y"]
+)
+############################################################################
+# Implement PRO LOGIC on hospital and patient reported exacerbation events
+############################################################################
+# Define min and max days for PRO LOGIC. No predictions made or data used within
+# logic_min_days after an exacerbation. Events falling between logic_min_days and
+# logic_max_days after an event are subject to the weekly rescue med LOGIC criterion
+logic_min_days = config["model_settings"]["pro_logic_min_days_after_exac"]
+logic_max_days = config["model_settings"]["pro_logic_max_days_after_exac"]
+# Calculate the days since the previous exacerbation for all patient days.
+df = (
+    df.groupby("StudyId")
+    .apply(
+        lambda x: model_h.calculate_days_since_last_event(
+            df=x, event_col="IsExac", output_col="DaysSinceLastExac"
+        )
+    )
+    .reset_index(drop=True)
+)
+# Apply exclusion period following all exacerbations
+df["RemoveRow"] = model_h.minimum_period_between_exacerbations(
+    df, minimum_days=logic_min_days
+)
+# Do not remove hospital exacerbations even if they get flagged up by PRO logic
+df["RemoveRow"] = np.where(df["IsHospExac"] == 1, 0, df["RemoveRow"])
+print(
+    "Number of community exacerbations excluded by PRO LOGIC {} day criterion: {}".format(
+        logic_min_days, len(df[(df.IsExac == 1) & (df.RemoveRow == 1)])
+    )
+)
+# Apply criterion for negative weekly Q5 responses - doesn't capture anything post Q5
+# change
+consecutive_replies = config["model_settings"]["neg_consecutive_q5_replies"]
+df = model_h.apply_logic_response_criterion(
+    df,
+    minimum_period=logic_min_days,
+    maximum_period=logic_max_days,
+    N=consecutive_replies,
+)
+# Do not remove hospital exacerbations even if they get flagged up by PRO logic
+df["RemoveExac"] = np.where(df["IsHospExac"] == 1, 0, df["RemoveExac"])
+print(
+    "Weekly rescue med (Q5) criterion applied to events occurring between {} and {} \
+days after a previous event. {} consecutive negative replies required for the event to \
+count as a new event".format(
+        logic_min_days, logic_max_days, consecutive_replies
+    )
+)
+print(
+    "Number of exacerbations excluded by PRO LOGIC Q5 response criterion: {}".format(
+        df.RemoveExac.sum()
+    )
+)
+print(
+    "Earliest and latest exacerbations excluded: {}, {}".format(
+        df[df.RemoveExac == 1].DateOfEvent.min(),
+        df[df.RemoveExac == 1].DateOfEvent.max(),
+    )
+)
+print(
+    "Remaining number of exacerbations: {}".format(
+        len(df[(df.IsExac == 1) & (df.RemoveRow != 1) & (df.RemoveExac != 1)])
+    )
+)
+print(
+    "Remaining exacerbations with unknown dates: {}".format(
+        len(df[(df.ExacDateUnknown == 1) & (df.RemoveRow != 1) & (df.RemoveExac != 1)])
+    )
+)
+# Remove data between segments of prolonged events, count only first occurrence
+df = model_h.remove_data_between_exacerbations(df)
+# Remove 7 days before each reported exacerbation within unknown date (meds in last week)
+df = model_h.remove_unknown_date_exacerbations(df)
+# Remove rows flagged as to remove
+df = df[df["RemoveRow"] != 1]
+# New df with unwanted rows removed for events breakdown.
+print("---Final exacerbation counts---")
+print("Final number of exacerbations: {}".format(df.IsExac.sum()))
+exac_patients = pd.Series(df[df.IsExac == 1].StudyId.unique())
+print(
+    "Number of unique exacerbation patients: {} ({} RC and {} SU)".format(
+        len(exac_patients),
+        exac_patients.str.startswith("RC").sum(),
+        exac_patients.str.startswith("SU").sum(),
+    )
+)
+print(
+    "Exacerbation breakdown: {} hospital, {} patient reported and {} overlapping".format(
+        df.IsHospExac.sum(),
+        df.IsCommExac.sum(),
+        len(df.loc[(df.IsCommExac == 1) & (df.IsHospExac == 1)]),
+    )
+)
+df.to_pickle(os.path.join(config["outputs"]["output_data_dir"], "hosp_comm_exacs.pkl"))
+############################################################################
+# Calculate the number of rows to include per patient in the dataset. This
+# is calculated based on the average number of exacerbations per patient and
+# is then adjusted to the average time within the service
+############################################################################
+# Calculate the average time patients have data recorded in the COPD service
+service_time = patient_details[["StudyId", "TotalLengthInService"]]
+service_time = service_time.drop_duplicates(subset="StudyId", keep="first")
+print(service_time)
+avg_service_time = sum(service_time["TotalLengthInService"]) / len(
+    service_time["TotalLengthInService"]
+)
+avg_service_time_months = round(avg_service_time / 30)
+print("Average time in service (days):", avg_service_time)
+print("Average time in service (months):", avg_service_time_months)
+# Calculate the average number of exacerberations per patient
+avg_exac_per_patient = round(
+    len(df[(df["IsExac"] == 1)]) / df[(df["IsExac"] == 1) | (df["IsExac"] == 0)][["StudyId"]].nunique().item(), 2
+)
+print(
+    "Number of exac/patient/months: {} exacerbations/patient in {} months".format(
+        avg_exac_per_patient, avg_service_time_months
+    )
+)
+print(
+    "On average, 1 exacerbation occurs in a patient every: {} months".format(
+        round(avg_service_time_months / avg_exac_per_patient, 2)
+    )
+)
+#################################################################
+# Calculate index dates. 1 row/patient for every 5 months in service.
+#################################################################
+# Obtain the number of rows required per patient.
+service_time["NumRows"] = round(
+    service_time["TotalLengthInService"]
+    / config["model_settings"]["one_row_per_days_in_service"]
+).astype("int")
+# If patient has not been in service for 5 months, make sure they have 1 row
+service_time["NumRows"] = np.where(
+    service_time["NumRows"] < 1, 1, service_time["NumRows"]
+)
+patient_details = pd.merge(
+    patient_details, service_time[["StudyId", "NumRows"]], on="StudyId", how="left"
+)
+# Calculate the number of days between earliest and latest index
+patient_details["NumDaysPossibleIndex"] = (
+    patient_details["LatestIndexDate"] - patient_details["EarliestIndexDate"]
+).dt.days
+patient_details["NumDaysPossibleIndex2"] = (
+    patient_details["LatestIndexAfterGap"] - patient_details["EarliestIndexAfterGap"]
+).dt.days
+patient_details.to_csv(
+    os.path.join(
+        config["outputs"]["output_data_dir"], "pat_details_to_calc_index_dt.csv"
+    ),
+    index=False,
+)
+# Generate random index dates
+# Multiple seeds tested to identify the random index dates that give a good
+# distribution across months. Seed chosen as 2188398760 from check_index_date_dist.py
+random_seed_general = config["model_settings"]["index_date_generation_master_seed"]
+random.seed(random_seed_general)
+# Create different random seeds for each patient
+patient_details["RandomSeed"] = random.sample(range(0, 2**32), patient_details.shape[0])
+# Create random index dates for each patient based on their random seed
+rand_days_dict = {}
+rand_date_dict = {}
+for index, row in patient_details.iterrows():
+    np.random.seed(row["RandomSeed"])
+    rand_days_dict[row["StudyId"]] = np.random.choice(
+        row["LengthInService"], size=row["NumRows"], replace=False
+    )
+    rand_date_dict[row["StudyId"]] = [
+        (
+            row["EarliestIndexDate"] + timedelta(days=int(day))
+            if day <= row["NumDaysPossibleIndex"]
+            else row["EarliestIndexAfterGap"]
+            + timedelta(days=int(day - row["NumDaysPossibleIndex"]))
+        )
+        for day in rand_days_dict[row["StudyId"]]
+    ]
+# Create df from dictionaries containing random index dates
+index_date_df = pd.DataFrame.from_dict(rand_date_dict, orient="index").reset_index()
+index_date_df = index_date_df.rename(columns={"index": "StudyId"})
+# Convert the multiple columns containing index dates to one column
+index_date_df = (
+    pd.melt(index_date_df, id_vars=["StudyId"], value_name="IndexDate")
+    .drop(["variable"], axis=1)
+    .sort_values(by=["StudyId", "IndexDate"])
+)
+index_date_df = index_date_df.dropna()
+index_date_df = index_date_df.reset_index(drop=True)
+# Join index dates with exacerbation events
+exac_events = pd.merge(index_date_df, df, on="StudyId", how="left")
+exac_events["IndexDate"] = pd.to_datetime(exac_events["IndexDate"], utc=True)
+# Calculate whether an exacerbation event occurred within the model time window (3 months)
+# after the index date
+exac_events["TimeToEvent"] = (
+    exac_events["DateOfEvent"] - exac_events["IndexDate"]
+).dt.days
+exac_events["ExacWithin3Months"] = np.where(
+    (
+        exac_events["TimeToEvent"].between(
+            1, config["model_settings"]["prediction_window"], inclusive="both"
+        )
+    )
+    & (exac_events["IsExac"] == 1),
+    1,
+    0,
+)
+exac_events["HospExacWithin3Months"] = np.where(
+    (
+        exac_events["TimeToEvent"].between(
+            1, config["model_settings"]["prediction_window"], inclusive="both"
+        )
+    )
+    & (exac_events["IsHospExac"] == 1),
+    1,
+    0,
+)
+exac_events["CommExacWithin3Months"] = np.where(
+    (
+        exac_events["TimeToEvent"].between(
+            1, config["model_settings"]["prediction_window"], inclusive="both"
+        )
+    )
+    & (exac_events["IsCommExac"] == 1),
+    1,
+    0,
+)
+exac_events = exac_events.sort_values(
+    by=["StudyId", "IndexDate", "ExacWithin3Months"], ascending=[True, True, False]
+)
+exac_events = exac_events.drop_duplicates(subset=["StudyId", "IndexDate"], keep="first")
+exac_events = exac_events[
+    [
+        "StudyId",
+        "PatientId",
+        "IndexDate",
+        "DateOfBirth",
+        "Sex",
+        "ExacWithin3Months",
+        "HospExacWithin3Months",
+        "CommExacWithin3Months",
+    ]
+]
+# Save exac_events
+exac_events.to_pickle(
+    os.path.join(config["outputs"]["output_data_dir"], "patient_labels_hosp_comm.pkl")
+)
+# Summary info
+class_distribution = (
+    exac_events.groupby("ExacWithin3Months").count()[["StudyId"]].reset_index()
+)
+class_distribution.plot.bar(x="ExacWithin3Months", y="StudyId")
+plt.savefig(
+    "./plots/class_distributions/final_seed_"
+    + str(random_seed_general)
+    + "_class_distribution_hosp_comm.png",
+    bbox_inches="tight",
+)
+print("---Summary info after setting up labels---")
+print("Number of unique patients:", exac_events["StudyId"].nunique())
+print("Number of rows:", len(exac_events))
+print(
+    "Number of exacerbations within 3 months of index date:",
+    len(exac_events[exac_events["ExacWithin3Months"] == 1]),
+)
+print(
+    "Percentage positive class (num exac/total rows): {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 1]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage negative class: {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 0]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage hospital exacs: {} %".format(
+        round(
+            (
+                len(exac_events[exac_events["HospExacWithin3Months"] == 1])
+                / len(exac_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage community exacs: {} %".format(
+        round(
+            (
+                len(exac_events[exac_events["CommExacWithin3Months"] == 1])
+                / len(exac_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print("Class balance:")
+print(class_distribution)
+print("---Events based on dates---")
+verified_events = exac_events[
+    exac_events["IndexDate"] <= pd.to_datetime("2021-11-30", utc=True)
+]
+unverified_events = exac_events[
+    exac_events["IndexDate"] > pd.to_datetime("2021-11-30", utc=True)
+]
+print("---Verified events---")
+print(
+    "Percentage positive class (num exac/total rows): {} %".format(
+        round(
+            (
+                len(verified_events[verified_events["ExacWithin3Months"] == 1])
+                / len(verified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage negative class: {} %".format(
+        round(
+            (
+                len(verified_events[verified_events["ExacWithin3Months"] == 0])
+                / len(verified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage hospital exacs: {} %".format(
+        round(
+            (
+                len(verified_events[verified_events["HospExacWithin3Months"] == 1])
+                / len(verified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage community exacs: {} %".format(
+        round(
+            (
+                len(verified_events[verified_events["CommExacWithin3Months"] == 1])
+                / len(verified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print("---Unverified events---")
+print(
+    "Percentage positive class (num exac/total rows): {} %".format(
+        round(
+            (
+                len(unverified_events[unverified_events["ExacWithin3Months"] == 1])
+                / len(unverified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage negative class: {} %".format(
+        round(
+            (
+                len(unverified_events[unverified_events["ExacWithin3Months"] == 0])
+                / len(unverified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage hospital exacs: {} %".format(
+        round(
+            (
+                len(unverified_events[unverified_events["HospExacWithin3Months"] == 1])
+                / len(unverified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage community exacs: {} %".format(
+        round(
+            (
+                len(unverified_events[unverified_events["CommExacWithin3Months"] == 1])
+                / len(unverified_events)
+            )
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Train date range", exac_events["IndexDate"].min(), exac_events["IndexDate"].max()
+)

training/setup_labels_only_hosp.py ADDED Viewed

	@@ -0,0 +1,338 @@

+"""
+Script uses only hospital exacerbation events.
+Collate all hospital to determine the number of exacerbation events. Use  exacerbation
+events to determine the number of rows required per patient in the data and generate
+random index dates and setup labels.
+"""
+import model_h
+import numpy as np
+import os
+import pandas as pd
+import sys
+import matplotlib.pyplot as plt
+from datetime import timedelta
+import random
+data_dir_service = "<YOUR_DATA_PATH>/copd-dataset"
+data_dir_model = "./data"
+# Setup log file
+log = open("./training/logging/setup_labels_only_hosp.log", "w")
+sys.stdout = log
+# Model time window in days to predict exacerbation
+model_time_window = 90
+############################################################################
+# Define model cohort and training data windows
+############################################################################
+# Read relevant info from patient details
+patient_details = pd.read_csv(
+    os.path.join(data_dir_service, "CopdDatasetPatientDetails.txt"),
+    usecols=[
+        "PatientId",
+        "FirstSubmissionDate",
+        "MostRecentSubmissionDate",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+    ],
+    delimiter="|",
+)
+# Select patients for inclusion (those with up to date events in service)
+# Create list of patients for model inclusion
+# Original RECEIVER cohort study id list
+receiver_patients = ["RC{:02d}".format(i) for i in range(1, 85)]
+# This patient needs removing
+receiver_patients.remove("RC34")
+# Scale up patients (subset)
+scaleup_patients = ["SU{:02d}".format(i) for i in range(1, 219)]
+# scaleup_patients.append('SU287') #Removed as study ID contains 2 patients
+# List of all valid patients for modelling
+valid_patients = receiver_patients + scaleup_patients
+# Filter for valid patients accounting for white spaces in StudyId (e.g. RC 26 and RC 52)
+patient_details = patient_details[
+    patient_details.StudyId.str.replace(" ", "").isin(valid_patients)
+]
+# Select only non null entries in patient data start/end dates
+patient_details = patient_details[
+    (patient_details.FirstSubmissionDate.notna())
+    & (patient_details.MostRecentSubmissionDate.notna())
+]
+# Create a column stating the latest date permitted based on events added to service data
+patient_details["LatestPredictionDate"] = "2022-02-28"
+date_cols = ["FirstSubmissionDate", "MostRecentSubmissionDate", "LatestPredictionDate"]
+patient_details[date_cols] = patient_details[date_cols].apply(
+    lambda x: pd.to_datetime(x, utc=True, format="mixed").dt.normalize(), axis=1
+)
+# Choose the earlier date out of the patient's last submission and the latest COPD data
+# events
+patient_details["LatestPredictionDate"] = patient_details[
+    ["MostRecentSubmissionDate", "LatestPredictionDate"]
+].min(axis=1)
+# Calculate the latest date that the index date can be for each patient
+patient_details["LatestIndexDate"] = patient_details[
+    "LatestPredictionDate"
+] - pd.DateOffset(days=model_time_window)
+# Add 6 months to start of data window to allow enough of a lookback period
+patient_details["EarliestIndexDate"] = patient_details[
+    "FirstSubmissionDate"
+] + pd.DateOffset(days=180)
+# Remove any patients for whom the index start date overlaps the last index
+# date, i.e. they have too short a window of data
+print("Number of total patients", len(patient_details))
+print(
+    "Number of patients with too short of a window of data:",
+    len(
+        patient_details[
+            patient_details["EarliestIndexDate"] > patient_details["LatestIndexDate"]
+        ]
+    ),
+)
+patient_details = patient_details[
+    patient_details["EarliestIndexDate"] < patient_details["LatestIndexDate"]
+]
+# List of remaining patients
+model_patients = list(patient_details.PatientId.unique())
+model_study_ids = list(patient_details.StudyId.unique())
+print(
+    "Model cohort: {} patients. {} RECEIVER and {} SU".format(
+        len(model_patients),
+        len(patient_details[patient_details["StudyId"].str.startswith("RC")]),
+        len(patient_details[patient_details["StudyId"].str.startswith("SU")]),
+    )
+)
+df = patient_details[
+    [
+        "PatientId",
+        "DateOfBirth",
+        "Sex",
+        "StudyId",
+        "FirstSubmissionDate",
+        "EarliestIndexDate",
+        "LatestIndexDate",
+        "LatestPredictionDate",
+    ]
+].copy()
+############################################################################
+# Extract hospital exacerbations and admissions from COPD service data
+############################################################################
+# Load hospital exacerbations and admissions data
+hosp_exacs = pd.read_pickle(os.path.join(data_dir_model, "hospital_exacerbations.pkl"))
+admissions = pd.read_pickle(os.path.join(data_dir_model, "hospital_admissions.pkl"))
+# Merge hospital exacs and admissions data
+hosp_exacs = hosp_exacs.merge(admissions, on=["PatientId", "DateOfEvent"], how="outer")
+# Fill missing values in PatientId and StudyId using a lookup table
+patient_id_lookup = patient_details[["PatientId", "StudyId"]]
+hosp_exacs["StudyId"] = np.NaN
+hosp_exacs["StudyId"] = np.where(
+    hosp_exacs.StudyId.isnull(),
+    hosp_exacs.PatientId.map(patient_id_lookup.set_index("PatientId").StudyId),
+    hosp_exacs.StudyId,
+)
+hosp_exacs = hosp_exacs.sort_values(
+    by=["StudyId", "DateOfEvent", "IsHospExac", "IsHospAdmission"],
+    ascending=[True, True, False, False],
+)
+exac_data = hosp_exacs.drop_duplicates(subset=["StudyId", "DateOfEvent"], keep="first")
+exac_data.to_pickle(os.path.join(data_dir_model, "only_hosp_exacs.pkl"))
+# Merge with patient details
+exac_data = pd.merge(
+    exac_data,
+    df[["StudyId", "PatientId", "FirstSubmissionDate", "LatestPredictionDate"]],
+    on=["StudyId", "PatientId"],
+    how="left",
+)
+# Remove exacerbations before onboarding to COPD service
+exac_data = exac_data[exac_data["DateOfEvent"] > exac_data["FirstSubmissionDate"]]
+# Retain only dates before the end of each patient's data window
+exac_data = exac_data[exac_data.DateOfEvent <= exac_data.LatestPredictionDate]
+exac_data = exac_data.drop(columns=["FirstSubmissionDate", "LatestPredictionDate"])
+df = pd.merge(df, exac_data, on=["StudyId", "PatientId"], how="left")
+df = df.rename(columns={"IsHospExac": "IsExac"})
+print("Starting number of exacerbations: {}".format(df.IsExac.sum()))
+print(
+    "Number of unique exacerbation patients: {}".format(
+        len(df[df.IsExac == 1].PatientId.unique())
+    )
+)
+print(
+    "Hospital exacerbations: {} ({} unique patients)".format(
+        len(df[(df.IsExac == 1)]), len(df[(df.IsExac == 1)].StudyId.unique())
+    )
+)
+#####################################################################
+# Calculate the number of rows to include per patient in the dataset.
+# This is calculated based on the average number of exacerbations per
+# patient and is then adjusted to the average time within the service
+#####################################################################
+# Calculate the average time patients have data recorded in the COPD service
+service_time = df[["StudyId", "LatestPredictionDate", "FirstSubmissionDate"]]
+service_time = service_time.drop_duplicates(subset="StudyId", keep="first")
+service_time["ServiceTime"] = (
+    service_time["LatestPredictionDate"] - service_time["FirstSubmissionDate"]
+).dt.days
+avg_service_time = sum(service_time["ServiceTime"]) / len(service_time["ServiceTime"])
+avg_service_time_months = round(avg_service_time / 30)
+print("Average time in service (days):", avg_service_time)
+print("Average time in service (months):", avg_service_time_months)
+# Calculate the average number of exacerberations per patient
+avg_exac_per_patient = round(
+    len(df[df["IsExac"] == 1]) / df[df["IsExac"] == 1][["StudyId"]].nunique().item(), 2
+)
+print(
+    "Number of exac/patient/months: {} exacerbations/patient in {} months".format(
+        avg_exac_per_patient, avg_service_time_months
+    )
+)
+print(
+    "On average, 1 exacerbation occurs in a patient every: {} months".format(
+        round(avg_service_time_months / avg_exac_per_patient, 2)
+    )
+)
+#################################################################
+# Calculate index dates. 1 row/patient for every 6 months in service.
+#################################################################
+# Obtain the number of rows required per patient. One row per patient for every 6 months in service.
+service_time["NumRows"] = round(service_time["ServiceTime"] / 180).astype("int")
+patient_details = pd.merge(
+    patient_details, service_time[["StudyId", "NumRows"]], on="StudyId", how="left"
+)
+# Calculate the number of days between earliest and latest index
+patient_details["NumDaysPossibleIndex"] = (
+    patient_details["LatestIndexDate"] - patient_details["EarliestIndexDate"]
+).dt.days
+# patient_details['NumRows'] = patient_details['NumRows'].astype('int')
+patient_details.to_csv("./data/pat_details_to_calc_index_dt.csv", index=False)
+# Generate random index dates
+# Multiple seeds tested to identify the random index dates that give a good
+# distribution across months. Seed chosen as 2188398760 from check_index_date_dist.py
+random_seed_general = 2188398760
+random.seed(random_seed_general)
+# Create different random seeds for each patient
+patient_details["RandomSeed"] = random.sample(
+    range(0, 2**32), patient_details.shape[0]
+)
+# Create random index dates for each patient based on their random seed
+rand_days_dict = {}
+rand_date_dict = {}
+for index, row in patient_details.iterrows():
+    np.random.seed(row["RandomSeed"])
+    rand_days_dict[row["StudyId"]] = np.random.choice(
+        row["NumDaysPossibleIndex"], size=row["NumRows"], replace=False
+    )
+    rand_date_dict[row["StudyId"]] = [
+        row["EarliestIndexDate"] + timedelta(days=int(day))
+        for day in rand_days_dict[row["StudyId"]]
+    ]
+# Create df from dictionaries containing random index dates
+index_date_df = pd.DataFrame.from_dict(rand_date_dict, orient="index").reset_index()
+index_date_df = index_date_df.rename(columns={"index": "StudyId"})
+# Convert the multiple columns containg index dates to one column
+index_date_df = (
+    pd.melt(index_date_df, id_vars=["StudyId"], value_name="IndexDate")
+    .drop(["variable"], axis=1)
+    .sort_values(by=["StudyId", "IndexDate"])
+)
+index_date_df = index_date_df.dropna()
+index_date_df = index_date_df.reset_index(drop=True)
+# Join index dates with exacerbation events
+exac_events = pd.merge(index_date_df, df, on="StudyId", how="left")
+exac_events["IndexDate"] = pd.to_datetime(exac_events["IndexDate"], utc=True)
+# Calculate whether an exacerbation event occurred within
+# the model time window (3 months) after the index date
+exac_events["TimeToEvent"] = (
+    exac_events["DateOfEvent"] - exac_events["IndexDate"]
+).dt.days
+exac_events["ExacWithin3Months"] = np.where(
+    (exac_events["TimeToEvent"].between(1, model_time_window, inclusive="both"))
+    & (exac_events["IsExac"] == 1),
+    1,
+    0,
+)
+exac_events = exac_events.sort_values(
+    by=["StudyId", "IndexDate", "ExacWithin3Months"], ascending=[True, True, False]
+)
+exac_events = exac_events.drop_duplicates(subset=["StudyId", "IndexDate"], keep="first")
+exac_events = exac_events[
+    ["StudyId", "PatientId", "IndexDate", "DateOfBirth", "Sex", "ExacWithin3Months"]
+]
+# Save exac_events
+exac_events.to_pickle(os.path.join(data_dir_model, "patient_labels_only_hosp.pkl"))
+# Summary info
+class_distribution = (
+    exac_events.groupby("ExacWithin3Months").count()[["StudyId"]].reset_index()
+)
+class_distribution.plot.bar(x="ExacWithin3Months", y="StudyId")
+plt.title("Class distribution of hospital exacerbations occuring within 3 months")
+plt.savefig(
+    "./plots/class_distributions/final_seed_"
+    + str(random_seed_general)
+    + "_class_distribution_only_hosp.png",
+    bbox_inches="tight",
+)
+print("---Summary info after setting up labels---")
+print("Number of unique patients:", exac_events["StudyId"].nunique())
+print("Number of rows:", len(exac_events))
+print(
+    "Number of exacerbations within 3 months of index date:",
+    len(exac_events[exac_events["ExacWithin3Months"] == 1]),
+)
+print(
+    "Percentage positive class (num exac/total rows): {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 1]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print(
+    "Percentage negative class: {} %".format(
+        round(
+            (len(exac_events[exac_events["ExacWithin3Months"] == 0]) / len(exac_events))
+            * 100,
+            2,
+        )
+    )
+)
+print("Class balance:")
+print(class_distribution)

training/split_train_test_val.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""Splits the model H cohort into train, test and balanced cross validation folds.
+The train set retains class ratio, sex and age distributions from the full dataset.
+Patients can only appear in either train or test set.
+This script also splits the train data into balanced folds for cross-validation. Patient
+IDs for train, test and all data folds are stored for use in subsequent scripts.
+"""
+import numpy as np
+import os
+import pandas as pd
+import pickle
+import sys
+import yaml
+import splitting
+with open("./training/config.yaml", "r") as config:
+    config = yaml.safe_load(config)
+##############################################################
+# Load correct files
+##############################################################
+save_cohort_info = True
+# Specify which model to perform split on
+model_type = config["model_settings"]["model_type"]
+# Setup log file
+log = open("./training/logging/split_train_test_" + model_type + ".log", "w")
+sys.stdout = log
+demographics = pd.read_pickle(
+    os.path.join(
+        config["outputs"]["processed_data_dir"],
+        "demographics_{}.pkl".format(model_type),
+    )
+)
+##############################################################
+# Split data into a train and hold out test set
+##############################################################
+train_data, test_data = splitting.subject_wise_train_test_split(
+    data=demographics,
+    target_col="ExacWithin3Months",
+    id_col="StudyId",
+    test_size=0.2,
+    stratify_by_sex=True,
+    sex_col="Sex_F",
+    stratify_by_age=True,
+    age_bin_col="AgeBinned",
+)
+print(demographics.Sex_F.value_counts() / demographics.Sex_F.count())
+print(train_data.Sex_F.value_counts() / train_data.Sex_F.count())
+print(test_data.Sex_F.value_counts() / test_data.Sex_F.count())
+print(demographics.Age.mean())
+print(train_data.Age.mean())
+print(test_data.Age.mean())
+train_ids = train_data.StudyId.unique()
+test_ids = test_data.StudyId.unique()
+##############################################################
+# Split training data into groups for cross validation
+##############################################################
+fold_patients = splitting.subject_wise_kfold_split(
+    train_data=train_data,
+    target_col="ExacWithin3Months",
+    id_col="StudyId",
+    num_folds=5,
+    sex_col="Sex_F",
+    age_col="Age",
+    stratify_by_sex=True,
+    print_log=True,
+)
+##############################################################
+# Save cohort info
+##############################################################
+fold_patients = np.array(fold_patients, dtype="object")
+if save_cohort_info:
+    # Save train and test ID info
+    os.makedirs(config["outputs"]["cohort_info_dir"], exist_ok=True)
+    with open(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"], "test_ids_" + model_type + ".pkl"
+        ),
+        "wb",
+    ) as f:
+        pickle.dump(list(test_ids), f)
+    with open(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"], "train_ids_" + model_type + ".pkl"
+        ),
+        "wb",
+    ) as f:
+        pickle.dump(list(train_ids), f)
+    print("Train and test patient IDs saved")
+    # Save cross validation fold info
+    np.save(
+        os.path.join(
+            config["outputs"]["cohort_info_dir"], "fold_patients_" + model_type + ".npy"
+        ),
+        fold_patients,
+        allow_pickle=True,
+    )
+    print("Cross validation fold information saved")

training/splitting.py ADDED Viewed

	@@ -0,0 +1,292 @@

+"""Module to perform splitting of data into train/test or K-folds."""
+import pandas as pd
+import numpy as np
+from sklearn.model_selection import StratifiedGroupKFold
+def subject_wise_train_test_split(
+    data,
+    target_col,
+    id_col,
+    test_size,
+    stratify_by_sex=False,
+    sex_col=None,
+    stratify_by_age=False,
+    age_bin_col=None,
+    shuffle=False,
+    random_state=None,
+):
+    """Subject wise splitting data into train and test sets.
+    Splits data into train and test sets ensuring that the same patient can only appear in
+    either train or test set. Stratifies data according to class label, with additional
+    options to stratify by sex and age.
+    Parameters
+    ----------
+    data : pd.DataFrame
+        dataframe containing features and target.
+    target_col : str
+        name of target column.
+    id_col : str
+        name of patient id column.
+    test_size : float
+        represents the proportion of the dataset to include in the test split. Float should
+        be between 0 and 1.
+    stratify_by_sex : bool, optional
+        option to stratify data by sex, by default False.
+    sex_col : str, optional
+        name of sex column, by default None.
+    stratify_by_age : bool, optional
+        option to stratify data by age, by default False.
+    age_bin_col : str, optional
+        name of age column, by default None. Age column must be provided in binned format.
+    shuffle : bool, optional
+        whether to shuffle each class's samples before splitting into batches, by default
+        False.
+    random_state : int, optional
+        when shuffle is True, random_state affects the ordering of the indices, by default
+        None.
+    Returns
+    -------
+    train_data : pd.DataFrame
+        train data stratified by class. Also stratified by age/sex as specified in input
+        parameters.
+    test_data : pd.DataFrame
+        test data stratified by class. Also stratified by age/sex as specified in input
+        parameters.
+    Raises
+    -------
+    ValueError : error raised when boolean stratify_by_age or stratify_by_sex is True but
+        the respective columns are not provided for stratifying.
+    """
+    # Raise error if stratify_by_sex/stratify_by_age is True but the respective columns are
+    # not provided
+    if (stratify_by_age is True) & (age_bin_col is None):
+        raise ValueError(
+            "Parameter stratify_by_age is True but age_bin_col not provided."
+        )
+    if (stratify_by_sex is True) & (sex_col is None):
+        raise ValueError("Parameter stratify_by_sex is True but sex_col not provided.")
+    # Adapt target column to contain all variables to split by to allow stratified splitting
+    # by StratifiedGroupKFold.
+    if (stratify_by_sex is True) and (stratify_by_age is True):
+        data["TempTarget"] = (
+            data[target_col].astype(str) + data[sex_col].astype(str) + data[age_bin_col]
+        )
+    elif (stratify_by_sex is True) and (stratify_by_age is False):
+        data["TempTarget"] = data[target_col].astype(str) + data[sex_col].astype(str)
+    elif (stratify_by_sex is False) and (stratify_by_age is True):
+        data["TempTarget"] = data[target_col].astype(str) + data[sex_col].astype(str)
+    else:
+        data["TempTarget"] = data[target_col]
+    temp_target_col = "TempTarget"
+    # Calculate the number of folds to split data to using the size of the test data.
+    num_folds = round(1 / test_size)
+    sgkf = StratifiedGroupKFold(
+        n_splits=num_folds, shuffle=shuffle, random_state=random_state
+    )
+    # Retrieve the first split and create train and test dfs
+    train_test_splits = next(
+        sgkf.split(data, data[temp_target_col], groups=data[id_col])
+    )
+    train_ids = train_test_splits[0].tolist()
+    test_ids = train_test_splits[1].tolist()
+    train_data = data.iloc[train_ids]
+    test_data = data.iloc[test_ids]
+    # Drop temporary target column
+    train_data = train_data.drop(columns=temp_target_col)
+    test_data = test_data.drop(columns=temp_target_col)
+    return train_data, test_data
+def subject_wise_kfold_split(
+    train_data,
+    target_col,
+    id_col,
+    num_folds,
+    sex_col=None,
+    age_col=None,
+    stratify_by_sex=False,
+    stratify_by_age=False,
+    age_bin_col=None,
+    shuffle=False,
+    random_state=None,
+    print_log=False,
+):
+    """Subject wise splitting data into balanced K-folds.
+    Splits data into K-folds ensuring that the same patient can only appear in
+    either train or validation set. Stratifies data according to class label, with additional
+    options to stratify by sex and age.
+    Parameters
+    ----------
+    train_data : pd.DataFrame
+        dataframe containing features and target.
+    target_col : str
+        name of target column.
+    id_col : str
+        name of patient id column.
+    num_folds : int
+        number of folds.
+    sex_col : str, optional
+        name of sex column, by default None. Required if stratify_by_sex is True. Can be
+        included when stratify_by_sex is False to get info on sex ratio across folds.
+    age_col : str, optional
+        name of age column, by default None. Column must be a continous variable. Can be
+        included to get info on mean age across folds.
+    stratify_by_sex : bool, optional
+        option to stratify data by sex, by default False.
+    stratify_by_age : bool, optional
+        option to stratify data by age, by default False. The binned age (age_bin_col) is
+        used for stratifying by age rather than age_col.
+    age_bin_col : str, optional
+        name of age column, by default None. Age column must be provided in binned format.
+    shuffle : bool, optional
+        whether to shuffle each class's samples before splitting into batches, by default
+        False.
+    random_state : int, optional
+        when shuffle is True, random_state affects the ordering of the indices, by default
+        None.
+    print_log : bool, optional
+        flag to print distributions across folds, by default False.
+    Returns
+    -------
+    validation_fold_ids : list of arrays
+        each array contains the validation patient IDs for each fold.
+    Raises
+    -------
+    ValueError : error raised when boolean stratify_by_age or stratify_by_sex is True but
+        the respective columns are not provided for stratifying.
+    """
+    # Raise error if stratify_by_sex/stratify_by_age is True but the respective columns are
+    # not provided
+    if (stratify_by_age is True) & (age_bin_col is None):
+        raise ValueError(
+            "Parameter stratify_by_age is True but age_bin_col not provided."
+        )
+    if (stratify_by_sex is True) & (sex_col is None):
+        raise ValueError("Parameter stratify_by_sex is True but sex_col not provided.")
+    # Adapt target column to contain all variables to split by to allow stratified splitting
+    # by StratifiedGroupKFold.
+    if (stratify_by_sex is True) and (stratify_by_age is True):
+        train_data["TempTarget"] = (
+            train_data[target_col].astype(str)
+            + train_data[sex_col].astype(str)
+            + train_data[age_bin_col]
+        )
+    elif (stratify_by_sex is True) and (stratify_by_age is False):
+        train_data["TempTarget"] = train_data[target_col].astype(str) + train_data[
+            sex_col
+        ].astype(str)
+    elif (stratify_by_sex is False) and (stratify_by_age is True):
+        train_data["TempTarget"] = train_data[target_col].astype(str) + train_data[
+            sex_col
+        ].astype(str)
+    else:
+        train_data["TempTarget"] = train_data[target_col]
+    temp_target_col = "TempTarget"
+    sgkf_train = StratifiedGroupKFold(
+        n_splits=num_folds, shuffle=shuffle, random_state=random_state
+    )
+    validation_fold_ids = []
+    class_fold_ratios = []
+    sex_fold_ratios = []
+    fold_mean_ages = []
+    for i, (train_index, validation_index) in enumerate(
+        sgkf_train.split(
+            train_data, train_data[temp_target_col], groups=train_data[id_col]
+        )
+    ):
+        # Get patient ID's for each validation fold
+        validation_ids = train_data[id_col].iloc[validation_index].unique()
+        validation_fold_ids.append(validation_ids)
+        # Get class, sex and age distributions for the training data for each fold
+        train_fold_data = train_data[~train_data[id_col].isin(validation_ids)]
+        class_ratio = train_fold_data[target_col].value_counts()[1] / len(
+            train_fold_data
+        )
+        class_fold_ratios.append(class_ratio)
+        if not sex_col is None:
+            sex_ratio = train_fold_data[sex_col].value_counts()[1] / len(
+                train_fold_data
+            )
+            sex_fold_ratios.append(sex_ratio)
+        if not age_col is None:
+            mean_age = train_fold_data[age_col].mean()
+            fold_mean_ages.append(mean_age)
+    if print_log is True:
+        print("Fold proportions:")
+        print("Train class ratio:", class_fold_ratios)
+        if not sex_col is None:
+            print("Sex class ratio:", sex_fold_ratios)
+        if not age_col is None:
+            print("Mean age:", fold_mean_ages)
+    # Allows inhomogenous array to be saved with np.save
+    validation_fold_ids = np.asarray(validation_fold_ids, dtype="object")
+    # Delete temporary target column
+    del train_data[temp_target_col]
+    return validation_fold_ids
+def get_cv_fold_indices(validation_fold_ids, train_data, id_col):
+    """
+    Find train/val dataframe indices for each fold and format for cross validation.
+    Creates a tuple with training and validation indices for each K-fold using the
+    validation_fold_ids. These patients are assigned to the validation portion of the data
+    and all other patients are assigned to the train portion for that fold.
+    Checks that all patient IDs listed for the K folds are contained in the train data.
+    For each fold, extracts the dataframe indices for patient data belonging that fold
+    and assigns all other indices to the 'train' portion. The list returned contains tuples
+    required to be passed to sklearn's cross_validate function (through the cv argument).
+    Parameters
+    ----------
+    fold_patients : array
+        lists of patient IDs for each of the K folds.
+    train_data : pd.DataFrame
+        train data (must contain id_col).
+    id_col : str
+        name of column containing patient ID.
+    Returns
+    -------
+    cross_validation_fold_indices : list of tuples
+        K lists of val/train dataframe indices.
+    """
+    # Create a tuple with training and validation indices for each fold.
+    cross_val_fold_indices = []
+    for fold in validation_fold_ids:
+        fold_val_ids = train_data[train_data[id_col].isin(fold)]
+        fold_train_ids = train_data[~train_data[id_col].isin(fold)]
+        # Get index of rows in val and train
+        fold_val_index = fold_val_ids.index
+        fold_train_index = fold_train_ids.index
+        # Append tuple of training and val indices
+        cross_val_fold_indices.append((fold_train_index, fold_val_index))
+    return cross_val_fold_indices