import pandas as pd
import numpy as np
#import pyforest
Reading the dataset.
df= pd.read_csv("./data1.csv")
df2 = pd.read_csv("./data2.csv")
df2
df
Making Some Columns Easier to Use. Basically making dummy variables so I dont have to later on.
df['Showed_up'] = df['No-show'].map(
{'Yes':0 ,'No':1})
df['sum_missed'] = df['No-show'].map(
{'Yes':1 ,'No':0})
df['Gender'] = df['Gender'].map(
{'M':1 ,'F':0})
Grouping Data set by PatientID to make a new column of the number of missed appointments by each patient and checking the correlation with the missing an appointment.
missed_prior = df.groupby('PatientID')['sum_missed'].sum()
df.drop(['sum_missed'], axis=1, inplace=True)
df.drop(['No-show'], axis=1, inplace=True)
missed_prior = pd.DataFrame(missed_prior)
df = pd.merge(df, missed_prior, on="PatientID")
df['sum_missed'].corr(df['Showed_up'])
Grouping Data set by PatientID to make a new column showing if patient has missed before.
missed_appointment = df.groupby('PatientID')['Showed_up'].sum()
missed_appointment = missed_appointment.to_dict()
df['missed_appointment_before'] = df.PatientID.map(lambda x: 1 if missed_appointment[x]>0 else 0)
df['Showed_up'].corr(df['missed_appointment_before'])
Extracting different time variables for further analysis later
import datetime
df["ScheduledDayofweek"] = pd.to_datetime(df['ScheduledDay']).dt.day_name()
df["Scheduledmonth"] = pd.to_datetime(df['ScheduledDay']).dt.month_name()
df["Scheduledhour"] = pd.to_datetime(df['ScheduledDay']).dt.hour
df["AppointmentDayofweek"] = pd.to_datetime(df['AppointmentDay']).dt.day_name()
df["Appointmentmonth"] = pd.to_datetime(df['AppointmentDay']).dt.month_name()
Dropping unwanted columns
df.drop(["PatientID", "AppointmentID",'ScheduledDay','AppointmentDay', "LocationID"], axis=1, inplace=True)
# "Gender"
df.columns
On an average most people do show up. Im sensing the dataset must be imbalanced so Ill need handle this prior to modeling. We have someone/people who've missed 18 times. That's quite unfortunate for the doctors.
df.describe()
print(df.columns)
df.drop_duplicates(inplace=True)
corr = df.corr()
corr.style.background_gradient(cmap='Purples')
df.isna().describe()
df.duplicated().describe()
# #Creates automated visualizations
# %matplotlib inline
# from autovizwidget.widget.utils import display_dataframe
# display_dataframe(df)
Exporting Automated Visuals to html
#from pandas_profiling import ProfileReport
#design_report = ProfileReport(df)
#design_report.to_file(output_file='no_showreport.html')
#design_report
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("RdBu")
df["Showed_up"].value_counts().plot(kind='bar')
# Showed = 1
# Missed = 0
Gender doesnt seem to have much visible effect.
ax = sns.countplot(x="Gender", hue="Showed_up", data=df)
# Men = 1
# Women = 0
Sending texts seems to help only a little. Perhaps switching to calls would be better.
ax = sns.countplot(x="SMS_received", hue="Showed_up", data=df)
# Texted = 1
# Not-texted = 0
Now lets see how time(month, days, hour) affects cancelations.
ax = sns.countplot(x="Scheduledmonth", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
ax = sns.countplot(x="Appointmentmonth", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
We can disregard months since theres not enough data for it or people simple dont go to hospitals outside of the summertime. The later is extremely unlikely.
ax = sns.countplot(x="ScheduledDayofweek", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
Thursdays and Fridays have the lowest cancellations and lowest appointments. Doctors should encourage patients for these days as people generally have more flexibility towards the end of the week.
ax = sns.countplot(x="AppointmentDayofweek", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
Less cancelations for people who called to set the apt during mid-day and evenings. Probably due to the working class being less pressured during lunch breaks and after working hours to be sure if they'll make the commitment.
Evenings and mid-afternoon seem to be the best time to call to set an appointment. Dataset had no time for when the appointment will actually be held but it be nice to see the correlation with it.
ax = sns.countplot(x="Scheduledhour", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
ax = sns.countplot(x="sum_missed", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
df["condition"] = df['Hypertension'] + df['Diabetes']
ax = sns.countplot(x="condition", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
Lets handle the imbalance issue
showed = df[df["Showed_up"] == 1]
missed = df[df["Showed_up"] == 0]
print(showed.shape)
print(missed.shape)
df = pd.concat([df, missed])
print(df["Showed_up"].value_counts())
df.groupby('Showed_up').size().plot(kind='pie',
y = "Showed_up",
label = "Type",
autopct='%1.1f%%')
Ill try downsampling, if it doesnt achieve a somewhat decent F1 score I'll opt for upsampling using SMOTE.
from sklearn.utils import resample
df_resampled = resample(showed,
replace=True,
n_samples=len(missed),
random_state=42)
print(df_resampled.shape)
df_resampled = pd.concat([df_resampled, missed])
print(df_resampled["Showed_up"].value_counts())
df_resampled.groupby('Showed_up').size().plot(kind='pie',
y = "Showed_up",
label = "Type",
autopct='%1.1f%%')
Lets analyze these features one more time
ax = sns.countplot(x="condition", hue="Showed_up", data=df_resampled)
# Showed = 1
# Missed = 0
Chances of missing when there's no pre-existing conditions are high but when there's 1 or 2 conditions then chances of missing reduces.
#Feature Importance
abscorrshowed = []
df = df_resampled
y = df['Showed_up']
features = df.drop(['Showed_up', 'ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'], axis = 1)
for x in features.columns:
abscorrshowed.append({"Feature": x, "Correlation" : (df[x].corr(y))})
abssorted_list = sorted(abscorrshowed,key=lambda x:x['Correlation'],reverse=True)
abssorted_list
print("######### By Absolute ########")
abscorrshowed = []
y = df['Showed_up']
features = df.drop(['Showed_up', 'ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'], axis = 1)
for x in features.columns:
abscorrshowed.append({"Feature": x, "Correlation" : abs(df[x].corr(y))})
abssorted_list = sorted(abscorrshowed,key=lambda x:x['Correlation'],reverse=True)
abssorted_list
# Both are somewhat correlated to each other, hope this wont be too much of an issue if I use both while modeling
df['missed_appointment_before'].corr(df['sum_missed'])
A function for our metrics: F1 Score, Accuracy & a confusion matrix for all the models
# Defining metrics
import itertools
def plot_confusion_matrix(y_test,y_pred,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
classes = ['Missed', 'Showed']
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion Matrix, without normalization')
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# obtian Accuracy & f1 score
print("Accuracy :" , accuracy_score(y_test, y_pred))
print("F1 Score:" , f1_score(y_test, y_pred))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score
#important = ['missed_appointment_before','sum_missed']
important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes']
#important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes','Scheduledhour']
X = df[important]
Y = df['Showed_up']
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.2)
pipe = make_pipeline(StandardScaler(), LogisticRegression(random_state=42, max_iter=200))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
plot_confusion_matrix(y_test,y_pred,
normalize=True,
title='Confusion matrix',
cmap=plt.cm.Blues)
# Saving our basic model that performed pretty well
import pickle
olsmodel = "olsmodel.sav"
# save
pickle.dump(pipe, open(olsmodel, "wb"))
Time didnt seem to affect it so I just removed it to prevent over-generalizing.
Removing either missed before (yes or no) OR Total Sum of Missed dropped our models accuracy by 5% so I choose to keep them while ignoring their multicollinearity.
The diseases and notification also made no impact on the models performance but I choose to leave them still for the doctors themselves to decide the right course of action. For instance there might be an emergency so getting a hold of patient might be life saving.
# Function for spitting out the predictions and its probability after done modeling
def analysis(missedprior,totalmissed,reminded,age,hyper,diabetes):
column = ['missed_appointment_before', 'sum_missed', 'Age', 'SMS_received',
'Hypertension', 'Diabetes']
serInput = [missedprior,totalmissed,age,reminded,hyper,diabetes]
data = pd.DataFrame([serInput], columns=column)
filename = 'olsmodel.sav'
model = pickle.load(open(filename, 'rb'))
#from joblib import dump, load
#scaler = load('std_scaler.bin')
#data = scaler.transform(data)
r = model.predict(data)
for i in r:
if r == 0:
result = str("Might Miss")
else:
result = str("Might Show")
proba = np.max(model.predict_proba(data)*100, axis=1)
pred = str( (result) + ' With A Probability of: ' +'%.2f' % (proba) +'%')
return pred
missedprior,totalmissed,age,reminded,hyper,diabetes = 1,1,50,1,0,0
result = analysis(missedprior,totalmissed,reminded,age,hyper,diabetes)
print(result)
Lets see how much better a gradient boosted ensemble model will perform.
import xgboost as xgb
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.2)
pipe = make_pipeline(StandardScaler(), xgb.XGBClassifier(n_jobs=-1, eval_metric='auc', use_label_encoder=False,
random_state=42))
pipe.fit(X_train, y_train.astype(int))
y_pred = pipe.predict(X_test)
# obtian Accuracy & f1 score
plot_confusion_matrix(y_test,y_pred,
normalize=True,
title='Confusion matrix',
cmap=plt.cm.Blues)
# save
xgbmodel = "xgb.sav"
pickle.dump(pipe, open(xgbmodel, "wb"))
# load
xgb_model_loaded = pickle.load(open(xgbmodel, "rb"))
Not so much better unfortunately so in an actual production environment, it'll be better to just use to the base model so as to not incure computational cost overtime for just a 2% increase. Still I'll use xgboost while deploying this for now just because I already have an environment set up for it and I dont have much time to reinvent the wheel. Hence why I'll also be performing a grid search with just so few parameters below:
# given parameters different values
Random_Grid={
'n_estimators':[3,5,6],
'max_depth':[3,4,10]
}
# xgboost model
gb = xgb.XGBClassifier( n_jobs=-1, eval_metric='auc', use_label_encoder=False,
random_state=42)
# randommized searchCV
rs=RandomizedSearchCV(gb,Random_Grid,cv=20,scoring="f1", n_iter = 6)
# fit the train data
rs.fit(X_train,y_train)
gs = xgb.XGBClassifier( n_estimators=rs.best_params_["n_estimators"],
max_depth=rs.best_params_["max_depth"],
n_jobs=-1, eval_metric='auc', use_label_encoder=False,
random_state=42)
# fit the model
gs.fit(X_train, y_train)
y_pred = gs.predict(X_test)
# obtian Accuracy & f1 score
print("Accuracy :" , accuracy_score(y_test, y_pred))
print("F1 Score:" , f1_score(y_test, y_pred))
plot_confusion_matrix(y_test,y_pred,
normalize=True,
title='Confusion matrix',
cmap=plt.cm.Blues)
import pickle
gsmodel = "gs.pkl"
# save
pickle.dump(gs, open(gsmodel, "wb"))
# load
#xgb_model_loaded = pickle.load(open(xgbmodel, "rb"))
Finally Ill like to see how much better a simple neural network might perform in contrast to prior models.
import tensorflow
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding,MaxPooling1D,ConvLSTM2D,RNN
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
#important = ['missed_appointment_before','sum_missed']
important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes']
#important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes','Scheduledhour']
X = df[important]
Y = df['Showed_up']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)
classifier = Sequential()
classifier.add(Dense(units = 64, activation = 'relu', input_dim = len(important)))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.summary()
history = classifier.fit(X_train, y_train, epochs = 10, validation_split = 0.2)
y_pred = classifier.predict(X_test)
y_pred = y_pred > 0.5
plot_confusion_matrix(y_test,y_pred,
normalize=True,
title='Confusion matrix',
cmap=plt.cm.Blues)
Adding 3 dense layers and dropout layers after each dense layer to reduce over generalizing, the model didnt impove by much if any at all so definitely shouldnt use the neural network model as it wont be cost efficient in contrast to simpler models. I did just use 10 epochs in training so chances are that if I had used more it would have outperformed the rest by more but I want to build a users interface right now so that would be work for another day.
Send text notifications to those who scheduled during in high cancelation time periods i.e [between 7am - 11am & 2pm -3pm].
Encourage Friday and Thursday bookings more.
Switch to phone call reminders a day prior to the appointment.
Use the app built to view a patients probability of canceling a day prior, if its > 70%, call to suggest reschedulling before the patient even thinks of canceling.
Try upsampling for better accuracy of model.
Try a more indepth grid search approach using more parameters.
Get more data especially for time variables.
Experiment with label encoding for the time variables made in the begining i.e ['ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'].
%%shell
jupyter nbconvert --to html code.ipynb