In [6]:
import pandas as pd
import numpy as np
#import pyforest

Reading the dataset.

In [7]:
df= pd.read_csv("./data1.csv")
df2 = pd.read_csv("./data2.csv")
df2
Out[7]:
Column Description
0 MemberID Member ID
1 AppointmentID Appointment ID
2 Gender M = Male, F = Female
3 ScheduledDay Date the appointment was scheduled
4 AppointmentDay Actual appointment date
5 Age Age
6 LocationID Patient Geography ID
7 MedicaidIND 1 = Medicaid patient, 0 = Non-Medicaid patient
8 Hypertension Hypertension indicator 1 = Yes, 0 = No
9 Diabetes Diabetes indicator 1 = Yes, 0 = No
10 Alcoholism Alcoholism indicator 1 = Yes, 0 = No
11 Disability Disability indicator 1 = Yes, 0 = No
12 SMS_received Text was sent to patient as an appointment rem...
13 No-show Yes = Did not attend the appointment, No = App...
In [8]:
df
Out[8]:
PatientID AppointmentID Gender ScheduledDay AppointmentDay Age LocationID MedicaidIND Hypertension Diabetes Alcoholism Disability SMS_received No-show
0 #29872499824296 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 40 0 1 0 0 0 0 No
1 #558997776694438 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 40 0 0 0 0 0 0 No
2 #4262962299951 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 47 0 0 0 0 0 0 No
3 #867951213174 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 55 0 0 0 0 0 0 No
4 #8841186448183 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 40 0 1 1 0 0 0 No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110522 #2572134369293 5651768 F 2016-05-03T09:15:35Z 2016-06-07T00:00:00Z 56 45 0 0 0 0 0 1 No
110523 #3596266328735 5650093 F 2016-05-03T07:27:33Z 2016-06-07T00:00:00Z 51 45 0 0 0 0 0 1 No
110524 #15576631729893 5630692 F 2016-04-27T16:03:52Z 2016-06-07T00:00:00Z 21 45 0 0 0 0 0 1 No
110525 #92134931435557 5630323 F 2016-04-27T15:09:23Z 2016-06-07T00:00:00Z 38 45 0 0 0 0 0 1 No
110526 #377511518121127 5629448 F 2016-04-27T13:30:56Z 2016-06-07T00:00:00Z 54 45 0 0 0 0 0 1 No

110527 rows × 14 columns

Processing

Making Some Columns Easier to Use. Basically making dummy variables so I dont have to later on.

In [9]:
df['Showed_up'] = df['No-show'].map(
                   {'Yes':0 ,'No':1})

df['sum_missed'] = df['No-show'].map(
                   {'Yes':1 ,'No':0})

df['Gender'] = df['Gender'].map(
                   {'M':1 ,'F':0})

Grouping Data set by PatientID to make a new column of the number of missed appointments by each patient and checking the correlation with the missing an appointment.

In [10]:
missed_prior = df.groupby('PatientID')['sum_missed'].sum()
df.drop(['sum_missed'], axis=1, inplace=True)
df.drop(['No-show'], axis=1, inplace=True)
missed_prior = pd.DataFrame(missed_prior)
df = pd.merge(df, missed_prior, on="PatientID")
df['sum_missed'].corr(df['Showed_up'])
Out[10]:
-0.4625076926719454

Grouping Data set by PatientID to make a new column showing if patient has missed before.

In [11]:
missed_appointment = df.groupby('PatientID')['Showed_up'].sum()
missed_appointment = missed_appointment.to_dict()
df['missed_appointment_before'] = df.PatientID.map(lambda x: 1 if missed_appointment[x]>0 else 0)
df['Showed_up'].corr(df['missed_appointment_before'])
Out[11]:
0.6098312772100071

Extracting different time variables for further analysis later

In [12]:
import datetime
df["ScheduledDayofweek"] = pd.to_datetime(df['ScheduledDay']).dt.day_name()
df["Scheduledmonth"] = pd.to_datetime(df['ScheduledDay']).dt.month_name()
df["Scheduledhour"] = pd.to_datetime(df['ScheduledDay']).dt.hour

df["AppointmentDayofweek"] = pd.to_datetime(df['AppointmentDay']).dt.day_name()
df["Appointmentmonth"] = pd.to_datetime(df['AppointmentDay']).dt.month_name()

Dropping unwanted columns

In [13]:
df.drop(["PatientID", "AppointmentID",'ScheduledDay','AppointmentDay', "LocationID"], axis=1, inplace=True)
# "Gender"
In [14]:
df.columns
Out[14]:
Index(['Gender', 'Age', 'MedicaidIND', 'Hypertension', 'Diabetes',
       'Alcoholism', 'Disability', 'SMS_received', 'Showed_up', 'sum_missed',
       'missed_appointment_before', 'ScheduledDayofweek', 'Scheduledmonth',
       'Scheduledhour', 'AppointmentDayofweek', 'Appointmentmonth'],
      dtype='object')

On an average most people do show up. Im sensing the dataset must be imbalanced so Ill need handle this prior to modeling. We have someone/people who've missed 18 times. That's quite unfortunate for the doctors.

In [15]:
df.describe()
Out[15]:
Gender Age MedicaidIND Hypertension Diabetes Alcoholism Disability SMS_received Showed_up sum_missed missed_appointment_before Scheduledhour
count 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000
mean 0.350023 37.088874 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026 0.798067 0.632796 0.913994 10.774517
std 0.476979 23.110205 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873 0.401444 1.145807 0.280374 3.216189
min 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000
25% 0.000000 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 8.000000
50% 0.000000 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 10.000000
75% 1.000000 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 13.000000
max 1.000000 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000 1.000000 18.000000 1.000000 21.000000
In [16]:
print(df.columns)
df.drop_duplicates(inplace=True)
Index(['Gender', 'Age', 'MedicaidIND', 'Hypertension', 'Diabetes',
       'Alcoholism', 'Disability', 'SMS_received', 'Showed_up', 'sum_missed',
       'missed_appointment_before', 'ScheduledDayofweek', 'Scheduledmonth',
       'Scheduledhour', 'AppointmentDayofweek', 'Appointmentmonth'],
      dtype='object')
In [17]:
corr = df.corr()
corr.style.background_gradient(cmap='Purples')
Out[17]:
  Gender Age MedicaidIND Hypertension Diabetes Alcoholism Disability SMS_received Showed_up sum_missed missed_appointment_before Scheduledhour
Gender 1.000000 -0.089898 -0.122643 -0.052455 -0.031750 0.107607 0.024807 -0.046877 0.006172 0.006974 -0.014144 0.000063
Age -0.089898 1.000000 -0.122760 0.501015 0.290950 0.085628 0.074033 -0.038251 0.106571 -0.088141 0.081525 -0.019296
MedicaidIND -0.122643 -0.122760 1.000000 -0.033131 -0.033325 0.030996 -0.013361 -0.016568 -0.013915 0.039385 -0.000075 -0.036376
Hypertension -0.052455 0.501015 -0.033131 1.000000 0.427491 0.082001 0.077579 -0.038713 0.063945 -0.053396 0.052116 -0.052622
Diabetes -0.031750 0.290950 -0.033325 0.427491 1.000000 0.014516 0.055605 -0.037040 0.032903 -0.029765 0.028443 -0.028557
Alcoholism 0.107607 0.085628 0.030996 0.082001 0.014516 1.000000 0.000701 -0.038360 0.009210 0.007060 0.008809 -0.015566
Disability 0.024807 0.074033 -0.013361 0.077579 0.055605 0.000701 1.000000 -0.037380 0.019583 0.004645 0.019667 -0.007222
SMS_received -0.046877 -0.038251 -0.016568 -0.038713 -0.037040 -0.038360 -0.037380 1.000000 -0.090447 -0.011530 -0.059179 0.024449
Showed_up 0.006172 0.106571 -0.013915 0.063945 0.032903 0.009210 0.019583 -0.090447 1.000000 -0.454892 0.604063 -0.029399
sum_missed 0.006974 -0.088141 0.039385 -0.053396 -0.029765 0.007060 0.004645 -0.011530 -0.454892 1.000000 -0.209066 0.044835
missed_appointment_before -0.014144 0.081525 -0.000075 0.052116 0.028443 0.008809 0.019667 -0.059179 0.604063 -0.209066 1.000000 -0.014312
Scheduledhour 0.000063 -0.019296 -0.036376 -0.052622 -0.028557 -0.015566 -0.007222 0.024449 -0.029399 0.044835 -0.014312 1.000000
In [18]:
df.isna().describe()
Out[18]:
Gender Age MedicaidIND Hypertension Diabetes Alcoholism Disability SMS_received Showed_up sum_missed missed_appointment_before ScheduledDayofweek Scheduledmonth Scheduledhour AppointmentDayofweek Appointmentmonth
count 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402
unique 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
top False False False False False False False False False False False False False False False False
freq 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402 91402
In [19]:
df.duplicated().describe()
Out[19]:
count     91402
unique        1
top       False
freq      91402
dtype: object

EDA

In [21]:
# #Creates automated visualizations 
# %matplotlib inline
# from autovizwidget.widget.utils import display_dataframe

# display_dataframe(df)

Exporting Automated Visuals to html

In [22]:
#from pandas_profiling import ProfileReport
#design_report = ProfileReport(df)
#design_report.to_file(output_file='no_showreport.html')
#design_report
In [23]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("RdBu")

df["Showed_up"].value_counts().plot(kind='bar')
# Showed = 1
# Missed = 0
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faafa68c610>

Gender doesnt seem to have much visible effect.

In [24]:
ax = sns.countplot(x="Gender", hue="Showed_up", data=df)
# Men = 1
# Women = 0

Sending texts seems to help only a little. Perhaps switching to calls would be better.

In [25]:
ax = sns.countplot(x="SMS_received", hue="Showed_up", data=df)
# Texted = 1
# Not-texted = 0

Now lets see how time(month, days, hour) affects cancelations.

In [26]:
ax = sns.countplot(x="Scheduledmonth", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
In [27]:
ax = sns.countplot(x="Appointmentmonth", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0

We can disregard months since theres not enough data for it or people simple dont go to hospitals outside of the summertime. The later is extremely unlikely.

In [28]:
ax = sns.countplot(x="ScheduledDayofweek", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0

Thursdays and Fridays have the lowest cancellations and lowest appointments. Doctors should encourage patients for these days as people generally have more flexibility towards the end of the week.

In [29]:
ax = sns.countplot(x="AppointmentDayofweek", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0

Less cancelations for people who called to set the apt during mid-day and evenings. Probably due to the working class being less pressured during lunch breaks and after working hours to be sure if they'll make the commitment.

Evenings and mid-afternoon seem to be the best time to call to set an appointment. Dataset had no time for when the appointment will actually be held but it be nice to see the correlation with it.

In [30]:
ax = sns.countplot(x="Scheduledhour", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
In [33]:
ax = sns.countplot(x="sum_missed", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0
In [34]:
df["condition"] = df['Hypertension'] + df['Diabetes']
In [35]:
ax = sns.countplot(x="condition", hue="Showed_up", data=df)
# Showed = 1
# Missed = 0

Model

Lets handle the imbalance issue

In [36]:
showed = df[df["Showed_up"] == 1]
missed  = df[df["Showed_up"] == 0]
print(showed.shape)
print(missed.shape)
df = pd.concat([df, missed])

print(df["Showed_up"].value_counts())

df.groupby('Showed_up').size().plot(kind='pie',
                                       y = "Showed_up",
                                       label = "Type",
                                       autopct='%1.1f%%')
(70047, 17)
(21355, 17)
1    70047
0    42710
Name: Showed_up, dtype: int64
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faafa69a050>

Ill try downsampling, if it doesnt achieve a somewhat decent F1 score I'll opt for upsampling using SMOTE.

In [37]:
from sklearn.utils import resample
df_resampled = resample(showed,
             replace=True,
             n_samples=len(missed),
             random_state=42)

print(df_resampled.shape)
(21355, 17)
In [38]:
df_resampled = pd.concat([df_resampled, missed])

print(df_resampled["Showed_up"].value_counts())

df_resampled.groupby('Showed_up').size().plot(kind='pie',
                                       y = "Showed_up",
                                       label = "Type",
                                       autopct='%1.1f%%')
1    21355
0    21355
Name: Showed_up, dtype: int64
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faafbc355d0>

Lets analyze these features one more time

In [39]:
ax = sns.countplot(x="condition", hue="Showed_up", data=df_resampled)
# Showed = 1
# Missed = 0

Chances of missing when there's no pre-existing conditions are high but when there's 1 or 2 conditions then chances of missing reduces.

In [43]:
#Feature Importance
abscorrshowed = []
df = df_resampled
y = df['Showed_up']
features = df.drop(['Showed_up', 'ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'], axis = 1)
for x in features.columns:
    abscorrshowed.append({"Feature": x, "Correlation" : (df[x].corr(y))})
abssorted_list = sorted(abscorrshowed,key=lambda x:x['Correlation'],reverse=True)
abssorted_list
Out[43]:
[{'Correlation': 0.5221551562049078, 'Feature': 'missed_appointment_before'},
 {'Correlation': 0.12602838004041125, 'Feature': 'Age'},
 {'Correlation': 0.07542836038507394, 'Feature': 'Hypertension'},
 {'Correlation': 0.06867140655649014, 'Feature': 'condition'},
 {'Correlation': 0.03288910570793788, 'Feature': 'Diabetes'},
 {'Correlation': 0.023093822097828173, 'Feature': 'Disability'},
 {'Correlation': 0.00874532827670649, 'Feature': 'Alcoholism'},
 {'Correlation': 0.004724813996662372, 'Feature': 'Gender'},
 {'Correlation': -0.011934077638705375, 'Feature': 'MedicaidIND'},
 {'Correlation': -0.03717159317752768, 'Feature': 'Scheduledhour'},
 {'Correlation': -0.10538677555673681, 'Feature': 'SMS_received'},
 {'Correlation': -0.47840224582944235, 'Feature': 'sum_missed'}]
In [44]:
print("#########  By Absolute ########")
abscorrshowed = []
y = df['Showed_up']
features = df.drop(['Showed_up', 'ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'], axis = 1)
for x in features.columns:
    abscorrshowed.append({"Feature": x, "Correlation" : abs(df[x].corr(y))})
abssorted_list = sorted(abscorrshowed,key=lambda x:x['Correlation'],reverse=True)
abssorted_list
#########  By Absolute ########
Out[44]:
[{'Correlation': 0.5221551562049078, 'Feature': 'missed_appointment_before'},
 {'Correlation': 0.47840224582944235, 'Feature': 'sum_missed'},
 {'Correlation': 0.12602838004041125, 'Feature': 'Age'},
 {'Correlation': 0.10538677555673681, 'Feature': 'SMS_received'},
 {'Correlation': 0.07542836038507394, 'Feature': 'Hypertension'},
 {'Correlation': 0.06867140655649014, 'Feature': 'condition'},
 {'Correlation': 0.03717159317752768, 'Feature': 'Scheduledhour'},
 {'Correlation': 0.03288910570793788, 'Feature': 'Diabetes'},
 {'Correlation': 0.023093822097828173, 'Feature': 'Disability'},
 {'Correlation': 0.011934077638705375, 'Feature': 'MedicaidIND'},
 {'Correlation': 0.00874532827670649, 'Feature': 'Alcoholism'},
 {'Correlation': 0.004724813996662372, 'Feature': 'Gender'}]
In [45]:
# Both are somewhat correlated to each other, hope this wont be too much of an issue if I use both while modeling
df['missed_appointment_before'].corr(df['sum_missed'])
Out[45]:
-0.1584175923864393

A function for our metrics: F1 Score, Accuracy & a confusion matrix for all the models

In [46]:
# Defining metrics
import itertools
def plot_confusion_matrix(y_test,y_pred,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    classes = ['Missed', 'Showed']
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test,y_pred)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion Matrix, without normalization')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                    color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    # obtian Accuracy & f1 score
    print("Accuracy :" , accuracy_score(y_test, y_pred))
    print("F1 Score:" , f1_score(y_test, y_pred))

Logistic Regression

In [56]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score


#important = ['missed_appointment_before','sum_missed']
important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes']
#important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes','Scheduledhour']


X = df[important]
Y = df['Showed_up']


X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.2)

pipe = make_pipeline(StandardScaler(), LogisticRegression(random_state=42, max_iter=200))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
plot_confusion_matrix(y_test,y_pred,
                          normalize=True,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues)

# Saving our basic model that performed pretty well
import pickle
olsmodel = "olsmodel.sav"

# save
pickle.dump(pipe, open(olsmodel, "wb"))
Normalized confusion matrix
Accuracy : 0.8114024818543667
F1 Score: 0.8276083467094704

Time didnt seem to affect it so I just removed it to prevent over-generalizing.

Removing either missed before (yes or no) OR Total Sum of Missed dropped our models accuracy by 5% so I choose to keep them while ignoring their multicollinearity.

The diseases and notification also made no impact on the models performance but I choose to leave them still for the doctors themselves to decide the right course of action. For instance there might be an emergency so getting a hold of patient might be life saving.

In [48]:
# Function for spitting out the predictions and its probability after done modeling
def analysis(missedprior,totalmissed,reminded,age,hyper,diabetes):

    column = ['missed_appointment_before', 'sum_missed', 'Age', 'SMS_received',
       'Hypertension', 'Diabetes']
    
    serInput = [missedprior,totalmissed,age,reminded,hyper,diabetes]
    
    data = pd.DataFrame([serInput], columns=column)
    
    filename = 'olsmodel.sav'
    model = pickle.load(open(filename, 'rb'))

    #from joblib import dump, load
    #scaler = load('std_scaler.bin')

    #data = scaler.transform(data)

    r =  model.predict(data)
    for i in r:
        if r == 0:
            result = str("Might Miss")
        else: 
            result = str("Might Show")
    proba = np.max(model.predict_proba(data)*100, axis=1)
    pred = str( (result) + ' With A Probability of: ' +'%.2f' % (proba) +'%')
     

    return  pred

missedprior,totalmissed,age,reminded,hyper,diabetes = 1,1,50,1,0,0
result =  analysis(missedprior,totalmissed,reminded,age,hyper,diabetes)

print(result)
Might Show With A Probability of: 54.05%

Xgboost Classifier

Lets see how much better a gradient boosted ensemble model will perform.

In [55]:
import xgboost as xgb

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.2)

pipe = make_pipeline(StandardScaler(), xgb.XGBClassifier(n_jobs=-1, eval_metric='auc', use_label_encoder=False,
                         random_state=42))
pipe.fit(X_train, y_train.astype(int))
y_pred = pipe.predict(X_test)
# obtian Accuracy & f1 score
plot_confusion_matrix(y_test,y_pred,
                          normalize=True,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues)

# save
xgbmodel = "xgb.sav"
pickle.dump(pipe, open(xgbmodel, "wb"))
# load
xgb_model_loaded = pickle.load(open(xgbmodel, "rb"))
Normalized confusion matrix
Accuracy : 0.8686490283306018
F1 Score: 0.8496381667113375

Not so much better unfortunately so in an actual production environment, it'll be better to just use to the base model so as to not incure computational cost overtime for just a 2% increase. Still I'll use xgboost while deploying this for now just because I already have an environment set up for it and I dont have much time to reinvent the wheel. Hence why I'll also be performing a grid search with just so few parameters below:

In [ ]:
# given parameters different values
Random_Grid={
             'n_estimators':[3,5,6],
            'max_depth':[3,4,10]
             }
# xgboost  model
gb = xgb.XGBClassifier( n_jobs=-1, eval_metric='auc', use_label_encoder=False,
                         random_state=42)
# randommized searchCV
rs=RandomizedSearchCV(gb,Random_Grid,cv=20,scoring="f1", n_iter = 6)

# fit the train data
rs.fit(X_train,y_train)
Out[ ]:
RandomizedSearchCV(cv=20,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           enable_categorical=False,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=100, n_jobs=-1,
                                           num_parallel_tree=None,
                                           predictor=None, random_state=42,
                                           reg_alpha=None, reg_lambda=None,
                                           scale_pos_weight=None,
                                           subsample=None, tree_method=None,
                                           use_label_encoder=False,
                                           validate_parameters=None,
                                           verbosity=None),
                   n_iter=6,
                   param_distributions={'max_depth': [3, 4, 10],
                                        'n_estimators': [3, 5, 6]},
                   scoring='f1')
In [ ]:
gs = xgb.XGBClassifier(  n_estimators=rs.best_params_["n_estimators"],
                           max_depth=rs.best_params_["max_depth"],
                        n_jobs=-1,  eval_metric='auc', use_label_encoder=False,
                        random_state=42)
# fit the model 
gs.fit(X_train, y_train)

y_pred = gs.predict(X_test)
# obtian Accuracy & f1 score
print("Accuracy :" , accuracy_score(y_test, y_pred))
print("F1 Score:" , f1_score(y_test, y_pred))
plot_confusion_matrix(y_test,y_pred,
                          normalize=True,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues)

import pickle
gsmodel = "gs.pkl"

# save
pickle.dump(gs, open(gsmodel, "wb"))
# load
#xgb_model_loaded = pickle.load(open(xgbmodel, "rb"))
Accuracy : 0.8455569350833629
F1 Score: 0.8715850016591086
Normalized confusion matrix
Accuracy : 0.8455569350833629
F1 Score: 0.8715850016591086

Neural Network

Finally Ill like to see how much better a simple neural network might perform in contrast to prior models.

In [60]:
import tensorflow 
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding,MaxPooling1D,ConvLSTM2D,RNN
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

#important = ['missed_appointment_before','sum_missed']
important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes']
#important = ['missed_appointment_before','sum_missed','Age','SMS_received', 'Hypertension','Diabetes','Scheduledhour']


X = df[important]
Y = df['Showed_up']



X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

classifier = Sequential()
classifier.add(Dense(units = 64, activation = 'relu', input_dim = len(important)))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.summary()

history = classifier.fit(X_train, y_train, epochs = 10, validation_split = 0.2)

y_pred = classifier.predict(X_test)
y_pred = y_pred > 0.5


plot_confusion_matrix(y_test,y_pred,
                          normalize=True,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues)
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_4 (Dense)             (None, 64)                448       
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 128)               8320      
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_6 (Dense)             (None, 128)               16512     
                                                                 
 dropout_5 (Dropout)         (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 1)                 129       
                                                                 
=================================================================
Total params: 25,409
Trainable params: 25,409
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
855/855 [==============================] - 5s 4ms/step - loss: 0.3244 - accuracy: 0.8354 - val_loss: 0.2470 - val_accuracy: 0.8709
Epoch 2/10
855/855 [==============================] - 5s 6ms/step - loss: 0.2627 - accuracy: 0.8655 - val_loss: 0.2480 - val_accuracy: 0.8709
Epoch 3/10
855/855 [==============================] - 5s 6ms/step - loss: 0.2597 - accuracy: 0.8672 - val_loss: 0.2503 - val_accuracy: 0.8709
Epoch 4/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2600 - accuracy: 0.8676 - val_loss: 0.2475 - val_accuracy: 0.8709
Epoch 5/10
855/855 [==============================] - 3s 3ms/step - loss: 0.2580 - accuracy: 0.8679 - val_loss: 0.2462 - val_accuracy: 0.8709
Epoch 6/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2583 - accuracy: 0.8675 - val_loss: 0.2467 - val_accuracy: 0.8709
Epoch 7/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2580 - accuracy: 0.8674 - val_loss: 0.2487 - val_accuracy: 0.8709
Epoch 8/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2580 - accuracy: 0.8677 - val_loss: 0.2475 - val_accuracy: 0.8709
Epoch 9/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2581 - accuracy: 0.8678 - val_loss: 0.2468 - val_accuracy: 0.8709
Epoch 10/10
855/855 [==============================] - 2s 3ms/step - loss: 0.2574 - accuracy: 0.8679 - val_loss: 0.2473 - val_accuracy: 0.8709
Normalized confusion matrix
Accuracy : 0.8727464294076329
F1 Score: 0.8537208989368861

Conlusion:

Adding 3 dense layers and dropout layers after each dense layer to reduce over generalizing, the model didnt impove by much if any at all so definitely shouldnt use the neural network model as it wont be cost efficient in contrast to simpler models. I did just use 10 epochs in training so chances are that if I had used more it would have outperformed the rest by more but I want to build a users interface right now so that would be work for another day.

Recommendations:

Operation Unit

Send text notifications to those who scheduled during in high cancelation time periods i.e [between 7am - 11am & 2pm -3pm].

Encourage Friday and Thursday bookings more.

Switch to phone call reminders a day prior to the appointment.

Use the app built to view a patients probability of canceling a day prior, if its > 70%, call to suggest reschedulling before the patient even thinks of canceling.

Model:

Try upsampling for better accuracy of model.

Try a more indepth grid search approach using more parameters.

Get more data especially for time variables.

Experiment with label encoding for the time variables made in the begining i.e ['ScheduledDayofweek', 'Scheduledmonth','AppointmentDayofweek', 'Appointmentmonth'].

In [59]:
%%shell
jupyter nbconvert --to html code.ipynb
[NbConvertApp] Converting notebook code.ipynb to html
[NbConvertApp] Writing 607126 bytes to code.html
Out[59]:

In [ ]: