CSC310-fall25 / audit_Diabetes_clustering.md

Upload 2 files

eb14d64 verified 6 months ago

5.57 kB

jupytext:
  formats: ipynb,md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.17.3
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3

Diabetes Hospital from 1999-2008

Info:

this data represents 10 years of clinical care 130 hospitals, the rows are the hospital record of patiends diagnosed with diabetes. Despite having strong improvements trhouhg the clinic for dibetics patients, not every patients recives the same outcome as the

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

sns.set_theme(palette = 'colorblind')

from sklearn.metrics import confusion_matrix, classification_report
from IPython.display import display

np.random.seed(1103)
np.random.seed(113)

url_base = 'https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008'

# can't find the raw data but here is the link where I got the source from.

diabetic_df =  pd.read_csv('diabetic_data.csv', index_col = 0)

diabetic_df.replace('?', 'N/A', inplace = True)

diabetic_df.shape

diabetic_df.columns

sns.pairplot(diabetic_df)

important_data = ['race', 'gender', 'age', 'admission_type_id', 'time_in_hospital', 'num_lab_procedures', 'num_medications', 'number_emergency']

# dropping any data with nothing

data_df = diabetic_df[important_data].dropna()

data_df.head()

data_df.shape # we can see that a lot of the columns reduce as we are getting the important parts

data_info = data_df['race']
data_info.iloc[0:25]

num_data = pd.get_dummies(data_df, drop_first= True)
num_data.head() # make age, race, and gender be a number to match pairs

num_data.columns # more columns

num_data['number_emergency'].value_counts()

km = KMeans(n_clusters=3)

km.__dict__

km.fit(num_data)

km.__dict__

labels = km.predict(num_data)
#readable
num_data['km3'] = labels.astype(str)

sns.pairplot(num_data.sample(100), hue='km3')

# this works just that theres too much graphing happening in the background so I decided to reduce the sample.

silo_values = metrics.silhouette_samples(num_data.drop(columns=['km3']),num_data['km3'].astype(int))

num_data['km3_silo'] = silo_values

num_data.groupby('km3')['km3_silo'].mean()

target = diabetic_df['readmitted']

X = num_data.drop(columns=['km3', 'km3_silo'])
y = target.loc[X.index]

Conclusion:

Describe what question you would be asking in applying clustering to this dataset. What does it mean if clustering does not work well?

can we see some patterns with patients to see if they are somewhat related?
if clustering does not work then it could just be that diabetic are something that is hard to find a pattern to overcome it.

How does this task compare to what the classification task on this dataset?

I believe from the data, most of the info comes down to readmitted as in if they came back before 30 days, after 30 days or just didnt show up. but what we did was we wanted to know what else could we do with the data if we didn't know about the 30 days process.
Apply Kmeans using the known, correct number of clusters, K. Evaluate how well clustering worked on the data: using a true clustering metric and using visualization and using a clustering metric that uses the ground truth labels
the data that came out were almost close but not really so therefore it's not a good data to use..
using pairplot you can see the colors on the graphs and they aren't really organized..

Include a discussion of your results that addresses the following: describes what the clustering means what the metrics show Does this clustering work better or worse than expected based on the classification performance (if you didn’t complete assignment 7, also apply a classifier)

I think that this means for clustering is that its really hard to see a pattern when patients tend to have alot going on at the same time. as in some stay in hospital longer while others dont.. its just a mess
metrics show that theres something, like small percent of patterns but they aren't perfect.
I kind of just did clustering as you can see below I was trying to make classification work but if im being honest, I feel like clustering makes it worst as we were mixing and grouping things together to see something! so I have a feeling classification works best here.

df6 = diabetic_df[['race','gender','age']]
df6.head()

sns.pairplot(data=df6, hue = 'char', hue_order=['A','B'])