Spaces:
Running
Running
File size: 12,692 Bytes
9ce984a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 |
"""
Title: Structured data classification from scratch
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2020/06/09
Last modified: 2020/06/09
Description: Binary classification of structured data including numerical and categorical features.
Accelerator: GPU
Made backend-agnostic by: [Humbulani Ndou](https://github.com/Humbulani1234)
"""
"""
## Introduction
This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will use Keras
preprocessing layers to normalize the numerical features and vectorize the categorical
ones.
Note that this example should be run with TensorFlow 2.5 or higher.
### The dataset
[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).
Here's the description of each feature:
Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target
"""
"""
## Setup
"""
import os
os.environ["KERAS_BACKEND"] = "torch" # or torch, or tensorflow
import pandas as pd
import keras
from keras import layers
"""
## Preparing the data
Let's download the data and load it into a Pandas dataframe:
"""
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)
"""
The dataset includes 303 samples with 14 columns per sample (13 features, plus the target
label):
"""
dataframe.shape
"""
Here's a preview of a few samples:
"""
dataframe.head()
"""
The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).
Let's split the data into a training and validation set:
"""
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
print(
f"Using {len(train_dataframe)} samples for training "
f"and {len(val_dataframe)} for validation"
)
"""
## Define dataset metadata
Here, we define the metadata of the dataset that will be useful for reading and
parsing the data into input features, and encoding the input features with respect
to their types.
"""
COLUMN_NAMES = [
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal",
"target",
]
# Target feature name.
TARGET_FEATURE_NAME = "target"
# Numeric feature names.
NUMERIC_FEATURE_NAMES = ["age", "trestbps", "thalach", "oldpeak", "slope", "chol"]
# Categorical features and their vocabulary lists.
# Note that we add 'v=' as a prefix to all categorical feature values to make
# sure that they are treated as strings.
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
feature_name: sorted(
[
# Integer categorcal must be int and string must be str
value if dataframe[feature_name].dtype == "int64" else str(value)
for value in list(dataframe[feature_name].unique())
]
)
for feature_name in COLUMN_NAMES
if feature_name not in list(NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME])
}
# All features names.
FEATURE_NAMES = NUMERIC_FEATURE_NAMES + list(
CATEGORICAL_FEATURES_WITH_VOCABULARY.keys()
)
"""
## Feature preprocessing with Keras layers
The following features are categorical features encoded as integers:
- `sex`
- `cp`
- `fbs`
- `restecg`
- `exang`
- `ca`
We will encode these features using **one-hot encoding**. We have two options
here:
- Use `CategoryEncoding()`, which requires knowing the range of input values
and will error on input outside the range.
- Use `IntegerLookup()` which will build a lookup table for inputs and reserve
an output index for unkown input values.
For this example, we want a simple solution that will handle out of range inputs
at inference, so we will use `IntegerLookup()`.
We also have a categorical feature encoded as a string: `thal`. We will create an
index of all possible features and encode output using the `StringLookup()` layer.
Finally, the following feature are continuous numerical features:
- `age`
- `trestbps`
- `chol`
- `thalach`
- `oldpeak`
- `slope`
For each of these features, we will use a `Normalization()` layer to make sure the mean
of each feature is 0 and its standard deviation is 1.
Below, we define 2 utility functions to do the operations:
- `encode_numerical_feature` to apply featurewise normalization to numerical features.
- `process` to one-hot encode string or integer categorical features.
"""
# Tensorflow required for tf.data.Dataset
import tensorflow as tf
# We process our datasets elements here (categorical) and convert them to indices to avoid this step
# during model training since only tensorflow support strings.
def encode_categorical(features, target):
for feature_name in features:
if feature_name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
lookup_class = (
layers.StringLookup
if features[feature_name].dtype == "string"
else layers.IntegerLookup
)
vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
# Create a lookup to convert a string values to an integer indices.
# Since we are not using a mask token nor expecting any out of vocabulary
# (oov) token, we set mask_token to None and num_oov_indices to 0.
index = lookup_class(
vocabulary=vocabulary,
mask_token=None,
num_oov_indices=0,
output_mode="binary",
)
# Convert the string input values into integer indices.
value_index = index(features[feature_name])
features[feature_name] = value_index
else:
pass
# Change features from OrderedDict to Dict to match Inputs as they are Dict.
return dict(features), target
def encode_numerical_feature(feature, name, dataset):
# Create a Normalization layer for our feature
normalizer = layers.Normalization()
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the statistics of the data
normalizer.adapt(feature_ds)
# Normalize the input feature
encoded_feature = normalizer(feature)
return encoded_feature
"""
Let's generate `tf.data.Dataset` objects for each dataframe:
"""
def dataframe_to_dataset(dataframe):
dataframe = dataframe.copy()
labels = dataframe.pop("target")
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)).map(
encode_categorical
)
ds = ds.shuffle(buffer_size=len(dataframe))
return ds
train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)
"""
Each `Dataset` yields a tuple `(input, target)` where `input` is a dictionary of features
and `target` is the value `0` or `1`:
"""
for x, y in train_ds.take(1):
print("Input:", x)
print("Target:", y)
"""
Let's batch the datasets:
"""
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)
"""
## Build a model
With this done, we can create our end-to-end model:
"""
# Categorical features have different shapes after the encoding, dependent on the
# vocabulary or unique values of each feature. We create them accordinly to match the
# input data elements generated by tf.data.Dataset after pre-processing them
def create_model_inputs():
inputs = {}
# This a helper function for creating categorical features
def create_input_helper(feature_name):
num_categories = len(CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name])
inputs[feature_name] = layers.Input(
name=feature_name, shape=(num_categories,), dtype="int64"
)
return inputs
for feature_name in FEATURE_NAMES:
if feature_name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
# Categorical features
create_input_helper(feature_name)
else:
# Make them float32, they are Real numbers
feature_input = layers.Input(name=feature_name, shape=(1,), dtype="float32")
# Process the Inputs here
inputs[feature_name] = encode_numerical_feature(
feature_input, feature_name, train_ds
)
return inputs
# This Layer defines the logic of the Model to perform the classification
class Classifier(keras.layers.Layer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.dense_1 = layers.Dense(32, activation="relu")
self.dropout = layers.Dropout(0.5)
self.dense_2 = layers.Dense(1, activation="sigmoid")
def call(self, inputs):
all_features = layers.concatenate(list(inputs.values()))
x = self.dense_1(all_features)
x = self.dropout(x)
output = self.dense_2(x)
return output
# Surpress build warnings
def build(self, input_shape):
self.built = True
# Create the Classifier model
def create_model():
all_inputs = create_model_inputs()
output = Classifier()(all_inputs)
model = keras.Model(all_inputs, output)
return model
model = create_model()
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
"""
Let's visualize our connectivity graph:
"""
# `rankdir='LR'` is to make the graph horizontal.
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")
"""
## Train the model
"""
model.fit(train_ds, epochs=50, validation_data=val_ds)
"""
We quickly get to 80% validation accuracy.
"""
"""
## Inference on new data
To get a prediction for a new sample, you can simply call `model.predict()`. There are
just two things you need to do:
1. wrap scalars into a list so as to have a batch dimension (models only process batches
of data, not single samples)
2. Call `convert_to_tensor` on each feature
"""
sample = {
"age": 60,
"sex": 1,
"cp": 1,
"trestbps": 145,
"chol": 233,
"fbs": 1,
"restecg": 2,
"thalach": 150,
"exang": 0,
"oldpeak": 2.3,
"slope": 3,
"ca": 0,
"thal": "fixed",
}
# Given the category (in the sample above - key) and the category value (in the sample above - value),
# we return its one-hot encoding
def get_cat_encoding(cat, cat_value):
# Create a list of zeros with the same length as categories
encoding = [0] * len(cat)
# Find the index of category_value in categories and set the corresponding position to 1
if cat_value in cat:
encoding[cat.index(cat_value)] = 1
return encoding
for name, value in sample.items():
if name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
sample.update(
{
name: get_cat_encoding(
CATEGORICAL_FEATURES_WITH_VOCABULARY[name], sample[name]
)
}
)
# Convert inputs to tensors
input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
print(
f"This particular patient had a {100 * predictions[0][0]:.1f} "
"percent probability of having a heart disease, "
"as evaluated by our model."
)
"""
## Conclusions
- The orignal model (the one that runs only on tensorflow) converges quickly to around 80% and remains
there for extended periods and at times hits 85%
- The updated model (the backed-agnostic) model may fluctuate between 78% and 83% and at times hitting 86%
validation accuracy and converges around 80% also.
"""
|