Update README.md
Browse files
README.md
CHANGED
|
@@ -5,4 +5,106 @@ datasets:
|
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
pipeline_tag: tabular-classification
|
| 8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
pipeline_tag: tabular-classification
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Stroke Prediction Model
|
| 11 |
+
|
| 12 |
+
This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.
|
| 13 |
+
|
| 14 |
+
### Data Set
|
| 15 |
+
|
| 16 |
+
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
|
| 17 |
+
|
| 18 |
+
### Attribute Information
|
| 19 |
+
|
| 20 |
+
1. id: unique identifier
|
| 21 |
+
2. gender: "Male", "Female" or "Other"
|
| 22 |
+
3. age: age of the patient
|
| 23 |
+
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
|
| 24 |
+
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
|
| 25 |
+
6. ever_married: "No" or "Yes"
|
| 26 |
+
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
|
| 27 |
+
8. Residence_type: "Rural" or "Urban"
|
| 28 |
+
9. avg_glucose_level: average glucose level in blood
|
| 29 |
+
10. bmi: body mass index
|
| 30 |
+
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
|
| 31 |
+
12. stroke: 1 if the patient had a stroke or 0 if not
|
| 32 |
+
|
| 33 |
+
## Key Considerations Implementation
|
| 34 |
+
|
| 35 |
+
## Data Cleaning
|
| 36 |
+
|
| 37 |
+
#### Drop id column
|
| 38 |
+
|
| 39 |
+
The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.
|
| 40 |
+
|
| 41 |
+
#### Remove missing values
|
| 42 |
+
|
| 43 |
+
Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number
|
| 44 |
+
|
| 45 |
+
## Feature Engineering
|
| 46 |
+
|
| 47 |
+
#### Binary Encoding
|
| 48 |
+
|
| 49 |
+
Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:
|
| 50 |
+
|
| 51 |
+
- ever_married: Encoded as 0 for “No” and 1 for “Yes”.
|
| 52 |
+
- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.
|
| 53 |
+
|
| 54 |
+
#### One-Hot Encoding for Multi-Class Categorical Features
|
| 55 |
+
|
| 56 |
+
- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
|
| 57 |
+
- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.
|
| 58 |
+
|
| 59 |
+
#### Split Dataset into Features and Target
|
| 60 |
+
|
| 61 |
+
- Separate the target variable (stroke) from the features:
|
| 62 |
+
- X: Contains all feature columns used as input for the model.
|
| 63 |
+
- y: Contains the target column, which indicates whether a stroke occurred.
|
| 64 |
+
|
| 65 |
+
#### Train-Test Split
|
| 66 |
+
|
| 67 |
+
- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
|
| 68 |
+
- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.
|
| 69 |
+
|
| 70 |
+
### Model Selection
|
| 71 |
+
|
| 72 |
+
Following models are evaluated:
|
| 73 |
+
|
| 74 |
+
- Logistic Regression
|
| 75 |
+
- K-Nearest Neighbors
|
| 76 |
+
- Support Vector Machine (Linear Kernel)
|
| 77 |
+
- Support Vector Machine (RBF Kernel)
|
| 78 |
+
- Neural Network
|
| 79 |
+
- Gradient Boosting
|
| 80 |
+
|
| 81 |
+
Evaluated for:
|
| 82 |
+
|
| 83 |
+
- Handles both numerical and categorical features
|
| 84 |
+
- Resistant to overfitting
|
| 85 |
+
- Provides feature importance
|
| 86 |
+
- Good performance on imbalanced data
|
| 87 |
+
|
| 88 |
+
### 4. Software Engineering Best Practices
|
| 89 |
+
|
| 90 |
+
#### A. Logging
|
| 91 |
+
|
| 92 |
+
Comprehensive logging system:
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
Logging features:
|
| 99 |
+
|
| 100 |
+
- Timestamp for each operation
|
| 101 |
+
- Different log levels (INFO, ERROR)
|
| 102 |
+
- Operation tracking
|
| 103 |
+
- Error capture and reporting
|
| 104 |
+
|
| 105 |
+
#### B. Documentation
|
| 106 |
+
|
| 107 |
+
- Docstrings for all classes and methods
|
| 108 |
+
- Clear code structure with comments
|
| 109 |
+
- This README file
|
| 110 |
+
- Logging outputs for tracking
|