File size: 6,282 Bytes
467816a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
metrics:
- accuracy
- ROC AUC
- confusion matrix
pipeline_tag: tabular-classification
library_name: sklearn
---

# Final Project in AI Engineering, DVAE26, ht24  


## Instructions  
This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment.  

*Objective:* Predict stroke occurrences using tabular healthcare data.  

*Key Considerations:*  
- Data preprocessing: Handle missing values, encode categorical variables, and normalize features.  
- Model selection: Train, evaluate, and compare multiple machine learning models.  
- Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance.  
- SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented.  

*Dataset:* Kaggle Stroke Prediction Dataset  
*Model Development:* Build a pipeline to process data, train models, and evaluate their performance.  
*Deployment:* Deploy the best-performing model on Hugging Face.  
*Deliverables:* A report summarizing the workflow, key findings, and insights.  

---

## Workflow Pipeline  

### 1. Data Acquisition  
The dataset was sourced from [Kaggle's Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status.  

### 2. Data Preprocessing  
The following steps were performed to ensure data quality and readiness for modeling:  
- **Handling Missing Values:** Imputed missing BMI values using the mean strategy.  
- **Encoding Categorical Features:** Used one-hot encoding for variables like gender and residence type.  
- **Feature Scaling:** Standardized continuous variables to ensure uniformity across features.  
- **Balancing the Dataset:** Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique).  

### 3. Model Development  
Several machine learning models were developed and evaluated:  
1. **Linear Regression**  
2. **Logistic Regression**  
3. **Random Forest Classifier**  + hyperparameter tuning
4. **Decision Tree Classifier** 
5. **Gradient Boosting Classifier**  + hyperparameter tuning
6. **K-Nearest Neighbors (KNN)**  


---

## Results  

The table below summarizes the performance of the models:  

| Model                 | Accuracy | ROC AUC  | Confusion Matrix        |  
|-----------------------|----------|----------|-------------------------|  
| Linear Regression     | 0.775632 | 0.837220 | [[478, 195], [107, 566]] |  
| Logistic Regression   | 0.776374 | 0.837297 | [[492, 181], [120, 553]] |  
| Random Forest         | 0.942793 | 0.990407 | [[629, 44], [33, 640]]   |  
| Decision Tree         | 0.892273 | 0.892273 | [[588, 85], [60, 613]]   |  
| Gradient Boosting     | 0.957652 | 0.988327 | [[651, 22], [35, 638]]   |  
| KNN                   | 0.880386 | 0.946411 | [[531, 142], [19, 654]]  |  

### Key Insights  
- **Gradient Boosting** achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988).  
- **Random Forest** provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%).  
- Simpler models like **Logistic Regression** and **Linear Regression** performed adequately but were less effective in comparison.  
- **KNN** and **Decision Tree** showed strong results but were outperformed by Gradient Boosting and Random Forest.  

### Evaluation Metrics  
1. **Accuracy:** Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances.  
2. **ROC AUC:** Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases.  
3. **Confusion Matrix:** Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance.  

---

## Deployment  
The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: [ML stroke prediction](https://huggingface.co/emlacodeuse/ml-stroke-prediction).  

Deployment highlights:  
- **Platform:** Hugging Face  
- **Usage:** Users can input patient data to predict stroke risk interactively.  
- **Versioning:** Deployment includes metadata for reproducibility.  

---

## Data Quality Analysis  

### Analysis of Data Quality  
- The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training.  
- Missing values in BMI were a key challenge, addressed using mean imputation.  

### Challenges Faced  
1. **Class Imbalance:** Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions.  
2. **Feature Correlation:** Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task.  

### Measures Taken to Ensure Data Integrity  
- Visualized data distribution and checked for anomalies before preprocessing.  
- Validated preprocessing steps using unit tests.  
- Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance.  

---

## SE Best Practices  

### Key Practices Implemented:  
- **Modularization:** Encapsulated preprocessing, training, and evaluation in separate functions for reusability.  
- **Version Control:** Maintained a Git repository with detailed commit messages.  
- **Hyperparameter Tuning:** Experimented with different learning rates, tree depths, and ensemble configurations.  
- **Documentation:** Created comprehensive project documentation, including a README file.  
- **Testing:** Added unit tests for data preprocessing and model evaluation steps.  

---

## Key Outcomes  

### Results:  
- Gradient Boosting emerged as the best-performing model for stroke prediction.  
- The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores.  

### Personal Insights:  
- It was a extremely interesting project for me, as I never did something like that. I learned a lot.

---