|
|
--- |
|
|
metrics: |
|
|
- accuracy |
|
|
- ROC AUC |
|
|
- confusion matrix |
|
|
pipeline_tag: tabular-classification |
|
|
library_name: sklearn |
|
|
--- |
|
|
|
|
|
# Final Project in AI Engineering, DVAE26, ht24 |
|
|
|
|
|
|
|
|
## Instructions |
|
|
This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment. |
|
|
|
|
|
*Objective:* Predict stroke occurrences using tabular healthcare data. |
|
|
|
|
|
*Key Considerations:* |
|
|
- Data preprocessing: Handle missing values, encode categorical variables, and normalize features. |
|
|
- Model selection: Train, evaluate, and compare multiple machine learning models. |
|
|
- Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance. |
|
|
- SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented. |
|
|
|
|
|
*Dataset:* Kaggle Stroke Prediction Dataset |
|
|
*Model Development:* Build a pipeline to process data, train models, and evaluate their performance. |
|
|
*Deployment:* Deploy the best-performing model on Hugging Face. |
|
|
*Deliverables:* A report summarizing the workflow, key findings, and insights. |
|
|
|
|
|
--- |
|
|
|
|
|
## Workflow Pipeline |
|
|
|
|
|
### 1. Data Acquisition |
|
|
The dataset was sourced from [Kaggle's Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status. |
|
|
|
|
|
### 2. Data Preprocessing |
|
|
The following steps were performed to ensure data quality and readiness for modeling: |
|
|
- **Handling Missing Values:** Imputed missing BMI values using the mean strategy. |
|
|
- **Encoding Categorical Features:** Used one-hot encoding for variables like gender and residence type. |
|
|
- **Feature Scaling:** Standardized continuous variables to ensure uniformity across features. |
|
|
- **Balancing the Dataset:** Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique). |
|
|
|
|
|
### 3. Model Development |
|
|
Several machine learning models were developed and evaluated: |
|
|
1. **Linear Regression** |
|
|
2. **Logistic Regression** |
|
|
3. **Random Forest Classifier** + hyperparameter tuning |
|
|
4. **Decision Tree Classifier** |
|
|
5. **Gradient Boosting Classifier** + hyperparameter tuning |
|
|
6. **K-Nearest Neighbors (KNN)** |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Results |
|
|
|
|
|
The table below summarizes the performance of the models: |
|
|
|
|
|
| Model | Accuracy | ROC AUC | Confusion Matrix | |
|
|
|-----------------------|----------|----------|-------------------------| |
|
|
| Linear Regression | 0.775632 | 0.837220 | [[478, 195], [107, 566]] | |
|
|
| Logistic Regression | 0.776374 | 0.837297 | [[492, 181], [120, 553]] | |
|
|
| Random Forest | 0.942793 | 0.990407 | [[629, 44], [33, 640]] | |
|
|
| Decision Tree | 0.892273 | 0.892273 | [[588, 85], [60, 613]] | |
|
|
| Gradient Boosting | 0.957652 | 0.988327 | [[651, 22], [35, 638]] | |
|
|
| KNN | 0.880386 | 0.946411 | [[531, 142], [19, 654]] | |
|
|
|
|
|
### Key Insights |
|
|
- **Gradient Boosting** achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988). |
|
|
- **Random Forest** provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%). |
|
|
- Simpler models like **Logistic Regression** and **Linear Regression** performed adequately but were less effective in comparison. |
|
|
- **KNN** and **Decision Tree** showed strong results but were outperformed by Gradient Boosting and Random Forest. |
|
|
|
|
|
### Evaluation Metrics |
|
|
1. **Accuracy:** Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances. |
|
|
2. **ROC AUC:** Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases. |
|
|
3. **Confusion Matrix:** Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Deployment |
|
|
The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: [ML stroke prediction](https://huggingface.co/emlacodeuse/ml-stroke-prediction). |
|
|
|
|
|
Deployment highlights: |
|
|
- **Platform:** Hugging Face |
|
|
- **Usage:** Users can input patient data to predict stroke risk interactively. |
|
|
- **Versioning:** Deployment includes metadata for reproducibility. |
|
|
|
|
|
--- |
|
|
|
|
|
## Data Quality Analysis |
|
|
|
|
|
### Analysis of Data Quality |
|
|
- The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training. |
|
|
- Missing values in BMI were a key challenge, addressed using mean imputation. |
|
|
|
|
|
### Challenges Faced |
|
|
1. **Class Imbalance:** Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions. |
|
|
2. **Feature Correlation:** Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task. |
|
|
|
|
|
### Measures Taken to Ensure Data Integrity |
|
|
- Visualized data distribution and checked for anomalies before preprocessing. |
|
|
- Validated preprocessing steps using unit tests. |
|
|
- Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance. |
|
|
|
|
|
--- |
|
|
|
|
|
## SE Best Practices |
|
|
|
|
|
### Key Practices Implemented: |
|
|
- **Modularization:** Encapsulated preprocessing, training, and evaluation in separate functions for reusability. |
|
|
- **Version Control:** Maintained a Git repository with detailed commit messages. |
|
|
- **Hyperparameter Tuning:** Experimented with different learning rates, tree depths, and ensemble configurations. |
|
|
- **Documentation:** Created comprehensive project documentation, including a README file. |
|
|
- **Testing:** Added unit tests for data preprocessing and model evaluation steps. |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Outcomes |
|
|
|
|
|
### Results: |
|
|
- Gradient Boosting emerged as the best-performing model for stroke prediction. |
|
|
- The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores. |
|
|
|
|
|
### Personal Insights: |
|
|
- It was a extremely interesting project for me, as I never did something like that. I learned a lot. |
|
|
|
|
|
--- |
|
|
|