emlacodeuse's picture
Create README.md
467816a verified
---
metrics:
- accuracy
- ROC AUC
- confusion matrix
pipeline_tag: tabular-classification
library_name: sklearn
---
# Final Project in AI Engineering, DVAE26, ht24
## Instructions
This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment.
*Objective:* Predict stroke occurrences using tabular healthcare data.
*Key Considerations:*
- Data preprocessing: Handle missing values, encode categorical variables, and normalize features.
- Model selection: Train, evaluate, and compare multiple machine learning models.
- Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance.
- SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented.
*Dataset:* Kaggle Stroke Prediction Dataset
*Model Development:* Build a pipeline to process data, train models, and evaluate their performance.
*Deployment:* Deploy the best-performing model on Hugging Face.
*Deliverables:* A report summarizing the workflow, key findings, and insights.
---
## Workflow Pipeline
### 1. Data Acquisition
The dataset was sourced from [Kaggle's Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status.
### 2. Data Preprocessing
The following steps were performed to ensure data quality and readiness for modeling:
- **Handling Missing Values:** Imputed missing BMI values using the mean strategy.
- **Encoding Categorical Features:** Used one-hot encoding for variables like gender and residence type.
- **Feature Scaling:** Standardized continuous variables to ensure uniformity across features.
- **Balancing the Dataset:** Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique).
### 3. Model Development
Several machine learning models were developed and evaluated:
1. **Linear Regression**
2. **Logistic Regression**
3. **Random Forest Classifier** + hyperparameter tuning
4. **Decision Tree Classifier**
5. **Gradient Boosting Classifier** + hyperparameter tuning
6. **K-Nearest Neighbors (KNN)**
---
## Results
The table below summarizes the performance of the models:
| Model | Accuracy | ROC AUC | Confusion Matrix |
|-----------------------|----------|----------|-------------------------|
| Linear Regression | 0.775632 | 0.837220 | [[478, 195], [107, 566]] |
| Logistic Regression | 0.776374 | 0.837297 | [[492, 181], [120, 553]] |
| Random Forest | 0.942793 | 0.990407 | [[629, 44], [33, 640]] |
| Decision Tree | 0.892273 | 0.892273 | [[588, 85], [60, 613]] |
| Gradient Boosting | 0.957652 | 0.988327 | [[651, 22], [35, 638]] |
| KNN | 0.880386 | 0.946411 | [[531, 142], [19, 654]] |
### Key Insights
- **Gradient Boosting** achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988).
- **Random Forest** provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%).
- Simpler models like **Logistic Regression** and **Linear Regression** performed adequately but were less effective in comparison.
- **KNN** and **Decision Tree** showed strong results but were outperformed by Gradient Boosting and Random Forest.
### Evaluation Metrics
1. **Accuracy:** Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances.
2. **ROC AUC:** Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases.
3. **Confusion Matrix:** Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance.
---
## Deployment
The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: [ML stroke prediction](https://huggingface.co/emlacodeuse/ml-stroke-prediction).
Deployment highlights:
- **Platform:** Hugging Face
- **Usage:** Users can input patient data to predict stroke risk interactively.
- **Versioning:** Deployment includes metadata for reproducibility.
---
## Data Quality Analysis
### Analysis of Data Quality
- The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training.
- Missing values in BMI were a key challenge, addressed using mean imputation.
### Challenges Faced
1. **Class Imbalance:** Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions.
2. **Feature Correlation:** Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task.
### Measures Taken to Ensure Data Integrity
- Visualized data distribution and checked for anomalies before preprocessing.
- Validated preprocessing steps using unit tests.
- Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance.
---
## SE Best Practices
### Key Practices Implemented:
- **Modularization:** Encapsulated preprocessing, training, and evaluation in separate functions for reusability.
- **Version Control:** Maintained a Git repository with detailed commit messages.
- **Hyperparameter Tuning:** Experimented with different learning rates, tree depths, and ensemble configurations.
- **Documentation:** Created comprehensive project documentation, including a README file.
- **Testing:** Added unit tests for data preprocessing and model evaluation steps.
---
## Key Outcomes
### Results:
- Gradient Boosting emerged as the best-performing model for stroke prediction.
- The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores.
### Personal Insights:
- It was a extremely interesting project for me, as I never did something like that. I learned a lot.
---