--- metrics: - accuracy - ROC AUC - confusion matrix pipeline_tag: tabular-classification library_name: sklearn --- # Final Project in AI Engineering, DVAE26, ht24 ## Instructions This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment. *Objective:* Predict stroke occurrences using tabular healthcare data. *Key Considerations:* - Data preprocessing: Handle missing values, encode categorical variables, and normalize features. - Model selection: Train, evaluate, and compare multiple machine learning models. - Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance. - SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented. *Dataset:* Kaggle Stroke Prediction Dataset *Model Development:* Build a pipeline to process data, train models, and evaluate their performance. *Deployment:* Deploy the best-performing model on Hugging Face. *Deliverables:* A report summarizing the workflow, key findings, and insights. --- ## Workflow Pipeline ### 1. Data Acquisition The dataset was sourced from [Kaggle's Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status. ### 2. Data Preprocessing The following steps were performed to ensure data quality and readiness for modeling: - **Handling Missing Values:** Imputed missing BMI values using the mean strategy. - **Encoding Categorical Features:** Used one-hot encoding for variables like gender and residence type. - **Feature Scaling:** Standardized continuous variables to ensure uniformity across features. - **Balancing the Dataset:** Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique). ### 3. Model Development Several machine learning models were developed and evaluated: 1. **Linear Regression** 2. **Logistic Regression** 3. **Random Forest Classifier** + hyperparameter tuning 4. **Decision Tree Classifier** 5. **Gradient Boosting Classifier** + hyperparameter tuning 6. **K-Nearest Neighbors (KNN)** --- ## Results The table below summarizes the performance of the models: | Model | Accuracy | ROC AUC | Confusion Matrix | |-----------------------|----------|----------|-------------------------| | Linear Regression | 0.775632 | 0.837220 | [[478, 195], [107, 566]] | | Logistic Regression | 0.776374 | 0.837297 | [[492, 181], [120, 553]] | | Random Forest | 0.942793 | 0.990407 | [[629, 44], [33, 640]] | | Decision Tree | 0.892273 | 0.892273 | [[588, 85], [60, 613]] | | Gradient Boosting | 0.957652 | 0.988327 | [[651, 22], [35, 638]] | | KNN | 0.880386 | 0.946411 | [[531, 142], [19, 654]] | ### Key Insights - **Gradient Boosting** achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988). - **Random Forest** provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%). - Simpler models like **Logistic Regression** and **Linear Regression** performed adequately but were less effective in comparison. - **KNN** and **Decision Tree** showed strong results but were outperformed by Gradient Boosting and Random Forest. ### Evaluation Metrics 1. **Accuracy:** Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances. 2. **ROC AUC:** Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases. 3. **Confusion Matrix:** Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance. --- ## Deployment The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: [ML stroke prediction](https://huggingface.co/emlacodeuse/ml-stroke-prediction). Deployment highlights: - **Platform:** Hugging Face - **Usage:** Users can input patient data to predict stroke risk interactively. - **Versioning:** Deployment includes metadata for reproducibility. --- ## Data Quality Analysis ### Analysis of Data Quality - The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training. - Missing values in BMI were a key challenge, addressed using mean imputation. ### Challenges Faced 1. **Class Imbalance:** Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions. 2. **Feature Correlation:** Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task. ### Measures Taken to Ensure Data Integrity - Visualized data distribution and checked for anomalies before preprocessing. - Validated preprocessing steps using unit tests. - Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance. --- ## SE Best Practices ### Key Practices Implemented: - **Modularization:** Encapsulated preprocessing, training, and evaluation in separate functions for reusability. - **Version Control:** Maintained a Git repository with detailed commit messages. - **Hyperparameter Tuning:** Experimented with different learning rates, tree depths, and ensemble configurations. - **Documentation:** Created comprehensive project documentation, including a README file. - **Testing:** Added unit tests for data preprocessing and model evaluation steps. --- ## Key Outcomes ### Results: - Gradient Boosting emerged as the best-performing model for stroke prediction. - The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores. ### Personal Insights: - It was a extremely interesting project for me, as I never did something like that. I learned a lot. ---