Create README.md

467816a verified about 1 year ago

6.28 kB

	---
	metrics:
	- accuracy
	- ROC AUC
	- confusion matrix
	pipeline_tag: tabular-classification
	library_name: sklearn
	---

	# Final Project in AI Engineering, DVAE26, ht24


	## Instructions
	This project focuses on developing a machine learning model for predictive analytics using a healthcare dataset. It encompasses the entire machine learning lifecycle, from data quality assessment and preprocessing to model training, evaluation, and deployment.

	Objective: Predict stroke occurrences using tabular healthcare data.

	Key Considerations:
	- Data preprocessing: Handle missing values, encode categorical variables, and normalize features.
	- Model selection: Train, evaluate, and compare multiple machine learning models.
	- Evaluation metrics: Accuracy, ROC AUC, and confusion matrix were utilized to measure model performance.
	- SE best practices: Modular code, version control, hyperparameter tuning, and interpretability techniques were implemented.

	Dataset: Kaggle Stroke Prediction Dataset
	Model Development: Build a pipeline to process data, train models, and evaluate their performance.
	Deployment: Deploy the best-performing model on Hugging Face.
	Deliverables: A report summarizing the workflow, key findings, and insights.

	---

	## Workflow Pipeline

	### 1. Data Acquisition
	The dataset was sourced from [Kaggle's Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). It includes 5110 records with 12 features, such as age, gender, hypertension, and smoking status.

	### 2. Data Preprocessing
	The following steps were performed to ensure data quality and readiness for modeling:
	- Handling Missing Values: Imputed missing BMI values using the mean strategy.
	- Encoding Categorical Features: Used one-hot encoding for variables like gender and residence type.
	- Feature Scaling: Standardized continuous variables to ensure uniformity across features.
	- Balancing the Dataset: Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique).

	### 3. Model Development
	Several machine learning models were developed and evaluated:
	1. Linear Regression
	2. Logistic Regression
	3. Random Forest Classifier + hyperparameter tuning
	4. Decision Tree Classifier
	5. Gradient Boosting Classifier + hyperparameter tuning
	6. K-Nearest Neighbors (KNN)


	---

	## Results

	The table below summarizes the performance of the models:

	\| Model \| Accuracy \| ROC AUC \| Confusion Matrix \|
	\|-----------------------\|----------\|----------\|-------------------------\|
	\| Linear Regression \| 0.775632 \| 0.837220 \| [[478, 195], [107, 566]] \|
	\| Logistic Regression \| 0.776374 \| 0.837297 \| [[492, 181], [120, 553]] \|
	\| Random Forest \| 0.942793 \| 0.990407 \| [[629, 44], [33, 640]] \|
	\| Decision Tree \| 0.892273 \| 0.892273 \| [[588, 85], [60, 613]] \|
	\| Gradient Boosting \| 0.957652 \| 0.988327 \| [[651, 22], [35, 638]] \|
	\| KNN \| 0.880386 \| 0.946411 \| [[531, 142], [19, 654]] \|

	### Key Insights
	- Gradient Boosting achieved the highest accuracy (95.76%) and maintained an excellent ROC AUC score (0.988).
	- Random Forest provided comparable ROC AUC performance (0.990), with slightly lower accuracy (94.28%).
	- Simpler models like Logistic Regression and Linear Regression performed adequately but were less effective in comparison.
	- KNN and Decision Tree showed strong results but were outperformed by Gradient Boosting and Random Forest.

	### Evaluation Metrics
	1. Accuracy: Gradient Boosting demonstrated the best predictive accuracy, correctly identifying 95.76% of instances.
	2. ROC AUC: Random Forest achieved the highest ROC AUC score, indicating excellent separability between stroke and non-stroke cases.
	3. Confusion Matrix: Gradient Boosting had minimal false negatives and false positives, demonstrating robust performance.

	---

	## Deployment
	The best-performing model (Gradient Boosting) was deployed on Hugging Face. It can be accessed publicly via the following link: [ML stroke prediction](https://huggingface.co/emlacodeuse/ml-stroke-prediction).

	Deployment highlights:
	- Platform: Hugging Face
	- Usage: Users can input patient data to predict stroke risk interactively.
	- Versioning: Deployment includes metadata for reproducibility.

	---

	## Data Quality Analysis

	### Analysis of Data Quality
	- The dataset included various features with distinct scales and types. It required preprocessing to ensure readiness for training.
	- Missing values in BMI were a key challenge, addressed using mean imputation.

	### Challenges Faced
	1. Class Imbalance: Stroke cases were significantly underrepresented in the dataset, potentially biasing model predictions.
	2. Feature Correlation: Certain features (e.g., age and hypertension) required analysis to ensure they contributed meaningfully to the prediction task.

	### Measures Taken to Ensure Data Integrity
	- Visualized data distribution and checked for anomalies before preprocessing.
	- Validated preprocessing steps using unit tests.
	- Conducted exploratory data analysis (EDA) to confirm correlations and feature relevance.

	---

	## SE Best Practices

	### Key Practices Implemented:
	- Modularization: Encapsulated preprocessing, training, and evaluation in separate functions for reusability.
	- Version Control: Maintained a Git repository with detailed commit messages.
	- Hyperparameter Tuning: Experimented with different learning rates, tree depths, and ensemble configurations.
	- Documentation: Created comprehensive project documentation, including a README file.
	- Testing: Added unit tests for data preprocessing and model evaluation steps.

	---

	## Key Outcomes

	### Results:
	- Gradient Boosting emerged as the best-performing model for stroke prediction.
	- The project showcased the impact of robust preprocessing and careful model selection on achieving high accuracy and ROC AUC scores.

	### Personal Insights:
	- It was a extremely interesting project for me, as I never did something like that. I learned a lot.

	---