DeepActionPotential commited on
Commit
247c16d
·
verified ·
1 Parent(s): da0d126

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -142
README.md CHANGED
@@ -1,142 +1,157 @@
1
- # Stroke Prediction Using Machine Learning
2
-
3
- ## About the Project
4
-
5
- This project provides a comprehensive machine learning pipeline for predicting the risk of stroke in individuals based on clinical and demographic features. The goal is to enable early identification of high-risk patients, supporting healthcare professionals in making informed decisions and potentially reducing stroke-related morbidity and mortality. The project covers the full data science workflow: data exploration, preprocessing, feature engineering, model selection, hyperparameter optimization, evaluation, explainability, and deployment. The final solution includes a trained model and a Streamlit web application for real-time inference.
6
-
7
- ---
8
-
9
- ## About the Dataset
10
-
11
- The dataset used is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-datasett) from Kaggle. It contains 5110 records with 12 features and a binary target variable (`stroke`). The features include:
12
-
13
- - **id**: Unique identifier (not used for modeling)
14
- - **gender**: Patient gender (`Male`, `Female`, `Other`)
15
- - **age**: Age in years
16
- - **hypertension**: Hypertension status (`0`: No, `1`: Yes)
17
- - **heart_disease**: Heart disease status (`0`: No, `1`: Yes)
18
- - **ever_married**: Marital status (`Yes`, `No`)
19
- - **work_type**: Type of work (`children`, `Govt_job`, `Never_worked`, `Private`, `Self-employed`)
20
- - **Residence_type**: Living area (`Urban`, `Rural`)
21
- - **avg_glucose_level**: Average glucose level
22
- - **bmi**: Body mass index (may contain missing values)
23
- - **smoking_status**: Smoking behavior (`formerly smoked`, `never smoked`, `smokes`, `Unknown`)
24
- - **stroke**: Target variable (`1`: Stroke occurred, `0`: No stroke)
25
-
26
- The dataset is imbalanced, with far fewer positive stroke cases than negatives, and contains missing values in the `bmi` column.
27
-
28
- ---
29
-
30
- ## Notebook Summary
31
-
32
- The notebook documents the entire process:
33
-
34
- 1. **Problem Definition**: Outlines the clinical motivation, dataset, and challenges.
35
- 2. **EDA**: Visualizes distributions, checks for missing values, and explores feature-target relationships.
36
- 3. **Feature Engineering**: Handles missing data, encodes categorical variables, and examines feature correlations.
37
- 4. **Data Balancing**: Uses RandomUnderSampler and SMOTE to address class imbalance.
38
- 5. **Model Selection**: Compares Random Forest, SVM, and XGBoost classifiers.
39
- 6. **Hyperparameter Tuning**: Uses Optuna for automated optimization of XGBoost.
40
- 7. **Evaluation**: Reports F1 score, confusion matrix, and classification report.
41
- 8. **Explainability**: Applies SHAP for model interpretation.
42
- 9. **Model Export**: Saves the trained model for deployment.
43
-
44
- ---
45
-
46
- ## Model Results
47
-
48
- ### Preprocessing
49
-
50
- - **Missing Values**: Imputed missing `bmi` values with the mean.
51
- - **Categorical Encoding**: Used `OrdinalEncoder` to convert categorical features to numeric.
52
- - **Feature Selection**: Dropped the `id` column and checked for highly correlated features.
53
-
54
- ### Data Balancing
55
-
56
- - **RandomUnderSampler**: Reduced the majority class to 10% of its original size.
57
- - **SMOTE**: Oversampled the minority class to achieve a 1:1 ratio.
58
-
59
- ### Training
60
-
61
- - **Train-Test Split**: Stratified split to preserve class distribution.
62
- - **Model Comparison**: Evaluated Random Forest, SVM, and XGBoost on balanced data.
63
- - **Best Model**: XGBoost achieved the highest F1 score.
64
-
65
- ### Hyperparameter Tuning
66
-
67
- - **Optuna**: Ran 50 trials to optimize XGBoost hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`, `gamma`, etc.) using 5-fold cross-validation and F1 score as the metric.
68
-
69
- ### Evaluation
70
-
71
- - **F1 Score**: Achieved ~90% F1 score on the balanced test set.
72
- - **Confusion Matrix**: Demonstrated balanced sensitivity and specificity.
73
- - **Classification Report**: Provided detailed precision, recall, and F1 for each class.
74
- - **Explainability**: SHAP analysis identified the most influential features and provided local/global interpretability.
75
-
76
- ---
77
-
78
- ## How to Install
79
-
80
- Follow these steps to set up the project using a virtual environment:
81
-
82
- ```bash
83
- # Clone or download the repository
84
- git clone https://github.com/DeepActionPotential/StrokeLineAI
85
- cd StrokeLineAI
86
-
87
- # Create a virtual environment
88
- python -m venv venv
89
-
90
- # Activate the virtual environment
91
- # On Windows:
92
- venv\Scripts\activate
93
- # On macOS/Linux:
94
- source venv/bin/activate
95
-
96
- # Upgrade pip
97
- pip install --upgrade pip
98
-
99
- # Install dependencies
100
- pip install -r requirements.txt
101
- ```
102
-
103
- ---
104
-
105
- ## How to Use the Software
106
-
107
- 1. **Run the Web Application**
108
- Start the Streamlit app:
109
-
110
- ```bash
111
- streamlit run app.py
112
- ```
113
-
114
- 2. **Demo**
115
- ## [demo-video](demo/strokeline_demo.mp4)
116
- ![demo-screenshot](demo/strokeline_demo.jpeg))
117
-
118
- ---
119
-
120
- ## Technologies Used
121
-
122
- ### Data Science & Model Training
123
-
124
-
125
- - **matplotlib, seaborn**: Data visualization.
126
- - **scikit-learn**: Preprocessing, model selection, metrics, and pipelines.
127
- - **imbalanced-learn**: Advanced resampling (SMOTE, RandomUnderSampler) for class balancing.
128
- - **XGBoost**: High-performance gradient boosting for classification.
129
- - **Optuna**: Automated hyperparameter optimization.
130
- - **SHAP**: Model explainability and feature importance analysis.
131
-
132
- ### Deployment
133
-
134
- - **Streamlit**: Rapid web app development for interactive model inference.
135
- - **joblib**: Model serialization for deployment.
136
-
137
- ---
138
-
139
- ## License
140
-
141
- This project is licensed under the MIT License.
142
- See the [LICENSE](LICENSE) file for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: StrokeLine - Stroke Prediction Using Machine Learning
3
+ emoji: 🤖
4
+ colorFrom: indigo
5
+ colorTo: blue
6
+ sdk: streamlit
7
+ sdk_version: 1.30.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+
14
+ # Stroke Prediction Using Machine Learning
15
+
16
+
17
+
18
+ ## About the Project
19
+
20
+ This project provides a comprehensive machine learning pipeline for predicting the risk of stroke in individuals based on clinical and demographic features. The goal is to enable early identification of high-risk patients, supporting healthcare professionals in making informed decisions and potentially reducing stroke-related morbidity and mortality. The project covers the full data science workflow: data exploration, preprocessing, feature engineering, model selection, hyperparameter optimization, evaluation, explainability, and deployment. The final solution includes a trained model and a Streamlit web application for real-time inference.
21
+
22
+ ---
23
+
24
+ ## About the Dataset
25
+
26
+ The dataset used is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-datasett) from Kaggle. It contains 5110 records with 12 features and a binary target variable (`stroke`). The features include:
27
+
28
+ - **id**: Unique identifier (not used for modeling)
29
+ - **gender**: Patient gender (`Male`, `Female`, `Other`)
30
+ - **age**: Age in years
31
+ - **hypertension**: Hypertension status (`0`: No, `1`: Yes)
32
+ - **heart_disease**: Heart disease status (`0`: No, `1`: Yes)
33
+ - **ever_married**: Marital status (`Yes`, `No`)
34
+ - **work_type**: Type of work (`children`, `Govt_job`, `Never_worked`, `Private`, `Self-employed`)
35
+ - **Residence_type**: Living area (`Urban`, `Rural`)
36
+ - **avg_glucose_level**: Average glucose level
37
+ - **bmi**: Body mass index (may contain missing values)
38
+ - **smoking_status**: Smoking behavior (`formerly smoked`, `never smoked`, `smokes`, `Unknown`)
39
+ - **stroke**: Target variable (`1`: Stroke occurred, `0`: No stroke)
40
+
41
+ The dataset is imbalanced, with far fewer positive stroke cases than negatives, and contains missing values in the `bmi` column.
42
+
43
+ ---
44
+
45
+ ## Notebook Summary
46
+
47
+ The notebook documents the entire process:
48
+
49
+ 1. **Problem Definition**: Outlines the clinical motivation, dataset, and challenges.
50
+ 2. **EDA**: Visualizes distributions, checks for missing values, and explores feature-target relationships.
51
+ 3. **Feature Engineering**: Handles missing data, encodes categorical variables, and examines feature correlations.
52
+ 4. **Data Balancing**: Uses RandomUnderSampler and SMOTE to address class imbalance.
53
+ 5. **Model Selection**: Compares Random Forest, SVM, and XGBoost classifiers.
54
+ 6. **Hyperparameter Tuning**: Uses Optuna for automated optimization of XGBoost.
55
+ 7. **Evaluation**: Reports F1 score, confusion matrix, and classification report.
56
+ 8. **Explainability**: Applies SHAP for model interpretation.
57
+ 9. **Model Export**: Saves the trained model for deployment.
58
+
59
+ ---
60
+
61
+ ## Model Results
62
+
63
+ ### Preprocessing
64
+
65
+ - **Missing Values**: Imputed missing `bmi` values with the mean.
66
+ - **Categorical Encoding**: Used `OrdinalEncoder` to convert categorical features to numeric.
67
+ - **Feature Selection**: Dropped the `id` column and checked for highly correlated features.
68
+
69
+ ### Data Balancing
70
+
71
+ - **RandomUnderSampler**: Reduced the majority class to 10% of its original size.
72
+ - **SMOTE**: Oversampled the minority class to achieve a 1:1 ratio.
73
+
74
+ ### Training
75
+
76
+ - **Train-Test Split**: Stratified split to preserve class distribution.
77
+ - **Model Comparison**: Evaluated Random Forest, SVM, and XGBoost on balanced data.
78
+ - **Best Model**: XGBoost achieved the highest F1 score.
79
+
80
+ ### Hyperparameter Tuning
81
+
82
+ - **Optuna**: Ran 50 trials to optimize XGBoost hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`, `gamma`, etc.) using 5-fold cross-validation and F1 score as the metric.
83
+
84
+ ### Evaluation
85
+
86
+ - **F1 Score**: Achieved ~90% F1 score on the balanced test set.
87
+ - **Confusion Matrix**: Demonstrated balanced sensitivity and specificity.
88
+ - **Classification Report**: Provided detailed precision, recall, and F1 for each class.
89
+ - **Explainability**: SHAP analysis identified the most influential features and provided local/global interpretability.
90
+
91
+ ---
92
+
93
+ ## How to Install
94
+
95
+ Follow these steps to set up the project using a virtual environment:
96
+
97
+ ```bash
98
+ # Clone or download the repository
99
+ git clone https://github.com/DeepActionPotential/StrokeLineAI
100
+ cd StrokeLineAI
101
+
102
+ # Create a virtual environment
103
+ python -m venv venv
104
+
105
+ # Activate the virtual environment
106
+ # On Windows:
107
+ venv\Scripts\activate
108
+ # On macOS/Linux:
109
+ source venv/bin/activate
110
+
111
+ # Upgrade pip
112
+ pip install --upgrade pip
113
+
114
+ # Install dependencies
115
+ pip install -r requirements.txt
116
+ ```
117
+
118
+ ---
119
+
120
+ ## How to Use the Software
121
+
122
+ 1. **Run the Web Application**
123
+ Start the Streamlit app:
124
+
125
+ ```bash
126
+ streamlit run app.py
127
+ ```
128
+
129
+ 2. **Demo**
130
+ ## [demo-video](demo/strokeline_demo.mp4)
131
+ ![demo-screenshot](demo/strokeline_demo.jpeg))
132
+
133
+ ---
134
+
135
+ ## Technologies Used
136
+
137
+ ### Data Science & Model Training
138
+
139
+
140
+ - **matplotlib, seaborn**: Data visualization.
141
+ - **scikit-learn**: Preprocessing, model selection, metrics, and pipelines.
142
+ - **imbalanced-learn**: Advanced resampling (SMOTE, RandomUnderSampler) for class balancing.
143
+ - **XGBoost**: High-performance gradient boosting for classification.
144
+ - **Optuna**: Automated hyperparameter optimization.
145
+ - **SHAP**: Model explainability and feature importance analysis.
146
+
147
+ ### Deployment
148
+
149
+ - **Streamlit**: Rapid web app development for interactive model inference.
150
+ - **joblib**: Model serialization for deployment.
151
+
152
+ ---
153
+
154
+ ## License
155
+
156
+ This project is licensed under the MIT License.
157
+ See the [LICENSE](LICENSE) file for details.