Spaces:

DeepActionPotential
/

StrokeLineAI

Sleeping

App Files Files Community

DeepActionPotential commited on Jun 9, 2025

Commit

247c16d

verified ·

1 Parent(s): da0d126

Update README.md

Browse files

Files changed (1) hide show

README.md +157 -142

README.md CHANGED Viewed

@@ -1,142 +1,157 @@
-# Stroke Prediction Using Machine Learning
-## About the Project
-This project provides a comprehensive machine learning pipeline for predicting the risk of stroke in individuals based on clinical and demographic features. The goal is to enable early identification of high-risk patients, supporting healthcare professionals in making informed decisions and potentially reducing stroke-related morbidity and mortality. The project covers the full data science workflow: data exploration, preprocessing, feature engineering, model selection, hyperparameter optimization, evaluation, explainability, and deployment. The final solution includes a trained model and a Streamlit web application for real-time inference.
----
-## About the Dataset
-The dataset used is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-datasett) from Kaggle. It contains 5110 records with 12 features and a binary target variable (`stroke`). The features include:
-- **id**: Unique identifier (not used for modeling)
-- **gender**: Patient gender (`Male`, `Female`, `Other`)
-- **age**: Age in years
-- **hypertension**: Hypertension status (`0`: No, `1`: Yes)
-- **heart_disease**: Heart disease status (`0`: No, `1`: Yes)
-- **ever_married**: Marital status (`Yes`, `No`)
-- **work_type**: Type of work (`children`, `Govt_job`, `Never_worked`, `Private`, `Self-employed`)
-- **Residence_type**: Living area (`Urban`, `Rural`)
-- **avg_glucose_level**: Average glucose level
-- **bmi**: Body mass index (may contain missing values)
-- **smoking_status**: Smoking behavior (`formerly smoked`, `never smoked`, `smokes`, `Unknown`)
-- **stroke**: Target variable (`1`: Stroke occurred, `0`: No stroke)
-The dataset is imbalanced, with far fewer positive stroke cases than negatives, and contains missing values in the `bmi` column.
----
-## Notebook Summary
-The notebook documents the entire process:
-1. **Problem Definition**: Outlines the clinical motivation, dataset, and challenges.
-2. **EDA**: Visualizes distributions, checks for missing values, and explores feature-target relationships.
-3. **Feature Engineering**: Handles missing data, encodes categorical variables, and examines feature correlations.
-4. **Data Balancing**: Uses RandomUnderSampler and SMOTE to address class imbalance.
-5. **Model Selection**: Compares Random Forest, SVM, and XGBoost classifiers.
-6. **Hyperparameter Tuning**: Uses Optuna for automated optimization of XGBoost.
-7. **Evaluation**: Reports F1 score, confusion matrix, and classification report.
-8. **Explainability**: Applies SHAP for model interpretation.
-9. **Model Export**: Saves the trained model for deployment.
----
-## Model Results
-### Preprocessing
-- **Missing Values**: Imputed missing `bmi` values with the mean.
-- **Categorical Encoding**: Used `OrdinalEncoder` to convert categorical features to numeric.
-- **Feature Selection**: Dropped the `id` column and checked for highly correlated features.
-### Data Balancing
-- **RandomUnderSampler**: Reduced the majority class to 10% of its original size.
-- **SMOTE**: Oversampled the minority class to achieve a 1:1 ratio.
-### Training
-- **Train-Test Split**: Stratified split to preserve class distribution.
-- **Model Comparison**: Evaluated Random Forest, SVM, and XGBoost on balanced data.
-- **Best Model**: XGBoost achieved the highest F1 score.
-### Hyperparameter Tuning
-- **Optuna**: Ran 50 trials to optimize XGBoost hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`, `gamma`, etc.) using 5-fold cross-validation and F1 score as the metric.
-### Evaluation
-- **F1 Score**: Achieved ~90% F1 score on the balanced test set.
-- **Confusion Matrix**: Demonstrated balanced sensitivity and specificity.
-- **Classification Report**: Provided detailed precision, recall, and F1 for each class.
-- **Explainability**: SHAP analysis identified the most influential features and provided local/global interpretability.
----
-## How to Install
-Follow these steps to set up the project using a virtual environment:
-```bash
-# Clone or download the repository
-git clone https://github.com/DeepActionPotential/StrokeLineAI
-cd StrokeLineAI
-# Create a virtual environment
-python -m venv venv
-# Activate the virtual environment
-# On Windows:
-venv\Scripts\activate
-# On macOS/Linux:
-source venv/bin/activate
-# Upgrade pip
-pip install --upgrade pip
-# Install dependencies
-pip install -r requirements.txt
-```
----
-## How to Use the Software
-1. **Run the Web Application**
-   Start the Streamlit app:
-   ```bash
-   streamlit run app.py
-   ```
-2. **Demo**
-   ## [demo-video](demo/strokeline_demo.mp4)
-   ![demo-screenshot](demo/strokeline_demo.jpeg))
----
-## Technologies Used
-### Data Science & Model Training
-- **matplotlib, seaborn**: Data visualization.
-- **scikit-learn**: Preprocessing, model selection, metrics, and pipelines.
-- **imbalanced-learn**: Advanced resampling (SMOTE, RandomUnderSampler) for class balancing.
-- **XGBoost**: High-performance gradient boosting for classification.
-- **Optuna**: Automated hyperparameter optimization.
-- **SHAP**: Model explainability and feature importance analysis.
-### Deployment
-- **Streamlit**: Rapid web app development for interactive model inference.
-- **joblib**: Model serialization for deployment.
----
-## License
-This project is licensed under the MIT License.
-See the [LICENSE](LICENSE) file for details.

+---
+title: StrokeLine - Stroke Prediction Using Machine Learning
+emoji: 🤖
+colorFrom: indigo
+colorTo: blue
+sdk: streamlit
+sdk_version: 1.30.0
+app_file: app.py
+pinned: false
+license: mit
+---
+# Stroke Prediction Using Machine Learning
+## About the Project
+This project provides a comprehensive machine learning pipeline for predicting the risk of stroke in individuals based on clinical and demographic features. The goal is to enable early identification of high-risk patients, supporting healthcare professionals in making informed decisions and potentially reducing stroke-related morbidity and mortality. The project covers the full data science workflow: data exploration, preprocessing, feature engineering, model selection, hyperparameter optimization, evaluation, explainability, and deployment. The final solution includes a trained model and a Streamlit web application for real-time inference.
+---
+## About the Dataset
+The dataset used is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-datasett) from Kaggle. It contains 5110 records with 12 features and a binary target variable (`stroke`). The features include:
+- **id**: Unique identifier (not used for modeling)
+- **gender**: Patient gender (`Male`, `Female`, `Other`)
+- **age**: Age in years
+- **hypertension**: Hypertension status (`0`: No, `1`: Yes)
+- **heart_disease**: Heart disease status (`0`: No, `1`: Yes)
+- **ever_married**: Marital status (`Yes`, `No`)
+- **work_type**: Type of work (`children`, `Govt_job`, `Never_worked`, `Private`, `Self-employed`)
+- **Residence_type**: Living area (`Urban`, `Rural`)
+- **avg_glucose_level**: Average glucose level
+- **bmi**: Body mass index (may contain missing values)
+- **smoking_status**: Smoking behavior (`formerly smoked`, `never smoked`, `smokes`, `Unknown`)
+- **stroke**: Target variable (`1`: Stroke occurred, `0`: No stroke)
+The dataset is imbalanced, with far fewer positive stroke cases than negatives, and contains missing values in the `bmi` column.
+---
+## Notebook Summary
+The notebook documents the entire process:
+1. **Problem Definition**: Outlines the clinical motivation, dataset, and challenges.
+2. **EDA**: Visualizes distributions, checks for missing values, and explores feature-target relationships.
+3. **Feature Engineering**: Handles missing data, encodes categorical variables, and examines feature correlations.
+4. **Data Balancing**: Uses RandomUnderSampler and SMOTE to address class imbalance.
+5. **Model Selection**: Compares Random Forest, SVM, and XGBoost classifiers.
+6. **Hyperparameter Tuning**: Uses Optuna for automated optimization of XGBoost.
+7. **Evaluation**: Reports F1 score, confusion matrix, and classification report.
+8. **Explainability**: Applies SHAP for model interpretation.
+9. **Model Export**: Saves the trained model for deployment.
+---
+## Model Results
+### Preprocessing
+- **Missing Values**: Imputed missing `bmi` values with the mean.
+- **Categorical Encoding**: Used `OrdinalEncoder` to convert categorical features to numeric.
+- **Feature Selection**: Dropped the `id` column and checked for highly correlated features.
+### Data Balancing
+- **RandomUnderSampler**: Reduced the majority class to 10% of its original size.
+- **SMOTE**: Oversampled the minority class to achieve a 1:1 ratio.
+### Training
+- **Train-Test Split**: Stratified split to preserve class distribution.
+- **Model Comparison**: Evaluated Random Forest, SVM, and XGBoost on balanced data.
+- **Best Model**: XGBoost achieved the highest F1 score.
+### Hyperparameter Tuning
+- **Optuna**: Ran 50 trials to optimize XGBoost hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`, `gamma`, etc.) using 5-fold cross-validation and F1 score as the metric.
+### Evaluation
+- **F1 Score**: Achieved ~90% F1 score on the balanced test set.
+- **Confusion Matrix**: Demonstrated balanced sensitivity and specificity.
+- **Classification Report**: Provided detailed precision, recall, and F1 for each class.
+- **Explainability**: SHAP analysis identified the most influential features and provided local/global interpretability.
+---
+## How to Install
+Follow these steps to set up the project using a virtual environment:
+```bash
+# Clone or download the repository
+git clone https://github.com/DeepActionPotential/StrokeLineAI
+cd StrokeLineAI
+# Create a virtual environment
+python -m venv venv
+# Activate the virtual environment
+# On Windows:
+venv\Scripts\activate
+# On macOS/Linux:
+source venv/bin/activate
+# Upgrade pip
+pip install --upgrade pip
+# Install dependencies
+pip install -r requirements.txt
+```
+---
+## How to Use the Software
+1. **Run the Web Application**
+   Start the Streamlit app:
+   ```bash
+   streamlit run app.py
+   ```
+2. **Demo**
+   ## [demo-video](demo/strokeline_demo.mp4)
+   ![demo-screenshot](demo/strokeline_demo.jpeg))
+---
+## Technologies Used
+### Data Science & Model Training
+- **matplotlib, seaborn**: Data visualization.
+- **scikit-learn**: Preprocessing, model selection, metrics, and pipelines.
+- **imbalanced-learn**: Advanced resampling (SMOTE, RandomUnderSampler) for class balancing.
+- **XGBoost**: High-performance gradient boosting for classification.
+- **Optuna**: Automated hyperparameter optimization.
+- **SHAP**: Model explainability and feature importance analysis.
+### Deployment
+- **Streamlit**: Rapid web app development for interactive model inference.
+- **joblib**: Model serialization for deployment.
+---
+## License
+This project is licensed under the MIT License.
+See the [LICENSE](LICENSE) file for details.