Spaces:

DeepActionPotential
/

StrokeLineAI

Sleeping

App Files Files Community

DeepActionPotential commited on Jun 9, 2025

Commit

da0d126

verified ·

1 Parent(s): fb88264

Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +1 -0
LICENCE +21 -0
README.md +142 -19
app.py +34 -0
data/healthcare-dataset-stroke-data.csv +0 -0
demo/strokeline_demo.jpeg +0 -0
demo/strokeline_demo.mp4 +3 -0
models/model.pkl +3 -0
requirements.txt +7 -3
run.py +3 -0
stroke-prediction-using-smote-90-f1-score.ipynb +0 -0
styles.css +146 -0
ui.py +34 -0
utils.py +26 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+demo/strokeline_demo.mp4 filter=lfs diff=lfs merge=lfs -text

LICENCE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Eslam Tarek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,19 +1,142 @@
----
-title: StrokeLineAI
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Predicting a stroke based on medical features
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# Stroke Prediction Using Machine Learning
+## About the Project
+This project provides a comprehensive machine learning pipeline for predicting the risk of stroke in individuals based on clinical and demographic features. The goal is to enable early identification of high-risk patients, supporting healthcare professionals in making informed decisions and potentially reducing stroke-related morbidity and mortality. The project covers the full data science workflow: data exploration, preprocessing, feature engineering, model selection, hyperparameter optimization, evaluation, explainability, and deployment. The final solution includes a trained model and a Streamlit web application for real-time inference.
+---
+## About the Dataset
+The dataset used is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-datasett) from Kaggle. It contains 5110 records with 12 features and a binary target variable (`stroke`). The features include:
+- **id**: Unique identifier (not used for modeling)
+- **gender**: Patient gender (`Male`, `Female`, `Other`)
+- **age**: Age in years
+- **hypertension**: Hypertension status (`0`: No, `1`: Yes)
+- **heart_disease**: Heart disease status (`0`: No, `1`: Yes)
+- **ever_married**: Marital status (`Yes`, `No`)
+- **work_type**: Type of work (`children`, `Govt_job`, `Never_worked`, `Private`, `Self-employed`)
+- **Residence_type**: Living area (`Urban`, `Rural`)
+- **avg_glucose_level**: Average glucose level
+- **bmi**: Body mass index (may contain missing values)
+- **smoking_status**: Smoking behavior (`formerly smoked`, `never smoked`, `smokes`, `Unknown`)
+- **stroke**: Target variable (`1`: Stroke occurred, `0`: No stroke)
+The dataset is imbalanced, with far fewer positive stroke cases than negatives, and contains missing values in the `bmi` column.
+---
+## Notebook Summary
+The notebook documents the entire process:
+1. **Problem Definition**: Outlines the clinical motivation, dataset, and challenges.
+2. **EDA**: Visualizes distributions, checks for missing values, and explores feature-target relationships.
+3. **Feature Engineering**: Handles missing data, encodes categorical variables, and examines feature correlations.
+4. **Data Balancing**: Uses RandomUnderSampler and SMOTE to address class imbalance.
+5. **Model Selection**: Compares Random Forest, SVM, and XGBoost classifiers.
+6. **Hyperparameter Tuning**: Uses Optuna for automated optimization of XGBoost.
+7. **Evaluation**: Reports F1 score, confusion matrix, and classification report.
+8. **Explainability**: Applies SHAP for model interpretation.
+9. **Model Export**: Saves the trained model for deployment.
+---
+## Model Results
+### Preprocessing
+- **Missing Values**: Imputed missing `bmi` values with the mean.
+- **Categorical Encoding**: Used `OrdinalEncoder` to convert categorical features to numeric.
+- **Feature Selection**: Dropped the `id` column and checked for highly correlated features.
+### Data Balancing
+- **RandomUnderSampler**: Reduced the majority class to 10% of its original size.
+- **SMOTE**: Oversampled the minority class to achieve a 1:1 ratio.
+### Training
+- **Train-Test Split**: Stratified split to preserve class distribution.
+- **Model Comparison**: Evaluated Random Forest, SVM, and XGBoost on balanced data.
+- **Best Model**: XGBoost achieved the highest F1 score.
+### Hyperparameter Tuning
+- **Optuna**: Ran 50 trials to optimize XGBoost hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`, `gamma`, etc.) using 5-fold cross-validation and F1 score as the metric.
+### Evaluation
+- **F1 Score**: Achieved ~90% F1 score on the balanced test set.
+- **Confusion Matrix**: Demonstrated balanced sensitivity and specificity.
+- **Classification Report**: Provided detailed precision, recall, and F1 for each class.
+- **Explainability**: SHAP analysis identified the most influential features and provided local/global interpretability.
+---
+## How to Install
+Follow these steps to set up the project using a virtual environment:
+```bash
+# Clone or download the repository
+git clone https://github.com/DeepActionPotential/StrokeLineAI
+cd StrokeLineAI
+# Create a virtual environment
+python -m venv venv
+# Activate the virtual environment
+# On Windows:
+venv\Scripts\activate
+# On macOS/Linux:
+source venv/bin/activate
+# Upgrade pip
+pip install --upgrade pip
+# Install dependencies
+pip install -r requirements.txt
+```
+---
+## How to Use the Software
+1. **Run the Web Application**
+   Start the Streamlit app:
+   ```bash
+   streamlit run app.py
+   ```
+2. **Demo**
+   ## [demo-video](demo/strokeline_demo.mp4)
+   ![demo-screenshot](demo/strokeline_demo.jpeg))
+---
+## Technologies Used
+### Data Science & Model Training
+- **matplotlib, seaborn**: Data visualization.
+- **scikit-learn**: Preprocessing, model selection, metrics, and pipelines.
+- **imbalanced-learn**: Advanced resampling (SMOTE, RandomUnderSampler) for class balancing.
+- **XGBoost**: High-performance gradient boosting for classification.
+- **Optuna**: Automated hyperparameter optimization.
+- **SHAP**: Model explainability and feature importance analysis.
+### Deployment
+- **Streamlit**: Rapid web app development for interactive model inference.
+- **joblib**: Model serialization for deployment.
+---
+## License
+This project is licensed under the MIT License.
+See the [LICENSE](LICENSE) file for details.

app.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import streamlit as st
+import joblib
+from utils import preprocess_input, predict_stroke
+from ui import input_form, display_result
+@st.cache_resource
+def load_model(path: str = "./models/model.pkl"):
+    """Load the trained classifier from disk."""
+    return joblib.load(path)
+def local_css(file_name):
+    with open(file_name) as f:
+        st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
+local_css("styles.css")
+def main():
+    st.title("Stroke Prediction Demo")
+    st.write("Enter patient metrics to predict stroke risk/type.")
+    # Get raw numeric inputs
+    data = input_form()
+    # Preprocess and predict
+    model = load_model()
+    X = preprocess_input(data)
+    label, proba = predict_stroke(model, X)
+    # Show result
+    display_result(label, proba)
+if __name__ == "__main__":
+    main()

data/healthcare-dataset-stroke-data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

demo/strokeline_demo.jpeg ADDED Viewed

demo/strokeline_demo.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b35c7ccc4990cd87a60cf139ba0c628d36e91bc54dfefc7523e6a1f5b4ebafe3
+size 2894759

models/model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca2bc023cf8a92424c7cb37655ec4bcbd60e69dc7f6d74fb1c0937eb14a597cd
+size 676565

requirements.txt CHANGED Viewed

@@ -1,3 +1,7 @@
-altair
-pandas
-streamlit

+streamlit>=1.20.0
+scikit-learn>=1.2.0
+pandas>=1.5.0
+numpy>=1.22.0
+xgboost>=2.0.0
+joblib>=1.2.0
+xgboost>=2.1.0

run.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ import subprocess
2	+
3	+ subprocess.run(['streamlit', 'run', 'app.py'])

stroke-prediction-using-smote-90-f1-score.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

styles.css ADDED Viewed

	@@ -0,0 +1,146 @@

+/* Hide Streamlit default UI elements */
+#MainMenu, header, footer {
+    visibility: hidden;
+}
+/* Full-screen center layout */
+.stApp {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    min-height: 100vh;
+    margin: 10;
+    padding: 10;
+}
+/* Global dark theme base */
+body {
+    background-color: #343541; /* ChatGPT dark gray */
+    color: #ececf1;            /* Light neutral for text */
+    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    margin: 10;
+    padding: 10;
+}
+/* Container centering */
+.centered-container {
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    height: 100vh;
+    width: 100vw;
+}
+/* ChatGPT-style button */
+.stButton > button {
+    background-color: #444654 !important;
+    color: #ececf1 !important;
+    border: 1px solid #5c5f72 !important;
+    border-radius: 999px !important;
+    padding: 0.5rem 1.25rem !important;
+    font-weight: 500;
+    transition: background-color 0.2s ease, transform 0.1s ease;
+    position: relative;
+}
+.stButton > button:hover {
+    background-color: #565869 !important;
+    transform: scale(1.03);
+}
+/* Sidebar styling */
+[data-testid="stSidebar"] {
+    background-color: #202123;
+    color: #ececf1;
+    border-right: 1px solid #2d2f36;
+    min-width: 140px;
+    max-width: 250px;
+    transition: all 0.3s ease;
+}
+[data-testid="stSidebar"][aria-expanded="false"] {
+    margin-left: -250px;
+}
+[data-testid="stSidebar"] h1,
+[data-testid="stSidebar"] h2,
+[data-testid="stSidebar"] h3 {
+    color: #ececf1;
+}
+/* Markdown and text elements */
+.stMarkdown, .stCaption, .stHeader {
+    color: #ececf1;
+}
+/* Dropdown styling */
+select {
+    background-color: #3e3f4b;
+    color: #ececf1;
+    border: 1px solid #5c5f72;
+    border-radius: 6px;
+    padding: 6px 10px;
+}
+/* Selectbox refinements */
+.stSelectbox {
+    cursor: pointer !important;
+}
+.stSelectbox input {
+    cursor: pointer !important;
+    caret-color: transparent !important;
+}
+.stSelectbox div[data-baseweb="select"] {
+    cursor: pointer !important;
+}
+.stSelectbox [role="option"] {
+    cursor: pointer !important;
+}
+.stSelectbox ::selection {
+    background: transparent !important;
+}
+/* General container */
+.block-container {
+    padding: 15px !important;
+    margin: 15px !important;
+    max-width: 100% !important;
+}
+/* Progress bar */
+.stProgress > div > div > div {
+    background-color: #10a37f !important; /* ChatGPT green */
+}
+.stProgress > div > div {
+    background-color: #3e3f4b !important;
+    height: 10px !important;
+    border-radius: 5px;
+}
+/* Loading or status text */
+.st-emotion-cache-1q7spjk {
+    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    color: #ececf1 !important;
+    font-size: 1.1rem;
+    margin-bottom: 15px;
+}
+/* Optional animation (retained from your original) */
+.rotate {
+    display: inline-block;
+    color: #10a37f;
+    animation: rotation 2s infinite linear;
+}
+@keyframes rotation {
+    from { transform: rotate(0deg); }
+    to { transform: rotate(359deg); }
+}
+/* Centered button containers */
+.centered-button-container,
+.button-container {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    text-align: center;
+}

ui.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import streamlit as st
+def input_form() -> dict:
+    """Collect numeric-encoded patient features via sidebar widgets."""
+    st.sidebar.header("Patient Information")
+    return {
+        "gender": st.sidebar.selectbox("Gender", [(0.0, "Male"), (1.0, "Female")])[0],
+        "age": st.sidebar.slider("Age", 0.0, 100.0, 50.0),
+        "hypertension": st.sidebar.selectbox("Hypertension", [(0, "No"), (1, "Yes")])[0],
+        "heart_disease": st.sidebar.selectbox("Heart Disease", [(0, "No"), (1, "Yes")])[0],
+        "ever_married": st.sidebar.selectbox("Ever Married", [(0.0, "No"), (1.0, "Yes")])[0],
+        "work_type": st.sidebar.selectbox(
+            "Work Type",
+            [(0.0, "Private"), (1.0, "Self-employed"), (2.0, "Govt_job"), (3.0, "children"), (4.0, "Never_worked")]
+        )[0],
+        "Residence_type": st.sidebar.selectbox(
+            "Residence Type", [(0.0, "Urban"), (1.0, "Rural")]
+        )[0],
+        "avg_glucose_level": st.sidebar.number_input("Avg Glucose Level", 40.0, 300.0, 100.0),
+        "bmi": st.sidebar.number_input("BMI", 10.0, 60.0, 25.0),
+        "smoking_status": st.sidebar.selectbox(
+            "Smoking Status",
+            [(0.0, "formerly smoked"), (1.0, "never smoked"), (2.0, "smokes"), (3.0, "Unknown")]
+        )[0]
+    }
+def display_result(label: str, proba: float):
+    """Render prediction and confidence."""
+    st.header("Prediction Result")
+    st.markdown(f"**Stroke Type:** {label}")
+    st.markdown(f"**Confidence:** {proba:.1%}")
+    if proba < 0.5:
+        st.info("Model confidence is low — consider additional evaluation.")

utils.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import pandas as pd
+def preprocess_input(data: dict) -> pd.DataFrame:
+    """
+    Build a single-row DataFrame matching the training schema:
+      ['gender','age','hypertension','heart_disease','ever_married',
+       'work_type','Residence_type','avg_glucose_level','bmi',
+       'smoking_status']
+    """
+    # Note: 'stroke' column is not included as a feature
+    feature_cols = [
+        "gender","age","hypertension","heart_disease","ever_married",
+        "work_type","Residence_type","avg_glucose_level","bmi",
+        "smoking_status"
+    ]
+    df = pd.DataFrame([{k: data[k] for k in feature_cols}])
+    return df
+def predict_stroke(model, X: pd.DataFrame):
+    """
+    Returns human-readable label and probability for the top class.
+    """
+    proba = model.predict_proba(X)[0]
+    idx = proba.argmax()
+    label_map = {0: "No Stroke", 1: "Stroke"}
+    return label_map[idx], proba[idx]