Spaces:

DeepActionPotential
/

RespiraAI

Sleeping

App Files Files Community

DeepActionPotential commited on Jun 10, 2025

Commit

741c10e

verified ·

1 Parent(s): fdf0c32

Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

.gitattributes +1 -0
LICENCE +21 -0
README.md +128 -19
__pycache__/ui.cpython-311.pyc +0 -0
__pycache__/utils.cpython-311.pyc +0 -0
app.py +33 -0
assets/1.jpeg +0 -0
assets/lung_cancer.mp4 +3 -0
lung-cancer-prediction-high-recall.ipynb +0 -0
models/model.pkl +3 -0
requirements.txt +5 -3
run.py +3 -0
styles.css +146 -0
ui.py +33 -0
utils.py +16 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/lung_cancer.mp4 filter=lfs diff=lfs merge=lfs -text

LICENCE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Eslam Tarek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,19 +1,128 @@
----
-title: RespiraAI
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: A ML model to predict the existence of pulmonary cancer.
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# Lung Cancer Prediction with High Recall
+## About the Project
+This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.
+The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.
+---
+## About the Dataset
+The dataset used in this project is sourced from Kaggle: [Lung Cancer Dataset](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer). It contains survey responses from 309 individuals, each described by 16 features:
+- **Demographics:** Age, Gender
+- **Lifestyle:** Smoking status, Alcohol consumption
+- **Symptoms:** Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
+- **Target Variable:** LUNG_CANCER (YES/NO)
+The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.
+---
+## Notebook Summary
+The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:
+1. **Problem Definition:** Outlines the medical and machine learning objectives, emphasizing the importance of recall.
+2. **Exploratory Data Analysis (EDA):** Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
+3. **Feature Engineering:** Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
+4. **Model Selection:** Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
+5. **Model Evaluation:** Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
+6. **Model Persistence:** The best-performing model (SVM with high recall) is saved using `joblib` for future deployment.
+---
+## Model Results
+### Preprocessing
+- **Duplicate Removal:** All duplicate rows are dropped to ensure data integrity.
+- **Missing Values:** The dataset contains no missing values, simplifying preprocessing.
+- **Encoding:** Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
+- **Collinearity:** Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
+- **Class Imbalance:** Stratified train-test splits and class weighting are used to address the imbalance in the target variable.
+### Training
+- **Algorithms Tested:** Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
+- **Cross-Validation:** Stratified K-Fold cross-validation is used to ensure robust evaluation.
+- **Hyperparameter Tuning:** Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
+- **Pipeline:** For SVM, a pipeline with feature scaling (`StandardScaler`) is used to improve performance.
+### Evaluation
+- **Metrics:** Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
+- **Results:** SVM achieved the highest recall, making it the preferred model for this application.
+- **Visualization:** Confusion matrices and classification reports are plotted for each model to facilitate comparison.
+### Model Persistence
+- The final SVM model is saved as `model.pkl` using `joblib`, enabling easy reuse and deployment.
+---
+## How to Install
+Follow these steps to set up the project in a virtual environment:
+1. **Clone the Repository**
+   ```bash
+   git clone https://github.com/DeepActionPotential/RepiraAI
+   cd RespiraAI
+   ```
+2. **Create a Virtual Environment**
+   ```bash
+   python -m venv venv
+   ```
+3. **Activate the Virtual Environment**
+   - On Windows:
+     ```bash
+     venv\Scripts\activate
+     ```
+   - On macOS/Linux:
+     ```bash
+     source venv/bin/activate
+     ```
+4. **Install Dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+---
+## How to Use the Software
+## [demo-video](assets/lung_cancer.mp4)
+## ![demo-image](assets/1.jpeg)
+---
+## Technologies Used
+- **Pandas:** Data manipulation and analysis, including cleaning, encoding, and feature engineering.
+- **NumPy:** Efficient numerical computations and array operations.
+- **Matplotlib & Seaborn:** Data visualization for EDA, feature distributions, and evaluation metrics.
+- **Scikit-learn:** Machine learning library used for model training, evaluation, cross-validation, and pipelines.
+- **XGBoost:** Advanced gradient boosting algorithm for classification.
+- **Optuna:** Hyperparameter optimization framework (optional, for advanced tuning).
+- **Joblib:** Model serialization and persistence.
+- **Streamlit:** (Optional) For building interactive web demos of the prediction model.
+- **Jupyter Notebook / VSCode:** Interactive development and documentation environment.
+Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.
+---
+## License
+This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.
+---

__pycache__/ui.cpython-311.pyc ADDED Viewed

Binary file (2.64 kB). View file

__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (867 Bytes). View file

app.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import streamlit as st
+import pandas as pd
+from utils import load_model, predict
+from ui import get_user_input
+def local_css(file_name):
+    with open(file_name) as f:
+        st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
+local_css("styles.css")
+st.title("Lung Cancer Risk Predictor")
+# 1. Load trained model
+model = load_model('./models/model.pkl')
+# 2. Collect user input as dataframe
+input_df = get_user_input()
+st.subheader("Patient Data")
+st.write(input_df)
+# 3. Make prediction
+prediction, proba = predict(model, input_df)
+# 4. Display results
+st.subheader("Prediction")
+print(prediction[0])
+label = 'Cancer' if prediction[0] == 1 else 'No Cancer'
+st.write(f"**{label}**")
+st.subheader("Prediction Probability")
+st.write(f"Probability of Cancer: {proba[0][1]:.2f}")

assets/1.jpeg ADDED Viewed

assets/lung_cancer.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe6ae2c1b5f5881aa8027f7439e059335f35f0637794e9ecb7f0f59fbc6cb286
+size 2077293

lung-cancer-prediction-high-recall.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

models/model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:97ab42772a2eb847a63f881205146724183347ff6640d26984988783da9c71d3
+size 15930

requirements.txt CHANGED Viewed

@@ -1,3 +1,5 @@
-altair
-pandas
-streamlit

+streamlit>=1.29.0
+pandas>=2.1.0
+scikit-learn>=1.3.0
+joblib>=1.3.0
+numpy>=1.24.0

run.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ import subprocess
2	+
3	+ subprocess.run(['streamlit', 'run', 'app.py'])

styles.css ADDED Viewed

	@@ -0,0 +1,146 @@

+/* Hide Streamlit default UI elements */
+#MainMenu, header, footer {
+    visibility: hidden;
+}
+/* Full-screen center layout */
+.stApp {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    min-height: 100vh;
+    margin: 10;
+    padding: 10;
+}
+/* Global dark theme base */
+body {
+    background-color: #343541; /* ChatGPT dark gray */
+    color: #ececf1;            /* Light neutral for text */
+    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    margin: 10;
+    padding: 10;
+}
+/* Container centering */
+.centered-container {
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    height: 100vh;
+    width: 100vw;
+}
+/* ChatGPT-style button */
+.stButton > button {
+    background-color: #444654 !important;
+    color: #ececf1 !important;
+    border: 1px solid #5c5f72 !important;
+    border-radius: 999px !important;
+    padding: 0.5rem 1.25rem !important;
+    font-weight: 500;
+    transition: background-color 0.2s ease, transform 0.1s ease;
+    position: relative;
+}
+.stButton > button:hover {
+    background-color: #565869 !important;
+    transform: scale(1.03);
+}
+/* Sidebar styling */
+[data-testid="stSidebar"] {
+    background-color: #202123;
+    color: #ececf1;
+    border-right: 1px solid #2d2f36;
+    min-width: 140px;
+    max-width: 250px;
+    transition: all 0.3s ease;
+}
+[data-testid="stSidebar"][aria-expanded="false"] {
+    margin-left: -250px;
+}
+[data-testid="stSidebar"] h1,
+[data-testid="stSidebar"] h2,
+[data-testid="stSidebar"] h3 {
+    color: #ececf1;
+}
+/* Markdown and text elements */
+.stMarkdown, .stCaption, .stHeader {
+    color: #ececf1;
+}
+/* Dropdown styling */
+select {
+    background-color: #3e3f4b;
+    color: #ececf1;
+    border: 1px solid #5c5f72;
+    border-radius: 6px;
+    padding: 6px 10px;
+}
+/* Selectbox refinements */
+.stSelectbox {
+    cursor: pointer !important;
+}
+.stSelectbox input {
+    cursor: pointer !important;
+    caret-color: transparent !important;
+}
+.stSelectbox div[data-baseweb="select"] {
+    cursor: pointer !important;
+}
+.stSelectbox [role="option"] {
+    cursor: pointer !important;
+}
+.stSelectbox ::selection {
+    background: transparent !important;
+}
+/* General container */
+.block-container {
+    padding: 15px !important;
+    margin: 15px !important;
+    max-width: 100% !important;
+}
+/* Progress bar */
+.stProgress > div > div > div {
+    background-color: #10a37f !important; /* ChatGPT green */
+}
+.stProgress > div > div {
+    background-color: #3e3f4b !important;
+    height: 10px !important;
+    border-radius: 5px;
+}
+/* Loading or status text */
+.st-emotion-cache-1q7spjk {
+    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    color: #ececf1 !important;
+    font-size: 1.1rem;
+    margin-bottom: 15px;
+}
+/* Optional animation (retained from your original) */
+.rotate {
+    display: inline-block;
+    color: #10a37f;
+    animation: rotation 2s infinite linear;
+}
+@keyframes rotation {
+    from { transform: rotate(0deg); }
+    to { transform: rotate(359deg); }
+}
+/* Centered button containers */
+.centered-button-container,
+.button-container {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    text-align: center;
+}

ui.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import streamlit as st
+import pandas as pd
+def get_user_input() -> pd.DataFrame:
+    """
+    Render Streamlit widgets for each predictor and
+    return a 1-row DataFrame ready for prediction.
+    """
+    st.sidebar.header('Patient Information')
+    inputs = {
+        'GENDER': st.sidebar.selectbox('Gender', ['Male', 'Female']),
+        'SMOKING': st.sidebar.selectbox('Smoking', ['YES', 'NO']),
+        'YELLOW_FINGERS': st.sidebar.selectbox('Yellow Fingers', ['YES', 'NO']),
+        'ANXIETY': st.sidebar.selectbox('Anxiety', ['YES', 'NO']),
+        'PEER_PRESSURE': st.sidebar.selectbox('Peer Pressure', ['YES', 'NO']),
+        'CHRONIC DISEASE': st.sidebar.selectbox('Chronic Disease', ['YES', 'NO']),
+        'FATIGUE ': st.sidebar.selectbox('Fatigue', ['YES', 'NO']),
+        'ALLERGY ': st.sidebar.selectbox('Allergy', ['YES', 'NO']),
+        'WHEEZING': st.sidebar.selectbox('Wheezing', ['YES', 'NO']),
+        'ALCOHOL CONSUMING': st.sidebar.selectbox('Alcohol Consuming', ['YES', 'NO']),
+        'COUGHING': st.sidebar.selectbox('Coughing', ['YES', 'NO']),
+        'SHORTNESS OF BREATH': st.sidebar.selectbox('Shortness of Breath', ['YES', 'NO']),
+        'SWALLOWING DIFFICULTY': st.sidebar.selectbox('Swallowing Difficulty', ['YES', 'NO']),
+        'CHEST PAIN': st.sidebar.selectbox('Chest Pain', ['YES', 'NO'])
+    }
+    # Convert to DataFrame
+    df = pd.DataFrame(inputs, index=[0])
+    mapping = {'YES': 1, 'NO': 0, 'Male': 0, 'Female': 1}
+    for col in df.columns:
+        df[col] = df[col].map(mapping)
+    return df

utils.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import joblib
+import pandas as pd
+def load_model(path: str):
+    """Load a saved pipeline model from disk."""
+    return joblib.load(path)
+def predict(model, input_df: pd.DataFrame):
+    """
+    Predict class labels and probabilities.
+    Returns: (labels, probabilities)
+    """
+    labels = model.predict(input_df)
+    probs = model.predict_proba(input_df)
+    return labels, probs