DeepActionPotential commited on
Commit
741c10e
·
verified ·
1 Parent(s): fdf0c32

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/lung_cancer.mp4 filter=lfs diff=lfs merge=lfs -text
LICENCE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Eslam Tarek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,19 +1,128 @@
1
- ---
2
- title: RespiraAI
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: A ML model to predict the existence of pulmonary cancer.
12
- ---
13
-
14
- # Welcome to Streamlit!
15
-
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
-
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lung Cancer Prediction with High Recall
2
+
3
+ ## About the Project
4
+
5
+ This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.
6
+
7
+ The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.
8
+
9
+ ---
10
+
11
+ ## About the Dataset
12
+
13
+ The dataset used in this project is sourced from Kaggle: [Lung Cancer Dataset](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer). It contains survey responses from 309 individuals, each described by 16 features:
14
+
15
+ - **Demographics:** Age, Gender
16
+ - **Lifestyle:** Smoking status, Alcohol consumption
17
+ - **Symptoms:** Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
18
+ - **Target Variable:** LUNG_CANCER (YES/NO)
19
+
20
+ The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.
21
+
22
+ ---
23
+
24
+ ## Notebook Summary
25
+
26
+ The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:
27
+
28
+ 1. **Problem Definition:** Outlines the medical and machine learning objectives, emphasizing the importance of recall.
29
+ 2. **Exploratory Data Analysis (EDA):** Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
30
+ 3. **Feature Engineering:** Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
31
+ 4. **Model Selection:** Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
32
+ 5. **Model Evaluation:** Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
33
+ 6. **Model Persistence:** The best-performing model (SVM with high recall) is saved using `joblib` for future deployment.
34
+
35
+ ---
36
+
37
+ ## Model Results
38
+
39
+ ### Preprocessing
40
+
41
+ - **Duplicate Removal:** All duplicate rows are dropped to ensure data integrity.
42
+ - **Missing Values:** The dataset contains no missing values, simplifying preprocessing.
43
+ - **Encoding:** Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
44
+ - **Collinearity:** Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
45
+ - **Class Imbalance:** Stratified train-test splits and class weighting are used to address the imbalance in the target variable.
46
+
47
+ ### Training
48
+
49
+ - **Algorithms Tested:** Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
50
+ - **Cross-Validation:** Stratified K-Fold cross-validation is used to ensure robust evaluation.
51
+ - **Hyperparameter Tuning:** Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
52
+ - **Pipeline:** For SVM, a pipeline with feature scaling (`StandardScaler`) is used to improve performance.
53
+
54
+ ### Evaluation
55
+
56
+ - **Metrics:** Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
57
+ - **Results:** SVM achieved the highest recall, making it the preferred model for this application.
58
+ - **Visualization:** Confusion matrices and classification reports are plotted for each model to facilitate comparison.
59
+
60
+ ### Model Persistence
61
+
62
+ - The final SVM model is saved as `model.pkl` using `joblib`, enabling easy reuse and deployment.
63
+
64
+ ---
65
+
66
+ ## How to Install
67
+
68
+ Follow these steps to set up the project in a virtual environment:
69
+
70
+ 1. **Clone the Repository**
71
+ ```bash
72
+ git clone https://github.com/DeepActionPotential/RepiraAI
73
+ cd RespiraAI
74
+ ```
75
+
76
+ 2. **Create a Virtual Environment**
77
+ ```bash
78
+ python -m venv venv
79
+ ```
80
+
81
+ 3. **Activate the Virtual Environment**
82
+ - On Windows:
83
+ ```bash
84
+ venv\Scripts\activate
85
+ ```
86
+ - On macOS/Linux:
87
+ ```bash
88
+ source venv/bin/activate
89
+ ```
90
+
91
+ 4. **Install Dependencies**
92
+ ```bash
93
+ pip install -r requirements.txt
94
+ ```
95
+
96
+ ---
97
+
98
+ ## How to Use the Software
99
+
100
+ ## [demo-video](assets/lung_cancer.mp4)
101
+ ## ![demo-image](assets/1.jpeg)
102
+
103
+
104
+ ---
105
+
106
+ ## Technologies Used
107
+
108
+
109
+ - **Pandas:** Data manipulation and analysis, including cleaning, encoding, and feature engineering.
110
+ - **NumPy:** Efficient numerical computations and array operations.
111
+ - **Matplotlib & Seaborn:** Data visualization for EDA, feature distributions, and evaluation metrics.
112
+ - **Scikit-learn:** Machine learning library used for model training, evaluation, cross-validation, and pipelines.
113
+ - **XGBoost:** Advanced gradient boosting algorithm for classification.
114
+ - **Optuna:** Hyperparameter optimization framework (optional, for advanced tuning).
115
+ - **Joblib:** Model serialization and persistence.
116
+ - **Streamlit:** (Optional) For building interactive web demos of the prediction model.
117
+ - **Jupyter Notebook / VSCode:** Interactive development and documentation environment.
118
+
119
+ Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.
120
+
121
+ ---
122
+
123
+ ## License
124
+
125
+ This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.
126
+
127
+ ---
128
+
__pycache__/ui.cpython-311.pyc ADDED
Binary file (2.64 kB). View file
 
__pycache__/utils.cpython-311.pyc ADDED
Binary file (867 Bytes). View file
 
app.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ from utils import load_model, predict
4
+ from ui import get_user_input
5
+
6
+
7
+ def local_css(file_name):
8
+ with open(file_name) as f:
9
+ st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
10
+
11
+ local_css("styles.css")
12
+
13
+ st.title("Lung Cancer Risk Predictor")
14
+
15
+ # 1. Load trained model
16
+ model = load_model('./models/model.pkl')
17
+
18
+ # 2. Collect user input as dataframe
19
+ input_df = get_user_input()
20
+ st.subheader("Patient Data")
21
+ st.write(input_df)
22
+
23
+ # 3. Make prediction
24
+ prediction, proba = predict(model, input_df)
25
+
26
+ # 4. Display results
27
+ st.subheader("Prediction")
28
+ print(prediction[0])
29
+ label = 'Cancer' if prediction[0] == 1 else 'No Cancer'
30
+ st.write(f"**{label}**")
31
+
32
+ st.subheader("Prediction Probability")
33
+ st.write(f"Probability of Cancer: {proba[0][1]:.2f}")
assets/1.jpeg ADDED
assets/lung_cancer.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe6ae2c1b5f5881aa8027f7439e059335f35f0637794e9ecb7f0f59fbc6cb286
3
+ size 2077293
lung-cancer-prediction-high-recall.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
models/model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97ab42772a2eb847a63f881205146724183347ff6640d26984988783da9c71d3
3
+ size 15930
requirements.txt CHANGED
@@ -1,3 +1,5 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
1
+ streamlit>=1.29.0
2
+ pandas>=2.1.0
3
+ scikit-learn>=1.3.0
4
+ joblib>=1.3.0
5
+ numpy>=1.24.0
run.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ import subprocess
2
+
3
+ subprocess.run(['streamlit', 'run', 'app.py'])
styles.css ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Hide Streamlit default UI elements */
2
+ #MainMenu, header, footer {
3
+ visibility: hidden;
4
+ }
5
+
6
+ /* Full-screen center layout */
7
+ .stApp {
8
+ display: flex;
9
+ justify-content: center;
10
+ align-items: center;
11
+ min-height: 100vh;
12
+ margin: 10;
13
+ padding: 10;
14
+ }
15
+
16
+ /* Global dark theme base */
17
+ body {
18
+ background-color: #343541; /* ChatGPT dark gray */
19
+ color: #ececf1; /* Light neutral for text */
20
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
21
+ margin: 10;
22
+ padding: 10;
23
+ }
24
+
25
+ /* Container centering */
26
+ .centered-container {
27
+ display: flex;
28
+ align-items: center;
29
+ justify-content: center;
30
+ height: 100vh;
31
+ width: 100vw;
32
+ }
33
+
34
+ /* ChatGPT-style button */
35
+ .stButton > button {
36
+ background-color: #444654 !important;
37
+ color: #ececf1 !important;
38
+ border: 1px solid #5c5f72 !important;
39
+ border-radius: 999px !important;
40
+ padding: 0.5rem 1.25rem !important;
41
+ font-weight: 500;
42
+ transition: background-color 0.2s ease, transform 0.1s ease;
43
+ position: relative;
44
+ }
45
+
46
+ .stButton > button:hover {
47
+ background-color: #565869 !important;
48
+ transform: scale(1.03);
49
+ }
50
+
51
+ /* Sidebar styling */
52
+ [data-testid="stSidebar"] {
53
+ background-color: #202123;
54
+ color: #ececf1;
55
+ border-right: 1px solid #2d2f36;
56
+ min-width: 140px;
57
+ max-width: 250px;
58
+ transition: all 0.3s ease;
59
+ }
60
+
61
+ [data-testid="stSidebar"][aria-expanded="false"] {
62
+ margin-left: -250px;
63
+ }
64
+
65
+ [data-testid="stSidebar"] h1,
66
+ [data-testid="stSidebar"] h2,
67
+ [data-testid="stSidebar"] h3 {
68
+ color: #ececf1;
69
+ }
70
+
71
+ /* Markdown and text elements */
72
+ .stMarkdown, .stCaption, .stHeader {
73
+ color: #ececf1;
74
+ }
75
+
76
+ /* Dropdown styling */
77
+ select {
78
+ background-color: #3e3f4b;
79
+ color: #ececf1;
80
+ border: 1px solid #5c5f72;
81
+ border-radius: 6px;
82
+ padding: 6px 10px;
83
+ }
84
+
85
+ /* Selectbox refinements */
86
+ .stSelectbox {
87
+ cursor: pointer !important;
88
+ }
89
+ .stSelectbox input {
90
+ cursor: pointer !important;
91
+ caret-color: transparent !important;
92
+ }
93
+ .stSelectbox div[data-baseweb="select"] {
94
+ cursor: pointer !important;
95
+ }
96
+ .stSelectbox [role="option"] {
97
+ cursor: pointer !important;
98
+ }
99
+ .stSelectbox ::selection {
100
+ background: transparent !important;
101
+ }
102
+
103
+ /* General container */
104
+ .block-container {
105
+ padding: 15px !important;
106
+ margin: 15px !important;
107
+ max-width: 100% !important;
108
+ }
109
+
110
+ /* Progress bar */
111
+ .stProgress > div > div > div {
112
+ background-color: #10a37f !important; /* ChatGPT green */
113
+ }
114
+ .stProgress > div > div {
115
+ background-color: #3e3f4b !important;
116
+ height: 10px !important;
117
+ border-radius: 5px;
118
+ }
119
+
120
+ /* Loading or status text */
121
+ .st-emotion-cache-1q7spjk {
122
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
123
+ color: #ececf1 !important;
124
+ font-size: 1.1rem;
125
+ margin-bottom: 15px;
126
+ }
127
+
128
+ /* Optional animation (retained from your original) */
129
+ .rotate {
130
+ display: inline-block;
131
+ color: #10a37f;
132
+ animation: rotation 2s infinite linear;
133
+ }
134
+ @keyframes rotation {
135
+ from { transform: rotate(0deg); }
136
+ to { transform: rotate(359deg); }
137
+ }
138
+
139
+ /* Centered button containers */
140
+ .centered-button-container,
141
+ .button-container {
142
+ display: flex;
143
+ justify-content: center;
144
+ align-items: center;
145
+ text-align: center;
146
+ }
ui.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+
4
+
5
+ def get_user_input() -> pd.DataFrame:
6
+ """
7
+ Render Streamlit widgets for each predictor and
8
+ return a 1-row DataFrame ready for prediction.
9
+ """
10
+ st.sidebar.header('Patient Information')
11
+ inputs = {
12
+ 'GENDER': st.sidebar.selectbox('Gender', ['Male', 'Female']),
13
+ 'SMOKING': st.sidebar.selectbox('Smoking', ['YES', 'NO']),
14
+ 'YELLOW_FINGERS': st.sidebar.selectbox('Yellow Fingers', ['YES', 'NO']),
15
+ 'ANXIETY': st.sidebar.selectbox('Anxiety', ['YES', 'NO']),
16
+ 'PEER_PRESSURE': st.sidebar.selectbox('Peer Pressure', ['YES', 'NO']),
17
+ 'CHRONIC DISEASE': st.sidebar.selectbox('Chronic Disease', ['YES', 'NO']),
18
+ 'FATIGUE ': st.sidebar.selectbox('Fatigue', ['YES', 'NO']),
19
+ 'ALLERGY ': st.sidebar.selectbox('Allergy', ['YES', 'NO']),
20
+ 'WHEEZING': st.sidebar.selectbox('Wheezing', ['YES', 'NO']),
21
+ 'ALCOHOL CONSUMING': st.sidebar.selectbox('Alcohol Consuming', ['YES', 'NO']),
22
+ 'COUGHING': st.sidebar.selectbox('Coughing', ['YES', 'NO']),
23
+ 'SHORTNESS OF BREATH': st.sidebar.selectbox('Shortness of Breath', ['YES', 'NO']),
24
+ 'SWALLOWING DIFFICULTY': st.sidebar.selectbox('Swallowing Difficulty', ['YES', 'NO']),
25
+ 'CHEST PAIN': st.sidebar.selectbox('Chest Pain', ['YES', 'NO'])
26
+ }
27
+ # Convert to DataFrame
28
+ df = pd.DataFrame(inputs, index=[0])
29
+ mapping = {'YES': 1, 'NO': 0, 'Male': 0, 'Female': 1}
30
+ for col in df.columns:
31
+ df[col] = df[col].map(mapping)
32
+
33
+ return df
utils.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import joblib
2
+ import pandas as pd
3
+
4
+ def load_model(path: str):
5
+ """Load a saved pipeline model from disk."""
6
+ return joblib.load(path)
7
+
8
+
9
+ def predict(model, input_df: pd.DataFrame):
10
+ """
11
+ Predict class labels and probabilities.
12
+ Returns: (labels, probabilities)
13
+ """
14
+ labels = model.predict(input_df)
15
+ probs = model.predict_proba(input_df)
16
+ return labels, probs