DeepActionPotential commited on
Commit
a7c44e0
·
verified ·
1 Parent(s): 741c10e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -128
README.md CHANGED
@@ -1,128 +1,141 @@
1
- # Lung Cancer Prediction with High Recall
2
-
3
- ## About the Project
4
-
5
- This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.
6
-
7
- The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.
8
-
9
- ---
10
-
11
- ## About the Dataset
12
-
13
- The dataset used in this project is sourced from Kaggle: [Lung Cancer Dataset](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer). It contains survey responses from 309 individuals, each described by 16 features:
14
-
15
- - **Demographics:** Age, Gender
16
- - **Lifestyle:** Smoking status, Alcohol consumption
17
- - **Symptoms:** Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
18
- - **Target Variable:** LUNG_CANCER (YES/NO)
19
-
20
- The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.
21
-
22
- ---
23
-
24
- ## Notebook Summary
25
-
26
- The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:
27
-
28
- 1. **Problem Definition:** Outlines the medical and machine learning objectives, emphasizing the importance of recall.
29
- 2. **Exploratory Data Analysis (EDA):** Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
30
- 3. **Feature Engineering:** Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
31
- 4. **Model Selection:** Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
32
- 5. **Model Evaluation:** Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
33
- 6. **Model Persistence:** The best-performing model (SVM with high recall) is saved using `joblib` for future deployment.
34
-
35
- ---
36
-
37
- ## Model Results
38
-
39
- ### Preprocessing
40
-
41
- - **Duplicate Removal:** All duplicate rows are dropped to ensure data integrity.
42
- - **Missing Values:** The dataset contains no missing values, simplifying preprocessing.
43
- - **Encoding:** Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
44
- - **Collinearity:** Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
45
- - **Class Imbalance:** Stratified train-test splits and class weighting are used to address the imbalance in the target variable.
46
-
47
- ### Training
48
-
49
- - **Algorithms Tested:** Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
50
- - **Cross-Validation:** Stratified K-Fold cross-validation is used to ensure robust evaluation.
51
- - **Hyperparameter Tuning:** Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
52
- - **Pipeline:** For SVM, a pipeline with feature scaling (`StandardScaler`) is used to improve performance.
53
-
54
- ### Evaluation
55
-
56
- - **Metrics:** Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
57
- - **Results:** SVM achieved the highest recall, making it the preferred model for this application.
58
- - **Visualization:** Confusion matrices and classification reports are plotted for each model to facilitate comparison.
59
-
60
- ### Model Persistence
61
-
62
- - The final SVM model is saved as `model.pkl` using `joblib`, enabling easy reuse and deployment.
63
-
64
- ---
65
-
66
- ## How to Install
67
-
68
- Follow these steps to set up the project in a virtual environment:
69
-
70
- 1. **Clone the Repository**
71
- ```bash
72
- git clone https://github.com/DeepActionPotential/RepiraAI
73
- cd RespiraAI
74
- ```
75
-
76
- 2. **Create a Virtual Environment**
77
- ```bash
78
- python -m venv venv
79
- ```
80
-
81
- 3. **Activate the Virtual Environment**
82
- - On Windows:
83
- ```bash
84
- venv\Scripts\activate
85
- ```
86
- - On macOS/Linux:
87
- ```bash
88
- source venv/bin/activate
89
- ```
90
-
91
- 4. **Install Dependencies**
92
- ```bash
93
- pip install -r requirements.txt
94
- ```
95
-
96
- ---
97
-
98
- ## How to Use the Software
99
-
100
- ## [demo-video](assets/lung_cancer.mp4)
101
- ## ![demo-image](assets/1.jpeg)
102
-
103
-
104
- ---
105
-
106
- ## Technologies Used
107
-
108
-
109
- - **Pandas:** Data manipulation and analysis, including cleaning, encoding, and feature engineering.
110
- - **NumPy:** Efficient numerical computations and array operations.
111
- - **Matplotlib & Seaborn:** Data visualization for EDA, feature distributions, and evaluation metrics.
112
- - **Scikit-learn:** Machine learning library used for model training, evaluation, cross-validation, and pipelines.
113
- - **XGBoost:** Advanced gradient boosting algorithm for classification.
114
- - **Optuna:** Hyperparameter optimization framework (optional, for advanced tuning).
115
- - **Joblib:** Model serialization and persistence.
116
- - **Streamlit:** (Optional) For building interactive web demos of the prediction model.
117
- - **Jupyter Notebook / VSCode:** Interactive development and documentation environment.
118
-
119
- Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.
120
-
121
- ---
122
-
123
- ## License
124
-
125
- This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.
126
-
127
- ---
128
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ ---
4
+ title: RespirAI - Lung Cancer Prediction with High Recall
5
+ emoji: 🤖
6
+ colorFrom: indigo
7
+ colorTo: blue
8
+ sdk: streamlit
9
+ sdk_version: 1.30.0
10
+ app_file: app.py
11
+ pinned: false
12
+ license: mit
13
+ ---
14
+
15
+
16
+ ## About the Project
17
+
18
+ This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.
19
+
20
+ The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.
21
+
22
+ ---
23
+
24
+ ## About the Dataset
25
+
26
+ The dataset used in this project is sourced from Kaggle: [Lung Cancer Dataset](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer). It contains survey responses from 309 individuals, each described by 16 features:
27
+
28
+ - **Demographics:** Age, Gender
29
+ - **Lifestyle:** Smoking status, Alcohol consumption
30
+ - **Symptoms:** Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
31
+ - **Target Variable:** LUNG_CANCER (YES/NO)
32
+
33
+ The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.
34
+
35
+ ---
36
+
37
+ ## Notebook Summary
38
+
39
+ The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:
40
+
41
+ 1. **Problem Definition:** Outlines the medical and machine learning objectives, emphasizing the importance of recall.
42
+ 2. **Exploratory Data Analysis (EDA):** Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
43
+ 3. **Feature Engineering:** Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
44
+ 4. **Model Selection:** Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
45
+ 5. **Model Evaluation:** Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
46
+ 6. **Model Persistence:** The best-performing model (SVM with high recall) is saved using `joblib` for future deployment.
47
+
48
+ ---
49
+
50
+ ## Model Results
51
+
52
+ ### Preprocessing
53
+
54
+ - **Duplicate Removal:** All duplicate rows are dropped to ensure data integrity.
55
+ - **Missing Values:** The dataset contains no missing values, simplifying preprocessing.
56
+ - **Encoding:** Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
57
+ - **Collinearity:** Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
58
+ - **Class Imbalance:** Stratified train-test splits and class weighting are used to address the imbalance in the target variable.
59
+
60
+ ### Training
61
+
62
+ - **Algorithms Tested:** Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
63
+ - **Cross-Validation:** Stratified K-Fold cross-validation is used to ensure robust evaluation.
64
+ - **Hyperparameter Tuning:** Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
65
+ - **Pipeline:** For SVM, a pipeline with feature scaling (`StandardScaler`) is used to improve performance.
66
+
67
+ ### Evaluation
68
+
69
+ - **Metrics:** Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
70
+ - **Results:** SVM achieved the highest recall, making it the preferred model for this application.
71
+ - **Visualization:** Confusion matrices and classification reports are plotted for each model to facilitate comparison.
72
+
73
+ ### Model Persistence
74
+
75
+ - The final SVM model is saved as `model.pkl` using `joblib`, enabling easy reuse and deployment.
76
+
77
+ ---
78
+
79
+ ## How to Install
80
+
81
+ Follow these steps to set up the project in a virtual environment:
82
+
83
+ 1. **Clone the Repository**
84
+ ```bash
85
+ git clone https://github.com/DeepActionPotential/RepiraAI
86
+ cd RespiraAI
87
+ ```
88
+
89
+ 2. **Create a Virtual Environment**
90
+ ```bash
91
+ python -m venv venv
92
+ ```
93
+
94
+ 3. **Activate the Virtual Environment**
95
+ - On Windows:
96
+ ```bash
97
+ venv\Scripts\activate
98
+ ```
99
+ - On macOS/Linux:
100
+ ```bash
101
+ source venv/bin/activate
102
+ ```
103
+
104
+ 4. **Install Dependencies**
105
+ ```bash
106
+ pip install -r requirements.txt
107
+ ```
108
+
109
+ ---
110
+
111
+ ## How to Use the Software
112
+
113
+ ## [demo-video](assets/lung_cancer.mp4)
114
+ ## ![demo-image](assets/1.jpeg)
115
+
116
+
117
+ ---
118
+
119
+ ## Technologies Used
120
+
121
+
122
+ - **Pandas:** Data manipulation and analysis, including cleaning, encoding, and feature engineering.
123
+ - **NumPy:** Efficient numerical computations and array operations.
124
+ - **Matplotlib & Seaborn:** Data visualization for EDA, feature distributions, and evaluation metrics.
125
+ - **Scikit-learn:** Machine learning library used for model training, evaluation, cross-validation, and pipelines.
126
+ - **XGBoost:** Advanced gradient boosting algorithm for classification.
127
+ - **Optuna:** Hyperparameter optimization framework (optional, for advanced tuning).
128
+ - **Joblib:** Model serialization and persistence.
129
+ - **Streamlit:** (Optional) For building interactive web demos of the prediction model.
130
+ - **Jupyter Notebook / VSCode:** Interactive development and documentation environment.
131
+
132
+ Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.
133
+
134
+ ---
135
+
136
+ ## License
137
+
138
+ This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.
139
+
140
+ ---
141
+