Spaces:
Sleeping
Sleeping
File size: 6,475 Bytes
a7c44e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
title: RespirAI - Lung Cancer Prediction with High Recall
emoji: 🤖
colorFrom: indigo
colorTo: blue
sdk: streamlit
sdk_version: 1.30.0
app_file: app.py
pinned: false
license: mit
---
## About the Project
This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.
The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.
---
## About the Dataset
The dataset used in this project is sourced from Kaggle: [Lung Cancer Dataset](https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer). It contains survey responses from 309 individuals, each described by 16 features:
- **Demographics:** Age, Gender
- **Lifestyle:** Smoking status, Alcohol consumption
- **Symptoms:** Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
- **Target Variable:** LUNG_CANCER (YES/NO)
The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.
---
## Notebook Summary
The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:
1. **Problem Definition:** Outlines the medical and machine learning objectives, emphasizing the importance of recall.
2. **Exploratory Data Analysis (EDA):** Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
3. **Feature Engineering:** Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
4. **Model Selection:** Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
5. **Model Evaluation:** Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
6. **Model Persistence:** The best-performing model (SVM with high recall) is saved using `joblib` for future deployment.
---
## Model Results
### Preprocessing
- **Duplicate Removal:** All duplicate rows are dropped to ensure data integrity.
- **Missing Values:** The dataset contains no missing values, simplifying preprocessing.
- **Encoding:** Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
- **Collinearity:** Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
- **Class Imbalance:** Stratified train-test splits and class weighting are used to address the imbalance in the target variable.
### Training
- **Algorithms Tested:** Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
- **Cross-Validation:** Stratified K-Fold cross-validation is used to ensure robust evaluation.
- **Hyperparameter Tuning:** Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
- **Pipeline:** For SVM, a pipeline with feature scaling (`StandardScaler`) is used to improve performance.
### Evaluation
- **Metrics:** Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
- **Results:** SVM achieved the highest recall, making it the preferred model for this application.
- **Visualization:** Confusion matrices and classification reports are plotted for each model to facilitate comparison.
### Model Persistence
- The final SVM model is saved as `model.pkl` using `joblib`, enabling easy reuse and deployment.
---
## How to Install
Follow these steps to set up the project in a virtual environment:
1. **Clone the Repository**
```bash
git clone https://github.com/DeepActionPotential/RepiraAI
cd RespiraAI
```
2. **Create a Virtual Environment**
```bash
python -m venv venv
```
3. **Activate the Virtual Environment**
- On Windows:
```bash
venv\Scripts\activate
```
- On macOS/Linux:
```bash
source venv/bin/activate
```
4. **Install Dependencies**
```bash
pip install -r requirements.txt
```
---
## How to Use the Software
## [demo-video](assets/lung_cancer.mp4)
## 
---
## Technologies Used
- **Pandas:** Data manipulation and analysis, including cleaning, encoding, and feature engineering.
- **NumPy:** Efficient numerical computations and array operations.
- **Matplotlib & Seaborn:** Data visualization for EDA, feature distributions, and evaluation metrics.
- **Scikit-learn:** Machine learning library used for model training, evaluation, cross-validation, and pipelines.
- **XGBoost:** Advanced gradient boosting algorithm for classification.
- **Optuna:** Hyperparameter optimization framework (optional, for advanced tuning).
- **Joblib:** Model serialization and persistence.
- **Streamlit:** (Optional) For building interactive web demos of the prediction model.
- **Jupyter Notebook / VSCode:** Interactive development and documentation environment.
Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.
---
## License
This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.
---
|