RespiraAI / README.md
DeepActionPotential's picture
Update README.md
a7c44e0 verified

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade
metadata
title: RespirAI - Lung Cancer Prediction with High Recall
emoji: 🤖
colorFrom: indigo
colorTo: blue
sdk: streamlit
sdk_version: 1.30.0
app_file: app.py
pinned: false
license: mit

About the Project

This project aims to develop a machine learning model that predicts the likelihood of lung cancer based on patient survey data. Early detection of lung cancer is crucial for improving survival rates, as it is often diagnosed at advanced stages. By leveraging simple survey responses, this tool can assist clinicians and healthcare professionals in identifying high-risk individuals for further screening and intervention. The focus of this project is on maximizing recall (sensitivity), ensuring that as many true cancer cases as possible are identified, even if it means accepting a higher rate of false positives. This approach is particularly important in medical diagnostics, where missing a positive case can have severe consequences.

The project includes a complete workflow: data preprocessing, exploratory data analysis, feature engineering, model selection, training, evaluation, and model persistence. The final model is designed to be interpretable and easily deployable in real-world healthcare settings.


About the Dataset

The dataset used in this project is sourced from Kaggle: Lung Cancer Dataset. It contains survey responses from 309 individuals, each described by 16 features:

  • Demographics: Age, Gender
  • Lifestyle: Smoking status, Alcohol consumption
  • Symptoms: Fatigue, Coughing, Shortness of breath, Wheezing, Swallowing difficulty, Chest pain, etc.
  • Target Variable: LUNG_CANCER (YES/NO)

The dataset is relatively small and exhibits class imbalance, with more negative cases than positive ones. All features are either categorical or binary, making them suitable for various classification algorithms after appropriate encoding.


Notebook Summary

The accompanying Jupyter notebook provides a step-by-step walkthrough of the entire machine learning pipeline:

  1. Problem Definition: Outlines the medical and machine learning objectives, emphasizing the importance of recall.
  2. Exploratory Data Analysis (EDA): Visualizes feature distributions, examines class imbalance, and investigates relationships between features and the target.
  3. Feature Engineering: Handles missing values, encodes categorical variables, and removes highly collinear features using correlation matrices and Variance Inflation Factor (VIF) analysis.
  4. Model Selection: Compares several algorithms (Logistic Regression, Random Forest, XGBoost, SVM) with a focus on recall. Hyperparameters are tuned, and class imbalance is addressed using class weighting and stratified splits.
  5. Model Evaluation: Reports metrics such as recall, precision, F1-score, and ROC-AUC. Confusion matrices and classification reports are visualized for each model.
  6. Model Persistence: The best-performing model (SVM with high recall) is saved using joblib for future deployment.

Model Results

Preprocessing

  • Duplicate Removal: All duplicate rows are dropped to ensure data integrity.
  • Missing Values: The dataset contains no missing values, simplifying preprocessing.
  • Encoding: Categorical features are encoded numerically. Binary responses (YES/NO) are mapped to 1/0, and gender is mapped to 0 (Male) and 1 (Female).
  • Collinearity: Feature correlation and VIF analysis are performed. The 'AGE' feature is removed due to high multicollinearity.
  • Class Imbalance: Stratified train-test splits and class weighting are used to address the imbalance in the target variable.

Training

  • Algorithms Tested: Logistic Regression, Random Forest, XGBoost, and Support Vector Machine (SVM).
  • Cross-Validation: Stratified K-Fold cross-validation is used to ensure robust evaluation.
  • Hyperparameter Tuning: Randomized search and Optuna are available for hyperparameter optimization (though not fully detailed in the notebook).
  • Pipeline: For SVM, a pipeline with feature scaling (StandardScaler) is used to improve performance.

Evaluation

  • Metrics: Emphasis on recall, but also reports accuracy, precision, F1-score, and ROC-AUC.
  • Results: SVM achieved the highest recall, making it the preferred model for this application.
  • Visualization: Confusion matrices and classification reports are plotted for each model to facilitate comparison.

Model Persistence

  • The final SVM model is saved as model.pkl using joblib, enabling easy reuse and deployment.

How to Install

Follow these steps to set up the project in a virtual environment:

  1. Clone the Repository

    git clone https://github.com/DeepActionPotential/RepiraAI
    cd RespiraAI
    
  2. Create a Virtual Environment

    python -m venv venv
    
  3. Activate the Virtual Environment

    • On Windows:
      venv\Scripts\activate
      
    • On macOS/Linux:
      source venv/bin/activate
      
  4. Install Dependencies

    pip install -r requirements.txt
    

How to Use the Software

demo-video

demo-image


Technologies Used

  • Pandas: Data manipulation and analysis, including cleaning, encoding, and feature engineering.
  • NumPy: Efficient numerical computations and array operations.
  • Matplotlib & Seaborn: Data visualization for EDA, feature distributions, and evaluation metrics.
  • Scikit-learn: Machine learning library used for model training, evaluation, cross-validation, and pipelines.
  • XGBoost: Advanced gradient boosting algorithm for classification.
  • Optuna: Hyperparameter optimization framework (optional, for advanced tuning).
  • Joblib: Model serialization and persistence.
  • Streamlit: (Optional) For building interactive web demos of the prediction model.
  • Jupyter Notebook / VSCode: Interactive development and documentation environment.

Each technology is chosen for its robustness, ease of use, and suitability for rapid prototyping and deployment in machine learning workflows.


License

This project is licensed under the MIT License. You are free to use, modify, and distribute this software for personal or commercial purposes, provided that proper attribution is given.