Spaces:

SpencerCPurdy
/

Automated_MLOps_Framework_for_Customer_Churn_Prediction

Running

App Files Files Community

Automated_MLOps_Framework_for_Customer_Churn_Prediction / README.md

SpencerCPurdy

Update README.md

82d3fa4 verified 26 days ago

preview code

raw

history blame contribute delete

6.18 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

metadata

title: Automated MLOps Framework For Customer Churn Prediction
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: mit
short_description: MLOps platform with drift detection and model monitoring

Automated MLOps Framework for Customer Churn Prediction

A comprehensive machine learning operations (MLOps) framework for predicting customer churn in the telecommunications industry. This project demonstrates end-to-end ML pipeline development including data preprocessing, model training, hyperparameter optimization, drift detection, and deployment with an interactive web interface.

About

This portfolio project showcases practical MLOps skills by implementing an automated pipeline for customer churn prediction. The system trains multiple machine learning models, performs hyperparameter optimization, monitors for data drift, and provides an interactive interface for making predictions.

Author: Spencer Purdy
Development Environment: Google Colab Pro (A100 GPU, High RAM)

Features

Automated Model Training: Trains and compares XGBoost, LightGBM, and Random Forest models
Hyperparameter Optimization: Uses Optuna for automated hyperparameter tuning (30 trials)
Model Versioning: SQLite-based model registry with performance tracking
Data Drift Detection: Kolmogorov-Smirnov statistical test for distribution changes
Feature Engineering: Creates derived features including tenure groups, charge ratios, and service counts
Class Balancing: SMOTE implementation to handle imbalanced dataset
Interactive Interface: Gradio web application for predictions and system monitoring
Model Explainability: Feature importance visualization
Performance Monitoring: Tracks training time, inference latency, and cost metrics

Dataset

Source: IBM Telco Customer Churn Dataset
License: Database Contents License (DbCL) v1.0
Samples: 7,043 customers
Features: 20 (demographic, account information, and service details)
Target: Binary classification (Churn: Yes/No)
Class Distribution: Approximately 26% churn rate

Model Performance

Performance metrics on held-out test set (20% of data):

Metric	Score
ROC-AUC	0.9337
Accuracy	85.46%
Precision	0.8536
Recall	0.8560
F1-Score	0.8548

Best Model: LightGBM
Training Time: 0.84 minutes
Inference Latency: <100ms per prediction

Technical Stack

Python Libraries: pandas, numpy, scikit-learn, xgboost, lightgbm, optuna, shap, imbalanced-learn
Database: SQLite (model registry and experiment tracking)
UI Framework: Gradio
Visualization: matplotlib, seaborn, plotly
Development: Google Colab Pro with A100 GPU

Setup and Usage

Running in Google Colab

Clone this repository or download the notebook file
Upload Automated MLOps Framework for Customer Churn Prediction.ipynb to Google Colab
Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier)
Run all cells sequentially

The notebook will automatically:

Install required dependencies
Download and preprocess the dataset
Train multiple models with hyperparameter optimization
Launch a Gradio interface with a shareable link

Running Locally

# Clone the repository
git clone https://github.com/SpencerCPurdy/Automated_MLOps_Framework_for_Customer_Churn_Prediction.git
cd Automated_MLOps_Framework_for_Customer_Churn_Prediction

# Install dependencies
pip install pandas numpy scikit-learn xgboost lightgbm optuna shap imbalanced-learn gradio plotly seaborn matplotlib scipy joblib

# Run the notebook
jupyter notebook "Automated MLOps Framework for Customer Churn Prediction.ipynb"

Project Structure

├── Automated MLOps Framework for Customer Churn Prediction.ipynb
├── README.md
├── LICENSE
└── .gitignore

The notebook contains the following components:

Configuration & Setup: System configuration, logging, and reproducibility settings
Database Management: Model registry and experiment tracking
Data Processing: Loading, cleaning, and feature engineering
Model Training: Automated training pipeline with Optuna optimization
Drift Detection: Statistical tests for data distribution changes
Evaluation: Comprehensive performance metrics and visualizations
Gradio Interface: Interactive web application for predictions

Key Implementation Details

Reproducibility: All random seeds set to 42 for deterministic results
Cross-Validation: 5-fold stratified cross-validation for model selection
Feature Engineering: Automated creation of tenure groups, charge ratios, and service counts
Missing Data: Median imputation for numerical features
Class Imbalance: SMOTE oversampling applied to training data

Limitations

Trained specifically on telecommunications customer data; may not generalize to other industries
Performance degrades with significant data drift (p-value < 0.05)
Requires minimum 1,000 samples for reliable predictions
Binary classification only (churn vs. no churn)
Model performance may degrade over time without retraining

Model Registry

The system maintains a SQLite database tracking:

Model versions and hyperparameters
Performance metrics on validation and test sets
Training time and sample counts
Production deployment status
Drift detection results

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

IBM Telco Customer Churn dataset (Database Contents License v1.0)
Kaggle community for dataset hosting and documentation
Open-source libraries and frameworks used in this project

Contact

Spencer Purdy
GitHub: @SpencerCPurdy

This is a portfolio project developed to demonstrate machine learning engineering and MLOps capabilities. Performance metrics are based on the specific dataset used and should be validated for any real-world application.