SpencerCPurdy commited on
Commit
0674890
Β·
verified Β·
1 Parent(s): c329ef1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -1
README.md CHANGED
@@ -11,4 +11,150 @@ license: mit
11
  short_description: MLOps platform with drift detection and model monitoring
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: MLOps platform with drift detection and model monitoring
12
  ---
13
 
14
+ # Automated MLOps Framework for Customer Churn Prediction
15
+
16
+ A comprehensive machine learning operations (MLOps) framework for predicting customer churn in the telecommunications industry. This project demonstrates end-to-end ML pipeline development including data preprocessing, model training, hyperparameter optimization, drift detection, and deployment with an interactive web interface.
17
+
18
+ ## About
19
+
20
+ This portfolio project showcases practical MLOps skills by implementing an automated pipeline for customer churn prediction. The system trains multiple machine learning models, performs hyperparameter optimization, monitors for data drift, and provides an interactive interface for making predictions.
21
+
22
+ **Author:** Spencer Purdy
23
+ **Development Environment:** Google Colab Pro (A100 GPU, High RAM)
24
+
25
+ ## Features
26
+
27
+ - **Automated Model Training**: Trains and compares XGBoost, LightGBM, and Random Forest models
28
+ - **Hyperparameter Optimization**: Uses Optuna for automated hyperparameter tuning (30 trials)
29
+ - **Model Versioning**: SQLite-based model registry with performance tracking
30
+ - **Data Drift Detection**: Kolmogorov-Smirnov statistical test for distribution changes
31
+ - **Feature Engineering**: Creates derived features including tenure groups, charge ratios, and service counts
32
+ - **Class Balancing**: SMOTE implementation to handle imbalanced dataset
33
+ - **Interactive Interface**: Gradio web application for predictions and system monitoring
34
+ - **Model Explainability**: Feature importance visualization
35
+ - **Performance Monitoring**: Tracks training time, inference latency, and cost metrics
36
+
37
+ ## Dataset
38
+
39
+ - **Source:** IBM Telco Customer Churn Dataset
40
+ - **License:** Database Contents License (DbCL) v1.0
41
+ - **Samples:** 7,043 customers
42
+ - **Features:** 20 (demographic, account information, and service details)
43
+ - **Target:** Binary classification (Churn: Yes/No)
44
+ - **Class Distribution:** Approximately 26% churn rate
45
+
46
+ ## Model Performance
47
+
48
+ Performance metrics on held-out test set (20% of data):
49
+
50
+ | Metric | Score |
51
+ |--------|-------|
52
+ | ROC-AUC | 0.9337 |
53
+ | Accuracy | 85.46% |
54
+ | Precision | 0.8536 |
55
+ | Recall | 0.8560 |
56
+ | F1-Score | 0.8548 |
57
+
58
+ **Best Model:** LightGBM
59
+ **Training Time:** 0.84 minutes
60
+ **Inference Latency:** <100ms per prediction
61
+
62
+ ## Technical Stack
63
+
64
+ - **Python Libraries:** pandas, numpy, scikit-learn, xgboost, lightgbm, optuna, shap, imbalanced-learn
65
+ - **Database:** SQLite (model registry and experiment tracking)
66
+ - **UI Framework:** Gradio
67
+ - **Visualization:** matplotlib, seaborn, plotly
68
+ - **Development:** Google Colab Pro with A100 GPU
69
+
70
+ ## Setup and Usage
71
+
72
+ ### Running in Google Colab
73
+
74
+ 1. Clone this repository or download the notebook file
75
+ 2. Upload `Automated MLOps Framework for Customer Churn Prediction.ipynb` to Google Colab
76
+ 3. Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier)
77
+ 4. Run all cells sequentially
78
+
79
+ The notebook will automatically:
80
+ - Install required dependencies
81
+ - Download and preprocess the dataset
82
+ - Train multiple models with hyperparameter optimization
83
+ - Launch a Gradio interface with a shareable link
84
+
85
+ ### Running Locally
86
+
87
+ ```bash
88
+ # Clone the repository
89
+ git clone https://github.com/SpencerCPurdy/Automated_MLOps_Framework_for_Customer_Churn_Prediction.git
90
+ cd Automated_MLOps_Framework_for_Customer_Churn_Prediction
91
+
92
+ # Install dependencies
93
+ pip install pandas numpy scikit-learn xgboost lightgbm optuna shap imbalanced-learn gradio plotly seaborn matplotlib scipy joblib
94
+
95
+ # Run the notebook
96
+ jupyter notebook "Automated MLOps Framework for Customer Churn Prediction.ipynb"
97
+ ```
98
+
99
+ ## Project Structure
100
+
101
+ ```
102
+ β”œβ”€β”€ Automated MLOps Framework for Customer Churn Prediction.ipynb
103
+ β”œβ”€β”€ README.md
104
+ β”œβ”€β”€ LICENSE
105
+ └── .gitignore
106
+ ```
107
+
108
+ The notebook contains the following components:
109
+
110
+ 1. **Configuration & Setup**: System configuration, logging, and reproducibility settings
111
+ 2. **Database Management**: Model registry and experiment tracking
112
+ 3. **Data Processing**: Loading, cleaning, and feature engineering
113
+ 4. **Model Training**: Automated training pipeline with Optuna optimization
114
+ 5. **Drift Detection**: Statistical tests for data distribution changes
115
+ 6. **Evaluation**: Comprehensive performance metrics and visualizations
116
+ 7. **Gradio Interface**: Interactive web application for predictions
117
+
118
+ ## Key Implementation Details
119
+
120
+ - **Reproducibility:** All random seeds set to 42 for deterministic results
121
+ - **Cross-Validation:** 5-fold stratified cross-validation for model selection
122
+ - **Feature Engineering:** Automated creation of tenure groups, charge ratios, and service counts
123
+ - **Missing Data:** Median imputation for numerical features
124
+ - **Class Imbalance:** SMOTE oversampling applied to training data
125
+
126
+ ## Limitations
127
+
128
+ - Trained specifically on telecommunications customer data; may not generalize to other industries
129
+ - Performance degrades with significant data drift (p-value < 0.05)
130
+ - Requires minimum 1,000 samples for reliable predictions
131
+ - Binary classification only (churn vs. no churn)
132
+ - Model performance may degrade over time without retraining
133
+
134
+ ## Model Registry
135
+
136
+ The system maintains a SQLite database tracking:
137
+ - Model versions and hyperparameters
138
+ - Performance metrics on validation and test sets
139
+ - Training time and sample counts
140
+ - Production deployment status
141
+ - Drift detection results
142
+
143
+ ## License
144
+
145
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
146
+
147
+ ## Acknowledgments
148
+
149
+ - IBM Telco Customer Churn dataset (Database Contents License v1.0)
150
+ - Kaggle community for dataset hosting and documentation
151
+ - Open-source libraries and frameworks used in this project
152
+
153
+ ## Contact
154
+
155
+ **Spencer Purdy**
156
+ GitHub: [@SpencerCPurdy](https://github.com/SpencerCPurdy)
157
+
158
+ ---
159
+
160
+ *This is a portfolio project developed to demonstrate machine learning engineering and MLOps capabilities. Performance metrics are based on the specific dataset used and should be validated for any real-world application.*