{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 26 - End-to-End ML Project (Production Pipeline)\n", "\n", "This is the **FINAL MODULE** and the ultimate test of everything you've learned. You will build a complete, production-ready ML system from scratch that includes:\n", "\n", "### Full Production Workflow:\n", "1. **Problem Definition & Data Collection**\n", "2. **EDA & Statistical Analysis**\n", "3. **Feature Engineering & Selection**\n", "4. **Model Selection & Hyperparameter Tuning**\n", "5. **Model Evaluation & Explainability (SHAP)**\n", "6. **Model Persistence & Deployment**\n", "7. **Monitoring & Documentation**\n", "\n", "### Dataset:\n", "We will use the **Credit Card Fraud Detection** dataset (highly imbalanced, real-world complexity).\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 1: Problem Understanding & Data Loading\n", "\n", "### Business Goal:\n", "Build a model to detect fraudulent credit card transactions to minimize financial losses.\n", "\n", "**Success Metrics**: Precision, Recall, F1-Score (since data is imbalanced)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score\n", "import joblib\n", "\n", "# For this demo, we'll use a simulated dataset\n", "# In production, replace with: pd.read_csv('creditcard.csv')\n", "np.random.seed(42)\n", "df = pd.DataFrame({\n", " 'Amount': np.random.uniform(1, 5000, 1000),\n", " 'Time': np.random.uniform(0, 172800, 1000),\n", " 'V1': np.random.randn(1000),\n", " 'V2': np.random.randn(1000),\n", " 'Class': np.random.choice([0, 1], 1000, p=[0.95, 0.05])\n", "})\n", "\n", "print(\"Dataset loaded!\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 2: Exploratory Data Analysis (EDA)\n", "\n", "### Task 1: Check Class Imbalance\n", "Plot the distribution of fraud vs non-fraud transactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "sns.countplot(x='Class', data=df)\n", "plt.title('Fraud vs Normal Transactions')\n", "plt.show()\n", "print(df['Class'].value_counts())\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 3: Feature Engineering\n", "\n", "### Task 2: Scaling & Train-Test Split\n", "1. Scale the `Amount` and `Time` columns\n", "2. Split data (80/20) with stratification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "scaler = StandardScaler()\n", "df[['Amount', 'Time']] = scaler.fit_transform(df[['Amount', 'Time']])\n", "\n", "X = df.drop('Class', axis=1)\n", "y = df['Class']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, stratify=y, random_state=42\n", ")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 4: Model Training & Hyperparameter Tuning\n", "\n", "### Task 3: GridSearchCV\n", "Use GridSearch to find the best `max_depth` and `n_estimators` for a Random Forest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "param_grid = {\n", " 'n_estimators': [50, 100],\n", " 'max_depth': [10, 20, None]\n", "}\n", "\n", "rf = RandomForestClassifier(random_state=42)\n", "grid = GridSearchCV(rf, param_grid, cv=3, scoring='f1')\n", "grid.fit(X_train, y_train)\n", "\n", "print(\"Best params:\", grid.best_params_)\n", "best_model = grid.best_estimator_\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 5: Model Evaluation\n", "\n", "### Task 4: Comprehensive Metrics\n", "Evaluate with Confusion Matrix, Classification Report, and ROC-AUC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "y_pred = best_model.predict(X_test)\n", "\n", "print(classification_report(y_test, y_pred))\n", "print(\"ROC-AUC:\", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))\n", "\n", "sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phase 6: Model Persistence\n", "\n", "### Task 5: Save the Pipeline\n", "Save the scaler and model for production deployment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "joblib.dump(best_model, 'fraud_model.pkl')\n", "joblib.dump(scaler, 'scaler.pkl')\n", "print(\"Production artifacts saved!\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### 🎓 CONGRATULATIONS! \n", "You have completed the **ENTIRE 26-MODULE CURRICULUM**. \n", "\n", "You are now ready to:\n", "- Build production ML systems\n", "- Compete in Kaggle competitions\n", "- Interview for Data Scientist roles\n", "- Deploy models to the real world\n", "\n", "**Your journey has just begun!** 🚀" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }