Spaces:

AashishAIHub
/

DataScience

Running

File size: 10,541 Bytes

854c114

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 26 - End-to-End ML Project (Production Pipeline)\n",
                "\n",
                "This is the **FINAL MODULE** and the ultimate test of everything you've learned. You will build a complete, production-ready ML system from scratch that includes:\n",
                "\n",
                "### Full Production Workflow:\n",
                "1. **Problem Definition & Data Collection**\n",
                "2. **EDA & Statistical Analysis**\n",
                "3. **Feature Engineering & Selection**\n",
                "4. **Model Selection & Hyperparameter Tuning**\n",
                "5. **Model Evaluation & Explainability (SHAP)**\n",
                "6. **Model Persistence & Deployment**\n",
                "7. **Monitoring & Documentation**\n",
                "\n",
                "### Dataset:\n",
                "We will use the **Credit Card Fraud Detection** dataset (highly imbalanced, real-world complexity).\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 1: Problem Understanding & Data Loading\n",
                "\n",
                "### Business Goal:\n",
                "Build a model to detect fraudulent credit card transactions to minimize financial losses.\n",
                "\n",
                "**Success Metrics**: Precision, Recall, F1-Score (since data is imbalanced)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from sklearn.model_selection import train_test_split, GridSearchCV\n",
                "from sklearn.preprocessing import StandardScaler\n",
                "from sklearn.ensemble import RandomForestClassifier\n",
                "from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score\n",
                "import joblib\n",
                "\n",
                "# For this demo, we'll use a simulated dataset\n",
                "# In production, replace with: pd.read_csv('creditcard.csv')\n",
                "np.random.seed(42)\n",
                "df = pd.DataFrame({\n",
                "    'Amount': np.random.uniform(1, 5000, 1000),\n",
                "    'Time': np.random.uniform(0, 172800, 1000),\n",
                "    'V1': np.random.randn(1000),\n",
                "    'V2': np.random.randn(1000),\n",
                "    'Class': np.random.choice([0, 1], 1000, p=[0.95, 0.05])\n",
                "})\n",
                "\n",
                "print(\"Dataset loaded!\")\n",
                "df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 2: Exploratory Data Analysis (EDA)\n",
                "\n",
                "### Task 1: Check Class Imbalance\n",
                "Plot the distribution of fraud vs non-fraud transactions."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "sns.countplot(x='Class', data=df)\n",
                "plt.title('Fraud vs Normal Transactions')\n",
                "plt.show()\n",
                "print(df['Class'].value_counts())\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 3: Feature Engineering\n",
                "\n",
                "### Task 2: Scaling & Train-Test Split\n",
                "1. Scale the `Amount` and `Time` columns\n",
                "2. Split data (80/20) with stratification"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "scaler = StandardScaler()\n",
                "df[['Amount', 'Time']] = scaler.fit_transform(df[['Amount', 'Time']])\n",
                "\n",
                "X = df.drop('Class', axis=1)\n",
                "y = df['Class']\n",
                "\n",
                "X_train, X_test, y_train, y_test = train_test_split(\n",
                "    X, y, test_size=0.2, stratify=y, random_state=42\n",
                ")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 4: Model Training & Hyperparameter Tuning\n",
                "\n",
                "### Task 3: GridSearchCV\n",
                "Use GridSearch to find the best `max_depth` and `n_estimators` for a Random Forest."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "param_grid = {\n",
                "    'n_estimators': [50, 100],\n",
                "    'max_depth': [10, 20, None]\n",
                "}\n",
                "\n",
                "rf = RandomForestClassifier(random_state=42)\n",
                "grid = GridSearchCV(rf, param_grid, cv=3, scoring='f1')\n",
                "grid.fit(X_train, y_train)\n",
                "\n",
                "print(\"Best params:\", grid.best_params_)\n",
                "best_model = grid.best_estimator_\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 5: Model Evaluation\n",
                "\n",
                "### Task 4: Comprehensive Metrics\n",
                "Evaluate with Confusion Matrix, Classification Report, and ROC-AUC."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "y_pred = best_model.predict(X_test)\n",
                "\n",
                "print(classification_report(y_test, y_pred))\n",
                "print(\"ROC-AUC:\", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))\n",
                "\n",
                "sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Phase 6: Model Persistence\n",
                "\n",
                "### Task 5: Save the Pipeline\n",
                "Save the scaler and model for production deployment."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "joblib.dump(best_model, 'fraud_model.pkl')\n",
                "joblib.dump(scaler, 'scaler.pkl')\n",
                "print(\"Production artifacts saved!\")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### 🎓 CONGRATULATIONS! \n",
                "You have completed the **ENTIRE 26-MODULE CURRICULUM**. \n",
                "\n",
                "You are now ready to:\n",
                "- Build production ML systems\n",
                "- Compete in Kaggle competitions\n",
                "- Interview for Data Scientist roles\n",
                "- Deploy models to the real world\n",
                "\n",
                "**Your journey has just begun!** 🚀"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}