diff --git "a/1_Churn_Data_Creation_and_Processing.ipynb" "b/1_Churn_Data_Creation_and_Processing.ipynb" new file mode 100644--- /dev/null +++ "b/1_Churn_Data_Creation_and_Processing.ipynb" @@ -0,0 +1,2342 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "pAyn432WulFm" + }, + "source": [ + "# πŸ” Notebook 1: Churn Data Creation and Processing\n", + "## AI for Big Data Management β€” ESCP Business School\n", + "### Final Group Project\n", + "\n", + "---\n", + "\n", + "## πŸ“Œ Problem Statement\n", + "> **\"How can a company predict customer churn based on support interactions and proactively adapt its retention strategy?\"**\n", + "\n", + "\n", + "- We aim to predict and understand customer churn by combining structured telecom data with synthetic behavioral signals derived from customer support interactions.\n", + "---\n", + "\n", + "## πŸ—ΊοΈ Project Pipeline\n", + "```\n", + "PROBLEM CREATION β†’ REAL-WORLD DATA PROCESSING β†’ SYNTHETIC DATASET GENERATION β†’ AUTOMATION β†’ WRAP-UP\n", + "```\n", + "\n", + "---\n", + "\n", + "## πŸ“‹ What This Notebook Does\n", + "1. **[REAL-WORLD]** Loads the Telco Customer Churn dataset from Kaggle\n", + "2. **[REAL-WORLD]** Cleans and preprocesses the real data\n", + "3. **[SYNTHETIC]** Generates realistic support interaction variables\n", + "4. **[SYNTHETIC]** Creates a merged, enriched final dataset\n", + "5. Exports `customer_churn_support_dataset.csv` for Notebook 2\n", + "\n", + "---\n", + "\n", + "### ⚠️ Before Running\n", + "You need to upload one file:\n", + "- `WA_Fn-UseC_-Telco-Customer-Churn.csv` (downloaded from Kaggle)\n", + "\n", + "Upload it using the πŸ“ Files panel on the left sidebar in Google Colab.\n", + "All other data is generated synthetically in this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "acacAe8GulFp" + }, + "source": [ + "---\n", + "## πŸ“¦ SECTION 1: Install & Import Libraries\n", + "Run this cell first. It installs VADER for sentiment analysis and imports all necessary libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "p5lEOq-1ulFq", + "outputId": "09999301-c22f-49db-896f-5344d4c0322d" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… All libraries imported successfully!\n", + " pandas : 2.2.2\n", + " numpy : 2.0.2\n" + ] + } + ], + "source": [ + "# ── Install required packages ──────────────────────────────────────────────────\n", + "!pip install vaderSentiment --quiet\n", + "\n", + "# ── Standard imports ──────────────────────────────────────────────────────────\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import random\n", + "from datetime import datetime, timedelta\n", + "from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n", + "\n", + "# ── Settings ──────────────────────────────────────────────────────────────────\n", + "np.random.seed(42) # reproducibility\n", + "random.seed(42)\n", + "pd.set_option('display.max_columns', None)\n", + "\n", + "print('βœ… All libraries imported successfully!')\n", + "print(f' pandas : {pd.__version__}')\n", + "print(f' numpy : {np.__version__}')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hrZYSJs4ulFr" + }, + "source": [ + "---\n", + "## πŸ“₯ SECTION 2: Load the Real-World Dataset\n", + "\n", + "### [REAL-WORLD DATA PROCESSING]\n", + "\n", + "**Where to download the dataset:**\n", + "1. Go to: https://www.kaggle.com/datasets/blastchar/telco-customer-churn\n", + "2. Click the **Download** button (top right)\n", + "3. Unzip the file β€” you will get: `WA_Fn-UseC_-Telco-Customer-Churn.csv`\n", + "4. In Google Colab, click the πŸ“ folder icon on the left sidebar\n", + "5. Click the ⬆️ Upload button and select the CSV file\n", + "6. Wait for the upload to finish, then run the cell below\n", + "\n", + "**Dataset info:** IBM Telco Customer Churn β€” 7,043 real customers with billing, contract, and service information." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 382 + }, + "id": "2BkVJSBzulFr", + "outputId": "a11033cd-94ba-4ba8-9387-edfdb4ca27bb" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Dataset loaded successfully!\n", + " Shape: 7043 rows Γ— 21 columns\n", + "\n", + "πŸ“Š First 5 rows:\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + " customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n", + "0 7590-VHVEG Female 0 Yes No 1 No \n", + "1 5575-GNVDE Male 0 No No 34 Yes \n", + "2 3668-QPYBK Male 0 No No 2 Yes \n", + "3 7795-CFOCW Male 0 No No 45 No \n", + "4 9237-HQITU Female 0 No No 2 Yes \n", + "\n", + " MultipleLines InternetService OnlineSecurity OnlineBackup \\\n", + "0 No phone service DSL No Yes \n", + "1 No DSL Yes No \n", + "2 No DSL Yes Yes \n", + "3 No phone service DSL Yes No \n", + "4 No Fiber optic No No \n", + "\n", + " DeviceProtection TechSupport StreamingTV StreamingMovies Contract \\\n", + "0 No No No No Month-to-month \n", + "1 Yes No No No One year \n", + "2 No No No No Month-to-month \n", + "3 Yes Yes No No One year \n", + "4 No No No No Month-to-month \n", + "\n", + " PaperlessBilling PaymentMethod MonthlyCharges TotalCharges \\\n", + "0 Yes Electronic check 29.85 29.85 \n", + "1 No Mailed check 56.95 1889.5 \n", + "2 Yes Mailed check 53.85 108.15 \n", + "3 No Bank transfer (automatic) 42.30 1840.75 \n", + "4 Yes Electronic check 70.70 151.65 \n", + "\n", + " Churn \n", + "0 No \n", + "1 No \n", + "2 Yes \n", + "3 No \n", + "4 Yes " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customerIDgenderSeniorCitizenPartnerDependentstenurePhoneServiceMultipleLinesInternetServiceOnlineSecurityOnlineBackupDeviceProtectionTechSupportStreamingTVStreamingMoviesContractPaperlessBillingPaymentMethodMonthlyChargesTotalChargesChurn
07590-VHVEGFemale0YesNo1NoNo phone serviceDSLNoYesNoNoNoNoMonth-to-monthYesElectronic check29.8529.85No
15575-GNVDEMale0NoNo34YesNoDSLYesNoYesNoNoNoOne yearNoMailed check56.951889.5No
23668-QPYBKMale0NoNo2YesNoDSLYesYesNoNoNoNoMonth-to-monthYesMailed check53.85108.15Yes
37795-CFOCWMale0NoNo45NoNo phone serviceDSLYesNoYesYesNoNoOne yearNoBank transfer (automatic)42.301840.75No
49237-HQITUFemale0NoNo2YesNoFiber opticNoNoNoNoNoNoMonth-to-monthYesElectronic check70.70151.65Yes
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe" + } + }, + "metadata": {} + } + ], + "source": [ + "# ── Load the real-world Telco Churn dataset ────────────────────────────────────\n", + "# If you renamed your file differently, change the filename below\n", + "DATASET_FILENAME = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'\n", + "\n", + "try:\n", + " df_real = pd.read_csv(DATASET_FILENAME)\n", + " print(f'βœ… Dataset loaded successfully!')\n", + " print(f' Shape: {df_real.shape[0]} rows Γ— {df_real.shape[1]} columns')\n", + " print(f'\\nπŸ“Š First 5 rows:')\n", + " display(df_real.head())\n", + "except FileNotFoundError:\n", + " print('❌ ERROR: File not found!')\n", + " print(' Please upload the CSV file to Colab (see instructions above).')\n", + " print(' File expected:', DATASET_FILENAME)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0ZxflowUulFr", + "outputId": "d7356987-2ce1-4eff-cf0e-8263f2c1942d" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "πŸ“‹ Column names:\n", + "['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']\n", + "\n", + "πŸ“‹ Data types:\n", + "customerID object\n", + "gender object\n", + "SeniorCitizen int64\n", + "Partner object\n", + "Dependents object\n", + "tenure int64\n", + "PhoneService object\n", + "MultipleLines object\n", + "InternetService object\n", + "OnlineSecurity object\n", + "OnlineBackup object\n", + "DeviceProtection object\n", + "TechSupport object\n", + "StreamingTV object\n", + "StreamingMovies object\n", + "Contract object\n", + "PaperlessBilling object\n", + "PaymentMethod object\n", + "MonthlyCharges float64\n", + "TotalCharges object\n", + "Churn object\n", + "dtype: object\n", + "\n", + "πŸ“‹ Missing values per column:\n", + "customerID 0\n", + "gender 0\n", + "SeniorCitizen 0\n", + "Partner 0\n", + "Dependents 0\n", + "tenure 0\n", + "PhoneService 0\n", + "MultipleLines 0\n", + "InternetService 0\n", + "OnlineSecurity 0\n", + "OnlineBackup 0\n", + "DeviceProtection 0\n", + "TechSupport 0\n", + "StreamingTV 0\n", + "StreamingMovies 0\n", + "Contract 0\n", + "PaperlessBilling 0\n", + "PaymentMethod 0\n", + "MonthlyCharges 0\n", + "TotalCharges 0\n", + "Churn 0\n", + "dtype: int64\n", + "\n", + "πŸ“‹ Churn distribution (real data):\n", + "Churn\n", + "No 5174\n", + "Yes 1869\n", + "Name: count, dtype: int64\n", + "\n", + "πŸ“‹ Churn rate (real data): 26.54 %\n" + ] + } + ], + "source": [ + "# ── Basic exploration of real dataset ─────────────────────────────────────────\n", + "print('πŸ“‹ Column names:')\n", + "print(df_real.columns.tolist())\n", + "print('\\nπŸ“‹ Data types:')\n", + "print(df_real.dtypes)\n", + "print('\\nπŸ“‹ Missing values per column:')\n", + "print(df_real.isnull().sum())\n", + "print('\\nπŸ“‹ Churn distribution (real data):')\n", + "print(df_real['Churn'].value_counts())\n", + "print('\\nπŸ“‹ Churn rate (real data):', round(df_real['Churn'].value_counts(normalize=True)['Yes'] * 100, 2), '%')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qtj_34hvulFr" + }, + "source": [ + "---\n", + "## 🧹 SECTION 3: Real-World Data Cleaning\n", + "\n", + "### [REAL-WORLD DATA PROCESSING β€” continued]\n", + "\n", + "This section handles:\n", + "- Converting `TotalCharges` to numeric (it arrives as a string with spaces)\n", + "- Filling missing values\n", + "- Encoding the target variable `Churn` as 0/1\n", + "- Selecting the columns we need" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 729 + }, + "id": "jpJKPzueulFs", + "outputId": "99e08519-c0e8-495c-eb84-09af613c0a4c" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… TotalCharges NaN values filled with median: 1397.47\n", + "βœ… Churn encoded: Yes=1, No=0\n", + "\n", + "βœ… SeniorCitizen unique values: [0 1]\n", + "\n", + "βœ… Cleaned dataset shape: (7043, 14)\n", + " Missing values after cleaning:\n", + "customerID 0\n", + "gender 0\n", + "SeniorCitizen 0\n", + "Partner 0\n", + "Dependents 0\n", + "tenure 0\n", + "Contract 0\n", + "PaymentMethod 0\n", + "MonthlyCharges 0\n", + "TotalCharges 0\n", + "InternetService 0\n", + "TechSupport 0\n", + "Churn 0\n", + "Churn_binary 0\n", + "dtype: int64\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/tmp/ipykernel_2585/1930681133.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n", + "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n", + "\n", + "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n", + "\n", + "\n", + " df_real['TotalCharges'].fillna(median_total, inplace=True)\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + " customerID gender SeniorCitizen Partner Dependents tenure \\\n", + "0 7590-VHVEG Female 0 Yes No 1 \n", + "1 5575-GNVDE Male 0 No No 34 \n", + "2 3668-QPYBK Male 0 No No 2 \n", + "3 7795-CFOCW Male 0 No No 45 \n", + "4 9237-HQITU Female 0 No No 2 \n", + "\n", + " Contract PaymentMethod MonthlyCharges TotalCharges \\\n", + "0 Month-to-month Electronic check 29.85 29.85 \n", + "1 One year Mailed check 56.95 1889.50 \n", + "2 Month-to-month Mailed check 53.85 108.15 \n", + "3 One year Bank transfer (automatic) 42.30 1840.75 \n", + "4 Month-to-month Electronic check 70.70 151.65 \n", + "\n", + " InternetService TechSupport Churn Churn_binary \n", + "0 DSL No No 0 \n", + "1 DSL No No 0 \n", + "2 DSL No Yes 1 \n", + "3 DSL Yes No 0 \n", + "4 Fiber optic No Yes 1 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customerIDgenderSeniorCitizenPartnerDependentstenureContractPaymentMethodMonthlyChargesTotalChargesInternetServiceTechSupportChurnChurn_binary
07590-VHVEGFemale0YesNo1Month-to-monthElectronic check29.8529.85DSLNoNo0
15575-GNVDEMale0NoNo34One yearMailed check56.951889.50DSLNoNo0
23668-QPYBKMale0NoNo2Month-to-monthMailed check53.85108.15DSLNoYes1
37795-CFOCWMale0NoNo45One yearBank transfer (automatic)42.301840.75DSLYesNo0
49237-HQITUFemale0NoNo2Month-to-monthElectronic check70.70151.65Fiber opticNoYes1
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gender\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Male\",\n \"Female\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Partner\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Dependents\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Contract\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"One year\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PaymentMethod\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Electronic check\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15.445573799635934,\n \"min\": 29.85,\n \"max\": 70.7,\n \"num_unique_values\": 5,\n \"samples\": [\n 56.95\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 969.8243111512518,\n \"min\": 29.85,\n \"max\": 1889.5,\n \"num_unique_values\": 5,\n \"samples\": [\n 1889.5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"InternetService\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Fiber optic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TechSupport\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {} + } + ], + "source": [ + "# ── Step 1: Fix TotalCharges column (arrives as string) ──────────────────────\n", + "df_real['TotalCharges'] = pd.to_numeric(df_real['TotalCharges'], errors='coerce')\n", + "\n", + "# ── Step 2: Fill missing TotalCharges with median ────────────────────────────\n", + "median_total = df_real['TotalCharges'].median()\n", + "df_real['TotalCharges'].fillna(median_total, inplace=True)\n", + "print(f'βœ… TotalCharges NaN values filled with median: {median_total:.2f}')\n", + "\n", + "# ── Step 3: Encode Churn as 0 / 1 ────────────────────────────────────────────\n", + "df_real['Churn_binary'] = df_real['Churn'].map({'Yes': 1, 'No': 0})\n", + "print(f'βœ… Churn encoded: Yes=1, No=0')\n", + "\n", + "# ── Step 4: Encode SeniorCitizen (already 0/1, but verify) ───────────────────\n", + "print(f'\\nβœ… SeniorCitizen unique values: {df_real[\"SeniorCitizen\"].unique()}')\n", + "\n", + "# ── Step 5: Select core columns for our analysis ─────────────────────────────\n", + "core_cols = [\n", + " 'customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',\n", + " 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges',\n", + " 'TotalCharges', 'InternetService', 'TechSupport',\n", + " 'Churn', 'Churn_binary'\n", + "]\n", + "df_clean = df_real[core_cols].copy()\n", + "\n", + "print(f'\\nβœ… Cleaned dataset shape: {df_clean.shape}')\n", + "print(f' Missing values after cleaning:')\n", + "print(df_clean.isnull().sum())\n", + "display(df_clean.head())" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 317 + }, + "id": "ySxvvy5hulFs", + "outputId": "5c552cda-2f7f-4088-e11e-0d07e0886538" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "πŸ“Š Descriptive statistics (numeric columns):\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + " SeniorCitizen tenure MonthlyCharges TotalCharges Churn_binary\n", + "count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000\n", + "mean 0.162147 32.371149 64.761692 2281.916928 0.265370\n", + "std 0.368612 24.559481 30.090047 2265.270398 0.441561\n", + "min 0.000000 0.000000 18.250000 18.800000 0.000000\n", + "25% 0.000000 9.000000 35.500000 402.225000 0.000000\n", + "50% 0.000000 29.000000 70.350000 1397.475000 0.000000\n", + "75% 0.000000 55.000000 89.850000 3786.600000 1.000000\n", + "max 1.000000 72.000000 118.750000 8684.800000 1.000000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SeniorCitizentenureMonthlyChargesTotalChargesChurn_binary
count7043.0000007043.0000007043.0000007043.0000007043.000000
mean0.16214732.37114964.7616922281.9169280.265370
std0.36861224.55948130.0900472265.2703980.441561
min0.0000000.00000018.25000018.8000000.000000
25%0.0000009.00000035.500000402.2250000.000000
50%0.00000029.00000070.3500001397.4750000.000000
75%0.00000055.00000089.8500003786.6000001.000000
max1.00000072.000000118.7500008684.8000001.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.9992387084,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.1621468124378816,\n 1.0,\n 0.36861160561002687\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2478.9752758409018,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 32.37114865824223,\n 29.0,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2468.7047672837775,\n \"min\": 18.25,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 64.76169246059918,\n 70.35,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3119.0484860242914,\n \"min\": 18.8,\n \"max\": 8684.8,\n \"num_unique_values\": 8,\n \"samples\": [\n 2281.9169281556156,\n 1397.475,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.939844235915,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.2653698707936959,\n 1.0,\n 0.44156130512195013\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {} + } + ], + "source": [ + "# ── Descriptive statistics on real data ──────────────────────────────────────\n", + "print('πŸ“Š Descriptive statistics (numeric columns):')\n", + "display(df_clean.describe())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RpDjH_T6ulFs" + }, + "source": [ + "---\n", + "## πŸ€– SECTION 4: Synthetic Support Interaction Data Generation\n", + "\n", + "### [SYNTHETIC DATASET GENERATION]\n", + "\n", + "**Why synthetic data?** \n", + "Real telecom datasets do not include detailed support call logs. We simulate realistic support interaction variables that are **statistically correlated** with churn β€” just as a real company's CRM data would show.\n", + "\n", + "**Variables we create:**\n", + "| Variable | Description |\n", + "|---|---|\n", + "| `support_calls` | Number of support calls made in the last 6 months |\n", + "| `avg_call_duration` | Average call duration in minutes |\n", + "| `complaint_type` | Type of most frequent complaint |\n", + "| `days_since_last_contact` | Days since the customer last contacted support |\n", + "| `last_contact_sentiment` | Text of the last customer feedback |\n", + "| `sentiment_score` | VADER compound sentiment score |\n", + "| `support_churn_risk` | Composite risk score (Low / Medium / High) |" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "NK03tdXdulFs", + "outputId": "6a003b55-cad7-4b8e-9ddb-426a7d5f9135" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Feedback templates and complaint types defined.\n" + ] + } + ], + "source": [ + "# ── Helper: realistic sentiment phrases per churn status ─────────────────────\n", + "POSITIVE_FEEDBACK = [\n", + " \"The support agent was very helpful and solved my issue quickly.\",\n", + " \"Great service, no complaints at all!\",\n", + " \"Fast resolution. I am happy with the service.\",\n", + " \"The team was professional and friendly. Very satisfied.\",\n", + " \"Everything was resolved in one call. Excellent experience.\",\n", + " \"I love this company, always responsive and caring.\",\n", + " \"No issues, service works perfectly. Very happy customer.\"\n", + "]\n", + "\n", + "NEUTRAL_FEEDBACK = [\n", + " \"The wait time was long but the issue was eventually resolved.\",\n", + " \"Average experience. Could be better.\",\n", + " \"Service is okay. Nothing special.\",\n", + " \"The agent was polite but the problem took two calls to fix.\",\n", + " \"Acceptable support, but I expected faster resolution.\"\n", + "]\n", + "\n", + "NEGATIVE_FEEDBACK = [\n", + " \"I have called five times and the problem is still not fixed!\",\n", + " \"Terrible service. I am thinking of switching providers.\",\n", + " \"The agents are unhelpful and the wait times are ridiculous.\",\n", + " \"I am very frustrated. Nobody seems to care about my problem.\",\n", + " \"Worst customer service I have ever experienced. Cancelling soon.\",\n", + " \"My bill is wrong again. This is the third time this month!\",\n", + " \"I am extremely disappointed. No follow-up, no resolution.\"\n", + "]\n", + "\n", + "COMPLAINT_TYPES = [\n", + " 'Billing Issue', 'Service Outage', 'Speed/Performance',\n", + " 'Contract Dispute', 'Technical Failure', 'Overcharge'\n", + "]\n", + "\n", + "print('βœ… Feedback templates and complaint types defined.')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EIzwa0U3ulFs", + "outputId": "0aa0c944-1e0d-4df3-cc48-15acc7a590ae" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Synthetic support variables generated!\n", + " support_calls range : 1 – 15\n", + " avg_call_duration range: 3.0 – 35.0 min\n", + " Sample complaint types : {'Contract Dispute', 'Billing Issue', 'Technical Failure', 'Service Outage'}\n" + ] + } + ], + "source": [ + "# ── Generate synthetic support variables ──────────────────────────────────────\n", + "n = len(df_clean)\n", + "churn_flag = df_clean['Churn_binary'].values\n", + "\n", + "# support_calls: churners call more (5-15 calls), non-churners call less (1-6)\n", + "support_calls = np.where(\n", + " churn_flag == 1,\n", + " np.random.randint(5, 16, n),\n", + " np.random.randint(1, 7, n)\n", + ")\n", + "\n", + "# avg_call_duration: churners have longer calls (frustration)\n", + "avg_call_duration = np.where(\n", + " churn_flag == 1,\n", + " np.round(np.random.uniform(12, 35, n), 1),\n", + " np.round(np.random.uniform(3, 15, n), 1)\n", + ")\n", + "\n", + "# complaint_type: random, but churners have heavier billing/contract issues\n", + "def pick_complaint(is_churner):\n", + " if is_churner:\n", + " # weight toward billing and contract disputes\n", + " weights = [0.30, 0.15, 0.15, 0.20, 0.10, 0.10]\n", + " else:\n", + " weights = [0.20, 0.20, 0.20, 0.10, 0.20, 0.10]\n", + " return random.choices(COMPLAINT_TYPES, weights=weights, k=1)[0]\n", + "\n", + "complaint_type = [pick_complaint(c) for c in churn_flag]\n", + "\n", + "# days_since_last_contact: churners contacted recently (about to leave)\n", + "days_since_last_contact = np.where(\n", + " churn_flag == 1,\n", + " np.random.randint(1, 30, n),\n", + " np.random.randint(15, 90, n)\n", + ")\n", + "\n", + "# last_contact_sentiment: text phrase matching churn likelihood\n", + "def pick_sentiment_text(is_churner):\n", + " if is_churner:\n", + " # 70% negative, 20% neutral, 10% positive\n", + " pool = (NEGATIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (POSITIVE_FEEDBACK * 1)\n", + " else:\n", + " # 10% negative, 20% neutral, 70% positive\n", + " pool = (POSITIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (NEGATIVE_FEEDBACK * 1)\n", + " return random.choice(pool)\n", + "\n", + "last_contact_sentiment = [pick_sentiment_text(c) for c in churn_flag]\n", + "\n", + "print('βœ… Synthetic support variables generated!')\n", + "print(f' support_calls range : {support_calls.min()} – {support_calls.max()}')\n", + "print(f' avg_call_duration range: {avg_call_duration.min()} – {avg_call_duration.max()} min')\n", + "print(f' Sample complaint types : {set(complaint_type[:10])}')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Yak2n7gpulFt", + "outputId": "ff972046-295c-44c0-c4bb-e0d6c59cb165" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… VADER sentiment scores computed!\n", + " Score range: -0.710 to 0.872\n", + " Mean score : 0.307\n" + ] + } + ], + "source": [ + "# ── Compute VADER sentiment score ─────────────────────────────────────────────\n", + "analyzer = SentimentIntensityAnalyzer()\n", + "\n", + "def get_compound_score(text):\n", + " \"\"\"Return VADER compound score: -1 (most negative) to +1 (most positive)\"\"\"\n", + " return analyzer.polarity_scores(text)['compound']\n", + "\n", + "sentiment_score = [get_compound_score(text) for text in last_contact_sentiment]\n", + "\n", + "print('βœ… VADER sentiment scores computed!')\n", + "print(f' Score range: {min(sentiment_score):.3f} to {max(sentiment_score):.3f}')\n", + "print(f' Mean score : {np.mean(sentiment_score):.3f}')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KJ3wGj9mulFt", + "outputId": "6b8e8fd1-6eaf-4f77-8e37-6218bea1e414" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… support_churn_risk categories created!\n", + " Distribution: Counter({'Medium': 4166, 'Low': 1471, 'High': 1406})\n" + ] + } + ], + "source": [ + "# ── Compute composite support_churn_risk ──────────────────────────────────────\n", + "# Logic:\n", + "# HIGH = many calls (>=6) AND negative sentiment (<= -0.3)\n", + "# MEDIUM = moderate calls (3-5) OR somewhat negative (-0.3 to 0)\n", + "# LOW = everything else\n", + "\n", + "def compute_risk(calls, score):\n", + " if calls >= 6 and score <= -0.3:\n", + " return 'High'\n", + " elif calls >= 3 or score <= 0.0:\n", + " return 'Medium'\n", + " else:\n", + " return 'Low'\n", + "\n", + "support_churn_risk = [\n", + " compute_risk(c, s) for c, s in zip(support_calls, sentiment_score)\n", + "]\n", + "\n", + "print('βœ… support_churn_risk categories created!')\n", + "from collections import Counter\n", + "print(' Distribution:', Counter(support_churn_risk))" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Synthetic Data Design & Assumptions\n", + "\n", + "### Why Synthetic Data Was Created\n", + "\n", + "The original Telco dataset does not include detailed information about customer support interactions or behavioral signals such as sentiment. However, in real-world business settings, these factors play a critical role in customer churn.\n", + "\n", + "To better approximate real-world conditions, we generated synthetic variables representing:\n", + "- number of support calls\n", + "- average call duration\n", + "- complaint type\n", + "- time since last interaction\n", + "- sentiment score\n", + "- support-based churn risk\n", + "\n", + "These variables allow us to enrich the dataset and create a more realistic narrative around customer experience and dissatisfaction.\n", + "\n", + "---\n", + "\n", + "### Key Assumptions\n", + "\n", + "The synthetic data generation is based on the following assumptions:\n", + "\n", + "- Customers who churn tend to have **more frequent support interactions**\n", + "- Negative experiences lead to **lower sentiment scores**\n", + "- Certain complaint types (e.g., technical failures) are more strongly associated with dissatisfaction\n", + "- Customers with unresolved issues are more likely to churn\n", + "\n", + "These assumptions are grounded in typical telecom business logic but are not directly observed in the original dataset.\n", + "\n", + "---\n", + "\n", + "### Limitations of Synthetic Data\n", + "\n", + "Because these variables are artificially generated:\n", + "- They may introduce **bias toward expected relationships**\n", + "- They are not independent of the churn outcome\n", + "- They may **overstate model performance**\n", + "\n", + "Therefore, results should be interpreted as illustrative rather than fully generalizable." + ], + "metadata": { + "id": "zvVoEV0Fl77_" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7TZ0lxJQulFt" + }, + "source": [ + "---\n", + "## πŸ”— SECTION 5: Merge Real + Synthetic Data\n", + "\n", + "We now combine the cleaned real-world dataset with our synthetic support variables into one rich dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 330 + }, + "id": "7tMGLUqvulFt", + "outputId": "d5013de1-0912-46b4-886e-e077adc64ac3" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Final merged dataset: 7043 rows Γ— 21 columns\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + " customerID gender SeniorCitizen Partner Dependents tenure \\\n", + "0 7590-VHVEG Female 0 Yes No 1 \n", + "1 5575-GNVDE Male 0 No No 34 \n", + "2 3668-QPYBK Male 0 No No 2 \n", + "3 7795-CFOCW Male 0 No No 45 \n", + "4 9237-HQITU Female 0 No No 2 \n", + "\n", + " Contract PaymentMethod MonthlyCharges TotalCharges \\\n", + "0 Month-to-month Electronic check 29.85 29.85 \n", + "1 One year Mailed check 56.95 1889.50 \n", + "2 Month-to-month Mailed check 53.85 108.15 \n", + "3 One year Bank transfer (automatic) 42.30 1840.75 \n", + "4 Month-to-month Electronic check 70.70 151.65 \n", + "\n", + " InternetService TechSupport Churn Churn_binary support_calls \\\n", + "0 DSL No No 0 2 \n", + "1 DSL No No 0 4 \n", + "2 DSL No Yes 1 15 \n", + "3 DSL Yes No 0 3 \n", + "4 Fiber optic No Yes 1 9 \n", + "\n", + " avg_call_duration complaint_type days_since_last_contact \\\n", + "0 9.7 Contract Dispute 16 \n", + "1 3.8 Billing Issue 57 \n", + "2 21.2 Billing Issue 12 \n", + "3 9.8 Service Outage 40 \n", + "4 12.3 Contract Dispute 24 \n", + "\n", + " last_contact_sentiment sentiment_score \\\n", + "0 I love this company, always responsive and car... 0.8720 \n", + "1 Great service, no complaints at all! 0.7684 \n", + "2 My bill is wrong again. This is the third time... -0.5255 \n", + "3 Everything was resolved in one call. Excellent... 0.6597 \n", + "4 My bill is wrong again. This is the third time... -0.5255 \n", + "\n", + " support_churn_risk \n", + "0 Low \n", + "1 Medium \n", + "2 High \n", + "3 Medium \n", + "4 High " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customerIDgenderSeniorCitizenPartnerDependentstenureContractPaymentMethodMonthlyChargesTotalChargesInternetServiceTechSupportChurnChurn_binarysupport_callsavg_call_durationcomplaint_typedays_since_last_contactlast_contact_sentimentsentiment_scoresupport_churn_risk
07590-VHVEGFemale0YesNo1Month-to-monthElectronic check29.8529.85DSLNoNo029.7Contract Dispute16I love this company, always responsive and car...0.8720Low
15575-GNVDEMale0NoNo34One yearMailed check56.951889.50DSLNoNo043.8Billing Issue57Great service, no complaints at all!0.7684Medium
23668-QPYBKMale0NoNo2Month-to-monthMailed check53.85108.15DSLNoYes11521.2Billing Issue12My bill is wrong again. This is the third time...-0.5255High
37795-CFOCWMale0NoNo45One yearBank transfer (automatic)42.301840.75DSLYesNo039.8Service Outage40Everything was resolved in one call. Excellent...0.6597Medium
49237-HQITUFemale0NoNo2Month-to-monthElectronic check70.70151.65Fiber opticNoYes1912.3Contract Dispute24My bill is wrong again. This is the third time...-0.5255High
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe" + } + }, + "metadata": {} + } + ], + "source": [ + "# ── Build the synthetic support DataFrame ────────────────────────────────────\n", + "df_support = pd.DataFrame({\n", + " 'support_calls' : support_calls,\n", + " 'avg_call_duration' : avg_call_duration,\n", + " 'complaint_type' : complaint_type,\n", + " 'days_since_last_contact': days_since_last_contact,\n", + " 'last_contact_sentiment' : last_contact_sentiment,\n", + " 'sentiment_score' : sentiment_score,\n", + " 'support_churn_risk' : support_churn_risk\n", + "})\n", + "\n", + "# ── Merge with real-world data (index-aligned) ────────────────────────────────\n", + "df_final = pd.concat([df_clean.reset_index(drop=True), df_support], axis=1)\n", + "\n", + "print(f'βœ… Final merged dataset: {df_final.shape[0]} rows Γ— {df_final.shape[1]} columns')\n", + "display(df_final.head())" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UAW-XT66ulFu", + "outputId": "32243187-b541-40b2-8069-894cbe430f8b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "πŸ“Š Mean support_calls by Churn:\n", + "Churn\n", + "No 3.47\n", + "Yes 10.02\n", + "Name: support_calls, dtype: float64\n", + "\n", + "πŸ“Š Mean sentiment_score by Churn:\n", + "Churn\n", + "No 0.519\n", + "Yes -0.279\n", + "Name: sentiment_score, dtype: float64\n", + "\n", + "πŸ“Š support_churn_risk vs Churn crosstab:\n", + "Churn No Yes\n", + "support_churn_risk \n", + "High 0.080 0.920\n", + "Low 1.000 0.000\n", + "Medium 0.862 0.138\n" + ] + } + ], + "source": [ + "# ── Verification: check correlations make sense ──────────────────────────────\n", + "print('πŸ“Š Mean support_calls by Churn:')\n", + "print(df_final.groupby('Churn')['support_calls'].mean().round(2))\n", + "\n", + "print('\\nπŸ“Š Mean sentiment_score by Churn:')\n", + "print(df_final.groupby('Churn')['sentiment_score'].mean().round(3))\n", + "\n", + "print('\\nπŸ“Š support_churn_risk vs Churn crosstab:')\n", + "print(pd.crosstab(df_final['support_churn_risk'], df_final['Churn'], normalize='index').round(3))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "27am_r6QulFu" + }, + "source": [ + "---\n", + "## πŸ’Ύ SECTION 6: Export Final Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EfM7zP4XulFu", + "outputId": "5f76e254-0dc9-49ea-c6ff-3c202b00c78f" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Dataset exported: customer_churn_support_dataset.csv\n", + " Rows : 7043\n", + " Columns : 21\n", + "\n", + "πŸ“‹ Final column list:\n", + " β€’ customerID\n", + " β€’ gender\n", + " β€’ SeniorCitizen\n", + " β€’ Partner\n", + " β€’ Dependents\n", + " β€’ tenure\n", + " β€’ Contract\n", + " β€’ PaymentMethod\n", + " β€’ MonthlyCharges\n", + " β€’ TotalCharges\n", + " β€’ InternetService\n", + " β€’ TechSupport\n", + " β€’ Churn\n", + " β€’ Churn_binary\n", + " β€’ support_calls\n", + " β€’ avg_call_duration\n", + " β€’ complaint_type\n", + " β€’ days_since_last_contact\n", + " β€’ last_contact_sentiment\n", + " β€’ sentiment_score\n", + " β€’ support_churn_risk\n" + ] + } + ], + "source": [ + "# ── Export to CSV ─────────────────────────────────────────────────────────────\n", + "OUTPUT_FILENAME = 'customer_churn_support_dataset.csv'\n", + "df_final.to_csv(OUTPUT_FILENAME, index=False)\n", + "\n", + "print(f'βœ… Dataset exported: {OUTPUT_FILENAME}')\n", + "print(f' Rows : {df_final.shape[0]}')\n", + "print(f' Columns : {df_final.shape[1]}')\n", + "print(f'\\nπŸ“‹ Final column list:')\n", + "for col in df_final.columns:\n", + " print(f' β€’ {col}')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "id": "HFWnx9zlulFu", + "outputId": "a9084d14-05d9-4cb2-98c6-a25491f9b6b1" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "application/javascript": [ + "\n", + " async function download(id, filename, size) {\n", + " if (!google.colab.kernel.accessAllowed) {\n", + " return;\n", + " }\n", + " const div = document.createElement('div');\n", + " const label = document.createElement('label');\n", + " label.textContent = `Downloading \"${filename}\": `;\n", + " div.appendChild(label);\n", + " const progress = document.createElement('progress');\n", + " progress.max = size;\n", + " div.appendChild(progress);\n", + " document.body.appendChild(div);\n", + "\n", + " const buffers = [];\n", + " let downloaded = 0;\n", + "\n", + " const channel = await google.colab.kernel.comms.open(id);\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + "\n", + " for await (const message of channel.messages) {\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + " if (message.buffers) {\n", + " for (const buffer of message.buffers) {\n", + " buffers.push(buffer);\n", + " downloaded += buffer.byteLength;\n", + " progress.value = downloaded;\n", + " }\n", + " }\n", + " }\n", + " const blob = new Blob(buffers, {type: 'application/binary'});\n", + " const a = document.createElement('a');\n", + " a.href = window.URL.createObjectURL(blob);\n", + " a.download = filename;\n", + " div.appendChild(a);\n", + " a.click();\n", + " div.remove();\n", + " }\n", + " " + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "application/javascript": [ + "download(\"download_160eb193-7480-4c6e-90ba-3271caaea521\", \"customer_churn_support_dataset.csv\", 1309271)" + ] + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "βœ… Download triggered! Check your Downloads folder.\n" + ] + } + ], + "source": [ + "# ── Download the file to your computer ───────────────────────────────────────\n", + "from google.colab import files\n", + "files.download(OUTPUT_FILENAME)\n", + "print('βœ… Download triggered! Check your Downloads folder.')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wr8dQ0YQulFu" + }, + "source": [ + "---\n", + "## βœ… SECTION 7: Final Verification Checks" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FK7d-FE2ulFu", + "outputId": "4e8cb3b5-08f8-4e0a-efd6-97b738be9181" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "============================================================\n", + " FINAL DATASET VERIFICATION REPORT\n", + "============================================================\n", + "\n", + "βœ… Shape : (7043, 21)\n", + "βœ… Missing values : 0 total\n", + "βœ… Churn rate : 26.5%\n", + "βœ… Risk categories : {'Medium': 4166, 'Low': 1471, 'High': 1406}\n", + "βœ… Complaint types : 6 unique\n", + "βœ… Sentiment range : -0.710 to 0.872\n", + "\n", + "============================================================\n", + " βœ… Notebook 1 COMPLETE! Proceed to Notebook 2.\n", + "============================================================\n" + ] + } + ], + "source": [ + "# ── Final verification ────────────────────────────────────────────────────────\n", + "print('=' * 60)\n", + "print(' FINAL DATASET VERIFICATION REPORT')\n", + "print('=' * 60)\n", + "\n", + "df_verify = pd.read_csv(OUTPUT_FILENAME)\n", + "\n", + "print(f'\\nβœ… Shape : {df_verify.shape}')\n", + "print(f'βœ… Missing values : {df_verify.isnull().sum().sum()} total')\n", + "print(f'βœ… Churn rate : {df_verify[\"Churn_binary\"].mean()*100:.1f}%')\n", + "print(f'βœ… Risk categories : {df_verify[\"support_churn_risk\"].value_counts().to_dict()}')\n", + "print(f'βœ… Complaint types : {df_verify[\"complaint_type\"].nunique()} unique')\n", + "print(f'βœ… Sentiment range : {df_verify[\"sentiment_score\"].min():.3f} to {df_verify[\"sentiment_score\"].max():.3f}')\n", + "\n", + "print('\\n' + '=' * 60)\n", + "print(' βœ… Notebook 1 COMPLETE! Proceed to Notebook 2.')\n", + "print('=' * 60)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## SECTION 8: Final Dataset Summary" + ], + "metadata": { + "id": "7KUQ8dwMa9pK" + } + }, + { + "cell_type": "code", + "source": [ + "print('=' * 60)\n", + "print(' NOTEBOOK 1 β€” FINAL DATASET SUMMARY')\n", + "print('=' * 60)\n", + "\n", + "print(f'\\nπŸ“ Shape: {df_final.shape[0]} rows Γ— {df_final.shape[1]} columns')\n", + "\n", + "print('\\nπŸ“‹ Column overview:')\n", + "for col in df_final.columns:\n", + " dtype = str(df_final[col].dtype)\n", + " sample = str(df_final[col].iloc[0])[:40]\n", + " print(f' {col:<30} [{dtype:<10}] e.g. {sample}')\n", + "\n", + "print('\\nπŸ“Š Real vs Synthetic columns:')\n", + "real_cols = ['customerID','gender','SeniorCitizen','Partner','Dependents',\n", + " 'tenure','Contract','PaymentMethod','MonthlyCharges',\n", + " 'TotalCharges','InternetService','TechSupport','Churn','Churn_binary']\n", + "synthetic_cols = ['support_calls','avg_call_duration','complaint_type',\n", + " 'days_since_last_contact','last_contact_sentiment',\n", + " 'sentiment_score','support_churn_risk']\n", + "\n", + "print(f' βœ… Real-world columns ({len(real_cols)}): {real_cols}')\n", + "print(f' πŸ€– Synthetic columns ({len(synthetic_cols)}): {synthetic_cols}')\n", + "\n", + "print('\\nπŸ“Š Sample rows:')\n", + "display(df_final[['customerID','tenure','Churn','support_calls',\n", + " 'sentiment_score','support_churn_risk']].head(5))\n", + "\n", + "\n", + "print('=' * 60)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 833 + }, + "id": "XpmZTW0RahxT", + "outputId": "10a7c450-6a35-447c-ca9f-7db0aab8d5ce" + }, + "execution_count": 21, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "============================================================\n", + " NOTEBOOK 1 β€” FINAL DATASET SUMMARY\n", + "============================================================\n", + "\n", + "πŸ“ Shape: 7043 rows Γ— 21 columns\n", + "\n", + "πŸ“‹ Column overview:\n", + " customerID [object ] e.g. 7590-VHVEG\n", + " gender [object ] e.g. Female\n", + " SeniorCitizen [int64 ] e.g. 0\n", + " Partner [object ] e.g. Yes\n", + " Dependents [object ] e.g. No\n", + " tenure [int64 ] e.g. 1\n", + " Contract [object ] e.g. Month-to-month\n", + " PaymentMethod [object ] e.g. Electronic check\n", + " MonthlyCharges [float64 ] e.g. 29.85\n", + " TotalCharges [float64 ] e.g. 29.85\n", + " InternetService [object ] e.g. DSL\n", + " TechSupport [object ] e.g. No\n", + " Churn [object ] e.g. No\n", + " Churn_binary [int64 ] e.g. 0\n", + " support_calls [int64 ] e.g. 2\n", + " avg_call_duration [float64 ] e.g. 9.7\n", + " complaint_type [object ] e.g. Contract Dispute\n", + " days_since_last_contact [int64 ] e.g. 16\n", + " last_contact_sentiment [object ] e.g. I love this company, always responsive a\n", + " sentiment_score [float64 ] e.g. 0.872\n", + " support_churn_risk [object ] e.g. Low\n", + "\n", + "πŸ“Š Real vs Synthetic columns:\n", + " βœ… Real-world columns (14): ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'InternetService', 'TechSupport', 'Churn', 'Churn_binary']\n", + " πŸ€– Synthetic columns (7): ['support_calls', 'avg_call_duration', 'complaint_type', 'days_since_last_contact', 'last_contact_sentiment', 'sentiment_score', 'support_churn_risk']\n", + "\n", + "πŸ“Š Sample rows:\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + " customerID tenure Churn support_calls sentiment_score support_churn_risk\n", + "0 7590-VHVEG 1 No 2 0.8720 Low\n", + "1 5575-GNVDE 34 No 4 0.7684 Medium\n", + "2 3668-QPYBK 2 Yes 15 -0.5255 High\n", + "3 7795-CFOCW 45 No 3 0.6597 Medium\n", + "4 9237-HQITU 2 Yes 9 -0.5255 High" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customerIDtenureChurnsupport_callssentiment_scoresupport_churn_risk
07590-VHVEG1No20.8720Low
15575-GNVDE34No40.7684Medium
23668-QPYBK2Yes15-0.5255High
37795-CFOCW45No30.6597Medium
49237-HQITU2Yes9-0.5255High
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"print('=' * 60)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34,\n 45,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_calls\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 2,\n \"max\": 15,\n \"num_unique_values\": 5,\n \"samples\": [\n 4,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7117367821041709,\n \"min\": -0.5255,\n \"max\": 0.872,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.7684,\n 0.6597\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_churn_risk\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low\",\n \"Medium\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "============================================================\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nEAK367ZulFu" + }, + "source": [ + "---\n", + "## πŸ“Œ Summary β€” What Was Done in This Notebook\n", + "\n", + "| Phase | Action | Result |\n", + "|---|---|---|\n", + "| **Real-World Data** | Downloaded Telco Churn from Kaggle | 7,043 real customers |\n", + "| **Cleaning** | Fixed TotalCharges, encoded Churn | Clean 13-column DataFrame |\n", + "| **Synthetic Generation** | Created 7 support variables with VADER | Statistically realistic |\n", + "| **Merge** | Combined real + synthetic | 20-column final dataset |\n", + "| **Export** | Saved as CSV | `customer_churn_support_dataset.csv` |\n", + "\n", + "**➑️ Next Step:** Open `2_Churn_Data_Analysis_and_Insights.ipynb` and upload this CSV file." + ] + } + ] +} \ No newline at end of file