diff --git "a/1_Churn_Data_Creation_and_Processing.ipynb" "b/1_Churn_Data_Creation_and_Processing.ipynb"
new file mode 100644--- /dev/null
+++ "b/1_Churn_Data_Creation_and_Processing.ipynb"
@@ -0,0 +1,2342 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pAyn432WulFm"
+ },
+ "source": [
+ "# π Notebook 1: Churn Data Creation and Processing\n",
+ "## AI for Big Data Management β ESCP Business School\n",
+ "### Final Group Project\n",
+ "\n",
+ "---\n",
+ "\n",
+ "## π Problem Statement\n",
+ "> **\"How can a company predict customer churn based on support interactions and proactively adapt its retention strategy?\"**\n",
+ "\n",
+ "\n",
+ "- We aim to predict and understand customer churn by combining structured telecom data with synthetic behavioral signals derived from customer support interactions.\n",
+ "---\n",
+ "\n",
+ "## πΊοΈ Project Pipeline\n",
+ "```\n",
+ "PROBLEM CREATION β REAL-WORLD DATA PROCESSING β SYNTHETIC DATASET GENERATION β AUTOMATION β WRAP-UP\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "## π What This Notebook Does\n",
+ "1. **[REAL-WORLD]** Loads the Telco Customer Churn dataset from Kaggle\n",
+ "2. **[REAL-WORLD]** Cleans and preprocesses the real data\n",
+ "3. **[SYNTHETIC]** Generates realistic support interaction variables\n",
+ "4. **[SYNTHETIC]** Creates a merged, enriched final dataset\n",
+ "5. Exports `customer_churn_support_dataset.csv` for Notebook 2\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### β οΈ Before Running\n",
+ "You need to upload one file:\n",
+ "- `WA_Fn-UseC_-Telco-Customer-Churn.csv` (downloaded from Kaggle)\n",
+ "\n",
+ "Upload it using the π Files panel on the left sidebar in Google Colab.\n",
+ "All other data is generated synthetically in this notebook."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "acacAe8GulFp"
+ },
+ "source": [
+ "---\n",
+ "## π¦ SECTION 1: Install & Import Libraries\n",
+ "Run this cell first. It installs VADER for sentiment analysis and imports all necessary libraries."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "p5lEOq-1ulFq",
+ "outputId": "09999301-c22f-49db-896f-5344d4c0322d"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
All libraries imported successfully!\n",
+ " pandas : 2.2.2\n",
+ " numpy : 2.0.2\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Install required packages ββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "!pip install vaderSentiment --quiet\n",
+ "\n",
+ "# ββ Standard imports ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "import random\n",
+ "from datetime import datetime, timedelta\n",
+ "from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n",
+ "\n",
+ "# ββ Settings ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "np.random.seed(42) # reproducibility\n",
+ "random.seed(42)\n",
+ "pd.set_option('display.max_columns', None)\n",
+ "\n",
+ "print('β
All libraries imported successfully!')\n",
+ "print(f' pandas : {pd.__version__}')\n",
+ "print(f' numpy : {np.__version__}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hrZYSJs4ulFr"
+ },
+ "source": [
+ "---\n",
+ "## π₯ SECTION 2: Load the Real-World Dataset\n",
+ "\n",
+ "### [REAL-WORLD DATA PROCESSING]\n",
+ "\n",
+ "**Where to download the dataset:**\n",
+ "1. Go to: https://www.kaggle.com/datasets/blastchar/telco-customer-churn\n",
+ "2. Click the **Download** button (top right)\n",
+ "3. Unzip the file β you will get: `WA_Fn-UseC_-Telco-Customer-Churn.csv`\n",
+ "4. In Google Colab, click the π folder icon on the left sidebar\n",
+ "5. Click the β¬οΈ Upload button and select the CSV file\n",
+ "6. Wait for the upload to finish, then run the cell below\n",
+ "\n",
+ "**Dataset info:** IBM Telco Customer Churn β 7,043 real customers with billing, contract, and service information."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 382
+ },
+ "id": "2BkVJSBzulFr",
+ "outputId": "a11033cd-94ba-4ba8-9387-edfdb4ca27bb"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Dataset loaded successfully!\n",
+ " Shape: 7043 rows Γ 21 columns\n",
+ "\n",
+ "π First 5 rows:\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n",
+ "0 7590-VHVEG Female 0 Yes No 1 No \n",
+ "1 5575-GNVDE Male 0 No No 34 Yes \n",
+ "2 3668-QPYBK Male 0 No No 2 Yes \n",
+ "3 7795-CFOCW Male 0 No No 45 No \n",
+ "4 9237-HQITU Female 0 No No 2 Yes \n",
+ "\n",
+ " MultipleLines InternetService OnlineSecurity OnlineBackup \\\n",
+ "0 No phone service DSL No Yes \n",
+ "1 No DSL Yes No \n",
+ "2 No DSL Yes Yes \n",
+ "3 No phone service DSL Yes No \n",
+ "4 No Fiber optic No No \n",
+ "\n",
+ " DeviceProtection TechSupport StreamingTV StreamingMovies Contract \\\n",
+ "0 No No No No Month-to-month \n",
+ "1 Yes No No No One year \n",
+ "2 No No No No Month-to-month \n",
+ "3 Yes Yes No No One year \n",
+ "4 No No No No Month-to-month \n",
+ "\n",
+ " PaperlessBilling PaymentMethod MonthlyCharges TotalCharges \\\n",
+ "0 Yes Electronic check 29.85 29.85 \n",
+ "1 No Mailed check 56.95 1889.5 \n",
+ "2 Yes Mailed check 53.85 108.15 \n",
+ "3 No Bank transfer (automatic) 42.30 1840.75 \n",
+ "4 Yes Electronic check 70.70 151.65 \n",
+ "\n",
+ " Churn \n",
+ "0 No \n",
+ "1 No \n",
+ "2 Yes \n",
+ "3 No \n",
+ "4 Yes "
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customerID | \n",
+ " gender | \n",
+ " SeniorCitizen | \n",
+ " Partner | \n",
+ " Dependents | \n",
+ " tenure | \n",
+ " PhoneService | \n",
+ " MultipleLines | \n",
+ " InternetService | \n",
+ " OnlineSecurity | \n",
+ " OnlineBackup | \n",
+ " DeviceProtection | \n",
+ " TechSupport | \n",
+ " StreamingTV | \n",
+ " StreamingMovies | \n",
+ " Contract | \n",
+ " PaperlessBilling | \n",
+ " PaymentMethod | \n",
+ " MonthlyCharges | \n",
+ " TotalCharges | \n",
+ " Churn | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 7590-VHVEG | \n",
+ " Female | \n",
+ " 0 | \n",
+ " Yes | \n",
+ " No | \n",
+ " 1 | \n",
+ " No | \n",
+ " No phone service | \n",
+ " DSL | \n",
+ " No | \n",
+ " Yes | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " Month-to-month | \n",
+ " Yes | \n",
+ " Electronic check | \n",
+ " 29.85 | \n",
+ " 29.85 | \n",
+ " No | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 5575-GNVDE | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 34 | \n",
+ " Yes | \n",
+ " No | \n",
+ " DSL | \n",
+ " Yes | \n",
+ " No | \n",
+ " Yes | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " One year | \n",
+ " No | \n",
+ " Mailed check | \n",
+ " 56.95 | \n",
+ " 1889.5 | \n",
+ " No | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3668-QPYBK | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Yes | \n",
+ " No | \n",
+ " DSL | \n",
+ " Yes | \n",
+ " Yes | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " Month-to-month | \n",
+ " Yes | \n",
+ " Mailed check | \n",
+ " 53.85 | \n",
+ " 108.15 | \n",
+ " Yes | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 7795-CFOCW | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 45 | \n",
+ " No | \n",
+ " No phone service | \n",
+ " DSL | \n",
+ " Yes | \n",
+ " No | \n",
+ " Yes | \n",
+ " Yes | \n",
+ " No | \n",
+ " No | \n",
+ " One year | \n",
+ " No | \n",
+ " Bank transfer (automatic) | \n",
+ " 42.30 | \n",
+ " 1840.75 | \n",
+ " No | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 9237-HQITU | \n",
+ " Female | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Yes | \n",
+ " No | \n",
+ " Fiber optic | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " No | \n",
+ " Month-to-month | \n",
+ " Yes | \n",
+ " Electronic check | \n",
+ " 70.70 | \n",
+ " 151.65 | \n",
+ " Yes | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe"
+ }
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "# ββ Load the real-world Telco Churn dataset ββββββββββββββββββββββββββββββββββββ\n",
+ "# If you renamed your file differently, change the filename below\n",
+ "DATASET_FILENAME = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'\n",
+ "\n",
+ "try:\n",
+ " df_real = pd.read_csv(DATASET_FILENAME)\n",
+ " print(f'β
Dataset loaded successfully!')\n",
+ " print(f' Shape: {df_real.shape[0]} rows Γ {df_real.shape[1]} columns')\n",
+ " print(f'\\nπ First 5 rows:')\n",
+ " display(df_real.head())\n",
+ "except FileNotFoundError:\n",
+ " print('β ERROR: File not found!')\n",
+ " print(' Please upload the CSV file to Colab (see instructions above).')\n",
+ " print(' File expected:', DATASET_FILENAME)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "0ZxflowUulFr",
+ "outputId": "d7356987-2ce1-4eff-cf0e-8263f2c1942d"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "π Column names:\n",
+ "['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']\n",
+ "\n",
+ "π Data types:\n",
+ "customerID object\n",
+ "gender object\n",
+ "SeniorCitizen int64\n",
+ "Partner object\n",
+ "Dependents object\n",
+ "tenure int64\n",
+ "PhoneService object\n",
+ "MultipleLines object\n",
+ "InternetService object\n",
+ "OnlineSecurity object\n",
+ "OnlineBackup object\n",
+ "DeviceProtection object\n",
+ "TechSupport object\n",
+ "StreamingTV object\n",
+ "StreamingMovies object\n",
+ "Contract object\n",
+ "PaperlessBilling object\n",
+ "PaymentMethod object\n",
+ "MonthlyCharges float64\n",
+ "TotalCharges object\n",
+ "Churn object\n",
+ "dtype: object\n",
+ "\n",
+ "π Missing values per column:\n",
+ "customerID 0\n",
+ "gender 0\n",
+ "SeniorCitizen 0\n",
+ "Partner 0\n",
+ "Dependents 0\n",
+ "tenure 0\n",
+ "PhoneService 0\n",
+ "MultipleLines 0\n",
+ "InternetService 0\n",
+ "OnlineSecurity 0\n",
+ "OnlineBackup 0\n",
+ "DeviceProtection 0\n",
+ "TechSupport 0\n",
+ "StreamingTV 0\n",
+ "StreamingMovies 0\n",
+ "Contract 0\n",
+ "PaperlessBilling 0\n",
+ "PaymentMethod 0\n",
+ "MonthlyCharges 0\n",
+ "TotalCharges 0\n",
+ "Churn 0\n",
+ "dtype: int64\n",
+ "\n",
+ "π Churn distribution (real data):\n",
+ "Churn\n",
+ "No 5174\n",
+ "Yes 1869\n",
+ "Name: count, dtype: int64\n",
+ "\n",
+ "π Churn rate (real data): 26.54 %\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Basic exploration of real dataset βββββββββββββββββββββββββββββββββββββββββ\n",
+ "print('π Column names:')\n",
+ "print(df_real.columns.tolist())\n",
+ "print('\\nπ Data types:')\n",
+ "print(df_real.dtypes)\n",
+ "print('\\nπ Missing values per column:')\n",
+ "print(df_real.isnull().sum())\n",
+ "print('\\nπ Churn distribution (real data):')\n",
+ "print(df_real['Churn'].value_counts())\n",
+ "print('\\nπ Churn rate (real data):', round(df_real['Churn'].value_counts(normalize=True)['Yes'] * 100, 2), '%')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Qtj_34hvulFr"
+ },
+ "source": [
+ "---\n",
+ "## π§Ή SECTION 3: Real-World Data Cleaning\n",
+ "\n",
+ "### [REAL-WORLD DATA PROCESSING β continued]\n",
+ "\n",
+ "This section handles:\n",
+ "- Converting `TotalCharges` to numeric (it arrives as a string with spaces)\n",
+ "- Filling missing values\n",
+ "- Encoding the target variable `Churn` as 0/1\n",
+ "- Selecting the columns we need"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 729
+ },
+ "id": "jpJKPzueulFs",
+ "outputId": "99e08519-c0e8-495c-eb84-09af613c0a4c"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
TotalCharges NaN values filled with median: 1397.47\n",
+ "β
Churn encoded: Yes=1, No=0\n",
+ "\n",
+ "β
SeniorCitizen unique values: [0 1]\n",
+ "\n",
+ "β
Cleaned dataset shape: (7043, 14)\n",
+ " Missing values after cleaning:\n",
+ "customerID 0\n",
+ "gender 0\n",
+ "SeniorCitizen 0\n",
+ "Partner 0\n",
+ "Dependents 0\n",
+ "tenure 0\n",
+ "Contract 0\n",
+ "PaymentMethod 0\n",
+ "MonthlyCharges 0\n",
+ "TotalCharges 0\n",
+ "InternetService 0\n",
+ "TechSupport 0\n",
+ "Churn 0\n",
+ "Churn_binary 0\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/tmp/ipykernel_2585/1930681133.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+ "\n",
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+ "\n",
+ "\n",
+ " df_real['TotalCharges'].fillna(median_total, inplace=True)\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " customerID gender SeniorCitizen Partner Dependents tenure \\\n",
+ "0 7590-VHVEG Female 0 Yes No 1 \n",
+ "1 5575-GNVDE Male 0 No No 34 \n",
+ "2 3668-QPYBK Male 0 No No 2 \n",
+ "3 7795-CFOCW Male 0 No No 45 \n",
+ "4 9237-HQITU Female 0 No No 2 \n",
+ "\n",
+ " Contract PaymentMethod MonthlyCharges TotalCharges \\\n",
+ "0 Month-to-month Electronic check 29.85 29.85 \n",
+ "1 One year Mailed check 56.95 1889.50 \n",
+ "2 Month-to-month Mailed check 53.85 108.15 \n",
+ "3 One year Bank transfer (automatic) 42.30 1840.75 \n",
+ "4 Month-to-month Electronic check 70.70 151.65 \n",
+ "\n",
+ " InternetService TechSupport Churn Churn_binary \n",
+ "0 DSL No No 0 \n",
+ "1 DSL No No 0 \n",
+ "2 DSL No Yes 1 \n",
+ "3 DSL Yes No 0 \n",
+ "4 Fiber optic No Yes 1 "
+ ],
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customerID | \n",
+ " gender | \n",
+ " SeniorCitizen | \n",
+ " Partner | \n",
+ " Dependents | \n",
+ " tenure | \n",
+ " Contract | \n",
+ " PaymentMethod | \n",
+ " MonthlyCharges | \n",
+ " TotalCharges | \n",
+ " InternetService | \n",
+ " TechSupport | \n",
+ " Churn | \n",
+ " Churn_binary | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 7590-VHVEG | \n",
+ " Female | \n",
+ " 0 | \n",
+ " Yes | \n",
+ " No | \n",
+ " 1 | \n",
+ " Month-to-month | \n",
+ " Electronic check | \n",
+ " 29.85 | \n",
+ " 29.85 | \n",
+ " DSL | \n",
+ " No | \n",
+ " No | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 5575-GNVDE | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 34 | \n",
+ " One year | \n",
+ " Mailed check | \n",
+ " 56.95 | \n",
+ " 1889.50 | \n",
+ " DSL | \n",
+ " No | \n",
+ " No | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3668-QPYBK | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Month-to-month | \n",
+ " Mailed check | \n",
+ " 53.85 | \n",
+ " 108.15 | \n",
+ " DSL | \n",
+ " No | \n",
+ " Yes | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 7795-CFOCW | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 45 | \n",
+ " One year | \n",
+ " Bank transfer (automatic) | \n",
+ " 42.30 | \n",
+ " 1840.75 | \n",
+ " DSL | \n",
+ " Yes | \n",
+ " No | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 9237-HQITU | \n",
+ " Female | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Month-to-month | \n",
+ " Electronic check | \n",
+ " 70.70 | \n",
+ " 151.65 | \n",
+ " Fiber optic | \n",
+ " No | \n",
+ " Yes | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gender\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Male\",\n \"Female\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Partner\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Dependents\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Contract\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"One year\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PaymentMethod\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Electronic check\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15.445573799635934,\n \"min\": 29.85,\n \"max\": 70.7,\n \"num_unique_values\": 5,\n \"samples\": [\n 56.95\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 969.8243111512518,\n \"min\": 29.85,\n \"max\": 1889.5,\n \"num_unique_values\": 5,\n \"samples\": [\n 1889.5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"InternetService\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Fiber optic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TechSupport\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "# ββ Step 1: Fix TotalCharges column (arrives as string) ββββββββββββββββββββββ\n",
+ "df_real['TotalCharges'] = pd.to_numeric(df_real['TotalCharges'], errors='coerce')\n",
+ "\n",
+ "# ββ Step 2: Fill missing TotalCharges with median ββββββββββββββββββββββββββββ\n",
+ "median_total = df_real['TotalCharges'].median()\n",
+ "df_real['TotalCharges'].fillna(median_total, inplace=True)\n",
+ "print(f'β
TotalCharges NaN values filled with median: {median_total:.2f}')\n",
+ "\n",
+ "# ββ Step 3: Encode Churn as 0 / 1 ββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "df_real['Churn_binary'] = df_real['Churn'].map({'Yes': 1, 'No': 0})\n",
+ "print(f'β
Churn encoded: Yes=1, No=0')\n",
+ "\n",
+ "# ββ Step 4: Encode SeniorCitizen (already 0/1, but verify) βββββββββββββββββββ\n",
+ "print(f'\\nβ
SeniorCitizen unique values: {df_real[\"SeniorCitizen\"].unique()}')\n",
+ "\n",
+ "# ββ Step 5: Select core columns for our analysis βββββββββββββββββββββββββββββ\n",
+ "core_cols = [\n",
+ " 'customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',\n",
+ " 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges',\n",
+ " 'TotalCharges', 'InternetService', 'TechSupport',\n",
+ " 'Churn', 'Churn_binary'\n",
+ "]\n",
+ "df_clean = df_real[core_cols].copy()\n",
+ "\n",
+ "print(f'\\nβ
Cleaned dataset shape: {df_clean.shape}')\n",
+ "print(f' Missing values after cleaning:')\n",
+ "print(df_clean.isnull().sum())\n",
+ "display(df_clean.head())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 317
+ },
+ "id": "ySxvvy5hulFs",
+ "outputId": "5c552cda-2f7f-4088-e11e-0d07e0886538"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "π Descriptive statistics (numeric columns):\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " SeniorCitizen tenure MonthlyCharges TotalCharges Churn_binary\n",
+ "count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000\n",
+ "mean 0.162147 32.371149 64.761692 2281.916928 0.265370\n",
+ "std 0.368612 24.559481 30.090047 2265.270398 0.441561\n",
+ "min 0.000000 0.000000 18.250000 18.800000 0.000000\n",
+ "25% 0.000000 9.000000 35.500000 402.225000 0.000000\n",
+ "50% 0.000000 29.000000 70.350000 1397.475000 0.000000\n",
+ "75% 0.000000 55.000000 89.850000 3786.600000 1.000000\n",
+ "max 1.000000 72.000000 118.750000 8684.800000 1.000000"
+ ],
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " SeniorCitizen | \n",
+ " tenure | \n",
+ " MonthlyCharges | \n",
+ " TotalCharges | \n",
+ " Churn_binary | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | count | \n",
+ " 7043.000000 | \n",
+ " 7043.000000 | \n",
+ " 7043.000000 | \n",
+ " 7043.000000 | \n",
+ " 7043.000000 | \n",
+ "
\n",
+ " \n",
+ " | mean | \n",
+ " 0.162147 | \n",
+ " 32.371149 | \n",
+ " 64.761692 | \n",
+ " 2281.916928 | \n",
+ " 0.265370 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 0.368612 | \n",
+ " 24.559481 | \n",
+ " 30.090047 | \n",
+ " 2265.270398 | \n",
+ " 0.441561 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 18.250000 | \n",
+ " 18.800000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " 0.000000 | \n",
+ " 9.000000 | \n",
+ " 35.500000 | \n",
+ " 402.225000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 0.000000 | \n",
+ " 29.000000 | \n",
+ " 70.350000 | \n",
+ " 1397.475000 | \n",
+ " 0.000000 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 0.000000 | \n",
+ " 55.000000 | \n",
+ " 89.850000 | \n",
+ " 3786.600000 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 1.000000 | \n",
+ " 72.000000 | \n",
+ " 118.750000 | \n",
+ " 8684.800000 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.9992387084,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.1621468124378816,\n 1.0,\n 0.36861160561002687\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2478.9752758409018,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 32.37114865824223,\n 29.0,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2468.7047672837775,\n \"min\": 18.25,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 64.76169246059918,\n 70.35,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3119.0484860242914,\n \"min\": 18.8,\n \"max\": 8684.8,\n \"num_unique_values\": 8,\n \"samples\": [\n 2281.9169281556156,\n 1397.475,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.939844235915,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.2653698707936959,\n 1.0,\n 0.44156130512195013\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "# ββ Descriptive statistics on real data ββββββββββββββββββββββββββββββββββββββ\n",
+ "print('π Descriptive statistics (numeric columns):')\n",
+ "display(df_clean.describe())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RpDjH_T6ulFs"
+ },
+ "source": [
+ "---\n",
+ "## π€ SECTION 4: Synthetic Support Interaction Data Generation\n",
+ "\n",
+ "### [SYNTHETIC DATASET GENERATION]\n",
+ "\n",
+ "**Why synthetic data?** \n",
+ "Real telecom datasets do not include detailed support call logs. We simulate realistic support interaction variables that are **statistically correlated** with churn β just as a real company's CRM data would show.\n",
+ "\n",
+ "**Variables we create:**\n",
+ "| Variable | Description |\n",
+ "|---|---|\n",
+ "| `support_calls` | Number of support calls made in the last 6 months |\n",
+ "| `avg_call_duration` | Average call duration in minutes |\n",
+ "| `complaint_type` | Type of most frequent complaint |\n",
+ "| `days_since_last_contact` | Days since the customer last contacted support |\n",
+ "| `last_contact_sentiment` | Text of the last customer feedback |\n",
+ "| `sentiment_score` | VADER compound sentiment score |\n",
+ "| `support_churn_risk` | Composite risk score (Low / Medium / High) |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "NK03tdXdulFs",
+ "outputId": "6a003b55-cad7-4b8e-9ddb-426a7d5f9135"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Feedback templates and complaint types defined.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Helper: realistic sentiment phrases per churn status βββββββββββββββββββββ\n",
+ "POSITIVE_FEEDBACK = [\n",
+ " \"The support agent was very helpful and solved my issue quickly.\",\n",
+ " \"Great service, no complaints at all!\",\n",
+ " \"Fast resolution. I am happy with the service.\",\n",
+ " \"The team was professional and friendly. Very satisfied.\",\n",
+ " \"Everything was resolved in one call. Excellent experience.\",\n",
+ " \"I love this company, always responsive and caring.\",\n",
+ " \"No issues, service works perfectly. Very happy customer.\"\n",
+ "]\n",
+ "\n",
+ "NEUTRAL_FEEDBACK = [\n",
+ " \"The wait time was long but the issue was eventually resolved.\",\n",
+ " \"Average experience. Could be better.\",\n",
+ " \"Service is okay. Nothing special.\",\n",
+ " \"The agent was polite but the problem took two calls to fix.\",\n",
+ " \"Acceptable support, but I expected faster resolution.\"\n",
+ "]\n",
+ "\n",
+ "NEGATIVE_FEEDBACK = [\n",
+ " \"I have called five times and the problem is still not fixed!\",\n",
+ " \"Terrible service. I am thinking of switching providers.\",\n",
+ " \"The agents are unhelpful and the wait times are ridiculous.\",\n",
+ " \"I am very frustrated. Nobody seems to care about my problem.\",\n",
+ " \"Worst customer service I have ever experienced. Cancelling soon.\",\n",
+ " \"My bill is wrong again. This is the third time this month!\",\n",
+ " \"I am extremely disappointed. No follow-up, no resolution.\"\n",
+ "]\n",
+ "\n",
+ "COMPLAINT_TYPES = [\n",
+ " 'Billing Issue', 'Service Outage', 'Speed/Performance',\n",
+ " 'Contract Dispute', 'Technical Failure', 'Overcharge'\n",
+ "]\n",
+ "\n",
+ "print('β
Feedback templates and complaint types defined.')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "EIzwa0U3ulFs",
+ "outputId": "0aa0c944-1e0d-4df3-cc48-15acc7a590ae"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Synthetic support variables generated!\n",
+ " support_calls range : 1 β 15\n",
+ " avg_call_duration range: 3.0 β 35.0 min\n",
+ " Sample complaint types : {'Contract Dispute', 'Billing Issue', 'Technical Failure', 'Service Outage'}\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Generate synthetic support variables ββββββββββββββββββββββββββββββββββββββ\n",
+ "n = len(df_clean)\n",
+ "churn_flag = df_clean['Churn_binary'].values\n",
+ "\n",
+ "# support_calls: churners call more (5-15 calls), non-churners call less (1-6)\n",
+ "support_calls = np.where(\n",
+ " churn_flag == 1,\n",
+ " np.random.randint(5, 16, n),\n",
+ " np.random.randint(1, 7, n)\n",
+ ")\n",
+ "\n",
+ "# avg_call_duration: churners have longer calls (frustration)\n",
+ "avg_call_duration = np.where(\n",
+ " churn_flag == 1,\n",
+ " np.round(np.random.uniform(12, 35, n), 1),\n",
+ " np.round(np.random.uniform(3, 15, n), 1)\n",
+ ")\n",
+ "\n",
+ "# complaint_type: random, but churners have heavier billing/contract issues\n",
+ "def pick_complaint(is_churner):\n",
+ " if is_churner:\n",
+ " # weight toward billing and contract disputes\n",
+ " weights = [0.30, 0.15, 0.15, 0.20, 0.10, 0.10]\n",
+ " else:\n",
+ " weights = [0.20, 0.20, 0.20, 0.10, 0.20, 0.10]\n",
+ " return random.choices(COMPLAINT_TYPES, weights=weights, k=1)[0]\n",
+ "\n",
+ "complaint_type = [pick_complaint(c) for c in churn_flag]\n",
+ "\n",
+ "# days_since_last_contact: churners contacted recently (about to leave)\n",
+ "days_since_last_contact = np.where(\n",
+ " churn_flag == 1,\n",
+ " np.random.randint(1, 30, n),\n",
+ " np.random.randint(15, 90, n)\n",
+ ")\n",
+ "\n",
+ "# last_contact_sentiment: text phrase matching churn likelihood\n",
+ "def pick_sentiment_text(is_churner):\n",
+ " if is_churner:\n",
+ " # 70% negative, 20% neutral, 10% positive\n",
+ " pool = (NEGATIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (POSITIVE_FEEDBACK * 1)\n",
+ " else:\n",
+ " # 10% negative, 20% neutral, 70% positive\n",
+ " pool = (POSITIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (NEGATIVE_FEEDBACK * 1)\n",
+ " return random.choice(pool)\n",
+ "\n",
+ "last_contact_sentiment = [pick_sentiment_text(c) for c in churn_flag]\n",
+ "\n",
+ "print('β
Synthetic support variables generated!')\n",
+ "print(f' support_calls range : {support_calls.min()} β {support_calls.max()}')\n",
+ "print(f' avg_call_duration range: {avg_call_duration.min()} β {avg_call_duration.max()} min')\n",
+ "print(f' Sample complaint types : {set(complaint_type[:10])}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Yak2n7gpulFt",
+ "outputId": "ff972046-295c-44c0-c4bb-e0d6c59cb165"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
VADER sentiment scores computed!\n",
+ " Score range: -0.710 to 0.872\n",
+ " Mean score : 0.307\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Compute VADER sentiment score βββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "analyzer = SentimentIntensityAnalyzer()\n",
+ "\n",
+ "def get_compound_score(text):\n",
+ " \"\"\"Return VADER compound score: -1 (most negative) to +1 (most positive)\"\"\"\n",
+ " return analyzer.polarity_scores(text)['compound']\n",
+ "\n",
+ "sentiment_score = [get_compound_score(text) for text in last_contact_sentiment]\n",
+ "\n",
+ "print('β
VADER sentiment scores computed!')\n",
+ "print(f' Score range: {min(sentiment_score):.3f} to {max(sentiment_score):.3f}')\n",
+ "print(f' Mean score : {np.mean(sentiment_score):.3f}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "KJ3wGj9mulFt",
+ "outputId": "6b8e8fd1-6eaf-4f77-8e37-6218bea1e414"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
support_churn_risk categories created!\n",
+ " Distribution: Counter({'Medium': 4166, 'Low': 1471, 'High': 1406})\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Compute composite support_churn_risk ββββββββββββββββββββββββββββββββββββββ\n",
+ "# Logic:\n",
+ "# HIGH = many calls (>=6) AND negative sentiment (<= -0.3)\n",
+ "# MEDIUM = moderate calls (3-5) OR somewhat negative (-0.3 to 0)\n",
+ "# LOW = everything else\n",
+ "\n",
+ "def compute_risk(calls, score):\n",
+ " if calls >= 6 and score <= -0.3:\n",
+ " return 'High'\n",
+ " elif calls >= 3 or score <= 0.0:\n",
+ " return 'Medium'\n",
+ " else:\n",
+ " return 'Low'\n",
+ "\n",
+ "support_churn_risk = [\n",
+ " compute_risk(c, s) for c, s in zip(support_calls, sentiment_score)\n",
+ "]\n",
+ "\n",
+ "print('β
support_churn_risk categories created!')\n",
+ "from collections import Counter\n",
+ "print(' Distribution:', Counter(support_churn_risk))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Synthetic Data Design & Assumptions\n",
+ "\n",
+ "### Why Synthetic Data Was Created\n",
+ "\n",
+ "The original Telco dataset does not include detailed information about customer support interactions or behavioral signals such as sentiment. However, in real-world business settings, these factors play a critical role in customer churn.\n",
+ "\n",
+ "To better approximate real-world conditions, we generated synthetic variables representing:\n",
+ "- number of support calls\n",
+ "- average call duration\n",
+ "- complaint type\n",
+ "- time since last interaction\n",
+ "- sentiment score\n",
+ "- support-based churn risk\n",
+ "\n",
+ "These variables allow us to enrich the dataset and create a more realistic narrative around customer experience and dissatisfaction.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Key Assumptions\n",
+ "\n",
+ "The synthetic data generation is based on the following assumptions:\n",
+ "\n",
+ "- Customers who churn tend to have **more frequent support interactions**\n",
+ "- Negative experiences lead to **lower sentiment scores**\n",
+ "- Certain complaint types (e.g., technical failures) are more strongly associated with dissatisfaction\n",
+ "- Customers with unresolved issues are more likely to churn\n",
+ "\n",
+ "These assumptions are grounded in typical telecom business logic but are not directly observed in the original dataset.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Limitations of Synthetic Data\n",
+ "\n",
+ "Because these variables are artificially generated:\n",
+ "- They may introduce **bias toward expected relationships**\n",
+ "- They are not independent of the churn outcome\n",
+ "- They may **overstate model performance**\n",
+ "\n",
+ "Therefore, results should be interpreted as illustrative rather than fully generalizable."
+ ],
+ "metadata": {
+ "id": "zvVoEV0Fl77_"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7TZ0lxJQulFt"
+ },
+ "source": [
+ "---\n",
+ "## π SECTION 5: Merge Real + Synthetic Data\n",
+ "\n",
+ "We now combine the cleaned real-world dataset with our synthetic support variables into one rich dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 330
+ },
+ "id": "7tMGLUqvulFt",
+ "outputId": "d5013de1-0912-46b4-886e-e077adc64ac3"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Final merged dataset: 7043 rows Γ 21 columns\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " customerID gender SeniorCitizen Partner Dependents tenure \\\n",
+ "0 7590-VHVEG Female 0 Yes No 1 \n",
+ "1 5575-GNVDE Male 0 No No 34 \n",
+ "2 3668-QPYBK Male 0 No No 2 \n",
+ "3 7795-CFOCW Male 0 No No 45 \n",
+ "4 9237-HQITU Female 0 No No 2 \n",
+ "\n",
+ " Contract PaymentMethod MonthlyCharges TotalCharges \\\n",
+ "0 Month-to-month Electronic check 29.85 29.85 \n",
+ "1 One year Mailed check 56.95 1889.50 \n",
+ "2 Month-to-month Mailed check 53.85 108.15 \n",
+ "3 One year Bank transfer (automatic) 42.30 1840.75 \n",
+ "4 Month-to-month Electronic check 70.70 151.65 \n",
+ "\n",
+ " InternetService TechSupport Churn Churn_binary support_calls \\\n",
+ "0 DSL No No 0 2 \n",
+ "1 DSL No No 0 4 \n",
+ "2 DSL No Yes 1 15 \n",
+ "3 DSL Yes No 0 3 \n",
+ "4 Fiber optic No Yes 1 9 \n",
+ "\n",
+ " avg_call_duration complaint_type days_since_last_contact \\\n",
+ "0 9.7 Contract Dispute 16 \n",
+ "1 3.8 Billing Issue 57 \n",
+ "2 21.2 Billing Issue 12 \n",
+ "3 9.8 Service Outage 40 \n",
+ "4 12.3 Contract Dispute 24 \n",
+ "\n",
+ " last_contact_sentiment sentiment_score \\\n",
+ "0 I love this company, always responsive and car... 0.8720 \n",
+ "1 Great service, no complaints at all! 0.7684 \n",
+ "2 My bill is wrong again. This is the third time... -0.5255 \n",
+ "3 Everything was resolved in one call. Excellent... 0.6597 \n",
+ "4 My bill is wrong again. This is the third time... -0.5255 \n",
+ "\n",
+ " support_churn_risk \n",
+ "0 Low \n",
+ "1 Medium \n",
+ "2 High \n",
+ "3 Medium \n",
+ "4 High "
+ ],
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customerID | \n",
+ " gender | \n",
+ " SeniorCitizen | \n",
+ " Partner | \n",
+ " Dependents | \n",
+ " tenure | \n",
+ " Contract | \n",
+ " PaymentMethod | \n",
+ " MonthlyCharges | \n",
+ " TotalCharges | \n",
+ " InternetService | \n",
+ " TechSupport | \n",
+ " Churn | \n",
+ " Churn_binary | \n",
+ " support_calls | \n",
+ " avg_call_duration | \n",
+ " complaint_type | \n",
+ " days_since_last_contact | \n",
+ " last_contact_sentiment | \n",
+ " sentiment_score | \n",
+ " support_churn_risk | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 7590-VHVEG | \n",
+ " Female | \n",
+ " 0 | \n",
+ " Yes | \n",
+ " No | \n",
+ " 1 | \n",
+ " Month-to-month | \n",
+ " Electronic check | \n",
+ " 29.85 | \n",
+ " 29.85 | \n",
+ " DSL | \n",
+ " No | \n",
+ " No | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 9.7 | \n",
+ " Contract Dispute | \n",
+ " 16 | \n",
+ " I love this company, always responsive and car... | \n",
+ " 0.8720 | \n",
+ " Low | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 5575-GNVDE | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 34 | \n",
+ " One year | \n",
+ " Mailed check | \n",
+ " 56.95 | \n",
+ " 1889.50 | \n",
+ " DSL | \n",
+ " No | \n",
+ " No | \n",
+ " 0 | \n",
+ " 4 | \n",
+ " 3.8 | \n",
+ " Billing Issue | \n",
+ " 57 | \n",
+ " Great service, no complaints at all! | \n",
+ " 0.7684 | \n",
+ " Medium | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3668-QPYBK | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Month-to-month | \n",
+ " Mailed check | \n",
+ " 53.85 | \n",
+ " 108.15 | \n",
+ " DSL | \n",
+ " No | \n",
+ " Yes | \n",
+ " 1 | \n",
+ " 15 | \n",
+ " 21.2 | \n",
+ " Billing Issue | \n",
+ " 12 | \n",
+ " My bill is wrong again. This is the third time... | \n",
+ " -0.5255 | \n",
+ " High | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 7795-CFOCW | \n",
+ " Male | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 45 | \n",
+ " One year | \n",
+ " Bank transfer (automatic) | \n",
+ " 42.30 | \n",
+ " 1840.75 | \n",
+ " DSL | \n",
+ " Yes | \n",
+ " No | \n",
+ " 0 | \n",
+ " 3 | \n",
+ " 9.8 | \n",
+ " Service Outage | \n",
+ " 40 | \n",
+ " Everything was resolved in one call. Excellent... | \n",
+ " 0.6597 | \n",
+ " Medium | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 9237-HQITU | \n",
+ " Female | \n",
+ " 0 | \n",
+ " No | \n",
+ " No | \n",
+ " 2 | \n",
+ " Month-to-month | \n",
+ " Electronic check | \n",
+ " 70.70 | \n",
+ " 151.65 | \n",
+ " Fiber optic | \n",
+ " No | \n",
+ " Yes | \n",
+ " 1 | \n",
+ " 9 | \n",
+ " 12.3 | \n",
+ " Contract Dispute | \n",
+ " 24 | \n",
+ " My bill is wrong again. This is the third time... | \n",
+ " -0.5255 | \n",
+ " High | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe"
+ }
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "# ββ Build the synthetic support DataFrame ββββββββββββββββββββββββββββββββββββ\n",
+ "df_support = pd.DataFrame({\n",
+ " 'support_calls' : support_calls,\n",
+ " 'avg_call_duration' : avg_call_duration,\n",
+ " 'complaint_type' : complaint_type,\n",
+ " 'days_since_last_contact': days_since_last_contact,\n",
+ " 'last_contact_sentiment' : last_contact_sentiment,\n",
+ " 'sentiment_score' : sentiment_score,\n",
+ " 'support_churn_risk' : support_churn_risk\n",
+ "})\n",
+ "\n",
+ "# ββ Merge with real-world data (index-aligned) ββββββββββββββββββββββββββββββββ\n",
+ "df_final = pd.concat([df_clean.reset_index(drop=True), df_support], axis=1)\n",
+ "\n",
+ "print(f'β
Final merged dataset: {df_final.shape[0]} rows Γ {df_final.shape[1]} columns')\n",
+ "display(df_final.head())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "UAW-XT66ulFu",
+ "outputId": "32243187-b541-40b2-8069-894cbe430f8b"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "π Mean support_calls by Churn:\n",
+ "Churn\n",
+ "No 3.47\n",
+ "Yes 10.02\n",
+ "Name: support_calls, dtype: float64\n",
+ "\n",
+ "π Mean sentiment_score by Churn:\n",
+ "Churn\n",
+ "No 0.519\n",
+ "Yes -0.279\n",
+ "Name: sentiment_score, dtype: float64\n",
+ "\n",
+ "π support_churn_risk vs Churn crosstab:\n",
+ "Churn No Yes\n",
+ "support_churn_risk \n",
+ "High 0.080 0.920\n",
+ "Low 1.000 0.000\n",
+ "Medium 0.862 0.138\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Verification: check correlations make sense ββββββββββββββββββββββββββββββ\n",
+ "print('π Mean support_calls by Churn:')\n",
+ "print(df_final.groupby('Churn')['support_calls'].mean().round(2))\n",
+ "\n",
+ "print('\\nπ Mean sentiment_score by Churn:')\n",
+ "print(df_final.groupby('Churn')['sentiment_score'].mean().round(3))\n",
+ "\n",
+ "print('\\nπ support_churn_risk vs Churn crosstab:')\n",
+ "print(pd.crosstab(df_final['support_churn_risk'], df_final['Churn'], normalize='index').round(3))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "27am_r6QulFu"
+ },
+ "source": [
+ "---\n",
+ "## πΎ SECTION 6: Export Final Dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "EfM7zP4XulFu",
+ "outputId": "5f76e254-0dc9-49ea-c6ff-3c202b00c78f"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Dataset exported: customer_churn_support_dataset.csv\n",
+ " Rows : 7043\n",
+ " Columns : 21\n",
+ "\n",
+ "π Final column list:\n",
+ " β’ customerID\n",
+ " β’ gender\n",
+ " β’ SeniorCitizen\n",
+ " β’ Partner\n",
+ " β’ Dependents\n",
+ " β’ tenure\n",
+ " β’ Contract\n",
+ " β’ PaymentMethod\n",
+ " β’ MonthlyCharges\n",
+ " β’ TotalCharges\n",
+ " β’ InternetService\n",
+ " β’ TechSupport\n",
+ " β’ Churn\n",
+ " β’ Churn_binary\n",
+ " β’ support_calls\n",
+ " β’ avg_call_duration\n",
+ " β’ complaint_type\n",
+ " β’ days_since_last_contact\n",
+ " β’ last_contact_sentiment\n",
+ " β’ sentiment_score\n",
+ " β’ support_churn_risk\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Export to CSV βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "OUTPUT_FILENAME = 'customer_churn_support_dataset.csv'\n",
+ "df_final.to_csv(OUTPUT_FILENAME, index=False)\n",
+ "\n",
+ "print(f'β
Dataset exported: {OUTPUT_FILENAME}')\n",
+ "print(f' Rows : {df_final.shape[0]}')\n",
+ "print(f' Columns : {df_final.shape[1]}')\n",
+ "print(f'\\nπ Final column list:')\n",
+ "for col in df_final.columns:\n",
+ " print(f' β’ {col}')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "id": "HFWnx9zlulFu",
+ "outputId": "a9084d14-05d9-4cb2-98c6-a25491f9b6b1"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "application/javascript": [
+ "\n",
+ " async function download(id, filename, size) {\n",
+ " if (!google.colab.kernel.accessAllowed) {\n",
+ " return;\n",
+ " }\n",
+ " const div = document.createElement('div');\n",
+ " const label = document.createElement('label');\n",
+ " label.textContent = `Downloading \"${filename}\": `;\n",
+ " div.appendChild(label);\n",
+ " const progress = document.createElement('progress');\n",
+ " progress.max = size;\n",
+ " div.appendChild(progress);\n",
+ " document.body.appendChild(div);\n",
+ "\n",
+ " const buffers = [];\n",
+ " let downloaded = 0;\n",
+ "\n",
+ " const channel = await google.colab.kernel.comms.open(id);\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ "\n",
+ " for await (const message of channel.messages) {\n",
+ " // Send a message to notify the kernel that we're ready.\n",
+ " channel.send({})\n",
+ " if (message.buffers) {\n",
+ " for (const buffer of message.buffers) {\n",
+ " buffers.push(buffer);\n",
+ " downloaded += buffer.byteLength;\n",
+ " progress.value = downloaded;\n",
+ " }\n",
+ " }\n",
+ " }\n",
+ " const blob = new Blob(buffers, {type: 'application/binary'});\n",
+ " const a = document.createElement('a');\n",
+ " a.href = window.URL.createObjectURL(blob);\n",
+ " a.download = filename;\n",
+ " div.appendChild(a);\n",
+ " a.click();\n",
+ " div.remove();\n",
+ " }\n",
+ " "
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "application/javascript": [
+ "download(\"download_160eb193-7480-4c6e-90ba-3271caaea521\", \"customer_churn_support_dataset.csv\", 1309271)"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "β
Download triggered! Check your Downloads folder.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Download the file to your computer βββββββββββββββββββββββββββββββββββββββ\n",
+ "from google.colab import files\n",
+ "files.download(OUTPUT_FILENAME)\n",
+ "print('β
Download triggered! Check your Downloads folder.')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wr8dQ0YQulFu"
+ },
+ "source": [
+ "---\n",
+ "## β
SECTION 7: Final Verification Checks"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "FK7d-FE2ulFu",
+ "outputId": "4e8cb3b5-08f8-4e0a-efd6-97b738be9181"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "============================================================\n",
+ " FINAL DATASET VERIFICATION REPORT\n",
+ "============================================================\n",
+ "\n",
+ "β
Shape : (7043, 21)\n",
+ "β
Missing values : 0 total\n",
+ "β
Churn rate : 26.5%\n",
+ "β
Risk categories : {'Medium': 4166, 'Low': 1471, 'High': 1406}\n",
+ "β
Complaint types : 6 unique\n",
+ "β
Sentiment range : -0.710 to 0.872\n",
+ "\n",
+ "============================================================\n",
+ " β
Notebook 1 COMPLETE! Proceed to Notebook 2.\n",
+ "============================================================\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ββ Final verification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
+ "print('=' * 60)\n",
+ "print(' FINAL DATASET VERIFICATION REPORT')\n",
+ "print('=' * 60)\n",
+ "\n",
+ "df_verify = pd.read_csv(OUTPUT_FILENAME)\n",
+ "\n",
+ "print(f'\\nβ
Shape : {df_verify.shape}')\n",
+ "print(f'β
Missing values : {df_verify.isnull().sum().sum()} total')\n",
+ "print(f'β
Churn rate : {df_verify[\"Churn_binary\"].mean()*100:.1f}%')\n",
+ "print(f'β
Risk categories : {df_verify[\"support_churn_risk\"].value_counts().to_dict()}')\n",
+ "print(f'β
Complaint types : {df_verify[\"complaint_type\"].nunique()} unique')\n",
+ "print(f'β
Sentiment range : {df_verify[\"sentiment_score\"].min():.3f} to {df_verify[\"sentiment_score\"].max():.3f}')\n",
+ "\n",
+ "print('\\n' + '=' * 60)\n",
+ "print(' β
Notebook 1 COMPLETE! Proceed to Notebook 2.')\n",
+ "print('=' * 60)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## SECTION 8: Final Dataset Summary"
+ ],
+ "metadata": {
+ "id": "7KUQ8dwMa9pK"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "print('=' * 60)\n",
+ "print(' NOTEBOOK 1 β FINAL DATASET SUMMARY')\n",
+ "print('=' * 60)\n",
+ "\n",
+ "print(f'\\nπ Shape: {df_final.shape[0]} rows Γ {df_final.shape[1]} columns')\n",
+ "\n",
+ "print('\\nπ Column overview:')\n",
+ "for col in df_final.columns:\n",
+ " dtype = str(df_final[col].dtype)\n",
+ " sample = str(df_final[col].iloc[0])[:40]\n",
+ " print(f' {col:<30} [{dtype:<10}] e.g. {sample}')\n",
+ "\n",
+ "print('\\nπ Real vs Synthetic columns:')\n",
+ "real_cols = ['customerID','gender','SeniorCitizen','Partner','Dependents',\n",
+ " 'tenure','Contract','PaymentMethod','MonthlyCharges',\n",
+ " 'TotalCharges','InternetService','TechSupport','Churn','Churn_binary']\n",
+ "synthetic_cols = ['support_calls','avg_call_duration','complaint_type',\n",
+ " 'days_since_last_contact','last_contact_sentiment',\n",
+ " 'sentiment_score','support_churn_risk']\n",
+ "\n",
+ "print(f' β
Real-world columns ({len(real_cols)}): {real_cols}')\n",
+ "print(f' π€ Synthetic columns ({len(synthetic_cols)}): {synthetic_cols}')\n",
+ "\n",
+ "print('\\nπ Sample rows:')\n",
+ "display(df_final[['customerID','tenure','Churn','support_calls',\n",
+ " 'sentiment_score','support_churn_risk']].head(5))\n",
+ "\n",
+ "\n",
+ "print('=' * 60)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 833
+ },
+ "id": "XpmZTW0RahxT",
+ "outputId": "10a7c450-6a35-447c-ca9f-7db0aab8d5ce"
+ },
+ "execution_count": 21,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "============================================================\n",
+ " NOTEBOOK 1 β FINAL DATASET SUMMARY\n",
+ "============================================================\n",
+ "\n",
+ "π Shape: 7043 rows Γ 21 columns\n",
+ "\n",
+ "π Column overview:\n",
+ " customerID [object ] e.g. 7590-VHVEG\n",
+ " gender [object ] e.g. Female\n",
+ " SeniorCitizen [int64 ] e.g. 0\n",
+ " Partner [object ] e.g. Yes\n",
+ " Dependents [object ] e.g. No\n",
+ " tenure [int64 ] e.g. 1\n",
+ " Contract [object ] e.g. Month-to-month\n",
+ " PaymentMethod [object ] e.g. Electronic check\n",
+ " MonthlyCharges [float64 ] e.g. 29.85\n",
+ " TotalCharges [float64 ] e.g. 29.85\n",
+ " InternetService [object ] e.g. DSL\n",
+ " TechSupport [object ] e.g. No\n",
+ " Churn [object ] e.g. No\n",
+ " Churn_binary [int64 ] e.g. 0\n",
+ " support_calls [int64 ] e.g. 2\n",
+ " avg_call_duration [float64 ] e.g. 9.7\n",
+ " complaint_type [object ] e.g. Contract Dispute\n",
+ " days_since_last_contact [int64 ] e.g. 16\n",
+ " last_contact_sentiment [object ] e.g. I love this company, always responsive a\n",
+ " sentiment_score [float64 ] e.g. 0.872\n",
+ " support_churn_risk [object ] e.g. Low\n",
+ "\n",
+ "π Real vs Synthetic columns:\n",
+ " β
Real-world columns (14): ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'InternetService', 'TechSupport', 'Churn', 'Churn_binary']\n",
+ " π€ Synthetic columns (7): ['support_calls', 'avg_call_duration', 'complaint_type', 'days_since_last_contact', 'last_contact_sentiment', 'sentiment_score', 'support_churn_risk']\n",
+ "\n",
+ "π Sample rows:\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ " customerID tenure Churn support_calls sentiment_score support_churn_risk\n",
+ "0 7590-VHVEG 1 No 2 0.8720 Low\n",
+ "1 5575-GNVDE 34 No 4 0.7684 Medium\n",
+ "2 3668-QPYBK 2 Yes 15 -0.5255 High\n",
+ "3 7795-CFOCW 45 No 3 0.6597 Medium\n",
+ "4 9237-HQITU 2 Yes 9 -0.5255 High"
+ ],
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " customerID | \n",
+ " tenure | \n",
+ " Churn | \n",
+ " support_calls | \n",
+ " sentiment_score | \n",
+ " support_churn_risk | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 7590-VHVEG | \n",
+ " 1 | \n",
+ " No | \n",
+ " 2 | \n",
+ " 0.8720 | \n",
+ " Low | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 5575-GNVDE | \n",
+ " 34 | \n",
+ " No | \n",
+ " 4 | \n",
+ " 0.7684 | \n",
+ " Medium | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 3668-QPYBK | \n",
+ " 2 | \n",
+ " Yes | \n",
+ " 15 | \n",
+ " -0.5255 | \n",
+ " High | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 7795-CFOCW | \n",
+ " 45 | \n",
+ " No | \n",
+ " 3 | \n",
+ " 0.6597 | \n",
+ " Medium | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 9237-HQITU | \n",
+ " 2 | \n",
+ " Yes | \n",
+ " 9 | \n",
+ " -0.5255 | \n",
+ " High | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "summary": "{\n \"name\": \"print('=' * 60)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34,\n 45,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_calls\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 2,\n \"max\": 15,\n \"num_unique_values\": 5,\n \"samples\": [\n 4,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7117367821041709,\n \"min\": -0.5255,\n \"max\": 0.872,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.7684,\n 0.6597\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_churn_risk\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low\",\n \"Medium\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "============================================================\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nEAK367ZulFu"
+ },
+ "source": [
+ "---\n",
+ "## π Summary β What Was Done in This Notebook\n",
+ "\n",
+ "| Phase | Action | Result |\n",
+ "|---|---|---|\n",
+ "| **Real-World Data** | Downloaded Telco Churn from Kaggle | 7,043 real customers |\n",
+ "| **Cleaning** | Fixed TotalCharges, encoded Churn | Clean 13-column DataFrame |\n",
+ "| **Synthetic Generation** | Created 7 support variables with VADER | Statistically realistic |\n",
+ "| **Merge** | Combined real + synthetic | 20-column final dataset |\n",
+ "| **Export** | Saved as CSV | `customer_churn_support_dataset.csv` |\n",
+ "\n",
+ "**β‘οΈ Next Step:** Open `2_Churn_Data_Analysis_and_Insights.ipynb` and upload this CSV file."
+ ]
+ }
+ ]
+}
\ No newline at end of file