{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "pAyn432WulFm"
},
"source": [
"# π Notebook 1: Churn Data Creation and Processing\n",
"## AI for Big Data Management β ESCP Business School\n",
"### Final Group Project\n",
"\n",
"---\n",
"\n",
"## π Problem Statement\n",
"> **\"How can a company predict customer churn based on support interactions and proactively adapt its retention strategy?\"**\n",
"\n",
"\n",
"- We aim to predict and understand customer churn by combining structured telecom data with synthetic behavioral signals derived from customer support interactions.\n",
"---\n",
"\n",
"## πΊοΈ Project Pipeline\n",
"```\n",
"PROBLEM CREATION β REAL-WORLD DATA PROCESSING β SYNTHETIC DATASET GENERATION β AUTOMATION β WRAP-UP\n",
"```\n",
"\n",
"---\n",
"\n",
"## π What This Notebook Does\n",
"1. **[REAL-WORLD]** Loads the Telco Customer Churn dataset from Kaggle\n",
"2. **[REAL-WORLD]** Cleans and preprocesses the real data\n",
"3. **[SYNTHETIC]** Generates realistic support interaction variables\n",
"4. **[SYNTHETIC]** Creates a merged, enriched final dataset\n",
"5. Exports `customer_churn_support_dataset.csv` for Notebook 2\n",
"\n",
"---\n",
"\n",
"### β οΈ Before Running\n",
"You need to upload one file:\n",
"- `WA_Fn-UseC_-Telco-Customer-Churn.csv` (downloaded from Kaggle)\n",
"\n",
"Upload it using the π Files panel on the left sidebar in Google Colab.\n",
"All other data is generated synthetically in this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "acacAe8GulFp"
},
"source": [
"---\n",
"## π¦ SECTION 1: Install & Import Libraries\n",
"Run this cell first. It installs VADER for sentiment analysis and imports all necessary libraries."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "p5lEOq-1ulFq",
"outputId": "09999301-c22f-49db-896f-5344d4c0322d"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
All libraries imported successfully!\n",
" pandas : 2.2.2\n",
" numpy : 2.0.2\n"
]
}
],
"source": [
"# ββ Install required packages ββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
"!pip install vaderSentiment --quiet\n",
"\n",
"# ββ Standard imports ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import random\n",
"from datetime import datetime, timedelta\n",
"from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n",
"\n",
"# ββ Settings ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
"np.random.seed(42) # reproducibility\n",
"random.seed(42)\n",
"pd.set_option('display.max_columns', None)\n",
"\n",
"print('β
All libraries imported successfully!')\n",
"print(f' pandas : {pd.__version__}')\n",
"print(f' numpy : {np.__version__}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hrZYSJs4ulFr"
},
"source": [
"---\n",
"## π₯ SECTION 2: Load the Real-World Dataset\n",
"\n",
"### [REAL-WORLD DATA PROCESSING]\n",
"\n",
"**Where to download the dataset:**\n",
"1. Go to: https://www.kaggle.com/datasets/blastchar/telco-customer-churn\n",
"2. Click the **Download** button (top right)\n",
"3. Unzip the file β you will get: `WA_Fn-UseC_-Telco-Customer-Churn.csv`\n",
"4. In Google Colab, click the π folder icon on the left sidebar\n",
"5. Click the β¬οΈ Upload button and select the CSV file\n",
"6. Wait for the upload to finish, then run the cell below\n",
"\n",
"**Dataset info:** IBM Telco Customer Churn β 7,043 real customers with billing, contract, and service information."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 382
},
"id": "2BkVJSBzulFr",
"outputId": "a11033cd-94ba-4ba8-9387-edfdb4ca27bb"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Dataset loaded successfully!\n",
" Shape: 7043 rows Γ 21 columns\n",
"\n",
"π First 5 rows:\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
" customerID gender SeniorCitizen Partner Dependents tenure PhoneService \\\n",
"0 7590-VHVEG Female 0 Yes No 1 No \n",
"1 5575-GNVDE Male 0 No No 34 Yes \n",
"2 3668-QPYBK Male 0 No No 2 Yes \n",
"3 7795-CFOCW Male 0 No No 45 No \n",
"4 9237-HQITU Female 0 No No 2 Yes \n",
"\n",
" MultipleLines InternetService OnlineSecurity OnlineBackup \\\n",
"0 No phone service DSL No Yes \n",
"1 No DSL Yes No \n",
"2 No DSL Yes Yes \n",
"3 No phone service DSL Yes No \n",
"4 No Fiber optic No No \n",
"\n",
" DeviceProtection TechSupport StreamingTV StreamingMovies Contract \\\n",
"0 No No No No Month-to-month \n",
"1 Yes No No No One year \n",
"2 No No No No Month-to-month \n",
"3 Yes Yes No No One year \n",
"4 No No No No Month-to-month \n",
"\n",
" PaperlessBilling PaymentMethod MonthlyCharges TotalCharges \\\n",
"0 Yes Electronic check 29.85 29.85 \n",
"1 No Mailed check 56.95 1889.5 \n",
"2 Yes Mailed check 53.85 108.15 \n",
"3 No Bank transfer (automatic) 42.30 1840.75 \n",
"4 Yes Electronic check 70.70 151.65 \n",
"\n",
" Churn \n",
"0 No \n",
"1 No \n",
"2 Yes \n",
"3 No \n",
"4 Yes "
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" customerID | \n",
" gender | \n",
" SeniorCitizen | \n",
" Partner | \n",
" Dependents | \n",
" tenure | \n",
" PhoneService | \n",
" MultipleLines | \n",
" InternetService | \n",
" OnlineSecurity | \n",
" OnlineBackup | \n",
" DeviceProtection | \n",
" TechSupport | \n",
" StreamingTV | \n",
" StreamingMovies | \n",
" Contract | \n",
" PaperlessBilling | \n",
" PaymentMethod | \n",
" MonthlyCharges | \n",
" TotalCharges | \n",
" Churn | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 7590-VHVEG | \n",
" Female | \n",
" 0 | \n",
" Yes | \n",
" No | \n",
" 1 | \n",
" No | \n",
" No phone service | \n",
" DSL | \n",
" No | \n",
" Yes | \n",
" No | \n",
" No | \n",
" No | \n",
" No | \n",
" Month-to-month | \n",
" Yes | \n",
" Electronic check | \n",
" 29.85 | \n",
" 29.85 | \n",
" No | \n",
"
\n",
" \n",
" | 1 | \n",
" 5575-GNVDE | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 34 | \n",
" Yes | \n",
" No | \n",
" DSL | \n",
" Yes | \n",
" No | \n",
" Yes | \n",
" No | \n",
" No | \n",
" No | \n",
" One year | \n",
" No | \n",
" Mailed check | \n",
" 56.95 | \n",
" 1889.5 | \n",
" No | \n",
"
\n",
" \n",
" | 2 | \n",
" 3668-QPYBK | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Yes | \n",
" No | \n",
" DSL | \n",
" Yes | \n",
" Yes | \n",
" No | \n",
" No | \n",
" No | \n",
" No | \n",
" Month-to-month | \n",
" Yes | \n",
" Mailed check | \n",
" 53.85 | \n",
" 108.15 | \n",
" Yes | \n",
"
\n",
" \n",
" | 3 | \n",
" 7795-CFOCW | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 45 | \n",
" No | \n",
" No phone service | \n",
" DSL | \n",
" Yes | \n",
" No | \n",
" Yes | \n",
" Yes | \n",
" No | \n",
" No | \n",
" One year | \n",
" No | \n",
" Bank transfer (automatic) | \n",
" 42.30 | \n",
" 1840.75 | \n",
" No | \n",
"
\n",
" \n",
" | 4 | \n",
" 9237-HQITU | \n",
" Female | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Yes | \n",
" No | \n",
" Fiber optic | \n",
" No | \n",
" No | \n",
" No | \n",
" No | \n",
" No | \n",
" No | \n",
" Month-to-month | \n",
" Yes | \n",
" Electronic check | \n",
" 70.70 | \n",
" 151.65 | \n",
" Yes | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe"
}
},
"metadata": {}
}
],
"source": [
"# ββ Load the real-world Telco Churn dataset ββββββββββββββββββββββββββββββββββββ\n",
"# If you renamed your file differently, change the filename below\n",
"DATASET_FILENAME = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'\n",
"\n",
"try:\n",
" df_real = pd.read_csv(DATASET_FILENAME)\n",
" print(f'β
Dataset loaded successfully!')\n",
" print(f' Shape: {df_real.shape[0]} rows Γ {df_real.shape[1]} columns')\n",
" print(f'\\nπ First 5 rows:')\n",
" display(df_real.head())\n",
"except FileNotFoundError:\n",
" print('β ERROR: File not found!')\n",
" print(' Please upload the CSV file to Colab (see instructions above).')\n",
" print(' File expected:', DATASET_FILENAME)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0ZxflowUulFr",
"outputId": "d7356987-2ce1-4eff-cf0e-8263f2c1942d"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"π Column names:\n",
"['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']\n",
"\n",
"π Data types:\n",
"customerID object\n",
"gender object\n",
"SeniorCitizen int64\n",
"Partner object\n",
"Dependents object\n",
"tenure int64\n",
"PhoneService object\n",
"MultipleLines object\n",
"InternetService object\n",
"OnlineSecurity object\n",
"OnlineBackup object\n",
"DeviceProtection object\n",
"TechSupport object\n",
"StreamingTV object\n",
"StreamingMovies object\n",
"Contract object\n",
"PaperlessBilling object\n",
"PaymentMethod object\n",
"MonthlyCharges float64\n",
"TotalCharges object\n",
"Churn object\n",
"dtype: object\n",
"\n",
"π Missing values per column:\n",
"customerID 0\n",
"gender 0\n",
"SeniorCitizen 0\n",
"Partner 0\n",
"Dependents 0\n",
"tenure 0\n",
"PhoneService 0\n",
"MultipleLines 0\n",
"InternetService 0\n",
"OnlineSecurity 0\n",
"OnlineBackup 0\n",
"DeviceProtection 0\n",
"TechSupport 0\n",
"StreamingTV 0\n",
"StreamingMovies 0\n",
"Contract 0\n",
"PaperlessBilling 0\n",
"PaymentMethod 0\n",
"MonthlyCharges 0\n",
"TotalCharges 0\n",
"Churn 0\n",
"dtype: int64\n",
"\n",
"π Churn distribution (real data):\n",
"Churn\n",
"No 5174\n",
"Yes 1869\n",
"Name: count, dtype: int64\n",
"\n",
"π Churn rate (real data): 26.54 %\n"
]
}
],
"source": [
"# ββ Basic exploration of real dataset βββββββββββββββββββββββββββββββββββββββββ\n",
"print('π Column names:')\n",
"print(df_real.columns.tolist())\n",
"print('\\nπ Data types:')\n",
"print(df_real.dtypes)\n",
"print('\\nπ Missing values per column:')\n",
"print(df_real.isnull().sum())\n",
"print('\\nπ Churn distribution (real data):')\n",
"print(df_real['Churn'].value_counts())\n",
"print('\\nπ Churn rate (real data):', round(df_real['Churn'].value_counts(normalize=True)['Yes'] * 100, 2), '%')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qtj_34hvulFr"
},
"source": [
"---\n",
"## π§Ή SECTION 3: Real-World Data Cleaning\n",
"\n",
"### [REAL-WORLD DATA PROCESSING β continued]\n",
"\n",
"This section handles:\n",
"- Converting `TotalCharges` to numeric (it arrives as a string with spaces)\n",
"- Filling missing values\n",
"- Encoding the target variable `Churn` as 0/1\n",
"- Selecting the columns we need"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 729
},
"id": "jpJKPzueulFs",
"outputId": "99e08519-c0e8-495c-eb84-09af613c0a4c"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
TotalCharges NaN values filled with median: 1397.47\n",
"β
Churn encoded: Yes=1, No=0\n",
"\n",
"β
SeniorCitizen unique values: [0 1]\n",
"\n",
"β
Cleaned dataset shape: (7043, 14)\n",
" Missing values after cleaning:\n",
"customerID 0\n",
"gender 0\n",
"SeniorCitizen 0\n",
"Partner 0\n",
"Dependents 0\n",
"tenure 0\n",
"Contract 0\n",
"PaymentMethod 0\n",
"MonthlyCharges 0\n",
"TotalCharges 0\n",
"InternetService 0\n",
"TechSupport 0\n",
"Churn 0\n",
"Churn_binary 0\n",
"dtype: int64\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"/tmp/ipykernel_2585/1930681133.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
"The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
"\n",
"For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
"\n",
"\n",
" df_real['TotalCharges'].fillna(median_total, inplace=True)\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
" customerID gender SeniorCitizen Partner Dependents tenure \\\n",
"0 7590-VHVEG Female 0 Yes No 1 \n",
"1 5575-GNVDE Male 0 No No 34 \n",
"2 3668-QPYBK Male 0 No No 2 \n",
"3 7795-CFOCW Male 0 No No 45 \n",
"4 9237-HQITU Female 0 No No 2 \n",
"\n",
" Contract PaymentMethod MonthlyCharges TotalCharges \\\n",
"0 Month-to-month Electronic check 29.85 29.85 \n",
"1 One year Mailed check 56.95 1889.50 \n",
"2 Month-to-month Mailed check 53.85 108.15 \n",
"3 One year Bank transfer (automatic) 42.30 1840.75 \n",
"4 Month-to-month Electronic check 70.70 151.65 \n",
"\n",
" InternetService TechSupport Churn Churn_binary \n",
"0 DSL No No 0 \n",
"1 DSL No No 0 \n",
"2 DSL No Yes 1 \n",
"3 DSL Yes No 0 \n",
"4 Fiber optic No Yes 1 "
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" customerID | \n",
" gender | \n",
" SeniorCitizen | \n",
" Partner | \n",
" Dependents | \n",
" tenure | \n",
" Contract | \n",
" PaymentMethod | \n",
" MonthlyCharges | \n",
" TotalCharges | \n",
" InternetService | \n",
" TechSupport | \n",
" Churn | \n",
" Churn_binary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 7590-VHVEG | \n",
" Female | \n",
" 0 | \n",
" Yes | \n",
" No | \n",
" 1 | \n",
" Month-to-month | \n",
" Electronic check | \n",
" 29.85 | \n",
" 29.85 | \n",
" DSL | \n",
" No | \n",
" No | \n",
" 0 | \n",
"
\n",
" \n",
" | 1 | \n",
" 5575-GNVDE | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 34 | \n",
" One year | \n",
" Mailed check | \n",
" 56.95 | \n",
" 1889.50 | \n",
" DSL | \n",
" No | \n",
" No | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 3668-QPYBK | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Month-to-month | \n",
" Mailed check | \n",
" 53.85 | \n",
" 108.15 | \n",
" DSL | \n",
" No | \n",
" Yes | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 7795-CFOCW | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 45 | \n",
" One year | \n",
" Bank transfer (automatic) | \n",
" 42.30 | \n",
" 1840.75 | \n",
" DSL | \n",
" Yes | \n",
" No | \n",
" 0 | \n",
"
\n",
" \n",
" | 4 | \n",
" 9237-HQITU | \n",
" Female | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Month-to-month | \n",
" Electronic check | \n",
" 70.70 | \n",
" 151.65 | \n",
" Fiber optic | \n",
" No | \n",
" Yes | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"gender\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Male\",\n \"Female\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Partner\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Dependents\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Contract\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"One year\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PaymentMethod\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Electronic check\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15.445573799635934,\n \"min\": 29.85,\n \"max\": 70.7,\n \"num_unique_values\": 5,\n \"samples\": [\n 56.95\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 969.8243111512518,\n \"min\": 29.85,\n \"max\": 1889.5,\n \"num_unique_values\": 5,\n \"samples\": [\n 1889.5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"InternetService\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Fiber optic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TechSupport\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {}
}
],
"source": [
"# ββ Step 1: Fix TotalCharges column (arrives as string) ββββββββββββββββββββββ\n",
"df_real['TotalCharges'] = pd.to_numeric(df_real['TotalCharges'], errors='coerce')\n",
"\n",
"# ββ Step 2: Fill missing TotalCharges with median ββββββββββββββββββββββββββββ\n",
"median_total = df_real['TotalCharges'].median()\n",
"df_real['TotalCharges'].fillna(median_total, inplace=True)\n",
"print(f'β
TotalCharges NaN values filled with median: {median_total:.2f}')\n",
"\n",
"# ββ Step 3: Encode Churn as 0 / 1 ββββββββββββββββββββββββββββββββββββββββββββ\n",
"df_real['Churn_binary'] = df_real['Churn'].map({'Yes': 1, 'No': 0})\n",
"print(f'β
Churn encoded: Yes=1, No=0')\n",
"\n",
"# ββ Step 4: Encode SeniorCitizen (already 0/1, but verify) βββββββββββββββββββ\n",
"print(f'\\nβ
SeniorCitizen unique values: {df_real[\"SeniorCitizen\"].unique()}')\n",
"\n",
"# ββ Step 5: Select core columns for our analysis βββββββββββββββββββββββββββββ\n",
"core_cols = [\n",
" 'customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',\n",
" 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges',\n",
" 'TotalCharges', 'InternetService', 'TechSupport',\n",
" 'Churn', 'Churn_binary'\n",
"]\n",
"df_clean = df_real[core_cols].copy()\n",
"\n",
"print(f'\\nβ
Cleaned dataset shape: {df_clean.shape}')\n",
"print(f' Missing values after cleaning:')\n",
"print(df_clean.isnull().sum())\n",
"display(df_clean.head())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 317
},
"id": "ySxvvy5hulFs",
"outputId": "5c552cda-2f7f-4088-e11e-0d07e0886538"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"π Descriptive statistics (numeric columns):\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
" SeniorCitizen tenure MonthlyCharges TotalCharges Churn_binary\n",
"count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000\n",
"mean 0.162147 32.371149 64.761692 2281.916928 0.265370\n",
"std 0.368612 24.559481 30.090047 2265.270398 0.441561\n",
"min 0.000000 0.000000 18.250000 18.800000 0.000000\n",
"25% 0.000000 9.000000 35.500000 402.225000 0.000000\n",
"50% 0.000000 29.000000 70.350000 1397.475000 0.000000\n",
"75% 0.000000 55.000000 89.850000 3786.600000 1.000000\n",
"max 1.000000 72.000000 118.750000 8684.800000 1.000000"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SeniorCitizen | \n",
" tenure | \n",
" MonthlyCharges | \n",
" TotalCharges | \n",
" Churn_binary | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 7043.000000 | \n",
" 7043.000000 | \n",
" 7043.000000 | \n",
" 7043.000000 | \n",
" 7043.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" 0.162147 | \n",
" 32.371149 | \n",
" 64.761692 | \n",
" 2281.916928 | \n",
" 0.265370 | \n",
"
\n",
" \n",
" | std | \n",
" 0.368612 | \n",
" 24.559481 | \n",
" 30.090047 | \n",
" 2265.270398 | \n",
" 0.441561 | \n",
"
\n",
" \n",
" | min | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 18.250000 | \n",
" 18.800000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 25% | \n",
" 0.000000 | \n",
" 9.000000 | \n",
" 35.500000 | \n",
" 402.225000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 50% | \n",
" 0.000000 | \n",
" 29.000000 | \n",
" 70.350000 | \n",
" 1397.475000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 75% | \n",
" 0.000000 | \n",
" 55.000000 | \n",
" 89.850000 | \n",
" 3786.600000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" | max | \n",
" 1.000000 | \n",
" 72.000000 | \n",
" 118.750000 | \n",
" 8684.800000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"display(df_clean\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"SeniorCitizen\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.9992387084,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.1621468124378816,\n 1.0,\n 0.36861160561002687\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2478.9752758409018,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 32.37114865824223,\n 29.0,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MonthlyCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2468.7047672837775,\n \"min\": 18.25,\n \"max\": 7043.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 64.76169246059918,\n 70.35,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TotalCharges\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3119.0484860242914,\n \"min\": 18.8,\n \"max\": 8684.8,\n \"num_unique_values\": 8,\n \"samples\": [\n 2281.9169281556156,\n 1397.475,\n 7043.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn_binary\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2489.939844235915,\n \"min\": 0.0,\n \"max\": 7043.0,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.2653698707936959,\n 1.0,\n 0.44156130512195013\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {}
}
],
"source": [
"# ββ Descriptive statistics on real data ββββββββββββββββββββββββββββββββββββββ\n",
"print('π Descriptive statistics (numeric columns):')\n",
"display(df_clean.describe())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RpDjH_T6ulFs"
},
"source": [
"---\n",
"## π€ SECTION 4: Synthetic Support Interaction Data Generation\n",
"\n",
"### [SYNTHETIC DATASET GENERATION]\n",
"\n",
"**Why synthetic data?** \n",
"Real telecom datasets do not include detailed support call logs. We simulate realistic support interaction variables that are **statistically correlated** with churn β just as a real company's CRM data would show.\n",
"\n",
"**Variables we create:**\n",
"| Variable | Description |\n",
"|---|---|\n",
"| `support_calls` | Number of support calls made in the last 6 months |\n",
"| `avg_call_duration` | Average call duration in minutes |\n",
"| `complaint_type` | Type of most frequent complaint |\n",
"| `days_since_last_contact` | Days since the customer last contacted support |\n",
"| `last_contact_sentiment` | Text of the last customer feedback |\n",
"| `sentiment_score` | VADER compound sentiment score |\n",
"| `support_churn_risk` | Composite risk score (Low / Medium / High) |"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NK03tdXdulFs",
"outputId": "6a003b55-cad7-4b8e-9ddb-426a7d5f9135"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Feedback templates and complaint types defined.\n"
]
}
],
"source": [
"# ββ Helper: realistic sentiment phrases per churn status βββββββββββββββββββββ\n",
"POSITIVE_FEEDBACK = [\n",
" \"The support agent was very helpful and solved my issue quickly.\",\n",
" \"Great service, no complaints at all!\",\n",
" \"Fast resolution. I am happy with the service.\",\n",
" \"The team was professional and friendly. Very satisfied.\",\n",
" \"Everything was resolved in one call. Excellent experience.\",\n",
" \"I love this company, always responsive and caring.\",\n",
" \"No issues, service works perfectly. Very happy customer.\"\n",
"]\n",
"\n",
"NEUTRAL_FEEDBACK = [\n",
" \"The wait time was long but the issue was eventually resolved.\",\n",
" \"Average experience. Could be better.\",\n",
" \"Service is okay. Nothing special.\",\n",
" \"The agent was polite but the problem took two calls to fix.\",\n",
" \"Acceptable support, but I expected faster resolution.\"\n",
"]\n",
"\n",
"NEGATIVE_FEEDBACK = [\n",
" \"I have called five times and the problem is still not fixed!\",\n",
" \"Terrible service. I am thinking of switching providers.\",\n",
" \"The agents are unhelpful and the wait times are ridiculous.\",\n",
" \"I am very frustrated. Nobody seems to care about my problem.\",\n",
" \"Worst customer service I have ever experienced. Cancelling soon.\",\n",
" \"My bill is wrong again. This is the third time this month!\",\n",
" \"I am extremely disappointed. No follow-up, no resolution.\"\n",
"]\n",
"\n",
"COMPLAINT_TYPES = [\n",
" 'Billing Issue', 'Service Outage', 'Speed/Performance',\n",
" 'Contract Dispute', 'Technical Failure', 'Overcharge'\n",
"]\n",
"\n",
"print('β
Feedback templates and complaint types defined.')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EIzwa0U3ulFs",
"outputId": "0aa0c944-1e0d-4df3-cc48-15acc7a590ae"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Synthetic support variables generated!\n",
" support_calls range : 1 β 15\n",
" avg_call_duration range: 3.0 β 35.0 min\n",
" Sample complaint types : {'Contract Dispute', 'Billing Issue', 'Technical Failure', 'Service Outage'}\n"
]
}
],
"source": [
"# ββ Generate synthetic support variables ββββββββββββββββββββββββββββββββββββββ\n",
"n = len(df_clean)\n",
"churn_flag = df_clean['Churn_binary'].values\n",
"\n",
"# support_calls: churners call more (5-15 calls), non-churners call less (1-6)\n",
"support_calls = np.where(\n",
" churn_flag == 1,\n",
" np.random.randint(5, 16, n),\n",
" np.random.randint(1, 7, n)\n",
")\n",
"\n",
"# avg_call_duration: churners have longer calls (frustration)\n",
"avg_call_duration = np.where(\n",
" churn_flag == 1,\n",
" np.round(np.random.uniform(12, 35, n), 1),\n",
" np.round(np.random.uniform(3, 15, n), 1)\n",
")\n",
"\n",
"# complaint_type: random, but churners have heavier billing/contract issues\n",
"def pick_complaint(is_churner):\n",
" if is_churner:\n",
" # weight toward billing and contract disputes\n",
" weights = [0.30, 0.15, 0.15, 0.20, 0.10, 0.10]\n",
" else:\n",
" weights = [0.20, 0.20, 0.20, 0.10, 0.20, 0.10]\n",
" return random.choices(COMPLAINT_TYPES, weights=weights, k=1)[0]\n",
"\n",
"complaint_type = [pick_complaint(c) for c in churn_flag]\n",
"\n",
"# days_since_last_contact: churners contacted recently (about to leave)\n",
"days_since_last_contact = np.where(\n",
" churn_flag == 1,\n",
" np.random.randint(1, 30, n),\n",
" np.random.randint(15, 90, n)\n",
")\n",
"\n",
"# last_contact_sentiment: text phrase matching churn likelihood\n",
"def pick_sentiment_text(is_churner):\n",
" if is_churner:\n",
" # 70% negative, 20% neutral, 10% positive\n",
" pool = (NEGATIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (POSITIVE_FEEDBACK * 1)\n",
" else:\n",
" # 10% negative, 20% neutral, 70% positive\n",
" pool = (POSITIVE_FEEDBACK * 7) + (NEUTRAL_FEEDBACK * 2) + (NEGATIVE_FEEDBACK * 1)\n",
" return random.choice(pool)\n",
"\n",
"last_contact_sentiment = [pick_sentiment_text(c) for c in churn_flag]\n",
"\n",
"print('β
Synthetic support variables generated!')\n",
"print(f' support_calls range : {support_calls.min()} β {support_calls.max()}')\n",
"print(f' avg_call_duration range: {avg_call_duration.min()} β {avg_call_duration.max()} min')\n",
"print(f' Sample complaint types : {set(complaint_type[:10])}')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Yak2n7gpulFt",
"outputId": "ff972046-295c-44c0-c4bb-e0d6c59cb165"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
VADER sentiment scores computed!\n",
" Score range: -0.710 to 0.872\n",
" Mean score : 0.307\n"
]
}
],
"source": [
"# ββ Compute VADER sentiment score βββββββββββββββββββββββββββββββββββββββββββββ\n",
"analyzer = SentimentIntensityAnalyzer()\n",
"\n",
"def get_compound_score(text):\n",
" \"\"\"Return VADER compound score: -1 (most negative) to +1 (most positive)\"\"\"\n",
" return analyzer.polarity_scores(text)['compound']\n",
"\n",
"sentiment_score = [get_compound_score(text) for text in last_contact_sentiment]\n",
"\n",
"print('β
VADER sentiment scores computed!')\n",
"print(f' Score range: {min(sentiment_score):.3f} to {max(sentiment_score):.3f}')\n",
"print(f' Mean score : {np.mean(sentiment_score):.3f}')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KJ3wGj9mulFt",
"outputId": "6b8e8fd1-6eaf-4f77-8e37-6218bea1e414"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
support_churn_risk categories created!\n",
" Distribution: Counter({'Medium': 4166, 'Low': 1471, 'High': 1406})\n"
]
}
],
"source": [
"# ββ Compute composite support_churn_risk ββββββββββββββββββββββββββββββββββββββ\n",
"# Logic:\n",
"# HIGH = many calls (>=6) AND negative sentiment (<= -0.3)\n",
"# MEDIUM = moderate calls (3-5) OR somewhat negative (-0.3 to 0)\n",
"# LOW = everything else\n",
"\n",
"def compute_risk(calls, score):\n",
" if calls >= 6 and score <= -0.3:\n",
" return 'High'\n",
" elif calls >= 3 or score <= 0.0:\n",
" return 'Medium'\n",
" else:\n",
" return 'Low'\n",
"\n",
"support_churn_risk = [\n",
" compute_risk(c, s) for c, s in zip(support_calls, sentiment_score)\n",
"]\n",
"\n",
"print('β
support_churn_risk categories created!')\n",
"from collections import Counter\n",
"print(' Distribution:', Counter(support_churn_risk))"
]
},
{
"cell_type": "markdown",
"source": [
"## Synthetic Data Design & Assumptions\n",
"\n",
"### Why Synthetic Data Was Created\n",
"\n",
"The original Telco dataset does not include detailed information about customer support interactions or behavioral signals such as sentiment. However, in real-world business settings, these factors play a critical role in customer churn.\n",
"\n",
"To better approximate real-world conditions, we generated synthetic variables representing:\n",
"- number of support calls\n",
"- average call duration\n",
"- complaint type\n",
"- time since last interaction\n",
"- sentiment score\n",
"- support-based churn risk\n",
"\n",
"These variables allow us to enrich the dataset and create a more realistic narrative around customer experience and dissatisfaction.\n",
"\n",
"---\n",
"\n",
"### Key Assumptions\n",
"\n",
"The synthetic data generation is based on the following assumptions:\n",
"\n",
"- Customers who churn tend to have **more frequent support interactions**\n",
"- Negative experiences lead to **lower sentiment scores**\n",
"- Certain complaint types (e.g., technical failures) are more strongly associated with dissatisfaction\n",
"- Customers with unresolved issues are more likely to churn\n",
"\n",
"These assumptions are grounded in typical telecom business logic but are not directly observed in the original dataset.\n",
"\n",
"---\n",
"\n",
"### Limitations of Synthetic Data\n",
"\n",
"Because these variables are artificially generated:\n",
"- They may introduce **bias toward expected relationships**\n",
"- They are not independent of the churn outcome\n",
"- They may **overstate model performance**\n",
"\n",
"Therefore, results should be interpreted as illustrative rather than fully generalizable."
],
"metadata": {
"id": "zvVoEV0Fl77_"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "7TZ0lxJQulFt"
},
"source": [
"---\n",
"## π SECTION 5: Merge Real + Synthetic Data\n",
"\n",
"We now combine the cleaned real-world dataset with our synthetic support variables into one rich dataset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 330
},
"id": "7tMGLUqvulFt",
"outputId": "d5013de1-0912-46b4-886e-e077adc64ac3"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Final merged dataset: 7043 rows Γ 21 columns\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
" customerID gender SeniorCitizen Partner Dependents tenure \\\n",
"0 7590-VHVEG Female 0 Yes No 1 \n",
"1 5575-GNVDE Male 0 No No 34 \n",
"2 3668-QPYBK Male 0 No No 2 \n",
"3 7795-CFOCW Male 0 No No 45 \n",
"4 9237-HQITU Female 0 No No 2 \n",
"\n",
" Contract PaymentMethod MonthlyCharges TotalCharges \\\n",
"0 Month-to-month Electronic check 29.85 29.85 \n",
"1 One year Mailed check 56.95 1889.50 \n",
"2 Month-to-month Mailed check 53.85 108.15 \n",
"3 One year Bank transfer (automatic) 42.30 1840.75 \n",
"4 Month-to-month Electronic check 70.70 151.65 \n",
"\n",
" InternetService TechSupport Churn Churn_binary support_calls \\\n",
"0 DSL No No 0 2 \n",
"1 DSL No No 0 4 \n",
"2 DSL No Yes 1 15 \n",
"3 DSL Yes No 0 3 \n",
"4 Fiber optic No Yes 1 9 \n",
"\n",
" avg_call_duration complaint_type days_since_last_contact \\\n",
"0 9.7 Contract Dispute 16 \n",
"1 3.8 Billing Issue 57 \n",
"2 21.2 Billing Issue 12 \n",
"3 9.8 Service Outage 40 \n",
"4 12.3 Contract Dispute 24 \n",
"\n",
" last_contact_sentiment sentiment_score \\\n",
"0 I love this company, always responsive and car... 0.8720 \n",
"1 Great service, no complaints at all! 0.7684 \n",
"2 My bill is wrong again. This is the third time... -0.5255 \n",
"3 Everything was resolved in one call. Excellent... 0.6597 \n",
"4 My bill is wrong again. This is the third time... -0.5255 \n",
"\n",
" support_churn_risk \n",
"0 Low \n",
"1 Medium \n",
"2 High \n",
"3 Medium \n",
"4 High "
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" customerID | \n",
" gender | \n",
" SeniorCitizen | \n",
" Partner | \n",
" Dependents | \n",
" tenure | \n",
" Contract | \n",
" PaymentMethod | \n",
" MonthlyCharges | \n",
" TotalCharges | \n",
" InternetService | \n",
" TechSupport | \n",
" Churn | \n",
" Churn_binary | \n",
" support_calls | \n",
" avg_call_duration | \n",
" complaint_type | \n",
" days_since_last_contact | \n",
" last_contact_sentiment | \n",
" sentiment_score | \n",
" support_churn_risk | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 7590-VHVEG | \n",
" Female | \n",
" 0 | \n",
" Yes | \n",
" No | \n",
" 1 | \n",
" Month-to-month | \n",
" Electronic check | \n",
" 29.85 | \n",
" 29.85 | \n",
" DSL | \n",
" No | \n",
" No | \n",
" 0 | \n",
" 2 | \n",
" 9.7 | \n",
" Contract Dispute | \n",
" 16 | \n",
" I love this company, always responsive and car... | \n",
" 0.8720 | \n",
" Low | \n",
"
\n",
" \n",
" | 1 | \n",
" 5575-GNVDE | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 34 | \n",
" One year | \n",
" Mailed check | \n",
" 56.95 | \n",
" 1889.50 | \n",
" DSL | \n",
" No | \n",
" No | \n",
" 0 | \n",
" 4 | \n",
" 3.8 | \n",
" Billing Issue | \n",
" 57 | \n",
" Great service, no complaints at all! | \n",
" 0.7684 | \n",
" Medium | \n",
"
\n",
" \n",
" | 2 | \n",
" 3668-QPYBK | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Month-to-month | \n",
" Mailed check | \n",
" 53.85 | \n",
" 108.15 | \n",
" DSL | \n",
" No | \n",
" Yes | \n",
" 1 | \n",
" 15 | \n",
" 21.2 | \n",
" Billing Issue | \n",
" 12 | \n",
" My bill is wrong again. This is the third time... | \n",
" -0.5255 | \n",
" High | \n",
"
\n",
" \n",
" | 3 | \n",
" 7795-CFOCW | \n",
" Male | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 45 | \n",
" One year | \n",
" Bank transfer (automatic) | \n",
" 42.30 | \n",
" 1840.75 | \n",
" DSL | \n",
" Yes | \n",
" No | \n",
" 0 | \n",
" 3 | \n",
" 9.8 | \n",
" Service Outage | \n",
" 40 | \n",
" Everything was resolved in one call. Excellent... | \n",
" 0.6597 | \n",
" Medium | \n",
"
\n",
" \n",
" | 4 | \n",
" 9237-HQITU | \n",
" Female | \n",
" 0 | \n",
" No | \n",
" No | \n",
" 2 | \n",
" Month-to-month | \n",
" Electronic check | \n",
" 70.70 | \n",
" 151.65 | \n",
" Fiber optic | \n",
" No | \n",
" Yes | \n",
" 1 | \n",
" 9 | \n",
" 12.3 | \n",
" Contract Dispute | \n",
" 24 | \n",
" My bill is wrong again. This is the third time... | \n",
" -0.5255 | \n",
" High | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe"
}
},
"metadata": {}
}
],
"source": [
"# ββ Build the synthetic support DataFrame ββββββββββββββββββββββββββββββββββββ\n",
"df_support = pd.DataFrame({\n",
" 'support_calls' : support_calls,\n",
" 'avg_call_duration' : avg_call_duration,\n",
" 'complaint_type' : complaint_type,\n",
" 'days_since_last_contact': days_since_last_contact,\n",
" 'last_contact_sentiment' : last_contact_sentiment,\n",
" 'sentiment_score' : sentiment_score,\n",
" 'support_churn_risk' : support_churn_risk\n",
"})\n",
"\n",
"# ββ Merge with real-world data (index-aligned) ββββββββββββββββββββββββββββββββ\n",
"df_final = pd.concat([df_clean.reset_index(drop=True), df_support], axis=1)\n",
"\n",
"print(f'β
Final merged dataset: {df_final.shape[0]} rows Γ {df_final.shape[1]} columns')\n",
"display(df_final.head())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UAW-XT66ulFu",
"outputId": "32243187-b541-40b2-8069-894cbe430f8b"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"π Mean support_calls by Churn:\n",
"Churn\n",
"No 3.47\n",
"Yes 10.02\n",
"Name: support_calls, dtype: float64\n",
"\n",
"π Mean sentiment_score by Churn:\n",
"Churn\n",
"No 0.519\n",
"Yes -0.279\n",
"Name: sentiment_score, dtype: float64\n",
"\n",
"π support_churn_risk vs Churn crosstab:\n",
"Churn No Yes\n",
"support_churn_risk \n",
"High 0.080 0.920\n",
"Low 1.000 0.000\n",
"Medium 0.862 0.138\n"
]
}
],
"source": [
"# ββ Verification: check correlations make sense ββββββββββββββββββββββββββββββ\n",
"print('π Mean support_calls by Churn:')\n",
"print(df_final.groupby('Churn')['support_calls'].mean().round(2))\n",
"\n",
"print('\\nπ Mean sentiment_score by Churn:')\n",
"print(df_final.groupby('Churn')['sentiment_score'].mean().round(3))\n",
"\n",
"print('\\nπ support_churn_risk vs Churn crosstab:')\n",
"print(pd.crosstab(df_final['support_churn_risk'], df_final['Churn'], normalize='index').round(3))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "27am_r6QulFu"
},
"source": [
"---\n",
"## πΎ SECTION 6: Export Final Dataset"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EfM7zP4XulFu",
"outputId": "5f76e254-0dc9-49ea-c6ff-3c202b00c78f"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Dataset exported: customer_churn_support_dataset.csv\n",
" Rows : 7043\n",
" Columns : 21\n",
"\n",
"π Final column list:\n",
" β’ customerID\n",
" β’ gender\n",
" β’ SeniorCitizen\n",
" β’ Partner\n",
" β’ Dependents\n",
" β’ tenure\n",
" β’ Contract\n",
" β’ PaymentMethod\n",
" β’ MonthlyCharges\n",
" β’ TotalCharges\n",
" β’ InternetService\n",
" β’ TechSupport\n",
" β’ Churn\n",
" β’ Churn_binary\n",
" β’ support_calls\n",
" β’ avg_call_duration\n",
" β’ complaint_type\n",
" β’ days_since_last_contact\n",
" β’ last_contact_sentiment\n",
" β’ sentiment_score\n",
" β’ support_churn_risk\n"
]
}
],
"source": [
"# ββ Export to CSV βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
"OUTPUT_FILENAME = 'customer_churn_support_dataset.csv'\n",
"df_final.to_csv(OUTPUT_FILENAME, index=False)\n",
"\n",
"print(f'β
Dataset exported: {OUTPUT_FILENAME}')\n",
"print(f' Rows : {df_final.shape[0]}')\n",
"print(f' Columns : {df_final.shape[1]}')\n",
"print(f'\\nπ Final column list:')\n",
"for col in df_final.columns:\n",
" print(f' β’ {col}')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "HFWnx9zlulFu",
"outputId": "a9084d14-05d9-4cb2-98c6-a25491f9b6b1"
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
""
],
"application/javascript": [
"\n",
" async function download(id, filename, size) {\n",
" if (!google.colab.kernel.accessAllowed) {\n",
" return;\n",
" }\n",
" const div = document.createElement('div');\n",
" const label = document.createElement('label');\n",
" label.textContent = `Downloading \"${filename}\": `;\n",
" div.appendChild(label);\n",
" const progress = document.createElement('progress');\n",
" progress.max = size;\n",
" div.appendChild(progress);\n",
" document.body.appendChild(div);\n",
"\n",
" const buffers = [];\n",
" let downloaded = 0;\n",
"\n",
" const channel = await google.colab.kernel.comms.open(id);\n",
" // Send a message to notify the kernel that we're ready.\n",
" channel.send({})\n",
"\n",
" for await (const message of channel.messages) {\n",
" // Send a message to notify the kernel that we're ready.\n",
" channel.send({})\n",
" if (message.buffers) {\n",
" for (const buffer of message.buffers) {\n",
" buffers.push(buffer);\n",
" downloaded += buffer.byteLength;\n",
" progress.value = downloaded;\n",
" }\n",
" }\n",
" }\n",
" const blob = new Blob(buffers, {type: 'application/binary'});\n",
" const a = document.createElement('a');\n",
" a.href = window.URL.createObjectURL(blob);\n",
" a.download = filename;\n",
" div.appendChild(a);\n",
" a.click();\n",
" div.remove();\n",
" }\n",
" "
]
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
""
],
"application/javascript": [
"download(\"download_160eb193-7480-4c6e-90ba-3271caaea521\", \"customer_churn_support_dataset.csv\", 1309271)"
]
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"β
Download triggered! Check your Downloads folder.\n"
]
}
],
"source": [
"# ββ Download the file to your computer βββββββββββββββββββββββββββββββββββββββ\n",
"from google.colab import files\n",
"files.download(OUTPUT_FILENAME)\n",
"print('β
Download triggered! Check your Downloads folder.')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wr8dQ0YQulFu"
},
"source": [
"---\n",
"## β
SECTION 7: Final Verification Checks"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FK7d-FE2ulFu",
"outputId": "4e8cb3b5-08f8-4e0a-efd6-97b738be9181"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"============================================================\n",
" FINAL DATASET VERIFICATION REPORT\n",
"============================================================\n",
"\n",
"β
Shape : (7043, 21)\n",
"β
Missing values : 0 total\n",
"β
Churn rate : 26.5%\n",
"β
Risk categories : {'Medium': 4166, 'Low': 1471, 'High': 1406}\n",
"β
Complaint types : 6 unique\n",
"β
Sentiment range : -0.710 to 0.872\n",
"\n",
"============================================================\n",
" β
Notebook 1 COMPLETE! Proceed to Notebook 2.\n",
"============================================================\n"
]
}
],
"source": [
"# ββ Final verification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
"print('=' * 60)\n",
"print(' FINAL DATASET VERIFICATION REPORT')\n",
"print('=' * 60)\n",
"\n",
"df_verify = pd.read_csv(OUTPUT_FILENAME)\n",
"\n",
"print(f'\\nβ
Shape : {df_verify.shape}')\n",
"print(f'β
Missing values : {df_verify.isnull().sum().sum()} total')\n",
"print(f'β
Churn rate : {df_verify[\"Churn_binary\"].mean()*100:.1f}%')\n",
"print(f'β
Risk categories : {df_verify[\"support_churn_risk\"].value_counts().to_dict()}')\n",
"print(f'β
Complaint types : {df_verify[\"complaint_type\"].nunique()} unique')\n",
"print(f'β
Sentiment range : {df_verify[\"sentiment_score\"].min():.3f} to {df_verify[\"sentiment_score\"].max():.3f}')\n",
"\n",
"print('\\n' + '=' * 60)\n",
"print(' β
Notebook 1 COMPLETE! Proceed to Notebook 2.')\n",
"print('=' * 60)"
]
},
{
"cell_type": "markdown",
"source": [
"## SECTION 8: Final Dataset Summary"
],
"metadata": {
"id": "7KUQ8dwMa9pK"
}
},
{
"cell_type": "code",
"source": [
"print('=' * 60)\n",
"print(' NOTEBOOK 1 β FINAL DATASET SUMMARY')\n",
"print('=' * 60)\n",
"\n",
"print(f'\\nπ Shape: {df_final.shape[0]} rows Γ {df_final.shape[1]} columns')\n",
"\n",
"print('\\nπ Column overview:')\n",
"for col in df_final.columns:\n",
" dtype = str(df_final[col].dtype)\n",
" sample = str(df_final[col].iloc[0])[:40]\n",
" print(f' {col:<30} [{dtype:<10}] e.g. {sample}')\n",
"\n",
"print('\\nπ Real vs Synthetic columns:')\n",
"real_cols = ['customerID','gender','SeniorCitizen','Partner','Dependents',\n",
" 'tenure','Contract','PaymentMethod','MonthlyCharges',\n",
" 'TotalCharges','InternetService','TechSupport','Churn','Churn_binary']\n",
"synthetic_cols = ['support_calls','avg_call_duration','complaint_type',\n",
" 'days_since_last_contact','last_contact_sentiment',\n",
" 'sentiment_score','support_churn_risk']\n",
"\n",
"print(f' β
Real-world columns ({len(real_cols)}): {real_cols}')\n",
"print(f' π€ Synthetic columns ({len(synthetic_cols)}): {synthetic_cols}')\n",
"\n",
"print('\\nπ Sample rows:')\n",
"display(df_final[['customerID','tenure','Churn','support_calls',\n",
" 'sentiment_score','support_churn_risk']].head(5))\n",
"\n",
"\n",
"print('=' * 60)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 833
},
"id": "XpmZTW0RahxT",
"outputId": "10a7c450-6a35-447c-ca9f-7db0aab8d5ce"
},
"execution_count": 21,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"============================================================\n",
" NOTEBOOK 1 β FINAL DATASET SUMMARY\n",
"============================================================\n",
"\n",
"π Shape: 7043 rows Γ 21 columns\n",
"\n",
"π Column overview:\n",
" customerID [object ] e.g. 7590-VHVEG\n",
" gender [object ] e.g. Female\n",
" SeniorCitizen [int64 ] e.g. 0\n",
" Partner [object ] e.g. Yes\n",
" Dependents [object ] e.g. No\n",
" tenure [int64 ] e.g. 1\n",
" Contract [object ] e.g. Month-to-month\n",
" PaymentMethod [object ] e.g. Electronic check\n",
" MonthlyCharges [float64 ] e.g. 29.85\n",
" TotalCharges [float64 ] e.g. 29.85\n",
" InternetService [object ] e.g. DSL\n",
" TechSupport [object ] e.g. No\n",
" Churn [object ] e.g. No\n",
" Churn_binary [int64 ] e.g. 0\n",
" support_calls [int64 ] e.g. 2\n",
" avg_call_duration [float64 ] e.g. 9.7\n",
" complaint_type [object ] e.g. Contract Dispute\n",
" days_since_last_contact [int64 ] e.g. 16\n",
" last_contact_sentiment [object ] e.g. I love this company, always responsive a\n",
" sentiment_score [float64 ] e.g. 0.872\n",
" support_churn_risk [object ] e.g. Low\n",
"\n",
"π Real vs Synthetic columns:\n",
" β
Real-world columns (14): ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'Contract', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'InternetService', 'TechSupport', 'Churn', 'Churn_binary']\n",
" π€ Synthetic columns (7): ['support_calls', 'avg_call_duration', 'complaint_type', 'days_since_last_contact', 'last_contact_sentiment', 'sentiment_score', 'support_churn_risk']\n",
"\n",
"π Sample rows:\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
" customerID tenure Churn support_calls sentiment_score support_churn_risk\n",
"0 7590-VHVEG 1 No 2 0.8720 Low\n",
"1 5575-GNVDE 34 No 4 0.7684 Medium\n",
"2 3668-QPYBK 2 Yes 15 -0.5255 High\n",
"3 7795-CFOCW 45 No 3 0.6597 Medium\n",
"4 9237-HQITU 2 Yes 9 -0.5255 High"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" customerID | \n",
" tenure | \n",
" Churn | \n",
" support_calls | \n",
" sentiment_score | \n",
" support_churn_risk | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 7590-VHVEG | \n",
" 1 | \n",
" No | \n",
" 2 | \n",
" 0.8720 | \n",
" Low | \n",
"
\n",
" \n",
" | 1 | \n",
" 5575-GNVDE | \n",
" 34 | \n",
" No | \n",
" 4 | \n",
" 0.7684 | \n",
" Medium | \n",
"
\n",
" \n",
" | 2 | \n",
" 3668-QPYBK | \n",
" 2 | \n",
" Yes | \n",
" 15 | \n",
" -0.5255 | \n",
" High | \n",
"
\n",
" \n",
" | 3 | \n",
" 7795-CFOCW | \n",
" 45 | \n",
" No | \n",
" 3 | \n",
" 0.6597 | \n",
" Medium | \n",
"
\n",
" \n",
" | 4 | \n",
" 9237-HQITU | \n",
" 2 | \n",
" Yes | \n",
" 9 | \n",
" -0.5255 | \n",
" High | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"print('=' * 60)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"customerID\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"5575-GNVDE\",\n \"9237-HQITU\",\n \"3668-QPYBK\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tenure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21,\n \"min\": 1,\n \"max\": 45,\n \"num_unique_values\": 4,\n \"samples\": [\n 34,\n 45,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Churn\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_calls\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 2,\n \"max\": 15,\n \"num_unique_values\": 5,\n \"samples\": [\n 4,\n 9\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7117367821041709,\n \"min\": -0.5255,\n \"max\": 0.872,\n \"num_unique_values\": 4,\n \"samples\": [\n 0.7684,\n 0.6597\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"support_churn_risk\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low\",\n \"Medium\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"============================================================\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nEAK367ZulFu"
},
"source": [
"---\n",
"## π Summary β What Was Done in This Notebook\n",
"\n",
"| Phase | Action | Result |\n",
"|---|---|---|\n",
"| **Real-World Data** | Downloaded Telco Churn from Kaggle | 7,043 real customers |\n",
"| **Cleaning** | Fixed TotalCharges, encoded Churn | Clean 13-column DataFrame |\n",
"| **Synthetic Generation** | Created 7 support variables with VADER | Statistically realistic |\n",
"| **Merge** | Combined real + synthetic | 20-column final dataset |\n",
"| **Export** | Saved as CSV | `customer_churn_support_dataset.csv` |\n",
"\n",
"**β‘οΈ Next Step:** Open `2_Churn_Data_Analysis_and_Insights.ipynb` and upload this CSV file."
]
}
]
}