Spaces:
Runtime error
Runtime error
Upload 3 files
Browse files
07_conflicts_prediction_mlflow.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
08_regression_addicted_score.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
09_clustering_analysis.ipynb
ADDED
|
@@ -0,0 +1,868 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "60199374",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# Clustering Analysis - Social Media Usage Patterns\n",
|
| 9 |
+
"\n",
|
| 10 |
+
"## Overview\n",
|
| 11 |
+
"This notebook performs comprehensive clustering analysis on student social media usage data to identify distinct behavioral patterns and user segments.\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"## Analysis Pipeline:\n",
|
| 14 |
+
"1. **Data Preparation** - Feature engineering and scaling\n",
|
| 15 |
+
"2. **Dimensionality Reduction** - PCA/UMAP for visualization\n",
|
| 16 |
+
"3. **Clustering Algorithms** - KMeans, HDBSCAN, and others\n",
|
| 17 |
+
"4. **Evaluation** - Silhouette scores and visual validation\n",
|
| 18 |
+
"5. **Interpretability** - Cluster profiling and labeling\n",
|
| 19 |
+
"6. **MLflow Tracking** - Experiment tracking and model management\n",
|
| 20 |
+
"\n",
|
| 21 |
+
"## Key Objectives:\n",
|
| 22 |
+
"- Identify distinct user segments based on social media behavior\n",
|
| 23 |
+
"- Understand relationships between usage patterns and demographics\n",
|
| 24 |
+
"- Create actionable insights for intervention strategies\n",
|
| 25 |
+
"- Build reproducible clustering pipeline with MLflow\n"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "code",
|
| 30 |
+
"execution_count": null,
|
| 31 |
+
"id": "bc8d220b",
|
| 32 |
+
"metadata": {},
|
| 33 |
+
"outputs": [],
|
| 34 |
+
"source": [
|
| 35 |
+
"# Core data science libraries",
|
| 36 |
+
"import pandas as pd",
|
| 37 |
+
"import numpy as np",
|
| 38 |
+
"import matplotlib.pyplot as plt",
|
| 39 |
+
"import seaborn as sns",
|
| 40 |
+
"from sklearn.preprocessing import StandardScaler, MinMaxScaler",
|
| 41 |
+
"from sklearn.decomposition import PCA",
|
| 42 |
+
"from sklearn.cluster import KMeans, DBSCAN",
|
| 43 |
+
"from sklearn.metrics import silhouette_score, silhouette_samples",
|
| 44 |
+
"from sklearn.manifold import TSNE",
|
| 45 |
+
"import umap",
|
| 46 |
+
"import hdbscan",
|
| 47 |
+
"from scipy import stats",
|
| 48 |
+
"import warnings",
|
| 49 |
+
"warnings.filterwarnings('ignore')",
|
| 50 |
+
"",
|
| 51 |
+
"# MLflow for experiment tracking",
|
| 52 |
+
"import mlflow",
|
| 53 |
+
"import mlflow.sklearn",
|
| 54 |
+
"from mlflow.tracking import MlflowClient",
|
| 55 |
+
"",
|
| 56 |
+
"# Visualization settings",
|
| 57 |
+
"plt.style.use('seaborn-v0_8')",
|
| 58 |
+
"sns.set_palette(\"husl\")",
|
| 59 |
+
"%matplotlib inline",
|
| 60 |
+
"",
|
| 61 |
+
"# Set pandas display options",
|
| 62 |
+
"pd.set_option('display.max_columns', None)",
|
| 63 |
+
"pd.set_option('display.max_rows', 100)",
|
| 64 |
+
"pd.set_option('display.width', None)",
|
| 65 |
+
"",
|
| 66 |
+
"print(\"\u2705 Libraries imported successfully!\")",
|
| 67 |
+
"",
|
| 68 |
+
"# Completely fresh MLflow setup with disabled autologging",
|
| 69 |
+
"mlflow.set_tracking_uri(\"file:mlruns\")\n",
|
| 70 |
+
"mlflow.set_experiment(\"Clustering_Analysis\")\n",
|
| 71 |
+
"",
|
| 72 |
+
"# Disable autologging to avoid conflicts",
|
| 73 |
+
"mlflow.sklearn.autolog(disable=True)",
|
| 74 |
+
"",
|
| 75 |
+
"print(\"\u2705 MLflow tracking configured with fresh database!\")",
|
| 76 |
+
"print(\"\ud83d\udcca MLflow tracking URI: file:mlruns\")\n",
|
| 77 |
+
"print(\"\ud83d\udcca MLflow experiment: Clustering_Analysis\")\n",
|
| 78 |
+
"print(\"\ud83d\udd12 Autologging disabled to prevent conflicts\")"
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"cell_type": "code",
|
| 83 |
+
"execution_count": null,
|
| 84 |
+
"id": "9f406fe7",
|
| 85 |
+
"metadata": {},
|
| 86 |
+
"outputs": [],
|
| 87 |
+
"source": [
|
| 88 |
+
"# Load the dataset\n",
|
| 89 |
+
"from pathlib import Path\n",
|
| 90 |
+
"PROJECT_ROOT = Path.cwd().parent\n",
|
| 91 |
+
"DATA_DIR = PROJECT_ROOT / \"data\"\n",
|
| 92 |
+
"\n",
|
| 93 |
+
"print(\"\ud83d\udcca Loading Students Social Media Addiction dataset...\")\n",
|
| 94 |
+
"df = pd.read_csv(DATA_DIR / \"Students Social Media Addiction.csv\")\n",
|
| 95 |
+
"\n",
|
| 96 |
+
"print(f\"\u2705 Dataset loaded successfully!\")\n",
|
| 97 |
+
"print(f\"\ud83d\udccb Shape: {df.shape}\")\n",
|
| 98 |
+
"print(f\"\ud83d\udccb Columns: {list(df.columns)}\")\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"# Display basic info\n",
|
| 101 |
+
"print(\"\\n\ud83d\udcca Dataset Overview:\")\n",
|
| 102 |
+
"print(f\" - Total students: {len(df)}\")\n",
|
| 103 |
+
"print(f\" - Age range: {df['Age'].min()} - {df['Age'].max()} years\")\n",
|
| 104 |
+
"print(f\" - Countries: {df['Country'].nunique()}\")\n",
|
| 105 |
+
"print(f\" - Platforms: {df['Most_Used_Platform'].nunique()}\")\n",
|
| 106 |
+
"\n",
|
| 107 |
+
"# Display first few rows\n",
|
| 108 |
+
"df.head()"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "code",
|
| 113 |
+
"execution_count": null,
|
| 114 |
+
"id": "7ab8f782",
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"outputs": [],
|
| 117 |
+
"source": [
|
| 118 |
+
"## 1. Data Preparation\n",
|
| 119 |
+
"\n",
|
| 120 |
+
"### 1.1 Feature Engineering\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"# Create binary features for categorical variables\n",
|
| 123 |
+
"df['Is_Female'] = (df['Gender'] == 'Female').astype(int)\n",
|
| 124 |
+
"df['Is_Male'] = (df['Gender'] == 'Male').astype(int)\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"# Academic level features\n",
|
| 127 |
+
"df['Is_Undergraduate'] = (df['Academic_Level'] == 'Undergraduate').astype(int)\n",
|
| 128 |
+
"df['Is_Graduate'] = (df['Academic_Level'] == 'Graduate').astype(int)\n",
|
| 129 |
+
"df['Is_High_School'] = (df['Academic_Level'] == 'High School').astype(int)\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"# Relationship status features\n",
|
| 132 |
+
"df['Is_Single'] = (df['Relationship_Status'] == 'Single').astype(int)\n",
|
| 133 |
+
"df['Is_In_Relationship'] = (df['Relationship_Status'] == 'In Relationship').astype(int)\n",
|
| 134 |
+
"df['Is_Complicated'] = (df['Relationship_Status'] == 'Complicated').astype(int)\n",
|
| 135 |
+
"\n",
|
| 136 |
+
"# Academic performance\n",
|
| 137 |
+
"df['Affects_Academic'] = (df['Affects_Academic_Performance'] == 'Yes').astype(int)\n",
|
| 138 |
+
"\n",
|
| 139 |
+
"# Create platform dummies (top 6 platforms)\n",
|
| 140 |
+
"top_platforms = df['Most_Used_Platform'].value_counts().head(6).index\n",
|
| 141 |
+
"for platform in top_platforms:\n",
|
| 142 |
+
" df[f'Uses_{platform}'] = (df['Most_Used_Platform'] == platform).astype(int)\n",
|
| 143 |
+
"\n",
|
| 144 |
+
"# Create behavioral features\n",
|
| 145 |
+
"df['High_Usage'] = (df['Avg_Daily_Usage_Hours'] >= 6).astype(int)\n",
|
| 146 |
+
"df['Low_Sleep'] = (df['Sleep_Hours_Per_Night'] <= 6).astype(int)\n",
|
| 147 |
+
"df['Poor_Mental_Health'] = (df['Mental_Health_Score'] <= 5).astype(int)\n",
|
| 148 |
+
"df['High_Conflict'] = (df['Conflicts_Over_Social_Media'] >= 3).astype(int)\n",
|
| 149 |
+
"df['High_Addiction'] = (df['Addicted_Score'] >= 7).astype(int)\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"# Create interaction features\n",
|
| 152 |
+
"df['Usage_Sleep_Ratio'] = df['Avg_Daily_Usage_Hours'] / df['Sleep_Hours_Per_Night']\n",
|
| 153 |
+
"df['Mental_Health_Usage_Ratio'] = df['Mental_Health_Score'] / df['Avg_Daily_Usage_Hours']\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"print(\"\u2705 Feature engineering completed!\")\n",
|
| 156 |
+
"print(f\"\ud83d\udcca New features created: {len([col for col in df.columns if col.startswith(('Is_', 'Uses_', 'High_', 'Low_', 'Poor_', 'Ratio'))])}\")"
|
| 157 |
+
]
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"cell_type": "code",
|
| 161 |
+
"execution_count": null,
|
| 162 |
+
"id": "a2e3ee38",
|
| 163 |
+
"metadata": {},
|
| 164 |
+
"outputs": [],
|
| 165 |
+
"source": [
|
| 166 |
+
"### 1.2 Feature Selection for Clustering\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"# Select numerical features for clustering\n",
|
| 169 |
+
"numerical_features = [\n",
|
| 170 |
+
" 'Age', 'Avg_Daily_Usage_Hours', 'Sleep_Hours_Per_Night', \n",
|
| 171 |
+
" 'Mental_Health_Score', 'Conflicts_Over_Social_Media', 'Addicted_Score',\n",
|
| 172 |
+
" 'Is_Female', 'Is_Undergraduate', 'Is_Graduate', 'Is_High_School',\n",
|
| 173 |
+
" 'Is_Single', 'Is_In_Relationship', 'Is_Complicated', 'Affects_Academic',\n",
|
| 174 |
+
" 'High_Usage', 'Low_Sleep', 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction',\n",
|
| 175 |
+
" 'Usage_Sleep_Ratio', 'Mental_Health_Usage_Ratio'\n",
|
| 176 |
+
"]\n",
|
| 177 |
+
"\n",
|
| 178 |
+
"# Add platform features\n",
|
| 179 |
+
"platform_features = [col for col in df.columns if col.startswith('Uses_')]\n",
|
| 180 |
+
"numerical_features.extend(platform_features)\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"# Create feature matrix\n",
|
| 183 |
+
"X = df[numerical_features].copy()\n",
|
| 184 |
+
"\n",
|
| 185 |
+
"print(f\"\ud83d\udcca Feature matrix shape: {X.shape}\")\n",
|
| 186 |
+
"print(f\"\ud83d\udcca Features selected: {len(numerical_features)}\")\n",
|
| 187 |
+
"\n",
|
| 188 |
+
"# Check for missing values\n",
|
| 189 |
+
"print(\"\\n\ud83d\udcca Missing values check:\")\n",
|
| 190 |
+
"print(X.isnull().sum().sum(), \"missing values found\")\n",
|
| 191 |
+
"\n",
|
| 192 |
+
"# Display feature statistics\n",
|
| 193 |
+
"print(\"\\n\ud83d\udcca Feature statistics:\")\n",
|
| 194 |
+
"print(X.describe())"
|
| 195 |
+
]
|
| 196 |
+
},
|
| 197 |
+
{
|
| 198 |
+
"cell_type": "code",
|
| 199 |
+
"execution_count": null,
|
| 200 |
+
"id": "9ac222e5",
|
| 201 |
+
"metadata": {},
|
| 202 |
+
"outputs": [],
|
| 203 |
+
"source": [
|
| 204 |
+
"### 1.3 Feature Scaling\n",
|
| 205 |
+
"\n",
|
| 206 |
+
"# Standardize features for clustering\n",
|
| 207 |
+
"scaler = StandardScaler()\n",
|
| 208 |
+
"X_scaled = scaler.fit_transform(X)\n",
|
| 209 |
+
"\n",
|
| 210 |
+
"# Convert back to DataFrame for easier handling\n",
|
| 211 |
+
"X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)\n",
|
| 212 |
+
"\n",
|
| 213 |
+
"print(\"\u2705 Features scaled successfully!\")\n",
|
| 214 |
+
"print(f\"\ud83d\udcca Scaled features shape: {X_scaled_df.shape}\")\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"# Verify scaling\n",
|
| 217 |
+
"print(\"\\n\ud83d\udcca Scaling verification:\")\n",
|
| 218 |
+
"print(\"Mean of scaled features:\", X_scaled_df.mean().mean())\n",
|
| 219 |
+
"print(\"Std of scaled features:\", X_scaled_df.std().mean())"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "code",
|
| 224 |
+
"execution_count": null,
|
| 225 |
+
"id": "69057eff",
|
| 226 |
+
"metadata": {},
|
| 227 |
+
"outputs": [],
|
| 228 |
+
"source": [
|
| 229 |
+
"## 2. Dimensionality Reduction for Visualization\n",
|
| 230 |
+
"\n",
|
| 231 |
+
"### 2.1 Principal Component Analysis (PCA)\n",
|
| 232 |
+
"\n",
|
| 233 |
+
"# Perform PCA\n",
|
| 234 |
+
"pca = PCA(n_components=2, random_state=42)\n",
|
| 235 |
+
"X_pca = pca.fit_transform(X_scaled)\n",
|
| 236 |
+
"\n",
|
| 237 |
+
"# Create PCA DataFrame\n",
|
| 238 |
+
"pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'], index=X.index)\n",
|
| 239 |
+
"\n",
|
| 240 |
+
"print(\"\ud83d\udcca PCA Results:\")\n",
|
| 241 |
+
"print(f\"Explained variance ratio: {pca.explained_variance_ratio_}\")\n",
|
| 242 |
+
"print(f\"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}\")\n",
|
| 243 |
+
"\n",
|
| 244 |
+
"# Visualize PCA\n",
|
| 245 |
+
"plt.figure(figsize=(12, 8))\n",
|
| 246 |
+
"plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.6, s=50)\n",
|
| 247 |
+
"plt.xlabel('Principal Component 1')\n",
|
| 248 |
+
"plt.ylabel('Principal Component 2')\n",
|
| 249 |
+
"plt.title('PCA Visualization of Social Media Usage Patterns')\n",
|
| 250 |
+
"plt.grid(True, alpha=0.3)\n",
|
| 251 |
+
"plt.show()\n",
|
| 252 |
+
"\n",
|
| 253 |
+
"# Feature importance in PCA\n",
|
| 254 |
+
"feature_importance = pd.DataFrame(\n",
|
| 255 |
+
" pca.components_.T,\n",
|
| 256 |
+
" columns=['PC1', 'PC2'],\n",
|
| 257 |
+
" index=X.columns\n",
|
| 258 |
+
")\n",
|
| 259 |
+
"\n",
|
| 260 |
+
"print(\"\\n\ud83d\udcca Top features contributing to PC1:\")\n",
|
| 261 |
+
"print(feature_importance['PC1'].abs().sort_values(ascending=False).head(10))\n",
|
| 262 |
+
"\n",
|
| 263 |
+
"print(\"\\n\ud83d\udcca Top features contributing to PC2:\")\n",
|
| 264 |
+
"print(feature_importance['PC2'].abs().sort_values(ascending=False).head(10))"
|
| 265 |
+
]
|
| 266 |
+
},
|
| 267 |
+
{
|
| 268 |
+
"cell_type": "code",
|
| 269 |
+
"execution_count": null,
|
| 270 |
+
"id": "d315d19f",
|
| 271 |
+
"metadata": {},
|
| 272 |
+
"outputs": [],
|
| 273 |
+
"source": [
|
| 274 |
+
"### 2.2 UMAP for Non-linear Dimensionality Reduction\n",
|
| 275 |
+
"\n",
|
| 276 |
+
"# Perform UMAP\n",
|
| 277 |
+
"umap_reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)\n",
|
| 278 |
+
"X_umap = umap_reducer.fit_transform(X_scaled)\n",
|
| 279 |
+
"\n",
|
| 280 |
+
"# Create UMAP DataFrame\n",
|
| 281 |
+
"umap_df = pd.DataFrame(X_umap, columns=['UMAP1', 'UMAP2'], index=X.index)\n",
|
| 282 |
+
"\n",
|
| 283 |
+
"print(\"\u2705 UMAP reduction completed!\")\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"# Visualize UMAP\n",
|
| 286 |
+
"plt.figure(figsize=(12, 8))\n",
|
| 287 |
+
"plt.scatter(umap_df['UMAP1'], umap_df['UMAP2'], alpha=0.6, s=50)\n",
|
| 288 |
+
"plt.xlabel('UMAP Component 1')\n",
|
| 289 |
+
"plt.ylabel('UMAP Component 2')\n",
|
| 290 |
+
"plt.title('UMAP Visualization of Social Media Usage Patterns')\n",
|
| 291 |
+
"plt.grid(True, alpha=0.3)\n",
|
| 292 |
+
"plt.show()\n",
|
| 293 |
+
"\n",
|
| 294 |
+
"# Compare PCA vs UMAP\n",
|
| 295 |
+
"fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
|
| 296 |
+
"\n",
|
| 297 |
+
"# PCA plot\n",
|
| 298 |
+
"axes[0].scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.6, s=50)\n",
|
| 299 |
+
"axes[0].set_xlabel('PC1')\n",
|
| 300 |
+
"axes[0].set_ylabel('PC2')\n",
|
| 301 |
+
"axes[0].set_title('PCA Visualization')\n",
|
| 302 |
+
"axes[0].grid(True, alpha=0.3)\n",
|
| 303 |
+
"\n",
|
| 304 |
+
"# UMAP plot\n",
|
| 305 |
+
"axes[1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], alpha=0.6, s=50)\n",
|
| 306 |
+
"axes[1].set_xlabel('UMAP1')\n",
|
| 307 |
+
"axes[1].set_ylabel('UMAP2')\n",
|
| 308 |
+
"axes[1].set_title('UMAP Visualization')\n",
|
| 309 |
+
"axes[1].grid(True, alpha=0.3)\n",
|
| 310 |
+
"\n",
|
| 311 |
+
"plt.tight_layout()\n",
|
| 312 |
+
"plt.show()"
|
| 313 |
+
]
|
| 314 |
+
},
|
| 315 |
+
{
|
| 316 |
+
"cell_type": "code",
|
| 317 |
+
"execution_count": null,
|
| 318 |
+
"id": "fca5c18d",
|
| 319 |
+
"metadata": {},
|
| 320 |
+
"outputs": [],
|
| 321 |
+
"source": [
|
| 322 |
+
"## 3. Clustering Algorithms\n",
|
| 323 |
+
"\n",
|
| 324 |
+
"### 3.1 K-Means Clustering\n",
|
| 325 |
+
"\n",
|
| 326 |
+
"# Find optimal number of clusters using elbow method\n",
|
| 327 |
+
"inertias = []\n",
|
| 328 |
+
"silhouette_scores = []\n",
|
| 329 |
+
"k_range = range(2, 11)\n",
|
| 330 |
+
"\n",
|
| 331 |
+
"for k in k_range:\n",
|
| 332 |
+
" kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)\n",
|
| 333 |
+
" kmeans.fit(X_scaled)\n",
|
| 334 |
+
" inertias.append(kmeans.inertia_)\n",
|
| 335 |
+
" silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))\n",
|
| 336 |
+
"\n",
|
| 337 |
+
"# Plot elbow curve\n",
|
| 338 |
+
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n",
|
| 339 |
+
"\n",
|
| 340 |
+
"# Inertia plot\n",
|
| 341 |
+
"ax1.plot(k_range, inertias, 'bo-')\n",
|
| 342 |
+
"ax1.set_xlabel('Number of Clusters (k)')\n",
|
| 343 |
+
"ax1.set_ylabel('Inertia')\n",
|
| 344 |
+
"ax1.set_title('Elbow Method for Optimal k')\n",
|
| 345 |
+
"ax1.grid(True, alpha=0.3)\n",
|
| 346 |
+
"\n",
|
| 347 |
+
"# Silhouette score plot\n",
|
| 348 |
+
"ax2.plot(k_range, silhouette_scores, 'ro-')\n",
|
| 349 |
+
"ax2.set_xlabel('Number of Clusters (k)')\n",
|
| 350 |
+
"ax2.set_ylabel('Silhouette Score')\n",
|
| 351 |
+
"ax2.set_title('Silhouette Score vs Number of Clusters')\n",
|
| 352 |
+
"ax2.grid(True, alpha=0.3)\n",
|
| 353 |
+
"\n",
|
| 354 |
+
"plt.tight_layout()\n",
|
| 355 |
+
"plt.show()\n",
|
| 356 |
+
"\n",
|
| 357 |
+
"# Find optimal k\n",
|
| 358 |
+
"optimal_k = k_range[np.argmax(silhouette_scores)]\n",
|
| 359 |
+
"print(f\"\ud83d\udcca Optimal number of clusters (K-Means): {optimal_k}\")\n",
|
| 360 |
+
"print(f\"\ud83d\udcca Best silhouette score: {max(silhouette_scores):.3f}\")"
|
| 361 |
+
]
|
| 362 |
+
},
|
| 363 |
+
{
|
| 364 |
+
"cell_type": "code",
|
| 365 |
+
"execution_count": null,
|
| 366 |
+
"id": "e9fc6734",
|
| 367 |
+
"metadata": {},
|
| 368 |
+
"outputs": [],
|
| 369 |
+
"source": [
|
| 370 |
+
"### 3.2 K-Means with Optimal k\n",
|
| 371 |
+
"\n",
|
| 372 |
+
"# Perform K-Means with optimal k and clean MLflow logging\n",
|
| 373 |
+
"with mlflow.start_run(run_name=\"kmeans_optimal\"):\n",
|
| 374 |
+
" # Create and fit the model\n",
|
| 375 |
+
" kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)\n",
|
| 376 |
+
" kmeans_labels = kmeans_optimal.fit_predict(X_scaled)\n",
|
| 377 |
+
" df['KMeans_Cluster'] = kmeans_labels\n",
|
| 378 |
+
" \n",
|
| 379 |
+
" # Log only essential parameters (avoid conflicts)\n",
|
| 380 |
+
" mlflow.log_param(\"algorithm\", \"KMeans\")\n",
|
| 381 |
+
" mlflow.log_param(\"n_clusters\", optimal_k)\n",
|
| 382 |
+
" \n",
|
| 383 |
+
" # Log metrics\n",
|
| 384 |
+
" mlflow.log_metric(\"silhouette_score\", max(silhouette_scores))\n",
|
| 385 |
+
" mlflow.log_metric(\"inertia\", kmeans_optimal.inertia_)\n",
|
| 386 |
+
" \n",
|
| 387 |
+
" # Log model\n",
|
| 388 |
+
" mlflow.sklearn.log_model(kmeans_optimal, \"kmeans_model\")\n",
|
| 389 |
+
" \n",
|
| 390 |
+
" print(f\"\u2705 K-Means clustering completed with {optimal_k} clusters!\")\n",
|
| 391 |
+
" print(f\"\u2705 K-Means experiment logged to MLflow!\")"
|
| 392 |
+
]
|
| 393 |
+
},
|
| 394 |
+
{
|
| 395 |
+
"cell_type": "code",
|
| 396 |
+
"execution_count": null,
|
| 397 |
+
"id": "eba9df96",
|
| 398 |
+
"metadata": {},
|
| 399 |
+
"outputs": [],
|
| 400 |
+
"source": [
|
| 401 |
+
"### 3.3 HDBSCAN Clustering\n",
|
| 402 |
+
"\n",
|
| 403 |
+
"# Perform HDBSCAN clustering with clean MLflow logging\n",
|
| 404 |
+
"with mlflow.start_run(run_name=\"hdbscan_clustering\"):\n",
|
| 405 |
+
" # Create and fit the model\n",
|
| 406 |
+
" hdbscan_clusterer = hdbscan.HDBSCAN(\n",
|
| 407 |
+
" min_cluster_size=15,\n",
|
| 408 |
+
" min_samples=5,\n",
|
| 409 |
+
" cluster_selection_epsilon=0.1,\n",
|
| 410 |
+
" cluster_selection_method='eom'\n",
|
| 411 |
+
" )\n",
|
| 412 |
+
" hdbscan_labels = hdbscan_clusterer.fit_predict(X_scaled)\n",
|
| 413 |
+
" \n",
|
| 414 |
+
" # Log only essential parameters (avoid conflicts)\n",
|
| 415 |
+
" mlflow.log_param(\"algorithm\", \"HDBSCAN\")\n",
|
| 416 |
+
" mlflow.log_param(\"min_cluster_size\", 15)\n",
|
| 417 |
+
" \n",
|
| 418 |
+
" # Log model\n",
|
| 419 |
+
" mlflow.sklearn.log_model(hdbscan_clusterer, \"hdbscan_model\")\n",
|
| 420 |
+
"\n",
|
| 421 |
+
"# Add HDBSCAN labels to data\n",
|
| 422 |
+
"df['HDBSCAN_Cluster'] = hdbscan_labels\n",
|
| 423 |
+
"\n",
|
| 424 |
+
"# Count clusters (including noise points labeled as -1)\n",
|
| 425 |
+
"n_clusters_hdbscan = len(set(hdbscan_labels)) - (1 if -1 in hdbscan_labels else 0)\n",
|
| 426 |
+
"n_noise_points = list(hdbscan_labels).count(-1)\n",
|
| 427 |
+
"\n",
|
| 428 |
+
"print(f\"\ud83d\udcca HDBSCAN Results:\")\n",
|
| 429 |
+
"print(f\" - Number of clusters: {n_clusters_hdbscan}\")\n",
|
| 430 |
+
"print(f\" - Noise points: {n_noise_points}\")\n",
|
| 431 |
+
"print(f\" - Noise percentage: {n_noise_points/len(df)*100:.1f}%\")\n",
|
| 432 |
+
"\n",
|
| 433 |
+
"# Calculate silhouette score (excluding noise points)\n",
|
| 434 |
+
"if n_noise_points < len(df):\n",
|
| 435 |
+
" non_noise_mask = hdbscan_labels != -1\n",
|
| 436 |
+
" if len(set(hdbscan_labels[non_noise_mask])) > 1:\n",
|
| 437 |
+
" hdbscan_silhouette = silhouette_score(X_scaled[non_noise_mask], hdbscan_labels[non_noise_mask])\n",
|
| 438 |
+
" print(f\" - Silhouette score: {hdbscan_silhouette:.3f}\")\n",
|
| 439 |
+
" \n",
|
| 440 |
+
" # Log HDBSCAN metrics in a separate run\n",
|
| 441 |
+
" with mlflow.start_run(run_name=\"hdbscan_metrics\"):\n",
|
| 442 |
+
" mlflow.log_metric(\"silhouette_score\", hdbscan_silhouette)\n",
|
| 443 |
+
" mlflow.log_metric(\"noise_percentage\", n_noise_points/len(df)*100)\n",
|
| 444 |
+
" else:\n",
|
| 445 |
+
" print(\" - Cannot calculate silhouette score (only one cluster)\")\n",
|
| 446 |
+
"else:\n",
|
| 447 |
+
" print(\" - Cannot calculate silhouette score (all points are noise)\")\n",
|
| 448 |
+
"\n",
|
| 449 |
+
"print(\"\u2705 HDBSCAN experiment logged to MLflow!\")"
|
| 450 |
+
]
|
| 451 |
+
},
|
| 452 |
+
{
|
| 453 |
+
"cell_type": "code",
|
| 454 |
+
"execution_count": null,
|
| 455 |
+
"id": "4a2882ec",
|
| 456 |
+
"metadata": {},
|
| 457 |
+
"outputs": [],
|
| 458 |
+
"source": [
|
| 459 |
+
"### 3.4 Clustering Visualization\n",
|
| 460 |
+
"\n",
|
| 461 |
+
"# Create visualization plots\n",
|
| 462 |
+
"fig, axes = plt.subplots(2, 2, figsize=(16, 12))\n",
|
| 463 |
+
"\n",
|
| 464 |
+
"# K-Means on PCA\n",
|
| 465 |
+
"scatter1 = axes[0,0].scatter(pca_df['PC1'], pca_df['PC2'], c=kmeans_labels, cmap='viridis', alpha=0.6, s=50)\n",
|
| 466 |
+
"axes[0,0].set_xlabel('PC1')\n",
|
| 467 |
+
"axes[0,0].set_ylabel('PC2')\n",
|
| 468 |
+
"axes[0,0].set_title('K-Means Clusters (PCA)')\n",
|
| 469 |
+
"plt.colorbar(scatter1, ax=axes[0,0])\n",
|
| 470 |
+
"\n",
|
| 471 |
+
"# K-Means on UMAP\n",
|
| 472 |
+
"scatter2 = axes[0,1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], c=kmeans_labels, cmap='viridis', alpha=0.6, s=50)\n",
|
| 473 |
+
"axes[0,1].set_xlabel('UMAP1')\n",
|
| 474 |
+
"axes[0,1].set_ylabel('UMAP2')\n",
|
| 475 |
+
"axes[0,1].set_title('K-Means Clusters (UMAP)')\n",
|
| 476 |
+
"plt.colorbar(scatter2, ax=axes[0,1])\n",
|
| 477 |
+
"\n",
|
| 478 |
+
"# HDBSCAN on PCA\n",
|
| 479 |
+
"scatter3 = axes[1,0].scatter(pca_df['PC1'], pca_df['PC2'], c=hdbscan_labels, cmap='Set1', alpha=0.6, s=50)\n",
|
| 480 |
+
"axes[1,0].set_xlabel('PC1')\n",
|
| 481 |
+
"axes[1,0].set_ylabel('PC2')\n",
|
| 482 |
+
"axes[1,0].set_title('HDBSCAN Clusters (PCA)')\n",
|
| 483 |
+
"plt.colorbar(scatter3, ax=axes[1,0])\n",
|
| 484 |
+
"\n",
|
| 485 |
+
"# HDBSCAN on UMAP\n",
|
| 486 |
+
"scatter4 = axes[1,1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], c=hdbscan_labels, cmap='Set1', alpha=0.6, s=50)\n",
|
| 487 |
+
"axes[1,1].set_xlabel('UMAP1')\n",
|
| 488 |
+
"axes[1,1].set_ylabel('UMAP2')\n",
|
| 489 |
+
"axes[1,1].set_title('HDBSCAN Clusters (UMAP)')\n",
|
| 490 |
+
"plt.colorbar(scatter4, ax=axes[1,1])\n",
|
| 491 |
+
"\n",
|
| 492 |
+
"plt.tight_layout()\n",
|
| 493 |
+
"plt.show()\n",
|
| 494 |
+
"\n",
|
| 495 |
+
"# Compare clustering results\n",
|
| 496 |
+
"print(\"\ud83d\udcca Clustering Comparison:\")\n",
|
| 497 |
+
"print(f\"K-Means clusters: {optimal_k}\")\n",
|
| 498 |
+
"print(f\"HDBSCAN clusters: {n_clusters_hdbscan}\")\n",
|
| 499 |
+
"print(f\"K-Means silhouette: {max(silhouette_scores):.3f}\")\n",
|
| 500 |
+
"if n_noise_points < len(df) and len(set(hdbscan_labels[hdbscan_labels != -1])) > 1:\n",
|
| 501 |
+
" print(f\"HDBSCAN silhouette: {hdbscan_silhouette:.3f}\")"
|
| 502 |
+
]
|
| 503 |
+
},
|
| 504 |
+
{
|
| 505 |
+
"cell_type": "code",
|
| 506 |
+
"execution_count": null,
|
| 507 |
+
"id": "d151af67",
|
| 508 |
+
"metadata": {},
|
| 509 |
+
"outputs": [],
|
| 510 |
+
"source": [
|
| 511 |
+
"## 4. Cluster Profiling and Interpretation\n",
|
| 512 |
+
"\n",
|
| 513 |
+
"### 4.1 K-Means Cluster Analysis\n",
|
| 514 |
+
"\n",
|
| 515 |
+
"# Analyze K-Means clusters\n",
|
| 516 |
+
"print(\"\ud83d\udcca K-Means Cluster Analysis\")\n",
|
| 517 |
+
"print(\"=\" * 50)\n",
|
| 518 |
+
"\n",
|
| 519 |
+
"# Key features for profiling (check which ones exist)\n",
|
| 520 |
+
"base_features = ['Age', 'Avg_Daily_Usage_Hours', 'Sleep_Hours_Per_Night', \n",
|
| 521 |
+
" 'Mental_Health_Score', 'Conflicts_Over_Social_Media', 'Addicted_Score']\n",
|
| 522 |
+
"\n",
|
| 523 |
+
"# Add binary features that exist\n",
|
| 524 |
+
"binary_features = []\n",
|
| 525 |
+
"for feature in ['Is_Female', 'Is_Undergraduate', 'Is_Graduate', 'High_Usage', 'Low_Sleep', \n",
|
| 526 |
+
" 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction']:\n",
|
| 527 |
+
" if feature in df.columns:\n",
|
| 528 |
+
" binary_features.append(feature)\n",
|
| 529 |
+
"\n",
|
| 530 |
+
"key_features = base_features + binary_features\n",
|
| 531 |
+
"\n",
|
| 532 |
+
"# Create cluster profiles\n",
|
| 533 |
+
"cluster_profiles = df.groupby('KMeans_Cluster')[key_features].mean()\n",
|
| 534 |
+
"\n",
|
| 535 |
+
"print(\"\\n\ud83d\udcca Cluster Profiles (Mean Values):\")\n",
|
| 536 |
+
"print(cluster_profiles.round(3))\n",
|
| 537 |
+
"\n",
|
| 538 |
+
"# Visualize cluster characteristics\n",
|
| 539 |
+
"fig, axes = plt.subplots(2, 3, figsize=(18, 12))\n",
|
| 540 |
+
"\n",
|
| 541 |
+
"# Usage patterns\n",
|
| 542 |
+
"usage_features = ['Avg_Daily_Usage_Hours', 'Addicted_Score', 'Conflicts_Over_Social_Media']\n",
|
| 543 |
+
"available_usage_features = [f for f in usage_features if f in cluster_profiles.columns]\n",
|
| 544 |
+
"if available_usage_features:\n",
|
| 545 |
+
" cluster_profiles[available_usage_features].plot(\n",
|
| 546 |
+
" kind='bar', ax=axes[0,0], title='Usage & Addiction Patterns')\n",
|
| 547 |
+
" axes[0,0].set_ylabel('Score')\n",
|
| 548 |
+
" axes[0,0].tick_params(axis='x', rotation=45)\n",
|
| 549 |
+
"\n",
|
| 550 |
+
"# Health patterns\n",
|
| 551 |
+
"health_features = ['Mental_Health_Score', 'Sleep_Hours_Per_Night']\n",
|
| 552 |
+
"available_health_features = [f for f in health_features if f in cluster_profiles.columns]\n",
|
| 553 |
+
"if available_health_features:\n",
|
| 554 |
+
" cluster_profiles[available_health_features].plot(\n",
|
| 555 |
+
" kind='bar', ax=axes[0,1], title='Health & Sleep Patterns')\n",
|
| 556 |
+
" axes[0,1].set_ylabel('Score')\n",
|
| 557 |
+
" axes[0,1].tick_params(axis='x', rotation=45)\n",
|
| 558 |
+
"\n",
|
| 559 |
+
"# Demographics\n",
|
| 560 |
+
"demo_features = ['Is_Female', 'Is_Undergraduate', 'Is_Graduate']\n",
|
| 561 |
+
"available_demo_features = [f for f in demo_features if f in cluster_profiles.columns]\n",
|
| 562 |
+
"if available_demo_features:\n",
|
| 563 |
+
" cluster_profiles[available_demo_features].plot(\n",
|
| 564 |
+
" kind='bar', ax=axes[0,2], title='Demographic Patterns')\n",
|
| 565 |
+
" axes[0,2].set_ylabel('Proportion')\n",
|
| 566 |
+
" axes[0,2].tick_params(axis='x', rotation=45)\n",
|
| 567 |
+
"\n",
|
| 568 |
+
"# Binary features\n",
|
| 569 |
+
"binary_plot_features = ['High_Usage', 'Low_Sleep', 'Poor_Mental_Health']\n",
|
| 570 |
+
"available_binary_features = [f for f in binary_plot_features if f in cluster_profiles.columns]\n",
|
| 571 |
+
"if available_binary_features:\n",
|
| 572 |
+
" cluster_profiles[available_binary_features].plot(\n",
|
| 573 |
+
" kind='bar', ax=axes[1,0], title='Risk Factor Patterns')\n",
|
| 574 |
+
" axes[1,0].set_ylabel('Proportion')\n",
|
| 575 |
+
" axes[1,0].tick_params(axis='x', rotation=45)\n",
|
| 576 |
+
"\n",
|
| 577 |
+
"# Age distribution\n",
|
| 578 |
+
"if 'Age' in cluster_profiles.columns:\n",
|
| 579 |
+
" cluster_profiles['Age'].plot(kind='bar', ax=axes[1,1], title='Age Distribution')\n",
|
| 580 |
+
" axes[1,1].set_ylabel('Age')\n",
|
| 581 |
+
" axes[1,1].tick_params(axis='x', rotation=45)\n",
|
| 582 |
+
"\n",
|
| 583 |
+
"# Cluster sizes\n",
|
| 584 |
+
"cluster_sizes = df['KMeans_Cluster'].value_counts().sort_index()\n",
|
| 585 |
+
"cluster_sizes.plot(kind='bar', ax=axes[1,2], title='Cluster Sizes')\n",
|
| 586 |
+
"axes[1,2].set_ylabel('Number of Students')\n",
|
| 587 |
+
"axes[1,2].tick_params(axis='x', rotation=45)\n",
|
| 588 |
+
"\n",
|
| 589 |
+
"plt.tight_layout()\n",
|
| 590 |
+
"plt.show()"
|
| 591 |
+
]
|
| 592 |
+
},
|
| 593 |
+
{
|
| 594 |
+
"cell_type": "code",
|
| 595 |
+
"execution_count": null,
|
| 596 |
+
"id": "3490cb53",
|
| 597 |
+
"metadata": {},
|
| 598 |
+
"outputs": [],
|
| 599 |
+
"source": [
|
| 600 |
+
"### 4.2 Cluster Labeling\n",
|
| 601 |
+
"\n",
|
| 602 |
+
"# Create intuitive labels for clusters based on their characteristics\n",
|
| 603 |
+
"def label_clusters(cluster_profiles):\n",
|
| 604 |
+
" labels = {}\n",
|
| 605 |
+
" \n",
|
| 606 |
+
" for cluster_id in cluster_profiles.index:\n",
|
| 607 |
+
" profile = cluster_profiles.loc[cluster_id]\n",
|
| 608 |
+
" \n",
|
| 609 |
+
" # Determine usage level\n",
|
| 610 |
+
" if profile['Avg_Daily_Usage_Hours'] > 6:\n",
|
| 611 |
+
" usage_level = \"High-Usage\"\n",
|
| 612 |
+
" elif profile['Avg_Daily_Usage_Hours'] > 4:\n",
|
| 613 |
+
" usage_level = \"Moderate-Usage\"\n",
|
| 614 |
+
" else:\n",
|
| 615 |
+
" usage_level = \"Low-Usage\"\n",
|
| 616 |
+
" \n",
|
| 617 |
+
" # Determine health status\n",
|
| 618 |
+
" if profile['Mental_Health_Score'] < 5 and profile['Sleep_Hours_Per_Night'] < 6:\n",
|
| 619 |
+
" health_status = \"Poor-Health\"\n",
|
| 620 |
+
" elif profile['Mental_Health_Score'] > 7 and profile['Sleep_Hours_Per_Night'] > 7:\n",
|
| 621 |
+
" health_status = \"Good-Health\"\n",
|
| 622 |
+
" else:\n",
|
| 623 |
+
" health_status = \"Average-Health\"\n",
|
| 624 |
+
" \n",
|
| 625 |
+
" # Determine addiction level\n",
|
| 626 |
+
" if profile['Addicted_Score'] > 7:\n",
|
| 627 |
+
" addiction_level = \"High-Addiction\"\n",
|
| 628 |
+
" elif profile['Addicted_Score'] > 5:\n",
|
| 629 |
+
" addiction_level = \"Moderate-Addiction\"\n",
|
| 630 |
+
" else:\n",
|
| 631 |
+
" addiction_level = \"Low-Addiction\"\n",
|
| 632 |
+
" \n",
|
| 633 |
+
" # Create label\n",
|
| 634 |
+
" label = f\"{usage_level}_{health_status}_{addiction_level}\"\n",
|
| 635 |
+
" labels[cluster_id] = label\n",
|
| 636 |
+
" \n",
|
| 637 |
+
" return labels\n",
|
| 638 |
+
"\n",
|
| 639 |
+
"# Generate labels\n",
|
| 640 |
+
"cluster_labels = label_clusters(cluster_profiles)\n",
|
| 641 |
+
"\n",
|
| 642 |
+
"print(\"\ud83d\udcca Cluster Labels:\")\n",
|
| 643 |
+
"for cluster_id, label in cluster_labels.items():\n",
|
| 644 |
+
" size = cluster_sizes[cluster_id]\n",
|
| 645 |
+
" print(f\"Cluster {cluster_id} ({size} students): {label}\")\n",
|
| 646 |
+
"\n",
|
| 647 |
+
"# Add labels to dataframe\n",
|
| 648 |
+
"df['Cluster_Label'] = df['KMeans_Cluster'].map(cluster_labels)\n",
|
| 649 |
+
"\n",
|
| 650 |
+
"# Display sample students from each cluster\n",
|
| 651 |
+
"print(\"\\n\ud83d\udcca Sample Students from Each Cluster:\")\n",
|
| 652 |
+
"for cluster_id in sorted(df['KMeans_Cluster'].unique()):\n",
|
| 653 |
+
" cluster_students = df[df['KMeans_Cluster'] == cluster_id].head(3)\n",
|
| 654 |
+
" print(f\"\\nCluster {cluster_id} - {cluster_labels[cluster_id]}:\")\n",
|
| 655 |
+
" print(cluster_students[['Age', 'Gender', 'Avg_Daily_Usage_Hours', \n",
|
| 656 |
+
" 'Mental_Health_Score', 'Sleep_Hours_Per_Night', 'Addicted_Score']].to_string())"
|
| 657 |
+
]
|
| 658 |
+
},
|
| 659 |
+
{
|
| 660 |
+
"cell_type": "code",
|
| 661 |
+
"execution_count": null,
|
| 662 |
+
"id": "022f52d9",
|
| 663 |
+
"metadata": {},
|
| 664 |
+
"outputs": [],
|
| 665 |
+
"source": [
|
| 666 |
+
"## 5. MLflow Experiment Tracking\n",
|
| 667 |
+
"\n",
|
| 668 |
+
"### 5.1 Comprehensive MLflow Logging\n",
|
| 669 |
+
"\n",
|
| 670 |
+
"print(\"\u2705 MLflow tracking configured!\")\n",
|
| 671 |
+
"\n",
|
| 672 |
+
"# Log clustering comparison and summary\n",
|
| 673 |
+
"with mlflow.start_run(run_name=\"clustering_summary\"):\n",
|
| 674 |
+
" # Log overall statistics\n",
|
| 675 |
+
" mlflow.log_metric(\"total_students\", len(df))\n",
|
| 676 |
+
" mlflow.log_metric(\"kmeans_clusters\", optimal_k)\n",
|
| 677 |
+
" mlflow.log_metric(\"hdbscan_clusters\", n_clusters_hdbscan)\n",
|
| 678 |
+
" mlflow.log_metric(\"best_silhouette_score\", max(silhouette_scores))\n",
|
| 679 |
+
" \n",
|
| 680 |
+
" # Log cluster labels\n",
|
| 681 |
+
" mlflow.log_dict(cluster_labels, \"cluster_labels.json\")\n",
|
| 682 |
+
" \n",
|
| 683 |
+
" # Log feature scaling info\n",
|
| 684 |
+
" scaling_info = {\n",
|
| 685 |
+
" \"scaler_type\": \"StandardScaler\",\n",
|
| 686 |
+
" \"n_features_scaled\": len(numerical_features)\n",
|
| 687 |
+
" }\n",
|
| 688 |
+
" mlflow.log_dict(scaling_info, \"scaling_info.json\")\n",
|
| 689 |
+
" \n",
|
| 690 |
+
" # Log clustering comparison metrics\n",
|
| 691 |
+
" comparison_metrics = {\n",
|
| 692 |
+
" \"kmeans_silhouette\": max(silhouette_scores),\n",
|
| 693 |
+
" \"kmeans_inertia\": kmeans_optimal.inertia_,\n",
|
| 694 |
+
" \"hdbscan_noise_percentage\": n_noise_points/len(df)*100\n",
|
| 695 |
+
" }\n",
|
| 696 |
+
" \n",
|
| 697 |
+
" # Add HDBSCAN silhouette if available\n",
|
| 698 |
+
" if 'hdbscan_silhouette' in locals():\n",
|
| 699 |
+
" comparison_metrics[\"hdbscan_silhouette\"] = hdbscan_silhouette\n",
|
| 700 |
+
" \n",
|
| 701 |
+
" for metric_name, metric_value in comparison_metrics.items():\n",
|
| 702 |
+
" mlflow.log_metric(metric_name, metric_value)\n",
|
| 703 |
+
" \n",
|
| 704 |
+
" print(\"\u2705 Clustering summary logged to MLflow!\")\n",
|
| 705 |
+
"\n",
|
| 706 |
+
"print(\"\\n\ud83d\udcca All experiments logged to MLflow successfully!\")\n",
|
| 707 |
+
"print(\"\ud83d\udcca You can view results using: mlflow ui --port 5001\")\n",
|
| 708 |
+
"print(\"\ud83d\udcca Experiments logged:\")\n",
|
| 709 |
+
"print(\" - kmeans_optimal\")\n",
|
| 710 |
+
"print(\" - hdbscan_clustering\") \n",
|
| 711 |
+
"print(\" - hdbscan_metrics\")\n",
|
| 712 |
+
"print(\" - clustering_summary\")"
|
| 713 |
+
]
|
| 714 |
+
},
|
| 715 |
+
{
|
| 716 |
+
"cell_type": "code",
|
| 717 |
+
"execution_count": null,
|
| 718 |
+
"id": "daef8797",
|
| 719 |
+
"metadata": {},
|
| 720 |
+
"outputs": [],
|
| 721 |
+
"source": [
|
| 722 |
+
"## 6. Final Analysis and Insights\n",
|
| 723 |
+
"\n",
|
| 724 |
+
"### 6.1 Key Findings\n",
|
| 725 |
+
"\n",
|
| 726 |
+
"print(\"\ud83d\udcca CLUSTERING ANALYSIS INSIGHTS\")\n",
|
| 727 |
+
"print(\"=\" * 50)\n",
|
| 728 |
+
"\n",
|
| 729 |
+
"# Summary statistics by cluster\n",
|
| 730 |
+
"cluster_summary = df.groupby('Cluster_Label').agg({\n",
|
| 731 |
+
" 'Avg_Daily_Usage_Hours': ['mean', 'std'],\n",
|
| 732 |
+
" 'Mental_Health_Score': ['mean', 'std'],\n",
|
| 733 |
+
" 'Sleep_Hours_Per_Night': ['mean', 'std'],\n",
|
| 734 |
+
" 'Addicted_Score': ['mean', 'std'],\n",
|
| 735 |
+
" 'Age': ['mean', 'std']\n",
|
| 736 |
+
"}).round(2)\n",
|
| 737 |
+
"\n",
|
| 738 |
+
"print(\"\\n\ud83d\udcca Cluster Summary Statistics:\")\n",
|
| 739 |
+
"print(cluster_summary)\n",
|
| 740 |
+
"\n",
|
| 741 |
+
"# Risk assessment by cluster\n",
|
| 742 |
+
"risk_factors = ['High_Usage', 'Low_Sleep', 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction']\n",
|
| 743 |
+
"risk_by_cluster = df.groupby('Cluster_Label')[risk_factors].mean()\n",
|
| 744 |
+
"\n",
|
| 745 |
+
"print(\"\\n\ud83d\udcca Risk Factors by Cluster:\")\n",
|
| 746 |
+
"print(risk_by_cluster.round(3))\n",
|
| 747 |
+
"\n",
|
| 748 |
+
"# Platform usage by cluster\n",
|
| 749 |
+
"platform_cols = [col for col in df.columns if col.startswith('Uses_')]\n",
|
| 750 |
+
"platform_by_cluster = df.groupby('Cluster_Label')[platform_cols].mean()\n",
|
| 751 |
+
"\n",
|
| 752 |
+
"print(\"\\n\ud83d\udcca Platform Usage by Cluster:\")\n",
|
| 753 |
+
"print(platform_by_cluster.round(3))\n",
|
| 754 |
+
"\n",
|
| 755 |
+
"### 6.2 Intervention Recommendations\n",
|
| 756 |
+
"\n",
|
| 757 |
+
"print(\"\\n\ud83d\udcca INTERVENTION RECOMMENDATIONS\")\n",
|
| 758 |
+
"print(\"=\" * 50)\n",
|
| 759 |
+
"\n",
|
| 760 |
+
"for cluster_label in df['Cluster_Label'].unique():\n",
|
| 761 |
+
" cluster_data = df[df['Cluster_Label'] == cluster_label]\n",
|
| 762 |
+
" size = len(cluster_data)\n",
|
| 763 |
+
" percentage = size / len(df) * 100\n",
|
| 764 |
+
" \n",
|
| 765 |
+
" print(f\"\\n\ud83c\udfaf Cluster: {cluster_label}\")\n",
|
| 766 |
+
" print(f\" Size: {size} students ({percentage:.1f}%)\")\n",
|
| 767 |
+
" \n",
|
| 768 |
+
" # Identify key characteristics\n",
|
| 769 |
+
" avg_usage = cluster_data['Avg_Daily_Usage_Hours'].mean()\n",
|
| 770 |
+
" avg_mental_health = cluster_data['Mental_Health_Score'].mean()\n",
|
| 771 |
+
" avg_sleep = cluster_data['Sleep_Hours_Per_Night'].mean()\n",
|
| 772 |
+
" avg_addiction = cluster_data['Addicted_Score'].mean()\n",
|
| 773 |
+
" \n",
|
| 774 |
+
" print(f\" Average Usage: {avg_usage:.1f} hours/day\")\n",
|
| 775 |
+
" print(f\" Mental Health Score: {avg_mental_health:.1f}/10\")\n",
|
| 776 |
+
" print(f\" Sleep Hours: {avg_sleep:.1f} hours/night\")\n",
|
| 777 |
+
" print(f\" Addiction Score: {avg_addiction:.1f}/10\")\n",
|
| 778 |
+
" \n",
|
| 779 |
+
" # Generate recommendations\n",
|
| 780 |
+
" if avg_usage > 6 and avg_addiction > 7:\n",
|
| 781 |
+
" print(\" \u26a0\ufe0f HIGH RISK: Intensive intervention needed\")\n",
|
| 782 |
+
" print(\" \ud83d\udca1 Recommendations: Digital detox programs, counseling, parental monitoring\")\n",
|
| 783 |
+
" elif avg_usage > 4 and avg_mental_health < 6:\n",
|
| 784 |
+
" print(\" \u26a0\ufe0f MODERATE RISK: Targeted intervention recommended\")\n",
|
| 785 |
+
" print(\" \ud83d\udca1 Recommendations: Screen time limits, mental health support, sleep hygiene\")\n",
|
| 786 |
+
" else:\n",
|
| 787 |
+
" print(\" \u2705 LOW RISK: Monitor and provide resources\")\n",
|
| 788 |
+
" print(\" \ud83d\udca1 Recommendations: Educational materials, healthy usage guidelines\")\n",
|
| 789 |
+
"\n",
|
| 790 |
+
"print(\"\\n\u2705 Clustering analysis completed successfully!\")\n",
|
| 791 |
+
"print(\"\ud83d\udcca Check MLflow UI for detailed experiment tracking\")\n",
|
| 792 |
+
"print(\"\ud83d\udcca Use cluster labels for targeted interventions\")\n"
|
| 793 |
+
]
|
| 794 |
+
},
|
| 795 |
+
{
|
| 796 |
+
"cell_type": "markdown",
|
| 797 |
+
"id": "2c6194f8",
|
| 798 |
+
"metadata": {},
|
| 799 |
+
"source": [
|
| 800 |
+
"## 11. Next Steps & Best Practices\n",
|
| 801 |
+
"\n",
|
| 802 |
+
"This final section provides a comprehensive summary of the clustering analysis workflow and actionable next steps for production deployment.\n",
|
| 803 |
+
"\n",
|
| 804 |
+
"### What We've Accomplished\n",
|
| 805 |
+
"- **Data Preparation**: Feature engineering, scaling, and dimensionality reduction\n",
|
| 806 |
+
"- **Clustering Analysis**: KMeans and HDBSCAN algorithms with optimal parameter selection\n",
|
| 807 |
+
"- **Evaluation**: Silhouette scores, visual validation, and cluster profiling\n",
|
| 808 |
+
"- **MLflow Integration**: Complete experiment tracking and model versioning\n",
|
| 809 |
+
"- **Interpretability**: Cluster labeling and actionable insights for intervention strategies\n",
|
| 810 |
+
"\n",
|
| 811 |
+
"### Key Insights\n",
|
| 812 |
+
"- Identified distinct user segments based on social media behavior patterns\n",
|
| 813 |
+
"- Created risk assessment profiles for targeted interventions\n",
|
| 814 |
+
"- Built reproducible clustering pipeline with MLflow tracking\n",
|
| 815 |
+
"- Generated actionable recommendations for each cluster\n",
|
| 816 |
+
"\n",
|
| 817 |
+
"### Production Readiness\n",
|
| 818 |
+
"The analysis is now ready for production deployment with proper monitoring, retraining pipelines, and API integration."
|
| 819 |
+
]
|
| 820 |
+
},
|
| 821 |
+
{
|
| 822 |
+
"cell_type": "code",
|
| 823 |
+
"execution_count": null,
|
| 824 |
+
"id": "17fc0802",
|
| 825 |
+
"metadata": {},
|
| 826 |
+
"outputs": [],
|
| 827 |
+
"source": [
|
| 828 |
+
"\n",
|
| 829 |
+
"# Best Practices Summary\n",
|
| 830 |
+
"print(\"\ud83c\udfaf MLflow Clustering Best Practices Implemented:\")\n",
|
| 831 |
+
"print(\"1. \u2705 Comprehensive data preparation and feature engineering\")\n",
|
| 832 |
+
"print(\"2. \u2705 Dimensionality reduction (PCA, UMAP) for visualization\")\n",
|
| 833 |
+
"print(\"3. \u2705 Multiple clustering algorithms (KMeans, HDBSCAN)\")\n",
|
| 834 |
+
"print(\"4. \u2705 Silhouette score and visual validation\")\n",
|
| 835 |
+
"print(\"5. \u2705 Cluster profiling and interpretability\")\n",
|
| 836 |
+
"print(\"6. \u2705 MLflow experiment tracking and model versioning\")\n",
|
| 837 |
+
"\n",
|
| 838 |
+
"print(\"\\n\ud83d\udcca Cluster Analysis Insights:\")\n",
|
| 839 |
+
"print(\"\u2022 Silhouette Score: Measures cluster separation (higher is better)\")\n",
|
| 840 |
+
"print(\"\u2022 Visualizations: PCA/UMAP plots for cluster structure\")\n",
|
| 841 |
+
"print(\"\u2022 Cluster profiles: Mean values and risk factors by group\")\n",
|
| 842 |
+
"print(\"\u2022 Labeling: Intuitive cluster names for actionable insights\")\n",
|
| 843 |
+
"\n",
|
| 844 |
+
"print(\"\\n\ud83d\udccb Next Steps:\")\n",
|
| 845 |
+
"print(\"1. Launch MLflow UI: mlflow ui --port 5001\")\n",
|
| 846 |
+
"print(\"2. Access experiments at: http://localhost:5001\")\n",
|
| 847 |
+
"print(\"3. Compare clustering runs and diagnostic plots\")\n",
|
| 848 |
+
"print(\"4. Deploy best clustering model to production API\")\n",
|
| 849 |
+
"print(\"5. Set up automated retraining pipeline\")\n",
|
| 850 |
+
"print(\"6. Monitor cluster assignments in production\")\n",
|
| 851 |
+
"print(\"7. Consider ensemble or consensus clustering for robustness\")\n",
|
| 852 |
+
"\n",
|
| 853 |
+
"print(\"\\n\ud83d\udd27 To launch MLflow UI:\")\n",
|
| 854 |
+
"print(\"!mlflow ui --port 5001 --host 0.0.0.0\")\n",
|
| 855 |
+
"\n",
|
| 856 |
+
"print(\"\\n\ud83d\udcc8 Additional Recommendations:\")\n",
|
| 857 |
+
"print(\"\u2022 Explore ensemble clustering (consensus, voting)\")\n",
|
| 858 |
+
"print(\"\u2022 Implement feature selection for clustering\")\n",
|
| 859 |
+
"print(\"\u2022 Add cluster explainability tools (e.g., SHAP for cluster assignment)\")\n",
|
| 860 |
+
"print(\"\u2022 Set up automated monitoring for cluster drift\")\n",
|
| 861 |
+
"print(\"\u2022 Build dashboards for real-time cluster insights\")\n"
|
| 862 |
+
]
|
| 863 |
+
}
|
| 864 |
+
],
|
| 865 |
+
"metadata": {},
|
| 866 |
+
"nbformat": 4,
|
| 867 |
+
"nbformat_minor": 5
|
| 868 |
+
}
|