BlakeL commited on
Commit
e7988fe
·
verified ·
1 Parent(s): b73dae3

Upload 3 files

Browse files
07_conflicts_prediction_mlflow.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
08_regression_addicted_score.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
09_clustering_analysis.ipynb ADDED
@@ -0,0 +1,868 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "60199374",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Clustering Analysis - Social Media Usage Patterns\n",
9
+ "\n",
10
+ "## Overview\n",
11
+ "This notebook performs comprehensive clustering analysis on student social media usage data to identify distinct behavioral patterns and user segments.\n",
12
+ "\n",
13
+ "## Analysis Pipeline:\n",
14
+ "1. **Data Preparation** - Feature engineering and scaling\n",
15
+ "2. **Dimensionality Reduction** - PCA/UMAP for visualization\n",
16
+ "3. **Clustering Algorithms** - KMeans, HDBSCAN, and others\n",
17
+ "4. **Evaluation** - Silhouette scores and visual validation\n",
18
+ "5. **Interpretability** - Cluster profiling and labeling\n",
19
+ "6. **MLflow Tracking** - Experiment tracking and model management\n",
20
+ "\n",
21
+ "## Key Objectives:\n",
22
+ "- Identify distinct user segments based on social media behavior\n",
23
+ "- Understand relationships between usage patterns and demographics\n",
24
+ "- Create actionable insights for intervention strategies\n",
25
+ "- Build reproducible clustering pipeline with MLflow\n"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": null,
31
+ "id": "bc8d220b",
32
+ "metadata": {},
33
+ "outputs": [],
34
+ "source": [
35
+ "# Core data science libraries",
36
+ "import pandas as pd",
37
+ "import numpy as np",
38
+ "import matplotlib.pyplot as plt",
39
+ "import seaborn as sns",
40
+ "from sklearn.preprocessing import StandardScaler, MinMaxScaler",
41
+ "from sklearn.decomposition import PCA",
42
+ "from sklearn.cluster import KMeans, DBSCAN",
43
+ "from sklearn.metrics import silhouette_score, silhouette_samples",
44
+ "from sklearn.manifold import TSNE",
45
+ "import umap",
46
+ "import hdbscan",
47
+ "from scipy import stats",
48
+ "import warnings",
49
+ "warnings.filterwarnings('ignore')",
50
+ "",
51
+ "# MLflow for experiment tracking",
52
+ "import mlflow",
53
+ "import mlflow.sklearn",
54
+ "from mlflow.tracking import MlflowClient",
55
+ "",
56
+ "# Visualization settings",
57
+ "plt.style.use('seaborn-v0_8')",
58
+ "sns.set_palette(\"husl\")",
59
+ "%matplotlib inline",
60
+ "",
61
+ "# Set pandas display options",
62
+ "pd.set_option('display.max_columns', None)",
63
+ "pd.set_option('display.max_rows', 100)",
64
+ "pd.set_option('display.width', None)",
65
+ "",
66
+ "print(\"\u2705 Libraries imported successfully!\")",
67
+ "",
68
+ "# Completely fresh MLflow setup with disabled autologging",
69
+ "mlflow.set_tracking_uri(\"file:mlruns\")\n",
70
+ "mlflow.set_experiment(\"Clustering_Analysis\")\n",
71
+ "",
72
+ "# Disable autologging to avoid conflicts",
73
+ "mlflow.sklearn.autolog(disable=True)",
74
+ "",
75
+ "print(\"\u2705 MLflow tracking configured with fresh database!\")",
76
+ "print(\"\ud83d\udcca MLflow tracking URI: file:mlruns\")\n",
77
+ "print(\"\ud83d\udcca MLflow experiment: Clustering_Analysis\")\n",
78
+ "print(\"\ud83d\udd12 Autologging disabled to prevent conflicts\")"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": null,
84
+ "id": "9f406fe7",
85
+ "metadata": {},
86
+ "outputs": [],
87
+ "source": [
88
+ "# Load the dataset\n",
89
+ "from pathlib import Path\n",
90
+ "PROJECT_ROOT = Path.cwd().parent\n",
91
+ "DATA_DIR = PROJECT_ROOT / \"data\"\n",
92
+ "\n",
93
+ "print(\"\ud83d\udcca Loading Students Social Media Addiction dataset...\")\n",
94
+ "df = pd.read_csv(DATA_DIR / \"Students Social Media Addiction.csv\")\n",
95
+ "\n",
96
+ "print(f\"\u2705 Dataset loaded successfully!\")\n",
97
+ "print(f\"\ud83d\udccb Shape: {df.shape}\")\n",
98
+ "print(f\"\ud83d\udccb Columns: {list(df.columns)}\")\n",
99
+ "\n",
100
+ "# Display basic info\n",
101
+ "print(\"\\n\ud83d\udcca Dataset Overview:\")\n",
102
+ "print(f\" - Total students: {len(df)}\")\n",
103
+ "print(f\" - Age range: {df['Age'].min()} - {df['Age'].max()} years\")\n",
104
+ "print(f\" - Countries: {df['Country'].nunique()}\")\n",
105
+ "print(f\" - Platforms: {df['Most_Used_Platform'].nunique()}\")\n",
106
+ "\n",
107
+ "# Display first few rows\n",
108
+ "df.head()"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "code",
113
+ "execution_count": null,
114
+ "id": "7ab8f782",
115
+ "metadata": {},
116
+ "outputs": [],
117
+ "source": [
118
+ "## 1. Data Preparation\n",
119
+ "\n",
120
+ "### 1.1 Feature Engineering\n",
121
+ "\n",
122
+ "# Create binary features for categorical variables\n",
123
+ "df['Is_Female'] = (df['Gender'] == 'Female').astype(int)\n",
124
+ "df['Is_Male'] = (df['Gender'] == 'Male').astype(int)\n",
125
+ "\n",
126
+ "# Academic level features\n",
127
+ "df['Is_Undergraduate'] = (df['Academic_Level'] == 'Undergraduate').astype(int)\n",
128
+ "df['Is_Graduate'] = (df['Academic_Level'] == 'Graduate').astype(int)\n",
129
+ "df['Is_High_School'] = (df['Academic_Level'] == 'High School').astype(int)\n",
130
+ "\n",
131
+ "# Relationship status features\n",
132
+ "df['Is_Single'] = (df['Relationship_Status'] == 'Single').astype(int)\n",
133
+ "df['Is_In_Relationship'] = (df['Relationship_Status'] == 'In Relationship').astype(int)\n",
134
+ "df['Is_Complicated'] = (df['Relationship_Status'] == 'Complicated').astype(int)\n",
135
+ "\n",
136
+ "# Academic performance\n",
137
+ "df['Affects_Academic'] = (df['Affects_Academic_Performance'] == 'Yes').astype(int)\n",
138
+ "\n",
139
+ "# Create platform dummies (top 6 platforms)\n",
140
+ "top_platforms = df['Most_Used_Platform'].value_counts().head(6).index\n",
141
+ "for platform in top_platforms:\n",
142
+ " df[f'Uses_{platform}'] = (df['Most_Used_Platform'] == platform).astype(int)\n",
143
+ "\n",
144
+ "# Create behavioral features\n",
145
+ "df['High_Usage'] = (df['Avg_Daily_Usage_Hours'] >= 6).astype(int)\n",
146
+ "df['Low_Sleep'] = (df['Sleep_Hours_Per_Night'] <= 6).astype(int)\n",
147
+ "df['Poor_Mental_Health'] = (df['Mental_Health_Score'] <= 5).astype(int)\n",
148
+ "df['High_Conflict'] = (df['Conflicts_Over_Social_Media'] >= 3).astype(int)\n",
149
+ "df['High_Addiction'] = (df['Addicted_Score'] >= 7).astype(int)\n",
150
+ "\n",
151
+ "# Create interaction features\n",
152
+ "df['Usage_Sleep_Ratio'] = df['Avg_Daily_Usage_Hours'] / df['Sleep_Hours_Per_Night']\n",
153
+ "df['Mental_Health_Usage_Ratio'] = df['Mental_Health_Score'] / df['Avg_Daily_Usage_Hours']\n",
154
+ "\n",
155
+ "print(\"\u2705 Feature engineering completed!\")\n",
156
+ "print(f\"\ud83d\udcca New features created: {len([col for col in df.columns if col.startswith(('Is_', 'Uses_', 'High_', 'Low_', 'Poor_', 'Ratio'))])}\")"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": null,
162
+ "id": "a2e3ee38",
163
+ "metadata": {},
164
+ "outputs": [],
165
+ "source": [
166
+ "### 1.2 Feature Selection for Clustering\n",
167
+ "\n",
168
+ "# Select numerical features for clustering\n",
169
+ "numerical_features = [\n",
170
+ " 'Age', 'Avg_Daily_Usage_Hours', 'Sleep_Hours_Per_Night', \n",
171
+ " 'Mental_Health_Score', 'Conflicts_Over_Social_Media', 'Addicted_Score',\n",
172
+ " 'Is_Female', 'Is_Undergraduate', 'Is_Graduate', 'Is_High_School',\n",
173
+ " 'Is_Single', 'Is_In_Relationship', 'Is_Complicated', 'Affects_Academic',\n",
174
+ " 'High_Usage', 'Low_Sleep', 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction',\n",
175
+ " 'Usage_Sleep_Ratio', 'Mental_Health_Usage_Ratio'\n",
176
+ "]\n",
177
+ "\n",
178
+ "# Add platform features\n",
179
+ "platform_features = [col for col in df.columns if col.startswith('Uses_')]\n",
180
+ "numerical_features.extend(platform_features)\n",
181
+ "\n",
182
+ "# Create feature matrix\n",
183
+ "X = df[numerical_features].copy()\n",
184
+ "\n",
185
+ "print(f\"\ud83d\udcca Feature matrix shape: {X.shape}\")\n",
186
+ "print(f\"\ud83d\udcca Features selected: {len(numerical_features)}\")\n",
187
+ "\n",
188
+ "# Check for missing values\n",
189
+ "print(\"\\n\ud83d\udcca Missing values check:\")\n",
190
+ "print(X.isnull().sum().sum(), \"missing values found\")\n",
191
+ "\n",
192
+ "# Display feature statistics\n",
193
+ "print(\"\\n\ud83d\udcca Feature statistics:\")\n",
194
+ "print(X.describe())"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": null,
200
+ "id": "9ac222e5",
201
+ "metadata": {},
202
+ "outputs": [],
203
+ "source": [
204
+ "### 1.3 Feature Scaling\n",
205
+ "\n",
206
+ "# Standardize features for clustering\n",
207
+ "scaler = StandardScaler()\n",
208
+ "X_scaled = scaler.fit_transform(X)\n",
209
+ "\n",
210
+ "# Convert back to DataFrame for easier handling\n",
211
+ "X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)\n",
212
+ "\n",
213
+ "print(\"\u2705 Features scaled successfully!\")\n",
214
+ "print(f\"\ud83d\udcca Scaled features shape: {X_scaled_df.shape}\")\n",
215
+ "\n",
216
+ "# Verify scaling\n",
217
+ "print(\"\\n\ud83d\udcca Scaling verification:\")\n",
218
+ "print(\"Mean of scaled features:\", X_scaled_df.mean().mean())\n",
219
+ "print(\"Std of scaled features:\", X_scaled_df.std().mean())"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": null,
225
+ "id": "69057eff",
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "## 2. Dimensionality Reduction for Visualization\n",
230
+ "\n",
231
+ "### 2.1 Principal Component Analysis (PCA)\n",
232
+ "\n",
233
+ "# Perform PCA\n",
234
+ "pca = PCA(n_components=2, random_state=42)\n",
235
+ "X_pca = pca.fit_transform(X_scaled)\n",
236
+ "\n",
237
+ "# Create PCA DataFrame\n",
238
+ "pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'], index=X.index)\n",
239
+ "\n",
240
+ "print(\"\ud83d\udcca PCA Results:\")\n",
241
+ "print(f\"Explained variance ratio: {pca.explained_variance_ratio_}\")\n",
242
+ "print(f\"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}\")\n",
243
+ "\n",
244
+ "# Visualize PCA\n",
245
+ "plt.figure(figsize=(12, 8))\n",
246
+ "plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.6, s=50)\n",
247
+ "plt.xlabel('Principal Component 1')\n",
248
+ "plt.ylabel('Principal Component 2')\n",
249
+ "plt.title('PCA Visualization of Social Media Usage Patterns')\n",
250
+ "plt.grid(True, alpha=0.3)\n",
251
+ "plt.show()\n",
252
+ "\n",
253
+ "# Feature importance in PCA\n",
254
+ "feature_importance = pd.DataFrame(\n",
255
+ " pca.components_.T,\n",
256
+ " columns=['PC1', 'PC2'],\n",
257
+ " index=X.columns\n",
258
+ ")\n",
259
+ "\n",
260
+ "print(\"\\n\ud83d\udcca Top features contributing to PC1:\")\n",
261
+ "print(feature_importance['PC1'].abs().sort_values(ascending=False).head(10))\n",
262
+ "\n",
263
+ "print(\"\\n\ud83d\udcca Top features contributing to PC2:\")\n",
264
+ "print(feature_importance['PC2'].abs().sort_values(ascending=False).head(10))"
265
+ ]
266
+ },
267
+ {
268
+ "cell_type": "code",
269
+ "execution_count": null,
270
+ "id": "d315d19f",
271
+ "metadata": {},
272
+ "outputs": [],
273
+ "source": [
274
+ "### 2.2 UMAP for Non-linear Dimensionality Reduction\n",
275
+ "\n",
276
+ "# Perform UMAP\n",
277
+ "umap_reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)\n",
278
+ "X_umap = umap_reducer.fit_transform(X_scaled)\n",
279
+ "\n",
280
+ "# Create UMAP DataFrame\n",
281
+ "umap_df = pd.DataFrame(X_umap, columns=['UMAP1', 'UMAP2'], index=X.index)\n",
282
+ "\n",
283
+ "print(\"\u2705 UMAP reduction completed!\")\n",
284
+ "\n",
285
+ "# Visualize UMAP\n",
286
+ "plt.figure(figsize=(12, 8))\n",
287
+ "plt.scatter(umap_df['UMAP1'], umap_df['UMAP2'], alpha=0.6, s=50)\n",
288
+ "plt.xlabel('UMAP Component 1')\n",
289
+ "plt.ylabel('UMAP Component 2')\n",
290
+ "plt.title('UMAP Visualization of Social Media Usage Patterns')\n",
291
+ "plt.grid(True, alpha=0.3)\n",
292
+ "plt.show()\n",
293
+ "\n",
294
+ "# Compare PCA vs UMAP\n",
295
+ "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
296
+ "\n",
297
+ "# PCA plot\n",
298
+ "axes[0].scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.6, s=50)\n",
299
+ "axes[0].set_xlabel('PC1')\n",
300
+ "axes[0].set_ylabel('PC2')\n",
301
+ "axes[0].set_title('PCA Visualization')\n",
302
+ "axes[0].grid(True, alpha=0.3)\n",
303
+ "\n",
304
+ "# UMAP plot\n",
305
+ "axes[1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], alpha=0.6, s=50)\n",
306
+ "axes[1].set_xlabel('UMAP1')\n",
307
+ "axes[1].set_ylabel('UMAP2')\n",
308
+ "axes[1].set_title('UMAP Visualization')\n",
309
+ "axes[1].grid(True, alpha=0.3)\n",
310
+ "\n",
311
+ "plt.tight_layout()\n",
312
+ "plt.show()"
313
+ ]
314
+ },
315
+ {
316
+ "cell_type": "code",
317
+ "execution_count": null,
318
+ "id": "fca5c18d",
319
+ "metadata": {},
320
+ "outputs": [],
321
+ "source": [
322
+ "## 3. Clustering Algorithms\n",
323
+ "\n",
324
+ "### 3.1 K-Means Clustering\n",
325
+ "\n",
326
+ "# Find optimal number of clusters using elbow method\n",
327
+ "inertias = []\n",
328
+ "silhouette_scores = []\n",
329
+ "k_range = range(2, 11)\n",
330
+ "\n",
331
+ "for k in k_range:\n",
332
+ " kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)\n",
333
+ " kmeans.fit(X_scaled)\n",
334
+ " inertias.append(kmeans.inertia_)\n",
335
+ " silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))\n",
336
+ "\n",
337
+ "# Plot elbow curve\n",
338
+ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n",
339
+ "\n",
340
+ "# Inertia plot\n",
341
+ "ax1.plot(k_range, inertias, 'bo-')\n",
342
+ "ax1.set_xlabel('Number of Clusters (k)')\n",
343
+ "ax1.set_ylabel('Inertia')\n",
344
+ "ax1.set_title('Elbow Method for Optimal k')\n",
345
+ "ax1.grid(True, alpha=0.3)\n",
346
+ "\n",
347
+ "# Silhouette score plot\n",
348
+ "ax2.plot(k_range, silhouette_scores, 'ro-')\n",
349
+ "ax2.set_xlabel('Number of Clusters (k)')\n",
350
+ "ax2.set_ylabel('Silhouette Score')\n",
351
+ "ax2.set_title('Silhouette Score vs Number of Clusters')\n",
352
+ "ax2.grid(True, alpha=0.3)\n",
353
+ "\n",
354
+ "plt.tight_layout()\n",
355
+ "plt.show()\n",
356
+ "\n",
357
+ "# Find optimal k\n",
358
+ "optimal_k = k_range[np.argmax(silhouette_scores)]\n",
359
+ "print(f\"\ud83d\udcca Optimal number of clusters (K-Means): {optimal_k}\")\n",
360
+ "print(f\"\ud83d\udcca Best silhouette score: {max(silhouette_scores):.3f}\")"
361
+ ]
362
+ },
363
+ {
364
+ "cell_type": "code",
365
+ "execution_count": null,
366
+ "id": "e9fc6734",
367
+ "metadata": {},
368
+ "outputs": [],
369
+ "source": [
370
+ "### 3.2 K-Means with Optimal k\n",
371
+ "\n",
372
+ "# Perform K-Means with optimal k and clean MLflow logging\n",
373
+ "with mlflow.start_run(run_name=\"kmeans_optimal\"):\n",
374
+ " # Create and fit the model\n",
375
+ " kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)\n",
376
+ " kmeans_labels = kmeans_optimal.fit_predict(X_scaled)\n",
377
+ " df['KMeans_Cluster'] = kmeans_labels\n",
378
+ " \n",
379
+ " # Log only essential parameters (avoid conflicts)\n",
380
+ " mlflow.log_param(\"algorithm\", \"KMeans\")\n",
381
+ " mlflow.log_param(\"n_clusters\", optimal_k)\n",
382
+ " \n",
383
+ " # Log metrics\n",
384
+ " mlflow.log_metric(\"silhouette_score\", max(silhouette_scores))\n",
385
+ " mlflow.log_metric(\"inertia\", kmeans_optimal.inertia_)\n",
386
+ " \n",
387
+ " # Log model\n",
388
+ " mlflow.sklearn.log_model(kmeans_optimal, \"kmeans_model\")\n",
389
+ " \n",
390
+ " print(f\"\u2705 K-Means clustering completed with {optimal_k} clusters!\")\n",
391
+ " print(f\"\u2705 K-Means experiment logged to MLflow!\")"
392
+ ]
393
+ },
394
+ {
395
+ "cell_type": "code",
396
+ "execution_count": null,
397
+ "id": "eba9df96",
398
+ "metadata": {},
399
+ "outputs": [],
400
+ "source": [
401
+ "### 3.3 HDBSCAN Clustering\n",
402
+ "\n",
403
+ "# Perform HDBSCAN clustering with clean MLflow logging\n",
404
+ "with mlflow.start_run(run_name=\"hdbscan_clustering\"):\n",
405
+ " # Create and fit the model\n",
406
+ " hdbscan_clusterer = hdbscan.HDBSCAN(\n",
407
+ " min_cluster_size=15,\n",
408
+ " min_samples=5,\n",
409
+ " cluster_selection_epsilon=0.1,\n",
410
+ " cluster_selection_method='eom'\n",
411
+ " )\n",
412
+ " hdbscan_labels = hdbscan_clusterer.fit_predict(X_scaled)\n",
413
+ " \n",
414
+ " # Log only essential parameters (avoid conflicts)\n",
415
+ " mlflow.log_param(\"algorithm\", \"HDBSCAN\")\n",
416
+ " mlflow.log_param(\"min_cluster_size\", 15)\n",
417
+ " \n",
418
+ " # Log model\n",
419
+ " mlflow.sklearn.log_model(hdbscan_clusterer, \"hdbscan_model\")\n",
420
+ "\n",
421
+ "# Add HDBSCAN labels to data\n",
422
+ "df['HDBSCAN_Cluster'] = hdbscan_labels\n",
423
+ "\n",
424
+ "# Count clusters (including noise points labeled as -1)\n",
425
+ "n_clusters_hdbscan = len(set(hdbscan_labels)) - (1 if -1 in hdbscan_labels else 0)\n",
426
+ "n_noise_points = list(hdbscan_labels).count(-1)\n",
427
+ "\n",
428
+ "print(f\"\ud83d\udcca HDBSCAN Results:\")\n",
429
+ "print(f\" - Number of clusters: {n_clusters_hdbscan}\")\n",
430
+ "print(f\" - Noise points: {n_noise_points}\")\n",
431
+ "print(f\" - Noise percentage: {n_noise_points/len(df)*100:.1f}%\")\n",
432
+ "\n",
433
+ "# Calculate silhouette score (excluding noise points)\n",
434
+ "if n_noise_points < len(df):\n",
435
+ " non_noise_mask = hdbscan_labels != -1\n",
436
+ " if len(set(hdbscan_labels[non_noise_mask])) > 1:\n",
437
+ " hdbscan_silhouette = silhouette_score(X_scaled[non_noise_mask], hdbscan_labels[non_noise_mask])\n",
438
+ " print(f\" - Silhouette score: {hdbscan_silhouette:.3f}\")\n",
439
+ " \n",
440
+ " # Log HDBSCAN metrics in a separate run\n",
441
+ " with mlflow.start_run(run_name=\"hdbscan_metrics\"):\n",
442
+ " mlflow.log_metric(\"silhouette_score\", hdbscan_silhouette)\n",
443
+ " mlflow.log_metric(\"noise_percentage\", n_noise_points/len(df)*100)\n",
444
+ " else:\n",
445
+ " print(\" - Cannot calculate silhouette score (only one cluster)\")\n",
446
+ "else:\n",
447
+ " print(\" - Cannot calculate silhouette score (all points are noise)\")\n",
448
+ "\n",
449
+ "print(\"\u2705 HDBSCAN experiment logged to MLflow!\")"
450
+ ]
451
+ },
452
+ {
453
+ "cell_type": "code",
454
+ "execution_count": null,
455
+ "id": "4a2882ec",
456
+ "metadata": {},
457
+ "outputs": [],
458
+ "source": [
459
+ "### 3.4 Clustering Visualization\n",
460
+ "\n",
461
+ "# Create visualization plots\n",
462
+ "fig, axes = plt.subplots(2, 2, figsize=(16, 12))\n",
463
+ "\n",
464
+ "# K-Means on PCA\n",
465
+ "scatter1 = axes[0,0].scatter(pca_df['PC1'], pca_df['PC2'], c=kmeans_labels, cmap='viridis', alpha=0.6, s=50)\n",
466
+ "axes[0,0].set_xlabel('PC1')\n",
467
+ "axes[0,0].set_ylabel('PC2')\n",
468
+ "axes[0,0].set_title('K-Means Clusters (PCA)')\n",
469
+ "plt.colorbar(scatter1, ax=axes[0,0])\n",
470
+ "\n",
471
+ "# K-Means on UMAP\n",
472
+ "scatter2 = axes[0,1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], c=kmeans_labels, cmap='viridis', alpha=0.6, s=50)\n",
473
+ "axes[0,1].set_xlabel('UMAP1')\n",
474
+ "axes[0,1].set_ylabel('UMAP2')\n",
475
+ "axes[0,1].set_title('K-Means Clusters (UMAP)')\n",
476
+ "plt.colorbar(scatter2, ax=axes[0,1])\n",
477
+ "\n",
478
+ "# HDBSCAN on PCA\n",
479
+ "scatter3 = axes[1,0].scatter(pca_df['PC1'], pca_df['PC2'], c=hdbscan_labels, cmap='Set1', alpha=0.6, s=50)\n",
480
+ "axes[1,0].set_xlabel('PC1')\n",
481
+ "axes[1,0].set_ylabel('PC2')\n",
482
+ "axes[1,0].set_title('HDBSCAN Clusters (PCA)')\n",
483
+ "plt.colorbar(scatter3, ax=axes[1,0])\n",
484
+ "\n",
485
+ "# HDBSCAN on UMAP\n",
486
+ "scatter4 = axes[1,1].scatter(umap_df['UMAP1'], umap_df['UMAP2'], c=hdbscan_labels, cmap='Set1', alpha=0.6, s=50)\n",
487
+ "axes[1,1].set_xlabel('UMAP1')\n",
488
+ "axes[1,1].set_ylabel('UMAP2')\n",
489
+ "axes[1,1].set_title('HDBSCAN Clusters (UMAP)')\n",
490
+ "plt.colorbar(scatter4, ax=axes[1,1])\n",
491
+ "\n",
492
+ "plt.tight_layout()\n",
493
+ "plt.show()\n",
494
+ "\n",
495
+ "# Compare clustering results\n",
496
+ "print(\"\ud83d\udcca Clustering Comparison:\")\n",
497
+ "print(f\"K-Means clusters: {optimal_k}\")\n",
498
+ "print(f\"HDBSCAN clusters: {n_clusters_hdbscan}\")\n",
499
+ "print(f\"K-Means silhouette: {max(silhouette_scores):.3f}\")\n",
500
+ "if n_noise_points < len(df) and len(set(hdbscan_labels[hdbscan_labels != -1])) > 1:\n",
501
+ " print(f\"HDBSCAN silhouette: {hdbscan_silhouette:.3f}\")"
502
+ ]
503
+ },
504
+ {
505
+ "cell_type": "code",
506
+ "execution_count": null,
507
+ "id": "d151af67",
508
+ "metadata": {},
509
+ "outputs": [],
510
+ "source": [
511
+ "## 4. Cluster Profiling and Interpretation\n",
512
+ "\n",
513
+ "### 4.1 K-Means Cluster Analysis\n",
514
+ "\n",
515
+ "# Analyze K-Means clusters\n",
516
+ "print(\"\ud83d\udcca K-Means Cluster Analysis\")\n",
517
+ "print(\"=\" * 50)\n",
518
+ "\n",
519
+ "# Key features for profiling (check which ones exist)\n",
520
+ "base_features = ['Age', 'Avg_Daily_Usage_Hours', 'Sleep_Hours_Per_Night', \n",
521
+ " 'Mental_Health_Score', 'Conflicts_Over_Social_Media', 'Addicted_Score']\n",
522
+ "\n",
523
+ "# Add binary features that exist\n",
524
+ "binary_features = []\n",
525
+ "for feature in ['Is_Female', 'Is_Undergraduate', 'Is_Graduate', 'High_Usage', 'Low_Sleep', \n",
526
+ " 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction']:\n",
527
+ " if feature in df.columns:\n",
528
+ " binary_features.append(feature)\n",
529
+ "\n",
530
+ "key_features = base_features + binary_features\n",
531
+ "\n",
532
+ "# Create cluster profiles\n",
533
+ "cluster_profiles = df.groupby('KMeans_Cluster')[key_features].mean()\n",
534
+ "\n",
535
+ "print(\"\\n\ud83d\udcca Cluster Profiles (Mean Values):\")\n",
536
+ "print(cluster_profiles.round(3))\n",
537
+ "\n",
538
+ "# Visualize cluster characteristics\n",
539
+ "fig, axes = plt.subplots(2, 3, figsize=(18, 12))\n",
540
+ "\n",
541
+ "# Usage patterns\n",
542
+ "usage_features = ['Avg_Daily_Usage_Hours', 'Addicted_Score', 'Conflicts_Over_Social_Media']\n",
543
+ "available_usage_features = [f for f in usage_features if f in cluster_profiles.columns]\n",
544
+ "if available_usage_features:\n",
545
+ " cluster_profiles[available_usage_features].plot(\n",
546
+ " kind='bar', ax=axes[0,0], title='Usage & Addiction Patterns')\n",
547
+ " axes[0,0].set_ylabel('Score')\n",
548
+ " axes[0,0].tick_params(axis='x', rotation=45)\n",
549
+ "\n",
550
+ "# Health patterns\n",
551
+ "health_features = ['Mental_Health_Score', 'Sleep_Hours_Per_Night']\n",
552
+ "available_health_features = [f for f in health_features if f in cluster_profiles.columns]\n",
553
+ "if available_health_features:\n",
554
+ " cluster_profiles[available_health_features].plot(\n",
555
+ " kind='bar', ax=axes[0,1], title='Health & Sleep Patterns')\n",
556
+ " axes[0,1].set_ylabel('Score')\n",
557
+ " axes[0,1].tick_params(axis='x', rotation=45)\n",
558
+ "\n",
559
+ "# Demographics\n",
560
+ "demo_features = ['Is_Female', 'Is_Undergraduate', 'Is_Graduate']\n",
561
+ "available_demo_features = [f for f in demo_features if f in cluster_profiles.columns]\n",
562
+ "if available_demo_features:\n",
563
+ " cluster_profiles[available_demo_features].plot(\n",
564
+ " kind='bar', ax=axes[0,2], title='Demographic Patterns')\n",
565
+ " axes[0,2].set_ylabel('Proportion')\n",
566
+ " axes[0,2].tick_params(axis='x', rotation=45)\n",
567
+ "\n",
568
+ "# Binary features\n",
569
+ "binary_plot_features = ['High_Usage', 'Low_Sleep', 'Poor_Mental_Health']\n",
570
+ "available_binary_features = [f for f in binary_plot_features if f in cluster_profiles.columns]\n",
571
+ "if available_binary_features:\n",
572
+ " cluster_profiles[available_binary_features].plot(\n",
573
+ " kind='bar', ax=axes[1,0], title='Risk Factor Patterns')\n",
574
+ " axes[1,0].set_ylabel('Proportion')\n",
575
+ " axes[1,0].tick_params(axis='x', rotation=45)\n",
576
+ "\n",
577
+ "# Age distribution\n",
578
+ "if 'Age' in cluster_profiles.columns:\n",
579
+ " cluster_profiles['Age'].plot(kind='bar', ax=axes[1,1], title='Age Distribution')\n",
580
+ " axes[1,1].set_ylabel('Age')\n",
581
+ " axes[1,1].tick_params(axis='x', rotation=45)\n",
582
+ "\n",
583
+ "# Cluster sizes\n",
584
+ "cluster_sizes = df['KMeans_Cluster'].value_counts().sort_index()\n",
585
+ "cluster_sizes.plot(kind='bar', ax=axes[1,2], title='Cluster Sizes')\n",
586
+ "axes[1,2].set_ylabel('Number of Students')\n",
587
+ "axes[1,2].tick_params(axis='x', rotation=45)\n",
588
+ "\n",
589
+ "plt.tight_layout()\n",
590
+ "plt.show()"
591
+ ]
592
+ },
593
+ {
594
+ "cell_type": "code",
595
+ "execution_count": null,
596
+ "id": "3490cb53",
597
+ "metadata": {},
598
+ "outputs": [],
599
+ "source": [
600
+ "### 4.2 Cluster Labeling\n",
601
+ "\n",
602
+ "# Create intuitive labels for clusters based on their characteristics\n",
603
+ "def label_clusters(cluster_profiles):\n",
604
+ " labels = {}\n",
605
+ " \n",
606
+ " for cluster_id in cluster_profiles.index:\n",
607
+ " profile = cluster_profiles.loc[cluster_id]\n",
608
+ " \n",
609
+ " # Determine usage level\n",
610
+ " if profile['Avg_Daily_Usage_Hours'] > 6:\n",
611
+ " usage_level = \"High-Usage\"\n",
612
+ " elif profile['Avg_Daily_Usage_Hours'] > 4:\n",
613
+ " usage_level = \"Moderate-Usage\"\n",
614
+ " else:\n",
615
+ " usage_level = \"Low-Usage\"\n",
616
+ " \n",
617
+ " # Determine health status\n",
618
+ " if profile['Mental_Health_Score'] < 5 and profile['Sleep_Hours_Per_Night'] < 6:\n",
619
+ " health_status = \"Poor-Health\"\n",
620
+ " elif profile['Mental_Health_Score'] > 7 and profile['Sleep_Hours_Per_Night'] > 7:\n",
621
+ " health_status = \"Good-Health\"\n",
622
+ " else:\n",
623
+ " health_status = \"Average-Health\"\n",
624
+ " \n",
625
+ " # Determine addiction level\n",
626
+ " if profile['Addicted_Score'] > 7:\n",
627
+ " addiction_level = \"High-Addiction\"\n",
628
+ " elif profile['Addicted_Score'] > 5:\n",
629
+ " addiction_level = \"Moderate-Addiction\"\n",
630
+ " else:\n",
631
+ " addiction_level = \"Low-Addiction\"\n",
632
+ " \n",
633
+ " # Create label\n",
634
+ " label = f\"{usage_level}_{health_status}_{addiction_level}\"\n",
635
+ " labels[cluster_id] = label\n",
636
+ " \n",
637
+ " return labels\n",
638
+ "\n",
639
+ "# Generate labels\n",
640
+ "cluster_labels = label_clusters(cluster_profiles)\n",
641
+ "\n",
642
+ "print(\"\ud83d\udcca Cluster Labels:\")\n",
643
+ "for cluster_id, label in cluster_labels.items():\n",
644
+ " size = cluster_sizes[cluster_id]\n",
645
+ " print(f\"Cluster {cluster_id} ({size} students): {label}\")\n",
646
+ "\n",
647
+ "# Add labels to dataframe\n",
648
+ "df['Cluster_Label'] = df['KMeans_Cluster'].map(cluster_labels)\n",
649
+ "\n",
650
+ "# Display sample students from each cluster\n",
651
+ "print(\"\\n\ud83d\udcca Sample Students from Each Cluster:\")\n",
652
+ "for cluster_id in sorted(df['KMeans_Cluster'].unique()):\n",
653
+ " cluster_students = df[df['KMeans_Cluster'] == cluster_id].head(3)\n",
654
+ " print(f\"\\nCluster {cluster_id} - {cluster_labels[cluster_id]}:\")\n",
655
+ " print(cluster_students[['Age', 'Gender', 'Avg_Daily_Usage_Hours', \n",
656
+ " 'Mental_Health_Score', 'Sleep_Hours_Per_Night', 'Addicted_Score']].to_string())"
657
+ ]
658
+ },
659
+ {
660
+ "cell_type": "code",
661
+ "execution_count": null,
662
+ "id": "022f52d9",
663
+ "metadata": {},
664
+ "outputs": [],
665
+ "source": [
666
+ "## 5. MLflow Experiment Tracking\n",
667
+ "\n",
668
+ "### 5.1 Comprehensive MLflow Logging\n",
669
+ "\n",
670
+ "print(\"\u2705 MLflow tracking configured!\")\n",
671
+ "\n",
672
+ "# Log clustering comparison and summary\n",
673
+ "with mlflow.start_run(run_name=\"clustering_summary\"):\n",
674
+ " # Log overall statistics\n",
675
+ " mlflow.log_metric(\"total_students\", len(df))\n",
676
+ " mlflow.log_metric(\"kmeans_clusters\", optimal_k)\n",
677
+ " mlflow.log_metric(\"hdbscan_clusters\", n_clusters_hdbscan)\n",
678
+ " mlflow.log_metric(\"best_silhouette_score\", max(silhouette_scores))\n",
679
+ " \n",
680
+ " # Log cluster labels\n",
681
+ " mlflow.log_dict(cluster_labels, \"cluster_labels.json\")\n",
682
+ " \n",
683
+ " # Log feature scaling info\n",
684
+ " scaling_info = {\n",
685
+ " \"scaler_type\": \"StandardScaler\",\n",
686
+ " \"n_features_scaled\": len(numerical_features)\n",
687
+ " }\n",
688
+ " mlflow.log_dict(scaling_info, \"scaling_info.json\")\n",
689
+ " \n",
690
+ " # Log clustering comparison metrics\n",
691
+ " comparison_metrics = {\n",
692
+ " \"kmeans_silhouette\": max(silhouette_scores),\n",
693
+ " \"kmeans_inertia\": kmeans_optimal.inertia_,\n",
694
+ " \"hdbscan_noise_percentage\": n_noise_points/len(df)*100\n",
695
+ " }\n",
696
+ " \n",
697
+ " # Add HDBSCAN silhouette if available\n",
698
+ " if 'hdbscan_silhouette' in locals():\n",
699
+ " comparison_metrics[\"hdbscan_silhouette\"] = hdbscan_silhouette\n",
700
+ " \n",
701
+ " for metric_name, metric_value in comparison_metrics.items():\n",
702
+ " mlflow.log_metric(metric_name, metric_value)\n",
703
+ " \n",
704
+ " print(\"\u2705 Clustering summary logged to MLflow!\")\n",
705
+ "\n",
706
+ "print(\"\\n\ud83d\udcca All experiments logged to MLflow successfully!\")\n",
707
+ "print(\"\ud83d\udcca You can view results using: mlflow ui --port 5001\")\n",
708
+ "print(\"\ud83d\udcca Experiments logged:\")\n",
709
+ "print(\" - kmeans_optimal\")\n",
710
+ "print(\" - hdbscan_clustering\") \n",
711
+ "print(\" - hdbscan_metrics\")\n",
712
+ "print(\" - clustering_summary\")"
713
+ ]
714
+ },
715
+ {
716
+ "cell_type": "code",
717
+ "execution_count": null,
718
+ "id": "daef8797",
719
+ "metadata": {},
720
+ "outputs": [],
721
+ "source": [
722
+ "## 6. Final Analysis and Insights\n",
723
+ "\n",
724
+ "### 6.1 Key Findings\n",
725
+ "\n",
726
+ "print(\"\ud83d\udcca CLUSTERING ANALYSIS INSIGHTS\")\n",
727
+ "print(\"=\" * 50)\n",
728
+ "\n",
729
+ "# Summary statistics by cluster\n",
730
+ "cluster_summary = df.groupby('Cluster_Label').agg({\n",
731
+ " 'Avg_Daily_Usage_Hours': ['mean', 'std'],\n",
732
+ " 'Mental_Health_Score': ['mean', 'std'],\n",
733
+ " 'Sleep_Hours_Per_Night': ['mean', 'std'],\n",
734
+ " 'Addicted_Score': ['mean', 'std'],\n",
735
+ " 'Age': ['mean', 'std']\n",
736
+ "}).round(2)\n",
737
+ "\n",
738
+ "print(\"\\n\ud83d\udcca Cluster Summary Statistics:\")\n",
739
+ "print(cluster_summary)\n",
740
+ "\n",
741
+ "# Risk assessment by cluster\n",
742
+ "risk_factors = ['High_Usage', 'Low_Sleep', 'Poor_Mental_Health', 'High_Conflict', 'High_Addiction']\n",
743
+ "risk_by_cluster = df.groupby('Cluster_Label')[risk_factors].mean()\n",
744
+ "\n",
745
+ "print(\"\\n\ud83d\udcca Risk Factors by Cluster:\")\n",
746
+ "print(risk_by_cluster.round(3))\n",
747
+ "\n",
748
+ "# Platform usage by cluster\n",
749
+ "platform_cols = [col for col in df.columns if col.startswith('Uses_')]\n",
750
+ "platform_by_cluster = df.groupby('Cluster_Label')[platform_cols].mean()\n",
751
+ "\n",
752
+ "print(\"\\n\ud83d\udcca Platform Usage by Cluster:\")\n",
753
+ "print(platform_by_cluster.round(3))\n",
754
+ "\n",
755
+ "### 6.2 Intervention Recommendations\n",
756
+ "\n",
757
+ "print(\"\\n\ud83d\udcca INTERVENTION RECOMMENDATIONS\")\n",
758
+ "print(\"=\" * 50)\n",
759
+ "\n",
760
+ "for cluster_label in df['Cluster_Label'].unique():\n",
761
+ " cluster_data = df[df['Cluster_Label'] == cluster_label]\n",
762
+ " size = len(cluster_data)\n",
763
+ " percentage = size / len(df) * 100\n",
764
+ " \n",
765
+ " print(f\"\\n\ud83c\udfaf Cluster: {cluster_label}\")\n",
766
+ " print(f\" Size: {size} students ({percentage:.1f}%)\")\n",
767
+ " \n",
768
+ " # Identify key characteristics\n",
769
+ " avg_usage = cluster_data['Avg_Daily_Usage_Hours'].mean()\n",
770
+ " avg_mental_health = cluster_data['Mental_Health_Score'].mean()\n",
771
+ " avg_sleep = cluster_data['Sleep_Hours_Per_Night'].mean()\n",
772
+ " avg_addiction = cluster_data['Addicted_Score'].mean()\n",
773
+ " \n",
774
+ " print(f\" Average Usage: {avg_usage:.1f} hours/day\")\n",
775
+ " print(f\" Mental Health Score: {avg_mental_health:.1f}/10\")\n",
776
+ " print(f\" Sleep Hours: {avg_sleep:.1f} hours/night\")\n",
777
+ " print(f\" Addiction Score: {avg_addiction:.1f}/10\")\n",
778
+ " \n",
779
+ " # Generate recommendations\n",
780
+ " if avg_usage > 6 and avg_addiction > 7:\n",
781
+ " print(\" \u26a0\ufe0f HIGH RISK: Intensive intervention needed\")\n",
782
+ " print(\" \ud83d\udca1 Recommendations: Digital detox programs, counseling, parental monitoring\")\n",
783
+ " elif avg_usage > 4 and avg_mental_health < 6:\n",
784
+ " print(\" \u26a0\ufe0f MODERATE RISK: Targeted intervention recommended\")\n",
785
+ " print(\" \ud83d\udca1 Recommendations: Screen time limits, mental health support, sleep hygiene\")\n",
786
+ " else:\n",
787
+ " print(\" \u2705 LOW RISK: Monitor and provide resources\")\n",
788
+ " print(\" \ud83d\udca1 Recommendations: Educational materials, healthy usage guidelines\")\n",
789
+ "\n",
790
+ "print(\"\\n\u2705 Clustering analysis completed successfully!\")\n",
791
+ "print(\"\ud83d\udcca Check MLflow UI for detailed experiment tracking\")\n",
792
+ "print(\"\ud83d\udcca Use cluster labels for targeted interventions\")\n"
793
+ ]
794
+ },
795
+ {
796
+ "cell_type": "markdown",
797
+ "id": "2c6194f8",
798
+ "metadata": {},
799
+ "source": [
800
+ "## 11. Next Steps & Best Practices\n",
801
+ "\n",
802
+ "This final section provides a comprehensive summary of the clustering analysis workflow and actionable next steps for production deployment.\n",
803
+ "\n",
804
+ "### What We've Accomplished\n",
805
+ "- **Data Preparation**: Feature engineering, scaling, and dimensionality reduction\n",
806
+ "- **Clustering Analysis**: KMeans and HDBSCAN algorithms with optimal parameter selection\n",
807
+ "- **Evaluation**: Silhouette scores, visual validation, and cluster profiling\n",
808
+ "- **MLflow Integration**: Complete experiment tracking and model versioning\n",
809
+ "- **Interpretability**: Cluster labeling and actionable insights for intervention strategies\n",
810
+ "\n",
811
+ "### Key Insights\n",
812
+ "- Identified distinct user segments based on social media behavior patterns\n",
813
+ "- Created risk assessment profiles for targeted interventions\n",
814
+ "- Built reproducible clustering pipeline with MLflow tracking\n",
815
+ "- Generated actionable recommendations for each cluster\n",
816
+ "\n",
817
+ "### Production Readiness\n",
818
+ "The analysis is now ready for production deployment with proper monitoring, retraining pipelines, and API integration."
819
+ ]
820
+ },
821
+ {
822
+ "cell_type": "code",
823
+ "execution_count": null,
824
+ "id": "17fc0802",
825
+ "metadata": {},
826
+ "outputs": [],
827
+ "source": [
828
+ "\n",
829
+ "# Best Practices Summary\n",
830
+ "print(\"\ud83c\udfaf MLflow Clustering Best Practices Implemented:\")\n",
831
+ "print(\"1. \u2705 Comprehensive data preparation and feature engineering\")\n",
832
+ "print(\"2. \u2705 Dimensionality reduction (PCA, UMAP) for visualization\")\n",
833
+ "print(\"3. \u2705 Multiple clustering algorithms (KMeans, HDBSCAN)\")\n",
834
+ "print(\"4. \u2705 Silhouette score and visual validation\")\n",
835
+ "print(\"5. \u2705 Cluster profiling and interpretability\")\n",
836
+ "print(\"6. \u2705 MLflow experiment tracking and model versioning\")\n",
837
+ "\n",
838
+ "print(\"\\n\ud83d\udcca Cluster Analysis Insights:\")\n",
839
+ "print(\"\u2022 Silhouette Score: Measures cluster separation (higher is better)\")\n",
840
+ "print(\"\u2022 Visualizations: PCA/UMAP plots for cluster structure\")\n",
841
+ "print(\"\u2022 Cluster profiles: Mean values and risk factors by group\")\n",
842
+ "print(\"\u2022 Labeling: Intuitive cluster names for actionable insights\")\n",
843
+ "\n",
844
+ "print(\"\\n\ud83d\udccb Next Steps:\")\n",
845
+ "print(\"1. Launch MLflow UI: mlflow ui --port 5001\")\n",
846
+ "print(\"2. Access experiments at: http://localhost:5001\")\n",
847
+ "print(\"3. Compare clustering runs and diagnostic plots\")\n",
848
+ "print(\"4. Deploy best clustering model to production API\")\n",
849
+ "print(\"5. Set up automated retraining pipeline\")\n",
850
+ "print(\"6. Monitor cluster assignments in production\")\n",
851
+ "print(\"7. Consider ensemble or consensus clustering for robustness\")\n",
852
+ "\n",
853
+ "print(\"\\n\ud83d\udd27 To launch MLflow UI:\")\n",
854
+ "print(\"!mlflow ui --port 5001 --host 0.0.0.0\")\n",
855
+ "\n",
856
+ "print(\"\\n\ud83d\udcc8 Additional Recommendations:\")\n",
857
+ "print(\"\u2022 Explore ensemble clustering (consensus, voting)\")\n",
858
+ "print(\"\u2022 Implement feature selection for clustering\")\n",
859
+ "print(\"\u2022 Add cluster explainability tools (e.g., SHAP for cluster assignment)\")\n",
860
+ "print(\"\u2022 Set up automated monitoring for cluster drift\")\n",
861
+ "print(\"\u2022 Build dashboards for real-time cluster insights\")\n"
862
+ ]
863
+ }
864
+ ],
865
+ "metadata": {},
866
+ "nbformat": 4,
867
+ "nbformat_minor": 5
868
+ }