Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

KNN_SHAP_Explainability.ipynb +1234 -0
LICENSE +21 -0
README.md +274 -0

KNN_SHAP_Explainability.ipynb ADDED Viewed

	@@ -0,0 +1,1234 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# KNN Explainability with SHAP on Cloud GPU\n\nThis notebook demonstrates how to use SHAP (SHapley Additive exPlanations) to interpret K-Nearest Neighbors (KNN) model predictions with comprehensive visualizations.\n\n**Environment:** Cloud GPU Instance (Running in VS Code)\n\n## Prerequisites\n- Cloud GPU instance with GPU (RTX 3090, RTX 4090, or A100 recommended)\n- VS Code with Jupyter extension installed\n- SSH connection to cloud instance"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Environment Setup and GPU Verification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check GPU availability and specifications\n",
+    "import subprocess\n",
+    "import sys\n",
+    "\n",
+    "print(\"=\" * 80)\n",
+    "print(\"VAST.AI GPU INFORMATION\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "try:\n",
+    "    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)\n",
+    "    print(result.stdout)\n",
+    "except FileNotFoundError:\n",
+    "    print(\"nvidia-smi not found. GPU may not be available.\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"PYTHON ENVIRONMENT\")\n",
+    "print(\"=\" * 80)\n",
+    "print(f\"Python version: {sys.version}\")\n",
+    "print(f\"Python executable: {sys.executable}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check PyTorch CUDA availability\n",
+    "try:\n",
+    "    import torch\n",
+    "    print(\"\\n\" + \"=\" * 80)\n",
+    "    print(\"PYTORCH & CUDA INFORMATION\")\n",
+    "    print(\"=\" * 80)\n",
+    "    print(f\"PyTorch version: {torch.__version__}\")\n",
+    "    print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
+    "    \n",
+    "    if torch.cuda.is_available():\n",
+    "        print(f\"CUDA version: {torch.version.cuda}\")\n",
+    "        print(f\"Number of GPUs: {torch.cuda.device_count()}\")\n",
+    "        for i in range(torch.cuda.device_count()):\n",
+    "            print(f\"\\nGPU {i}: {torch.cuda.get_device_name(i)}\")\n",
+    "            print(f\"  Memory Total: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB\")\n",
+    "            print(f\"  Memory Allocated: {torch.cuda.memory_allocated(i) / 1e9:.4f} GB\")\n",
+    "            print(f\"  Memory Cached: {torch.cuda.memory_reserved(i) / 1e9:.4f} GB\")\n",
+    "    else:\n",
+    "        print(\"\\nWARNING: CUDA not available. Running on CPU.\")\n",
+    "        print(\"This notebook is optimized for GPU but will work on CPU (slower).\")\n",
+    "except ImportError:\n",
+    "    print(\"\\nPyTorch not installed yet. Will install in next cell.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install required packages\n",
+    "print(\"Installing required packages...\\n\")\n",
+    "\n",
+    "packages = [\n",
+    "    'torch',\n",
+    "    'shap',\n",
+    "    'scikit-learn',\n",
+    "    'matplotlib',\n",
+    "    'seaborn',\n",
+    "    'pandas',\n",
+    "    'numpy',\n",
+    "    'plotly',\n",
+    "    'ipywidgets',\n",
+    "    'tqdm'\n",
+    "]\n",
+    "\n",
+    "for package in packages:\n",
+    "    print(f\"Installing {package}...\")\n",
+    "    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*80)\n",
+    "print(\"All packages installed successfully!\")\n",
+    "print(\"=\"*80)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Import Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Import all necessary libraries\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport shap\nimport torch\nimport sklearn\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import train_test_split, cross_val_score\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.datasets import load_breast_cancer, load_wine, load_iris, make_classification\nfrom sklearn.metrics import (\n    accuracy_score, \n    classification_report, \n    confusion_matrix,\n    roc_curve,\n    roc_auc_score,\n    precision_recall_curve\n)\nfrom tqdm.auto import tqdm\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# Set random seeds for reproducibility\nnp.random.seed(42)\ntorch.manual_seed(42)\nif torch.cuda.is_available():\n    torch.cuda.manual_seed_all(42)\n\n# Configure plotting style\nsns.set_style('whitegrid')\nplt.rcParams['figure.figsize'] = (14, 8)\nplt.rcParams['font.size'] = 10\nplt.rcParams['axes.titlesize'] = 14\nplt.rcParams['axes.labelsize'] = 12\n\n# Initialize SHAP's JavaScript visualization\nshap.initjs()\n\nprint(\"=\" * 80)\nprint(\"LIBRARY VERSIONS\")\nprint(\"=\" * 80)\nprint(f\"NumPy version: {np.__version__}\")\nprint(f\"Pandas version: {pd.__version__}\")\nprint(f\"Matplotlib version: {plt.matplotlib.__version__}\")\nprint(f\"Seaborn version: {sns.__version__}\")\nprint(f\"SHAP version: {shap.__version__}\")\nprint(f\"PyTorch version: {torch.__version__}\")\nprint(f\"Scikit-learn version: {sklearn.__version__}\")\nprint(\"\\nLibraries imported successfully!\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. GPU Memory Management Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Utility functions for GPU memory management on Vast.ai\n",
+    "\n",
+    "def print_gpu_memory():\n",
+    "    \"\"\"Print current GPU memory usage\"\"\"\n",
+    "    if torch.cuda.is_available():\n",
+    "        for i in range(torch.cuda.device_count()):\n",
+    "            allocated = torch.cuda.memory_allocated(i) / 1e9\n",
+    "            cached = torch.cuda.memory_reserved(i) / 1e9\n",
+    "            total = torch.cuda.get_device_properties(i).total_memory / 1e9\n",
+    "            print(f\"GPU {i} ({torch.cuda.get_device_name(i)}):\")\n",
+    "            print(f\"  Allocated: {allocated:.3f} GB / {total:.2f} GB ({allocated/total*100:.1f}%)\")\n",
+    "            print(f\"  Cached: {cached:.3f} GB\")\n",
+    "    else:\n",
+    "        print(\"No GPU available\")\n",
+    "\n",
+    "def clear_gpu_memory():\n",
+    "    \"\"\"Clear GPU cache\"\"\"\n",
+    "    if torch.cuda.is_available():\n",
+    "        torch.cuda.empty_cache()\n",
+    "        print(\"GPU cache cleared\")\n",
+    "\n",
+    "def get_optimal_device():\n",
+    "    \"\"\"Get optimal device for computation\"\"\"\n",
+    "    if torch.cuda.is_available():\n",
+    "        device = torch.device('cuda')\n",
+    "        print(f\"Using GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    else:\n",
+    "        device = torch.device('cpu')\n",
+    "        print(\"Using CPU (GPU not available)\")\n",
+    "    return device\n",
+    "\n",
+    "# Initialize device\n",
+    "device = get_optimal_device()\n",
+    "print(\"\\nInitial GPU Memory:\")\n",
+    "print_gpu_memory()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Data Loading and Exploration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the Breast Cancer Wisconsin dataset\n",
+    "print(\"=\" * 80)\n",
+    "print(\"LOADING DATASET\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "data = load_breast_cancer()\n",
+    "X = pd.DataFrame(data.data, columns=data.feature_names)\n",
+    "y = pd.Series(data.target, name='target')\n",
+    "\n",
+    "print(f\"\\nDataset: Breast Cancer Wisconsin (Diagnostic)\")\n",
+    "print(f\"Number of samples: {X.shape[0]}\")\n",
+    "print(f\"Number of features: {X.shape[1]}\")\n",
+    "print(f\"Number of classes: {len(data.target_names)}\")\n",
+    "print(f\"Class names: {list(data.target_names)}\")\n",
+    "print(f\"\\nTarget distribution:\")\n",
+    "for idx, name in enumerate(data.target_names):\n",
+    "    count = (y == idx).sum()\n",
+    "    percentage = count / len(y) * 100\n",
+    "    print(f\"  {name}: {count} ({percentage:.2f}%)\")\n",
+    "\n",
+    "print(f\"\\nFeature statistics:\")\n",
+    "print(X.describe().round(2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Display first few rows\n",
+    "print(\"\\nFirst 5 rows of the dataset:\")\n",
+    "display_df = X.head()\n",
+    "display_df['target'] = y.head().map({0: data.target_names[0], 1: data.target_names[1]})\n",
+    "display_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Exploratory Data Analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Target distribution visualization\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+    "\n",
+    "# Bar plot\n",
+    "target_counts = y.value_counts().sort_index()\n",
+    "colors = ['#FF6B6B', '#4ECDC4']\n",
+    "axes[0].bar(data.target_names, target_counts.values, color=colors, edgecolor='black', linewidth=1.5)\n",
+    "axes[0].set_ylabel('Count', fontweight='bold')\n",
+    "axes[0].set_title('Target Class Distribution', fontweight='bold', fontsize=14)\n",
+    "axes[0].grid(axis='y', alpha=0.3)\n",
+    "\n",
+    "# Add count labels on bars\n",
+    "for i, (name, count) in enumerate(zip(data.target_names, target_counts.values)):\n",
+    "    axes[0].text(i, count + 5, str(count), ha='center', fontweight='bold')\n",
+    "\n",
+    "# Pie chart\n",
+    "axes[1].pie(target_counts.values, labels=data.target_names, autopct='%1.1f%%',\n",
+    "            colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})\n",
+    "axes[1].set_title('Target Class Proportion', fontweight='bold', fontsize=14)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Feature distributions for key features\n",
+    "print(\"Feature Distributions (Top 12 Features):\")\n",
+    "fig, axes = plt.subplots(3, 4, figsize=(16, 12))\n",
+    "axes = axes.ravel()\n",
+    "\n",
+    "for idx, col in enumerate(X.columns[:12]):\n",
+    "    # Histogram for each class\n",
+    "    for target_idx, target_name in enumerate(data.target_names):\n",
+    "        mask = y == target_idx\n",
+    "        axes[idx].hist(X.loc[mask, col], bins=25, alpha=0.6, \n",
+    "                      label=target_name, color=colors[target_idx], edgecolor='black')\n",
+    "    \n",
+    "    axes[idx].set_title(f'{col}', fontsize=10, fontweight='bold')\n",
+    "    axes[idx].set_xlabel('Value', fontsize=9)\n",
+    "    axes[idx].set_ylabel('Frequency', fontsize=9)\n",
+    "    axes[idx].legend(fontsize=8)\n",
+    "    axes[idx].grid(alpha=0.3)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.suptitle('Feature Distributions by Class', y=1.002, fontsize=16, fontweight='bold')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Correlation heatmap for top features\n",
+    "print(\"\\nFeature Correlation Analysis (Top 15 Features):\")\n",
+    "top_features = X.columns[:15]\n",
+    "correlation_matrix = X[top_features].corr()\n",
+    "\n",
+    "plt.figure(figsize=(14, 12))\n",
+    "mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))\n",
+    "sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', \n",
+    "            cmap='coolwarm', center=0, square=True, linewidths=0.5,\n",
+    "            cbar_kws={\"shrink\": 0.8, \"label\": \"Correlation Coefficient\"})\n",
+    "plt.title('Feature Correlation Matrix (Top 15 Features)', fontsize=14, fontweight='bold', pad=20)\n",
+    "plt.tight_layout()\n",
+    "plt.show()\n",
+    "\n",
+    "# Find highly correlated pairs\n",
+    "high_corr_pairs = []\n",
+    "for i in range(len(correlation_matrix.columns)):\n",
+    "    for j in range(i+1, len(correlation_matrix.columns)):\n",
+    "        if abs(correlation_matrix.iloc[i, j]) > 0.8:\n",
+    "            high_corr_pairs.append((\n",
+    "                correlation_matrix.columns[i],\n",
+    "                correlation_matrix.columns[j],\n",
+    "                correlation_matrix.iloc[i, j]\n",
+    "            ))\n",
+    "\n",
+    "if high_corr_pairs:\n",
+    "    print(\"\\nHighly correlated feature pairs (|r| > 0.8):\")\n",
+    "    for feat1, feat2, corr in high_corr_pairs[:5]:\n",
+    "        print(f\"  {feat1} <-> {feat2}: {corr:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Data Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split the data\n",
+    "print(\"=\" * 80)\n",
+    "print(\"DATA PREPROCESSING\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = train_test_split(\n",
+    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nTraining set size: {X_train.shape[0]} samples\")\n",
+    "print(f\"Test set size: {X_test.shape[0]} samples\")\n",
+    "print(f\"\\nTraining set class distribution:\")\n",
+    "for idx, name in enumerate(data.target_names):\n",
+    "    count = (y_train == idx).sum()\n",
+    "    percentage = count / len(y_train) * 100\n",
+    "    print(f\"  {name}: {count} ({percentage:.2f}%)\")\n",
+    "\n",
+    "# Feature scaling (critical for KNN)\n",
+    "print(\"\\nApplying StandardScaler...\")\n",
+    "scaler = StandardScaler()\n",
+    "X_train_scaled = scaler.fit_transform(X_train)\n",
+    "X_test_scaled = scaler.transform(X_test)\n",
+    "\n",
+    "# Convert back to DataFrame for better handling\n",
+    "X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)\n",
+    "X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)\n",
+    "\n",
+    "print(f\"\\nScaled features - Mean: {X_train_scaled.mean().mean():.6f}\")\n",
+    "print(f\"Scaled features - Std: {X_train_scaled.std().mean():.6f}\")\n",
+    "print(\"\\nData preprocessing completed!\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Visualize scaling effect\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+    "\n",
+    "# Before scaling\n",
+    "sample_features = X.columns[:5]\n",
+    "X_train[sample_features].boxplot(ax=axes[0])\n",
+    "axes[0].set_title('Feature Scales - Before Scaling', fontweight='bold', fontsize=12)\n",
+    "axes[0].set_ylabel('Value', fontweight='bold')\n",
+    "axes[0].tick_params(axis='x', rotation=45)\n",
+    "axes[0].grid(alpha=0.3)\n",
+    "\n",
+    "# After scaling\n",
+    "X_train_scaled[sample_features].boxplot(ax=axes[1])\n",
+    "axes[1].set_title('Feature Scales - After Scaling', fontweight='bold', fontsize=12)\n",
+    "axes[1].set_ylabel('Standardized Value', fontweight='bold')\n",
+    "axes[1].tick_params(axis='x', rotation=45)\n",
+    "axes[1].grid(alpha=0.3)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. KNN Model Training and Optimization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Find optimal K value using cross-validation\n",
+    "print(\"=\" * 80)\n",
+    "print(\"KNN MODEL TRAINING\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\\nFinding optimal K value...\\n\")\n",
+    "\n",
+    "k_range = range(1, 31)\n",
+    "train_scores = []\n",
+    "test_scores = []\n",
+    "cv_scores = []\n",
+    "\n",
+    "# Use tqdm for progress bar\n",
+    "for k in tqdm(k_range, desc=\"Testing K values\"):\n",
+    "    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)\n",
+    "    \n",
+    "    # Training score\n",
+    "    knn.fit(X_train_scaled, y_train)\n",
+    "    train_scores.append(knn.score(X_train_scaled, y_train))\n",
+    "    \n",
+    "    # Test score\n",
+    "    test_scores.append(knn.score(X_test_scaled, y_test))\n",
+    "    \n",
+    "    # Cross-validation score\n",
+    "    cv_score = cross_val_score(knn, X_train_scaled, y_train, cv=5, n_jobs=-1)\n",
+    "    cv_scores.append(cv_score.mean())\n",
+    "\n",
+    "# Find best K\n",
+    "best_k_test = k_range[np.argmax(test_scores)]\n",
+    "best_k_cv = k_range[np.argmax(cv_scores)]\n",
+    "\n",
+    "print(f\"\\n✓ Optimal K (based on test accuracy): {best_k_test}\")\n",
+    "print(f\"✓ Optimal K (based on CV score): {best_k_cv}\")\n",
+    "print(f\"\\nUsing K = {best_k_cv} for final model\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plot K vs Accuracy\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
+    "\n",
+    "# Left plot: All scores\n",
+    "axes[0].plot(k_range, train_scores, label='Training Accuracy', \n",
+    "            marker='o', linewidth=2, markersize=4, color='#2ecc71')\n",
+    "axes[0].plot(k_range, test_scores, label='Test Accuracy', \n",
+    "            marker='s', linewidth=2, markersize=4, color='#e74c3c')\n",
+    "axes[0].plot(k_range, cv_scores, label='CV Accuracy (5-fold)', \n",
+    "            marker='^', linewidth=2, markersize=4, color='#3498db')\n",
+    "axes[0].axvline(x=best_k_cv, color='black', linestyle='--', alpha=0.5, label=f'Best K={best_k_cv}')\n",
+    "axes[0].set_xlabel('K Value (Number of Neighbors)', fontweight='bold')\n",
+    "axes[0].set_ylabel('Accuracy', fontweight='bold')\n",
+    "axes[0].set_title('KNN: K Value vs Accuracy', fontweight='bold', fontsize=14)\n",
+    "axes[0].legend(loc='best', fontsize=10)\n",
+    "axes[0].grid(True, alpha=0.3)\n",
+    "\n",
+    "# Right plot: Train-test gap\n",
+    "gap = np.array(train_scores) - np.array(test_scores)\n",
+    "axes[1].plot(k_range, gap, marker='o', linewidth=2, markersize=4, color='#9b59b6')\n",
+    "axes[1].axvline(x=best_k_cv, color='black', linestyle='--', alpha=0.5, label=f'Best K={best_k_cv}')\n",
+    "axes[1].axhline(y=0, color='red', linestyle='-', alpha=0.3)\n",
+    "axes[1].set_xlabel('K Value (Number of Neighbors)', fontweight='bold')\n",
+    "axes[1].set_ylabel('Train-Test Accuracy Gap', fontweight='bold')\n",
+    "axes[1].set_title('Overfitting Analysis', fontweight='bold', fontsize=14)\n",
+    "axes[1].legend(loc='best', fontsize=10)\n",
+    "axes[1].grid(True, alpha=0.3)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"\\nBest test accuracy: {max(test_scores):.4f} at K={best_k_test}\")\n",
+    "print(f\"Best CV accuracy: {max(cv_scores):.4f} at K={best_k_cv}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Train final KNN model with optimal K\n",
+    "print(\"\\nTraining final KNN model...\")\n",
+    "optimal_k = best_k_cv\n",
+    "knn_model = KNeighborsClassifier(n_neighbors=optimal_k, n_jobs=-1)\n",
+    "knn_model.fit(X_train_scaled, y_train)\n",
+    "\n",
+    "# Make predictions\n",
+    "y_train_pred = knn_model.predict(X_train_scaled)\n",
+    "y_test_pred = knn_model.predict(X_test_scaled)\n",
+    "y_train_proba = knn_model.predict_proba(X_train_scaled)\n",
+    "y_test_proba = knn_model.predict_proba(X_test_scaled)\n",
+    "\n",
+    "# Calculate metrics\n",
+    "train_accuracy = accuracy_score(y_train, y_train_pred)\n",
+    "test_accuracy = accuracy_score(y_test, y_test_pred)\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"MODEL PERFORMANCE\")\n",
+    "print(\"=\" * 80)\n",
+    "print(f\"\\nOptimal K: {optimal_k}\")\n",
+    "print(f\"Training Accuracy: {train_accuracy:.4f}\")\n",
+    "print(f\"Test Accuracy: {test_accuracy:.4f}\")\n",
+    "print(f\"\\nClassification Report (Test Set):\")\n",
+    "print(classification_report(y_test, y_test_pred, target_names=data.target_names, digits=4))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Confusion Matrix Visualization\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
+    "\n",
+    "# Absolute counts\n",
+    "cm = confusion_matrix(y_test, y_test_pred)\n",
+    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', \n",
+    "            xticklabels=data.target_names, \n",
+    "            yticklabels=data.target_names,\n",
+    "            cbar_kws={'label': 'Count'},\n",
+    "            ax=axes[0],\n",
+    "            annot_kws={'fontsize': 14, 'fontweight': 'bold'})\n",
+    "axes[0].set_title('Confusion Matrix (Counts)', fontsize=14, fontweight='bold')\n",
+    "axes[0].set_ylabel('True Label', fontsize=12, fontweight='bold')\n",
+    "axes[0].set_xlabel('Predicted Label', fontsize=12, fontweight='bold')\n",
+    "\n",
+    "# Normalized\n",
+    "cm_norm = confusion_matrix(y_test, y_test_pred, normalize='true')\n",
+    "sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Greens', \n",
+    "            xticklabels=data.target_names, \n",
+    "            yticklabels=data.target_names,\n",
+    "            cbar_kws={'label': 'Proportion'},\n",
+    "            ax=axes[1],\n",
+    "            annot_kws={'fontsize': 14, 'fontweight': 'bold'})\n",
+    "axes[1].set_title('Confusion Matrix (Normalized)', fontsize=14, fontweight='bold')\n",
+    "axes[1].set_ylabel('True Label', fontsize=12, fontweight='bold')\n",
+    "axes[1].set_xlabel('Predicted Label', fontsize=12, fontweight='bold')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ROC Curve and AUC\n",
+    "fpr, tpr, thresholds = roc_curve(y_test, y_test_proba[:, 1])\n",
+    "roc_auc = roc_auc_score(y_test, y_test_proba[:, 1])\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
+    "\n",
+    "# ROC Curve\n",
+    "axes[0].plot(fpr, tpr, color='darkorange', lw=2, \n",
+    "            label=f'ROC curve (AUC = {roc_auc:.4f})')\n",
+    "axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')\n",
+    "axes[0].set_xlim([0.0, 1.0])\n",
+    "axes[0].set_ylim([0.0, 1.05])\n",
+    "axes[0].set_xlabel('False Positive Rate', fontweight='bold')\n",
+    "axes[0].set_ylabel('True Positive Rate', fontweight='bold')\n",
+    "axes[0].set_title('Receiver Operating Characteristic (ROC) Curve', fontweight='bold', fontsize=12)\n",
+    "axes[0].legend(loc=\"lower right\", fontsize=10)\n",
+    "axes[0].grid(alpha=0.3)\n",
+    "\n",
+    "# Precision-Recall Curve\n",
+    "precision, recall, _ = precision_recall_curve(y_test, y_test_proba[:, 1])\n",
+    "axes[1].plot(recall, precision, color='green', lw=2, label='Precision-Recall curve')\n",
+    "axes[1].set_xlabel('Recall', fontweight='bold')\n",
+    "axes[1].set_ylabel('Precision', fontweight='bold')\n",
+    "axes[1].set_title('Precision-Recall Curve', fontweight='bold', fontsize=12)\n",
+    "axes[1].legend(loc=\"best\", fontsize=10)\n",
+    "axes[1].grid(alpha=0.3)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. SHAP Explainability Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"=\" * 80)\n",
+    "print(\"SHAP EXPLAINABILITY ANALYSIS\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\\nSetting up SHAP explainer...\\n\")\n",
+    "\n",
+    "# Create background dataset for SHAP\n",
+    "# Using kmeans to select representative samples (faster for large datasets)\n",
+    "background_size = 100\n",
+    "background = shap.kmeans(X_train_scaled, background_size)\n",
+    "\n",
+    "print(f\"Background dataset size: {background_size} samples\")\n",
+    "print(f\"Background dataset shape: {background.data.shape}\")\n",
+    "\n",
+    "# Create SHAP explainer\n",
+    "# Using KernelExplainer (model-agnostic) for KNN\n",
+    "print(\"\\nCreating SHAP KernelExplainer (this may take a moment)...\")\n",
+    "explainer = shap.KernelExplainer(knn_model.predict_proba, background)\n",
+    "\n",
+    "print(\"\\n✓ SHAP explainer created successfully!\")\n",
+    "print(f\"Expected value (class 0): {explainer.expected_value[0]:.4f}\")\n",
+    "print(f\"Expected value (class 1): {explainer.expected_value[1]:.4f}\")\n",
+    "\n",
+    "# Check GPU memory after setup\n",
+    "print(\"\\nGPU Memory Status:\")\n",
+    "print_gpu_memory()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Compute SHAP values for test set\n# Adjust sample size based on your GPU memory and time constraints\nn_samples = min(100, len(X_test_scaled))  # Use 100 samples or all test samples if less\nX_test_sample = X_test_scaled.iloc[:n_samples]\ny_test_sample = y_test.iloc[:n_samples]\n\nprint(f\"Computing SHAP values for {n_samples} test samples...\")\nprint(\"This may take several minutes depending on your GPU...\")\nprint(\"Progress will be shown below:\\n\")\n\n# Compute SHAP values with progress tracking\nimport time\nstart_time = time.time()\n\nshap_values = explainer.shap_values(X_test_sample, nsamples=100)  # nsamples controls accuracy/speed tradeoff\n\nelapsed_time = time.time() - start_time\n\nprint(f\"\\n✓ SHAP values computed successfully!\")\nprint(f\"Computation time: {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)\")\nprint(f\"Time per sample: {elapsed_time/n_samples:.2f} seconds\")\n\n# Debug: Check the structure of shap_values\nprint(f\"\\nDEBUG: Type of shap_values: {type(shap_values)}\")\nif isinstance(shap_values, list):\n    print(f\"DEBUG: shap_values is a list with {len(shap_values)} elements\")\n    for i, sv in enumerate(shap_values):\n        print(f\"DEBUG: shap_values[{i}].shape = {np.array(sv).shape}\")\nelse:\n    print(f\"DEBUG: shap_values.shape = {np.array(shap_values).shape}\")\nprint(f\"DEBUG: X_test_sample.shape = {X_test_sample.shape}\")\n\nprint(f\"\\nSHAP values shape: {np.array(shap_values).shape}\")\nprint(f\"  - 2 classes (binary classification)\")\nprint(f\"  - {n_samples} samples explained\")\nprint(f\"  - {X_test_sample.shape[1]} features\")\n\n# Check GPU memory\nprint(\"\\nGPU Memory Status:\")\nprint_gpu_memory()"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 9. SHAP Visualizations - Global Explanations\n\n# Fix SHAP values format for binary classification\n# SHAP KernelExplainer returns shape (n_samples, n_features, n_classes) for predict_proba\n# We need to convert it to a list of arrays: [class_0_shap_values, class_1_shap_values]\nprint(\"Reshaping SHAP values for visualization...\")\nprint(f\"Original shape: {shap_values.shape}\")\n\nif len(shap_values.shape) == 3 and shap_values.shape[2] == 2:\n    # Convert from (n_samples, n_features, n_classes) to list of (n_samples, n_features)\n    shap_values_list = [shap_values[:, :, i] for i in range(shap_values.shape[2])]\n    print(f\"Converted to list format:\")\n    print(f\"  Class 0 SHAP values shape: {shap_values_list[0].shape}\")\n    print(f\"  Class 1 SHAP values shape: {shap_values_list[1].shape}\")\n    shap_values = shap_values_list\nelse:\n    print(f\"SHAP values already in correct format\")\n\nprint(\"✓ SHAP values ready for visualization!\")"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# SHAP Summary Plot (Beeswarm) - Shows global feature importance\nprint(\"=\" * 80)\nprint(\"SHAP VISUALIZATION 1: SUMMARY PLOT (BEESWARM)\")\nprint(\"=\" * 80)\nprint(\"\"\"\nThis plot shows:\n- Feature importance (vertical axis, ordered by importance)\n- SHAP values (horizontal axis, impact on prediction)\n- Feature values (color, red=high, blue=low)\n- Distribution across all samples (density)\n\nReading the plot:\n- Features at the top are most important\n- Points to the right increase probability of class 1 (malignant)\n- Points to the left decrease probability of class 1\n- Color shows whether high (red) or low (blue) feature values have that effect\n\"\"\")\n\nplt.figure(figsize=(14, 10))\nshap.summary_plot(shap_values[1], X_test_sample.values, \n                  feature_names=X_test_sample.columns.tolist(),\n                  plot_type=\"dot\", show=False, max_display=20)\nplt.title('SHAP Summary Plot - Global Feature Importance and Impact\\n(Predicting Malignant Class)', \n          fontsize=14, fontweight='bold', pad=20)\nplt.tight_layout()\nplt.show()"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# SHAP Bar Plot - Mean absolute SHAP values\nprint(\"=\" * 80)\nprint(\"SHAP VISUALIZATION 2: BAR PLOT\")\nprint(\"=\" * 80)\nprint(\"\"\"\nThis plot shows:\n- Average magnitude of feature impact (mean |SHAP value|)\n- Overall feature importance ranking\n- Which features have the strongest effect on predictions (regardless of direction)\n\"\"\")\n\nplt.figure(figsize=(14, 10))\nshap.summary_plot(shap_values[1], X_test_sample.values,\n                  feature_names=X_test_sample.columns.tolist(),\n                  plot_type=\"bar\", show=False, max_display=20)\nplt.title('SHAP Bar Plot - Mean Absolute Feature Importance', \n          fontsize=14, fontweight='bold', pad=20)\nplt.xlabel('Mean |SHAP value| (average impact on model output magnitude)', fontweight='bold')\nplt.tight_layout()\nplt.show()"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Calculate and display feature importance rankings\n",
+    "feature_importance = np.abs(shap_values[1]).mean(axis=0)\n",
+    "feature_importance_df = pd.DataFrame({\n",
+    "    'Feature': X_test_sample.columns,\n",
+    "    'Mean_Abs_SHAP': feature_importance,\n",
+    "    'Mean_SHAP': shap_values[1].mean(axis=0),\n",
+    "    'Std_SHAP': np.std(shap_values[1], axis=0)\n",
+    "}).sort_values('Mean_Abs_SHAP', ascending=False)\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"TOP 15 MOST IMPORTANT FEATURES (by mean |SHAP value|)\")\n",
+    "print(\"=\" * 80)\n",
+    "print(feature_importance_df.head(15).to_string(index=False))\n",
+    "\n",
+    "# Visualize feature importance with custom plot\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(16, 8))\n",
+    "\n",
+    "# Top features by absolute importance\n",
+    "top_15 = feature_importance_df.head(15)\n",
+    "y_pos = np.arange(len(top_15))\n",
+    "axes[0].barh(y_pos, top_15['Mean_Abs_SHAP'], color='steelblue', edgecolor='black')\n",
+    "axes[0].set_yticks(y_pos)\n",
+    "axes[0].set_yticklabels(top_15['Feature'], fontsize=9)\n",
+    "axes[0].set_xlabel('Mean |SHAP Value|', fontweight='bold')\n",
+    "axes[0].set_title('Top 15 Features by Absolute Importance', fontweight='bold', fontsize=12)\n",
+    "axes[0].invert_yaxis()\n",
+    "axes[0].grid(axis='x', alpha=0.3)\n",
+    "\n",
+    "# Top features by directional importance (showing positive/negative effect)\n",
+    "colors = ['green' if x > 0 else 'red' for x in top_15['Mean_SHAP']]\n",
+    "axes[1].barh(y_pos, top_15['Mean_SHAP'], color=colors, edgecolor='black', alpha=0.7)\n",
+    "axes[1].set_yticks(y_pos)\n",
+    "axes[1].set_yticklabels(top_15['Feature'], fontsize=9)\n",
+    "axes[1].set_xlabel('Mean SHAP Value (Directional)', fontweight='bold')\n",
+    "axes[1].set_title('Top 15 Features - Directional Impact\\n(Green: Increases Malignant Prob, Red: Decreases)', \n",
+    "                 fontweight='bold', fontsize=12)\n",
+    "axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.8)\n",
+    "axes[1].invert_yaxis()\n",
+    "axes[1].grid(axis='x', alpha=0.3)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. SHAP Visualizations - Local Explanations (Individual Predictions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Select interesting samples for detailed explanation\n",
+    "print(\"=\" * 80)\n",
+    "print(\"SELECTING SAMPLES FOR DETAILED EXPLANATION\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "# Get predictions for our sample\n",
+    "y_pred_sample = knn_model.predict(X_test_sample)\n",
+    "y_proba_sample = knn_model.predict_proba(X_test_sample)\n",
+    "\n",
+    "# Find interesting samples\n",
+    "correct_mask = y_test_sample == y_pred_sample\n",
+    "incorrect_mask = ~correct_mask\n",
+    "\n",
+    "# High confidence correct\n",
+    "high_conf_correct_idx = np.where(correct_mask & (np.max(y_proba_sample, axis=1) > 0.9))[0]\n",
+    "# Low confidence correct\n",
+    "low_conf_correct_idx = np.where(correct_mask & (np.max(y_proba_sample, axis=1) < 0.7))[0]\n",
+    "# Misclassified\n",
+    "misclassified_idx = np.where(incorrect_mask)[0]\n",
+    "\n",
+    "print(f\"\\nTotal samples analyzed: {len(y_test_sample)}\")\n",
+    "print(f\"Correctly classified: {correct_mask.sum()} ({correct_mask.sum()/len(y_test_sample)*100:.1f}%)\")\n",
+    "print(f\"Misclassified: {incorrect_mask.sum()} ({incorrect_mask.sum()/len(y_test_sample)*100:.1f}%)\")\n",
+    "print(f\"\\nHigh confidence correct predictions: {len(high_conf_correct_idx)}\")\n",
+    "print(f\"Low confidence correct predictions: {len(low_conf_correct_idx)}\")\n",
+    "print(f\"Misclassified predictions: {len(misclassified_idx)}\")\n",
+    "\n",
+    "# Select samples to explain\n",
+    "samples_to_explain = []\n",
+    "sample_descriptions = []\n",
+    "\n",
+    "if len(high_conf_correct_idx) > 0:\n",
+    "    samples_to_explain.append(high_conf_correct_idx[0])\n",
+    "    sample_descriptions.append(\"High confidence correct\")\n",
+    "\n",
+    "if len(low_conf_correct_idx) > 0:\n",
+    "    samples_to_explain.append(low_conf_correct_idx[0])\n",
+    "    sample_descriptions.append(\"Low confidence correct\")\n",
+    "\n",
+    "if len(misclassified_idx) > 0:\n",
+    "    samples_to_explain.append(misclassified_idx[0])\n",
+    "    sample_descriptions.append(\"Misclassified\")\n",
+    "\n",
+    "if len(samples_to_explain) == 0:\n",
+    "    # If no special cases, just use first sample\n",
+    "    samples_to_explain = [0]\n",
+    "    sample_descriptions = [\"First sample\"]\n",
+    "\n",
+    "print(f\"\\nSelected {len(samples_to_explain)} samples for detailed explanation\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Waterfall plots for selected samples\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"SHAP VISUALIZATION 3: WATERFALL PLOTS (Individual Predictions)\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\"\"\n",
+    "Waterfall plots show how each feature contributes to pushing the prediction\n",
+    "from the base value (expected value) to the final prediction for a single sample.\n",
+    "\n",
+    "Reading the plot:\n",
+    "- Starts at E[f(x)] (expected value/average prediction)\n",
+    "- Each bar shows a feature's contribution\n",
+    "- Red bars push prediction higher (toward malignant)\n",
+    "- Blue bars push prediction lower (toward benign)\n",
+    "- Final value f(x) is the model's output for this sample\n",
+    "\"\"\")\n",
+    "\n",
+    "for idx, (sample_idx, description) in enumerate(zip(samples_to_explain, sample_descriptions)):\n",
+    "    true_label = data.target_names[y_test_sample.iloc[sample_idx]]\n",
+    "    pred_label = data.target_names[y_pred_sample[sample_idx]]\n",
+    "    pred_proba = y_proba_sample[sample_idx]\n",
+    "    \n",
+    "    print(f\"\\n{'-'*80}\")\n",
+    "    print(f\"Sample {idx+1}: {description} (Index {sample_idx})\")\n",
+    "    print(f\"{'-'*80}\")\n",
+    "    print(f\"True Label: {true_label}\")\n",
+    "    print(f\"Predicted Label: {pred_label}\")\n",
+    "    print(f\"Prediction Probabilities:\")\n",
+    "    for class_idx, class_name in enumerate(data.target_names):\n",
+    "        print(f\"  {class_name}: {pred_proba[class_idx]:.4f} ({pred_proba[class_idx]*100:.2f}%)\")\n",
+    "    print(f\"Correct: {true_label == pred_label}\")\n",
+    "    \n",
+    "    # Create waterfall plot\n",
+    "    shap.plots.waterfall(\n",
+    "        shap.Explanation(\n",
+    "            values=shap_values[1][sample_idx],\n",
+    "            base_values=explainer.expected_value[1],\n",
+    "            data=X_test_sample.iloc[sample_idx],\n",
+    "            feature_names=X_test_sample.columns.tolist()\n",
+    "        ),\n",
+    "        max_display=15\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Force plots for individual predictions\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"SHAP VISUALIZATION 4: FORCE PLOTS (Individual Predictions)\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\"\"\n",
+    "Force plots provide another view of individual predictions:\n",
+    "- Red features push prediction toward higher values (malignant)\n",
+    "- Blue features push prediction toward lower values (benign)\n",
+    "- Width of each feature shows magnitude of impact\n",
+    "\"\"\")\n",
+    "\n",
+    "for idx, (sample_idx, description) in enumerate(zip(samples_to_explain[:3], sample_descriptions[:3])):\n",
+    "    print(f\"\\nSample {idx+1}: {description}\")\n",
+    "    \n",
+    "    shap.force_plot(\n",
+    "        explainer.expected_value[1],\n",
+    "        shap_values[1][sample_idx],\n",
+    "        X_test_sample.iloc[sample_idx],\n",
+    "        matplotlib=True,\n",
+    "        show=False,\n",
+    "        figsize=(20, 3)\n",
+    "    )\n",
+    "    plt.title(f'Force Plot - {description} (Sample {sample_idx})', \n",
+    "             fontsize=12, fontweight='bold', pad=10)\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Interactive force plot for multiple samples\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"SHAP VISUALIZATION 5: INTERACTIVE FORCE PLOT (Multiple Predictions)\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\"\"\n",
+    "This interactive visualization shows force plots for multiple samples simultaneously.\n",
+    "Samples are sorted by their predicted probability, allowing you to see patterns\n",
+    "across different prediction strengths.\n",
+    "\"\"\")\n",
+    "\n",
+    "# Use first 50 samples for visualization\n",
+    "n_force_samples = min(50, len(X_test_sample))\n",
+    "\n",
+    "shap.force_plot(\n",
+    "    explainer.expected_value[1],\n",
+    "    shap_values[1][:n_force_samples],\n",
+    "    X_test_sample.iloc[:n_force_samples]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 11. SHAP Dependence Plots - Feature Interactions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Dependence plots for top features\nprint(\"=\" * 80)\nprint(\"SHAP VISUALIZATION 6: DEPENDENCE PLOTS\")\nprint(\"=\" * 80)\nprint(\"\"\"\nDependence plots show how feature values relate to their SHAP values:\n- X-axis: Feature value\n- Y-axis: SHAP value (impact on prediction)\n- Color: Another feature that may interact with this feature\n\nThese plots reveal:\n- Non-linear relationships between features and predictions\n- Feature interactions (shown by color patterns)\n- Threshold effects\n\"\"\")\n\n# Get top 6 most important features\ntop_features = feature_importance_df.head(6)['Feature'].values\n\nfig, axes = plt.subplots(2, 3, figsize=(18, 12))\naxes = axes.ravel()\n\nfor idx, feature in enumerate(top_features):\n    plt.sca(axes[idx])\n    shap.dependence_plot(\n        feature,\n        shap_values[1],\n        X_test_sample.values,\n        feature_names=X_test_sample.columns.tolist(),\n        show=False,\n        ax=axes[idx]\n    )\n    axes[idx].set_title(f'Dependence Plot: {feature}', fontsize=11, fontweight='bold')\n    axes[idx].grid(alpha=0.3)\n\nplt.suptitle('SHAP Dependence Plots - Top 6 Features', \n            fontsize=14, fontweight='bold', y=1.002)\nplt.tight_layout()\nplt.show()\n\nprint(\"\\nKey observations from dependence plots:\")\nprint(\"- Look for non-linear patterns in the scatter plots\")\nprint(\"- Color gradients indicate feature interactions\")\nprint(\"- Vertical spread at a given x-value suggests interactions with other features\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 12. SHAP Decision Plot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Decision plot showing prediction paths\n",
+    "print(\"=\" * 80)\n",
+    "print(\"SHAP VISUALIZATION 7: DECISION PLOT\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\"\"\n",
+    "Decision plots show the cumulative effect of features on predictions:\n",
+    "- Each line represents one sample's prediction path\n",
+    "- Starts from expected value at bottom\n",
+    "- Each feature shifts the prediction up or down\n",
+    "- Final position (top) is the model's prediction\n",
+    "- Color indicates the final predicted class\n",
+    "\n",
+    "This helps visualize:\n",
+    "- Which features drive different predictions\n",
+    "- Where predictions diverge\n",
+    "- Similarity between prediction paths\n",
+    "\"\"\")\n",
+    "\n",
+    "# Select diverse samples for decision plot\n",
+    "n_decision_samples = min(30, len(X_test_sample))\n",
+    "decision_indices = np.linspace(0, len(X_test_sample)-1, n_decision_samples, dtype=int)\n",
+    "\n",
+    "plt.figure(figsize=(14, 10))\n",
+    "shap.decision_plot(\n",
+    "    explainer.expected_value[1],\n",
+    "    shap_values[1][decision_indices],\n",
+    "    X_test_sample.iloc[decision_indices],\n",
+    "    show=False,\n",
+    "    feature_display_range=slice(-1, -21, -1)  # Show top 20 features\n",
+    ")\n",
+    "plt.title(f'SHAP Decision Plot - Prediction Paths for {n_decision_samples} Samples\\n(Top 20 Features)', \n",
+    "         fontsize=14, fontweight='bold', pad=20)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 13. Advanced Analysis - Correct vs Incorrect Predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Compare SHAP patterns between correct and incorrect predictions\nprint(\"=\" * 80)\nprint(\"ADVANCED ANALYSIS: CORRECT VS INCORRECT PREDICTIONS\")\nprint(\"=\" * 80)\n\nif incorrect_mask.sum() > 0:\n    print(f\"\\nAnalyzing {incorrect_mask.sum()} misclassified samples...\\n\")\n    \n    # Compare average SHAP values\n    shap_correct = np.abs(shap_values[1][correct_mask]).mean(axis=0)\n    shap_incorrect = np.abs(shap_values[1][incorrect_mask]).mean(axis=0)\n    \n    comparison_df = pd.DataFrame({\n        'Feature': X_test_sample.columns,\n        'Correct_Predictions': shap_correct,\n        'Incorrect_Predictions': shap_incorrect,\n        'Difference': shap_incorrect - shap_correct\n    }).sort_values('Difference', ascending=False)\n    \n    print(\"Features with largest difference in importance:\")\n    print(\"\\nTop 10 features MORE important in incorrect predictions:\")\n    print(comparison_df.head(10).to_string(index=False))\n    \n    # Visualize comparison\n    fig, axes = plt.subplots(1, 2, figsize=(16, 8))\n    \n    # Convert mask to numpy array for indexing\n    correct_mask_np = correct_mask.values if hasattr(correct_mask, 'values') else correct_mask\n    incorrect_mask_np = incorrect_mask.values if hasattr(incorrect_mask, 'values') else incorrect_mask\n    \n    # Get the data for correct and incorrect predictions\n    X_correct = X_test_sample.values[correct_mask_np]\n    X_incorrect = X_test_sample.values[incorrect_mask_np]\n    \n    # Summary plot for correct predictions\n    plt.sca(axes[0])\n    shap.summary_plot(shap_values[1][correct_mask_np], \n                     X_correct,\n                     feature_names=X_test_sample.columns.tolist(),\n                     plot_type=\"bar\", show=False, max_display=15)\n    axes[0].set_title(f'Feature Importance - Correct Predictions (n={correct_mask.sum()})', \n                     fontweight='bold', fontsize=12)\n    \n    # Summary plot for incorrect predictions\n    plt.sca(axes[1])\n    shap.summary_plot(shap_values[1][incorrect_mask_np], \n                     X_incorrect,\n                     feature_names=X_test_sample.columns.tolist(),\n                     plot_type=\"bar\", show=False, max_display=15)\n    axes[1].set_title(f'Feature Importance - Incorrect Predictions (n={incorrect_mask.sum()})', \n                     fontweight='bold', fontsize=12)\n    \n    plt.tight_layout()\n    plt.show()\n    \nelse:\n    print(\"\\nAll samples in the test set were correctly classified!\")\n    print(\"This indicates excellent model performance.\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 14. Interactive Exploration Function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create interactive exploration function\n",
+    "def explain_sample(sample_index):\n",
+    "    \"\"\"\n",
+    "    Provide detailed explanation for a specific sample prediction.\n",
+    "    \n",
+    "    Args:\n",
+    "        sample_index: Index of the sample to explain (0 to len(X_test_sample)-1)\n",
+    "    \"\"\"\n",
+    "    if sample_index < 0 or sample_index >= len(X_test_sample):\n",
+    "        print(f\"Error: Sample index out of range. Please use 0 to {len(X_test_sample)-1}\")\n",
+    "        return\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\" * 80)\n",
+    "    print(f\"DETAILED EXPLANATION FOR SAMPLE {sample_index}\")\n",
+    "    print(\"=\" * 80)\n",
+    "    \n",
+    "    # Get prediction information\n",
+    "    true_label = data.target_names[y_test_sample.iloc[sample_index]]\n",
+    "    pred_label = data.target_names[y_pred_sample[sample_index]]\n",
+    "    pred_proba = y_proba_sample[sample_index]\n",
+    "    \n",
+    "    print(f\"\\n1. PREDICTION SUMMARY\")\n",
+    "    print(f\"{'-'*80}\")\n",
+    "    print(f\"True Label: {true_label}\")\n",
+    "    print(f\"Predicted Label: {pred_label}\")\n",
+    "    print(f\"Correct: {'✓ Yes' if true_label == pred_label else '✗ No'}\")\n",
+    "    print(f\"\\nPrediction Probabilities:\")\n",
+    "    for class_idx, class_name in enumerate(data.target_names):\n",
+    "        prob = pred_proba[class_idx]\n",
+    "        bar = '█' * int(prob * 50)\n",
+    "        print(f\"  {class_name:12s}: {prob:.4f} ({prob*100:5.2f}%) {bar}\")\n",
+    "    \n",
+    "    # SHAP explanation\n",
+    "    print(f\"\\n2. SHAP EXPLANATION\")\n",
+    "    print(f\"{'-'*80}\")\n",
+    "    print(f\"Base value (expected output): {explainer.expected_value[1]:.4f}\")\n",
+    "    print(f\"Model output for this sample: {explainer.expected_value[1] + shap_values[1][sample_index].sum():.4f}\")\n",
+    "    \n",
+    "    # Top positive and negative contributors\n",
+    "    shap_sample = shap_values[1][sample_index]\n",
+    "    feature_impacts = pd.DataFrame({\n",
+    "        'Feature': X_test_sample.columns,\n",
+    "        'Value': X_test_sample.iloc[sample_index].values,\n",
+    "        'SHAP_Value': shap_sample\n",
+    "    }).sort_values('SHAP_Value', key=abs, ascending=False)\n",
+    "    \n",
+    "    print(f\"\\nTop 5 features INCREASING malignant probability:\")\n",
+    "    positive_features = feature_impacts[feature_impacts['SHAP_Value'] > 0].head(5)\n",
+    "    for idx, row in positive_features.iterrows():\n",
+    "        print(f\"  {row['Feature']:30s}: {row['SHAP_Value']:+.4f} (value={row['Value']:.4f})\")\n",
+    "    \n",
+    "    print(f\"\\nTop 5 features DECREASING malignant probability:\")\n",
+    "    negative_features = feature_impacts[feature_impacts['SHAP_Value'] < 0].head(5)\n",
+    "    for idx, row in negative_features.iterrows():\n",
+    "        print(f\"  {row['Feature']:30s}: {row['SHAP_Value']:+.4f} (value={row['Value']:.4f})\")\n",
+    "    \n",
+    "    # Visualizations\n",
+    "    print(f\"\\n3. VISUALIZATIONS\")\n",
+    "    print(f\"{'-'*80}\\n\")\n",
+    "    \n",
+    "    # Waterfall plot\n",
+    "    print(\"Waterfall Plot:\")\n",
+    "    shap.plots.waterfall(\n",
+    "        shap.Explanation(\n",
+    "            values=shap_values[1][sample_index],\n",
+    "            base_values=explainer.expected_value[1],\n",
+    "            data=X_test_sample.iloc[sample_index],\n",
+    "            feature_names=X_test_sample.columns.tolist()\n",
+    "        ),\n",
+    "        max_display=15\n",
+    "    )\n",
+    "    \n",
+    "    # Feature values comparison\n",
+    "    print(\"\\n4. FEATURE VALUES COMPARISON\")\n",
+    "    print(f\"{'-'*80}\")\n",
+    "    print(\"\\nTop 10 features by absolute value (original scale):\")\n",
+    "    original_values = X_test.iloc[sample_index].sort_values(ascending=False).head(10)\n",
+    "    for feature, value in original_values.items():\n",
+    "        mean_val = X_train[feature].mean()\n",
+    "        std_val = X_train[feature].std()\n",
+    "        z_score = (value - mean_val) / std_val\n",
+    "        print(f\"  {feature:30s}: {value:10.4f} (μ={mean_val:8.4f}, z={z_score:+6.2f})\")\n",
+    "\n",
+    "# Display usage instructions\n",
+    "print(\"=\" * 80)\n",
+    "print(\"INTERACTIVE EXPLORATION\")\n",
+    "print(\"=\" * 80)\n",
+    "print(f\"\\nUse the explain_sample() function to explore any prediction:\")\n",
+    "print(f\"\\nExample usage:\")\n",
+    "print(f\"  explain_sample(0)   # Explain first sample\")\n",
+    "print(f\"  explain_sample(10)  # Explain 11th sample\")\n",
+    "print(f\"\\nValid range: 0 to {len(X_test_sample)-1}\")\n",
+    "print(f\"\\nTry these interesting samples:\")\n",
+    "if len(high_conf_correct_idx) > 0:\n",
+    "    print(f\"  explain_sample({high_conf_correct_idx[0]})  # High confidence correct\")\n",
+    "if len(low_conf_correct_idx) > 0:\n",
+    "    print(f\"  explain_sample({low_conf_correct_idx[0]})  # Low confidence correct\")\n",
+    "if len(misclassified_idx) > 0:\n",
+    "    print(f\"  explain_sample({misclassified_idx[0]})  # Misclassified sample\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*80)\n",
+    "print(\"Example: Explaining sample 0\")\n",
+    "print(\"=\"*80)\n",
+    "explain_sample(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 15. GPU Resource Monitoring"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Final GPU memory check\n",
+    "print(\"=\" * 80)\n",
+    "print(\"VAST.AI GPU RESOURCE SUMMARY\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\\nFinal GPU Memory Status:\")\n",
+    "print_gpu_memory()\n",
+    "\n",
+    "print(\"\\nTo free GPU memory, run: clear_gpu_memory()\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 16. Summary and Key Insights"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comprehensive summary\n",
+    "print(\"=\" * 80)\n",
+    "print(\"COMPREHENSIVE SUMMARY\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "print(\"\\n1. MODEL PERFORMANCE\")\n",
+    "print(\"-\" * 80)\n",
+    "print(f\"Algorithm: K-Nearest Neighbors (KNN)\")\n",
+    "print(f\"Optimal K: {optimal_k} neighbors\")\n",
+    "print(f\"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)\")\n",
+    "print(f\"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)\")\n",
+    "print(f\"ROC AUC Score: {roc_auc:.4f}\")\n",
+    "\n",
+    "print(\"\\n2. DATASET INFORMATION\")\n",
+    "print(\"-\" * 80)\n",
+    "print(f\"Dataset: Breast Cancer Wisconsin (Diagnostic)\")\n",
+    "print(f\"Total Samples: {len(X)}\")\n",
+    "print(f\"Features: {X.shape[1]}\")\n",
+    "print(f\"Classes: {len(data.target_names)} ({', '.join(data.target_names)})\")\n",
+    "print(f\"Train/Test Split: {len(X_train)}/{len(X_test)}\")\n",
+    "\n",
+    "print(\"\\n3. TOP 5 MOST IMPORTANT FEATURES (SHAP)\")\n",
+    "print(\"-\" * 80)\n",
+    "for idx, row in feature_importance_df.head(5).iterrows():\n",
+    "    print(f\"{idx+1}. {row['Feature']:30s} (mean |SHAP|={row['Mean_Abs_SHAP']:.4f})\")\n",
+    "\n",
+    "print(\"\\n4. SHAP EXPLAINABILITY INSIGHTS\")\n",
+    "print(\"-\" * 80)\n",
+    "print(f\"Samples explained: {n_samples}\")\n",
+    "print(f\"Background dataset size: {background_size}\")\n",
+    "print(f\"Computation time: {elapsed_time:.2f} seconds\")\n",
+    "print(f\"Time per sample: {elapsed_time/n_samples:.2f} seconds\")\n",
+    "\n",
+    "print(\"\\n5. KEY TAKEAWAYS\")\n",
+    "print(\"-\" * 80)\n",
+    "print(\"\"\"\n",
+    "✓ SHAP provides model-agnostic explainability for KNN predictions\n",
+    "✓ Feature importance rankings identify which features drive predictions\n",
+    "✓ Waterfall plots explain individual predictions step-by-step\n",
+    "✓ Dependence plots reveal non-linear relationships and interactions\n",
+    "✓ Summary plots show global patterns across all predictions\n",
+    "✓ Decision plots visualize prediction paths for multiple samples\n",
+    "✓ KNN combined with SHAP offers both accuracy and interpretability\n",
+    "\"\"\")\n",
+    "\n",
+    "print(\"\\n6. SHAP VISUALIZATION TYPES USED\")\n",
+    "print(\"-\" * 80)\n",
+    "visualizations = [\n",
+    "    (\"Summary Plot (Beeswarm)\", \"Global feature importance with value distributions\"),\n",
+    "    (\"Bar Plot\", \"Mean absolute feature importance\"),\n",
+    "    (\"Waterfall Plot\", \"Individual prediction breakdown\"),\n",
+    "    (\"Force Plot\", \"Visual representation of feature contributions\"),\n",
+    "    (\"Dependence Plot\", \"Feature-value relationships and interactions\"),\n",
+    "    (\"Decision Plot\", \"Cumulative prediction paths for multiple samples\")\n",
+    "]\n",
+    "for viz_name, description in visualizations:\n",
+    "    print(f\"  • {viz_name:30s}: {description}\")\n",
+    "\n",
+    "print(\"\\n7. VAST.AI ENVIRONMENT\")\n",
+    "print(\"-\" * 80)\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    print(f\"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\")\n",
+    "    print(f\"CUDA Version: {torch.version.cuda}\")\n",
+    "else:\n",
+    "    print(\"Running on CPU (GPU not detected)\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 80)\n",
+    "print(\"ANALYSIS COMPLETE\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\\nNext steps:\")\n",
+    "print(\"  1. Explore individual predictions using: explain_sample(index)\")\n",
+    "print(\"  2. Try different K values for KNN\")\n",
+    "print(\"  3. Test with other datasets (load_wine, load_iris, etc.)\")\n",
+    "print(\"  4. Compare with other algorithms (Random Forest, SVM, XGBoost)\")\n",
+    "print(\"  5. Use SHAP insights for feature engineering\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 17. Cleanup and Resource Management"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optional: Clear GPU memory when done\n",
+    "# Uncomment the line below to free GPU memory\n",
+    "# clear_gpu_memory()\n",
+    "\n",
+    "print(\"=\" * 80)\n",
+    "print(\"NOTEBOOK COMPLETE\")\n",
+    "print(\"=\" * 80)\n",
+    "print(\"\\nThank you for using this SHAP explainability notebook!\")\n",
+    "print(\"\\nTo clear GPU memory, run: clear_gpu_memory()\")\n",
+    "print(\"To check GPU status, run: print_gpu_memory()\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 KNN SHAP Explainability Project
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,274 @@

+# KNN Explainability with SHAP
+A comprehensive Jupyter notebook demonstrating how to use SHAP (SHapley Additive exPlanations) to interpret K-Nearest Neighbors (KNN) model predictions with detailed visualizations.
+## Overview
+This project provides a complete walkthrough of:
+- Training a K-Nearest Neighbors classifier on the Breast Cancer Wisconsin dataset
+- Using SHAP to explain model predictions at both global and local levels
+- Creating comprehensive visualizations to understand feature importance and model behavior
+- Interactive exploration of individual predictions
+## Features
+### Model Training
+- Optimal K value selection through cross-validation
+- StandardScaler preprocessing for KNN optimization
+- Comprehensive model evaluation metrics
+- ROC curves and confusion matrices
+### SHAP Explainability
+- **Summary Plots**: Global feature importance with value distributions
+- **Bar Plots**: Mean absolute feature importance rankings
+- **Waterfall Plots**: Step-by-step breakdown of individual predictions
+- **Force Plots**: Visual representation of feature contributions
+- **Dependence Plots**: Feature-value relationships and interactions
+- **Decision Plots**: Cumulative prediction paths for multiple samples
+### Interactive Analysis
+- Custom `explain_sample()` function for detailed prediction exploration
+- Comparison of correct vs. incorrect predictions
+- GPU memory management utilities
+- Comprehensive reporting and summaries
+## Prerequisites
+### Required Packages
+```
+torch
+shap
+scikit-learn
+matplotlib
+seaborn
+pandas
+numpy
+plotly
+ipywidgets
+tqdm
+```
+### Environment
+- Python 3.7+
+- Jupyter Notebook or JupyterLab
+- VS Code with Jupyter extension (recommended)
+- GPU support optional (CUDA-enabled PyTorch for faster computation)
+## Installation
+1. Clone this repository:
+```bash
+git clone <repository-url>
+cd <repository-directory>
+```
+2. Install required packages:
+```bash
+pip install torch shap scikit-learn matplotlib seaborn pandas numpy plotly ipywidgets tqdm
+```
+3. Launch Jupyter:
+```bash
+jupyter notebook
+```
+4. Open `KNN_SHAP_Explainability.ipynb`
+## Usage
+### Basic Usage
+Run the notebook cells sequentially from top to bottom. The notebook is structured in logical sections:
+1. **Environment Setup**: GPU verification and package installation
+2. **Data Loading**: Load and explore the Breast Cancer Wisconsin dataset
+3. **Preprocessing**: Feature scaling and train-test split
+4. **Model Training**: KNN optimization and evaluation
+5. **SHAP Analysis**: Compute SHAP values for test samples
+6. **Visualizations**: Generate all SHAP plots and explanations
+7. **Interactive Exploration**: Use custom functions to explore predictions
+### Interactive Exploration
+After running all cells, use the `explain_sample()` function to explore any prediction:
+```python
+# Explain the first sample
+explain_sample(0)
+# Explain a high-confidence correct prediction
+explain_sample(5)
+# Explain a misclassified sample
+explain_sample(42)
+```
+### GPU Memory Management
+If running on GPU, monitor and manage memory:
+```python
+# Check current GPU memory usage
+print_gpu_memory()
+# Clear GPU cache
+clear_gpu_memory()
+# Get optimal device (CPU or GPU)
+device = get_optimal_device()
+```
+## Dataset
+The notebook uses the **Breast Cancer Wisconsin (Diagnostic)** dataset from scikit-learn:
+- **Samples**: 569
+- **Features**: 30 (mean, standard error, and worst values of 10 real-valued features)
+- **Classes**: 2 (Malignant, Benign)
+- **Task**: Binary classification
+Features include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.
+## Model Performance
+The KNN model achieves:
+- High accuracy on both training and test sets
+- Optimal K value determined through cross-validation
+- ROC AUC score > 0.95 (typical)
+- Interpretable predictions through SHAP analysis
+## SHAP Visualization Guide
+### Summary Plot (Beeswarm)
+- Shows global feature importance across all samples
+- Color indicates feature value (red=high, blue=low)
+- Horizontal position shows impact on prediction
+### Waterfall Plot
+- Explains individual predictions step-by-step
+- Starts from base value (expected prediction)
+- Each bar shows a feature's contribution
+- Red pushes toward malignant, blue toward benign
+### Dependence Plot
+- Reveals non-linear feature relationships
+- Shows feature interactions through color
+- Identifies threshold effects
+### Decision Plot
+- Visualizes prediction paths for multiple samples
+- Shows cumulative effect of features
+- Helps identify prediction patterns
+## Key Insights
+1. **Feature Importance**: SHAP identifies the most critical features for cancer diagnosis
+2. **Non-linearity**: Dependence plots reveal complex feature-value relationships
+3. **Interactions**: Color gradients show which features interact
+4. **Individual Explanations**: Each prediction can be fully explained and understood
+5. **Model Trust**: Transparent explanations increase confidence in model decisions
+## Customization
+### Using Different Datasets
+Replace the data loading section with your own dataset:
+```python
+# Load your dataset
+X = pd.DataFrame(your_data)
+y = pd.Series(your_labels)
+# Continue with the rest of the notebook
+```
+### Adjusting SHAP Computation
+Modify SHAP parameters for speed/accuracy tradeoff:
+```python
+# Faster computation (less accurate)
+shap_values = explainer.shap_values(X_test_sample, nsamples=50)
+# More accurate (slower)
+shap_values = explainer.shap_values(X_test_sample, nsamples=200)
+# Smaller background dataset (faster)
+background = shap.kmeans(X_train_scaled, 50)
+```
+### Trying Other Algorithms
+The SHAP approach works with any model:
+```python
+from sklearn.ensemble import RandomForestClassifier
+# Train Random Forest instead of KNN
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+model.fit(X_train_scaled, y_train)
+# Use TreeExplainer for faster computation on tree-based models
+explainer = shap.TreeExplainer(model)
+```
+## Performance Tips
+1. **GPU Acceleration**: Use GPU for faster PyTorch operations
+2. **Background Size**: Reduce background dataset size for faster SHAP computation
+3. **Sample Size**: Start with fewer samples (e.g., 50) for quick testing
+4. **nsamples Parameter**: Lower values speed up computation but reduce accuracy
+5. **Memory Management**: Clear GPU cache between major computations
+## Troubleshooting
+### Common Issues
+**GPU not detected:**
+- Check CUDA installation
+- Verify PyTorch GPU support: `torch.cuda.is_available()`
+- Notebook will fall back to CPU automatically
+**SHAP computation too slow:**
+- Reduce background dataset size
+- Decrease number of test samples
+- Lower nsamples parameter
+**Memory errors:**
+- Process fewer samples at once
+- Clear GPU cache with `clear_gpu_memory()`
+- Reduce background dataset size
+**Visualization issues:**
+- Ensure matplotlib backend is compatible
+- Update SHAP to latest version
+- Restart kernel if plots don't render
+## Contributing
+Contributions are welcome! Please feel free to submit pull requests or open issues for bugs, questions, or new features.
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## Acknowledgments
+- **SHAP**: Scott Lundberg et al. for the SHAP library
+- **scikit-learn**: For the Breast Cancer Wisconsin dataset and ML tools
+- **PyTorch**: For GPU acceleration capabilities
+- **Community**: All contributors to the open-source ML/AI ecosystem
+## References
+- Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. NeurIPS.
+- SHAP Documentation: https://shap.readthedocs.io/
+- scikit-learn Documentation: https://scikit-learn.org/
+- Breast Cancer Wisconsin Dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
+## Contact
+For questions or feedback, please open an issue in the repository.
+---
+**Note**: This notebook is designed for educational purposes and demonstrates best practices for ML model interpretability using SHAP.