Spaces:
Running
Running
Commit ·
c2b5d4f
1
Parent(s): 939a3b2
feat: Enhanced Feature Engineering module with MathJax and Python code
Browse files- feature-engineering/index.html +501 -38
feature-engineering/index.html
CHANGED
|
@@ -1,11 +1,25 @@
|
|
| 1 |
<!DOCTYPE html>
|
| 2 |
<html lang="en">
|
|
|
|
| 3 |
<head>
|
| 4 |
<meta charset="UTF-8" />
|
| 5 |
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
| 6 |
<title>Feature Engineering Explorer</title>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
<link rel="stylesheet" href="style.css" />
|
| 8 |
</head>
|
|
|
|
| 9 |
<body>
|
| 10 |
<div class="app flex">
|
| 11 |
<!-- Sidebar Navigation -->
|
|
@@ -33,26 +47,49 @@
|
|
| 33 |
<!-- ============================ 1. INTRO ============================ -->
|
| 34 |
<section id="intro" class="topic-section">
|
| 35 |
<h2>Introduction to Feature Engineering</h2>
|
| 36 |
-
<p>Feature Engineering is the process of transforming raw data into meaningful inputs that boost
|
|
|
|
|
|
|
| 37 |
|
| 38 |
<div class="info-card">
|
| 39 |
-
<strong>Key Idea:</strong> 💡 Thoughtful features provide the model with clearer patterns, like lenses
|
|
|
|
| 40 |
</div>
|
| 41 |
|
| 42 |
<!-- Canvas Visual -->
|
| 43 |
<div class="canvas-wrapper">
|
| 44 |
<canvas id="canvas-intro" width="600" height="280"></canvas>
|
| 45 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
</section>
|
| 47 |
|
| 48 |
<!-- ====================== 2. HANDLING MISSING DATA ================== -->
|
| 49 |
<section id="missing-data" class="topic-section">
|
| 50 |
<h2>Handling Missing Data</h2>
|
| 51 |
-
<p>Missing values come in three flavors: MCAR (Missing Completely At Random), MAR (Missing At Random), and MNAR
|
|
|
|
| 52 |
|
| 53 |
<!-- Real-world Example -->
|
| 54 |
<div class="info-card">
|
| 55 |
-
<strong>Real Example:</strong> A hospital's patient records often have absent <em>cholesterol</em> values
|
|
|
|
| 56 |
</div>
|
| 57 |
|
| 58 |
<!-- Controls -->
|
|
@@ -70,13 +107,49 @@
|
|
| 70 |
<!-- Callouts -->
|
| 71 |
<div class="callout callout--insight">💡 Mean/Median work best when data is MCAR or MAR.</div>
|
| 72 |
<div class="callout callout--mistake">⚠️ Using mean imputation on skewed data can distort distributions.</div>
|
| 73 |
-
<div class="callout callout--tip">✅ Always impute <strong>after</strong> splitting into train and test to avoid
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
</section>
|
| 75 |
|
| 76 |
<!-- ======================= 3. HANDLING OUTLIERS ===================== -->
|
| 77 |
<section id="outliers" class="topic-section">
|
| 78 |
<h2>Handling Outliers</h2>
|
| 79 |
-
<p>Outliers are data points that deviate markedly from others. Detecting and treating them prevents skewed
|
|
|
|
| 80 |
|
| 81 |
<div class="form-group">
|
| 82 |
<button id="btn-detect-iqr" class="btn btn--primary">IQR Method</button>
|
|
@@ -90,6 +163,40 @@
|
|
| 90 |
|
| 91 |
<div class="callout callout--insight">💡 The IQR method is robust to non-normal data.</div>
|
| 92 |
<div class="callout callout--mistake">⚠️ Removing legitimate extreme values can erase important signals.</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
</section>
|
| 94 |
|
| 95 |
<!-- ========================== 4. SCALING ============================ -->
|
|
@@ -106,6 +213,43 @@
|
|
| 106 |
<div class="canvas-wrapper">
|
| 107 |
<canvas id="canvas-scaling" width="600" height="300"></canvas>
|
| 108 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
</section>
|
| 110 |
|
| 111 |
<!-- ========================== 5. ENCODING =========================== -->
|
|
@@ -122,6 +266,44 @@
|
|
| 122 |
<div class="canvas-wrapper">
|
| 123 |
<canvas id="canvas-encoding" width="600" height="300"></canvas>
|
| 124 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
</section>
|
| 126 |
|
| 127 |
<!-- ===================== 6. FEATURE SELECTION ======================= -->
|
|
@@ -138,6 +320,47 @@
|
|
| 138 |
<div class="canvas-wrapper">
|
| 139 |
<canvas id="canvas-selection" width="600" height="300"></canvas>
|
| 140 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
</section>
|
| 142 |
|
| 143 |
<!-- =================== 7. IMBALANCED DATA =========================== -->
|
|
@@ -154,12 +377,56 @@
|
|
| 154 |
<div class="canvas-wrapper">
|
| 155 |
<canvas id="canvas-imbalanced" width="600" height="300"></canvas>
|
| 156 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
</section>
|
| 158 |
|
| 159 |
<!-- ========================== 8. EDA ================================ -->
|
| 160 |
<section id="eda" class="topic-section">
|
| 161 |
<h2>Exploratory Data Analysis (EDA)</h2>
|
| 162 |
-
<p><strong>Exploratory Data Analysis (EDA)</strong> is a critical step in the machine learning pipeline that
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
<div class="info-card">
|
| 165 |
<strong>Key Questions EDA Answers:</strong>
|
|
@@ -175,7 +442,8 @@
|
|
| 175 |
</div>
|
| 176 |
|
| 177 |
<div class="info-card">
|
| 178 |
-
<strong>Real-World Example:</strong> Imagine you're analyzing customer data for a bank to predict loan
|
|
|
|
| 179 |
<ul>
|
| 180 |
<li>Age distribution of customers (histogram)</li>
|
| 181 |
<li>Income levels (box plot for outliers)</li>
|
|
@@ -217,7 +485,8 @@
|
|
| 217 |
|
| 218 |
<h4>2. Inferential Statistics</h4>
|
| 219 |
<p><strong>Purpose:</strong> Make inferences or generalizations about the population from the sample</p>
|
| 220 |
-
<p><strong>Key Question:</strong> Can we claim this effect exists in the larger population, or is it just by
|
|
|
|
| 221 |
|
| 222 |
<div class="info-card">
|
| 223 |
<strong>A. Hypothesis Testing:</strong><br>
|
|
@@ -250,7 +519,8 @@
|
|
| 250 |
<strong>3. Analyze Distributions:</strong> Histograms, count plots, box plots<br>
|
| 251 |
<strong>4. Check for Imbalance:</strong> Count target classes, plot distribution<br>
|
| 252 |
<strong>5. Correlation Analysis:</strong> Correlation matrix, heatmap, identify multicollinearity<br>
|
| 253 |
-
<strong>6. Statistical Testing:</strong> Compare groups (t-test, ANOVA), test assumptions, calculate effect
|
|
|
|
| 254 |
</div>
|
| 255 |
|
| 256 |
<h3>Interactive EDA Dashboard</h3>
|
|
@@ -264,7 +534,8 @@
|
|
| 264 |
</div>
|
| 265 |
|
| 266 |
<div class="form-group">
|
| 267 |
-
<label for="confidenceLevel" class="form-label">Confidence Level: <span
|
|
|
|
| 268 |
<input type="range" id="confidenceLevel" min="90" max="99" step="1" value="95" class="form-control" />
|
| 269 |
</div>
|
| 270 |
|
|
@@ -278,9 +549,57 @@
|
|
| 278 |
<canvas id="canvas-eda" width="800" height="500"></canvas>
|
| 279 |
</div>
|
| 280 |
|
| 281 |
-
<div class="callout callout--insight">💡 EDA typically takes 30-40% of total project time. Good EDA reveals
|
| 282 |
-
|
| 283 |
-
<div class="callout callout--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
<h3>Use Cases and Applications</h3>
|
| 286 |
<ul>
|
|
@@ -292,19 +611,25 @@
|
|
| 292 |
</ul>
|
| 293 |
|
| 294 |
<h3>Summary & Key Takeaways</h3>
|
| 295 |
-
<p>Exploratory Data Analysis is the foundation of any successful machine learning project. It combines
|
|
|
|
|
|
|
|
|
|
| 296 |
<p><strong>Descriptive EDA</strong> answers: "What is happening in the dataset?"<br>
|
| 297 |
-
|
|
|
|
| 298 |
<p>Remember: <strong>Data → EDA → Feature Engineering → ML → Deployment</strong></p>
|
| 299 |
</section>
|
| 300 |
|
| 301 |
<!-- =================== 9. FEATURE TRANSFORMATION ==================== -->
|
| 302 |
<section id="feature-transformation" class="topic-section">
|
| 303 |
<h2>Feature Transformation</h2>
|
| 304 |
-
<p>Feature transformation creates new representations of data to capture non-linear patterns. Techniques like
|
|
|
|
| 305 |
|
| 306 |
<div class="info-card">
|
| 307 |
-
<strong>Real Example:</strong> Predicting house prices with polynomial features (adding x² terms) improves
|
|
|
|
| 308 |
</div>
|
| 309 |
|
| 310 |
<h3>Mathematical Foundations</h3>
|
|
@@ -332,9 +657,50 @@
|
|
| 332 |
<canvas id="canvas-transformation" width="700" height="350"></canvas>
|
| 333 |
</div>
|
| 334 |
|
| 335 |
-
<div class="callout callout--insight">💡 Polynomial features capture curve fitting, but degree=3 on 10 features
|
| 336 |
-
|
| 337 |
-
<div class="callout callout--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 338 |
|
| 339 |
<h3>Use Cases</h3>
|
| 340 |
<ul>
|
|
@@ -347,10 +713,12 @@
|
|
| 347 |
<!-- =================== 10. FEATURE CREATION ========================= -->
|
| 348 |
<section id="feature-creation" class="topic-section">
|
| 349 |
<h2>Feature Creation</h2>
|
| 350 |
-
<p>Creating new features from existing ones based on domain knowledge. Interaction terms, ratios, and
|
|
|
|
| 351 |
|
| 352 |
<div class="info-card">
|
| 353 |
-
<strong>Real Example:</strong> E-commerce revenue = price × quantity. Profit margin = (selling_price -
|
|
|
|
| 354 |
</div>
|
| 355 |
|
| 356 |
<h3>Mathematical Foundations</h3>
|
|
@@ -379,9 +747,49 @@
|
|
| 379 |
<canvas id="canvas-creation" width="700" height="350"></canvas>
|
| 380 |
</div>
|
| 381 |
|
| 382 |
-
<div class="callout callout--insight">💡 Interaction terms are especially powerful in linear models - neural
|
| 383 |
-
|
| 384 |
-
<div class="callout callout--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 385 |
|
| 386 |
<h3>Use Cases</h3>
|
| 387 |
<ul>
|
|
@@ -394,22 +802,24 @@
|
|
| 394 |
<!-- ================ 11. DIMENSIONALITY REDUCTION ==================== -->
|
| 395 |
<section id="dimensionality-reduction" class="topic-section">
|
| 396 |
<h2>Dimensionality Reduction</h2>
|
| 397 |
-
<p>Reducing the number of features while preserving information. PCA (Principal Component Analysis) projects
|
|
|
|
| 398 |
|
| 399 |
<div class="info-card">
|
| 400 |
-
<strong>Real Example:</strong> Image compression and genome analysis with thousands of genes benefit from PCA.
|
|
|
|
| 401 |
</div>
|
| 402 |
|
| 403 |
<h3>PCA Mathematical Foundations</h3>
|
| 404 |
<div class="info-card">
|
| 405 |
<strong>Algorithm Steps:</strong><br>
|
| 406 |
-
1. Standardize data:
|
| 407 |
-
2. Compute covariance matrix:
|
| 408 |
-
3. Calculate eigenvalues and eigenvectors<br>
|
| 409 |
4. Sort eigenvectors by eigenvalues (descending)<br>
|
| 410 |
-
5. Select top k eigenvectors (principal components)<br>
|
| 411 |
-
6. Transform:
|
| 412 |
-
<strong>Explained Variance:</strong>
|
| 413 |
<strong>Cumulative Variance:</strong> Shows total information preserved<br><br>
|
| 414 |
<strong>Why PCA Works:</strong><br>
|
| 415 |
• Removes correlated features<br>
|
|
@@ -428,9 +838,61 @@
|
|
| 428 |
<canvas id="canvas-pca" width="700" height="400"></canvas>
|
| 429 |
</div>
|
| 430 |
|
| 431 |
-
<div class="callout callout--insight">💡 PCA is unsupervised - it doesn't use the target variable. First PC
|
| 432 |
-
|
| 433 |
-
<div class="callout callout--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 434 |
|
| 435 |
<h3>Use Cases</h3>
|
| 436 |
<ul>
|
|
@@ -452,4 +914,5 @@
|
|
| 452 |
|
| 453 |
<script src="app.js" defer></script>
|
| 454 |
</body>
|
| 455 |
-
|
|
|
|
|
|
| 1 |
<!DOCTYPE html>
|
| 2 |
<html lang="en">
|
| 3 |
+
|
| 4 |
<head>
|
| 5 |
<meta charset="UTF-8" />
|
| 6 |
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
| 7 |
<title>Feature Engineering Explorer</title>
|
| 8 |
+
|
| 9 |
+
<!-- MathJax for rendering LaTeX formulas -->
|
| 10 |
+
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
| 11 |
+
<script>
|
| 12 |
+
MathJax = {
|
| 13 |
+
tex: {
|
| 14 |
+
inlineMath: [['$', '$'], ['\\(', '\\)']]
|
| 15 |
+
}
|
| 16 |
+
};
|
| 17 |
+
</script>
|
| 18 |
+
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
| 19 |
+
|
| 20 |
<link rel="stylesheet" href="style.css" />
|
| 21 |
</head>
|
| 22 |
+
|
| 23 |
<body>
|
| 24 |
<div class="app flex">
|
| 25 |
<!-- Sidebar Navigation -->
|
|
|
|
| 47 |
<!-- ============================ 1. INTRO ============================ -->
|
| 48 |
<section id="intro" class="topic-section">
|
| 49 |
<h2>Introduction to Feature Engineering</h2>
|
| 50 |
+
<p>Feature Engineering is the process of transforming raw data into meaningful inputs that boost
|
| 51 |
+
machine-learning model performance. A well-crafted feature set can improve accuracy by 10-30% without changing
|
| 52 |
+
the underlying algorithm.</p>
|
| 53 |
|
| 54 |
<div class="info-card">
|
| 55 |
+
<strong>Key Idea:</strong> 💡 Thoughtful features provide the model with clearer patterns, like lenses
|
| 56 |
+
sharpening a blurry picture.
|
| 57 |
</div>
|
| 58 |
|
| 59 |
<!-- Canvas Visual -->
|
| 60 |
<div class="canvas-wrapper">
|
| 61 |
<canvas id="canvas-intro" width="600" height="280"></canvas>
|
| 62 |
</div>
|
| 63 |
+
|
| 64 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 65 |
+
<div class="code-header">
|
| 66 |
+
<span>setup.py - Pandas Basics</span>
|
| 67 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 68 |
+
</div>
|
| 69 |
+
<pre><code>import pandas as pd
|
| 70 |
+
import numpy as np
|
| 71 |
+
|
| 72 |
+
# Load the dataset
|
| 73 |
+
df = pd.read_csv('housing_data.csv')
|
| 74 |
+
|
| 75 |
+
# Inspect raw data types and missing values
|
| 76 |
+
df.info()
|
| 77 |
+
|
| 78 |
+
# View summary statistics
|
| 79 |
+
display(df.describe())</code></pre>
|
| 80 |
+
</div>
|
| 81 |
</section>
|
| 82 |
|
| 83 |
<!-- ====================== 2. HANDLING MISSING DATA ================== -->
|
| 84 |
<section id="missing-data" class="topic-section">
|
| 85 |
<h2>Handling Missing Data</h2>
|
| 86 |
+
<p>Missing values come in three flavors: MCAR (Missing Completely At Random), MAR (Missing At Random), and MNAR
|
| 87 |
+
(Missing Not At Random). Each demands different treatment to avoid bias.</p>
|
| 88 |
|
| 89 |
<!-- Real-world Example -->
|
| 90 |
<div class="info-card">
|
| 91 |
+
<strong>Real Example:</strong> A hospital's patient records often have absent <em>cholesterol</em> values
|
| 92 |
+
because certain tests were not ordered for healthy young adults.
|
| 93 |
</div>
|
| 94 |
|
| 95 |
<!-- Controls -->
|
|
|
|
| 107 |
<!-- Callouts -->
|
| 108 |
<div class="callout callout--insight">💡 Mean/Median work best when data is MCAR or MAR.</div>
|
| 109 |
<div class="callout callout--mistake">⚠️ Using mean imputation on skewed data can distort distributions.</div>
|
| 110 |
+
<div class="callout callout--tip">✅ Always impute <strong>after</strong> splitting into train and test to avoid
|
| 111 |
+
leakage.</div>
|
| 112 |
+
|
| 113 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 114 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Imputation Math</h3>
|
| 115 |
+
<p><strong>KNN Imputation</strong> predicts missing values by finding the $k$ closest neighbors using a
|
| 116 |
+
distance metric like Euclidean distance. For two samples $x$ and $y$ with $n$ features, ignoring missing
|
| 117 |
+
dimensions:</p>
|
| 118 |
+
<div
|
| 119 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 120 |
+
$$ d(x, y) = \sqrt{\sum_{i=1}^{n} w_i (x_i - y_i)^2} $$
|
| 121 |
+
</div>
|
| 122 |
+
<p style="margin-bottom: 0;">Once the $k$ neighbors are found, their values are averaged (or weighted by
|
| 123 |
+
distance) to fill the missing slot. This preserves local cluster distributions better than global mean
|
| 124 |
+
imputation.</p>
|
| 125 |
+
</div>
|
| 126 |
+
|
| 127 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 128 |
+
<div class="code-header">
|
| 129 |
+
<span>missing_data.py - Scikit-Learn Imputers</span>
|
| 130 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 131 |
+
</div>
|
| 132 |
+
<pre><code>from sklearn.impute import SimpleImputer, KNNImputer
|
| 133 |
+
|
| 134 |
+
# 1. Simple Imputation (Mean/Median/Most Frequent)
|
| 135 |
+
# Good for MCAR (Missing Completely At Random)
|
| 136 |
+
mean_imputer = SimpleImputer(strategy='mean')
|
| 137 |
+
df['age_imputed'] = mean_imputer.fit_transform(df[['age']])
|
| 138 |
+
|
| 139 |
+
# 2. KNN Imputation (Distance-based)
|
| 140 |
+
# Good for MAR (Missing At Random) when variables are correlated
|
| 141 |
+
knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
|
| 142 |
+
df_imputed = knn_imputer.fit_transform(df)
|
| 143 |
+
|
| 144 |
+
# Note: Tree-based models like XGBoost can handle NaNs natively!</code></pre>
|
| 145 |
+
</div>
|
| 146 |
</section>
|
| 147 |
|
| 148 |
<!-- ======================= 3. HANDLING OUTLIERS ===================== -->
|
| 149 |
<section id="outliers" class="topic-section">
|
| 150 |
<h2>Handling Outliers</h2>
|
| 151 |
+
<p>Outliers are data points that deviate markedly from others. Detecting and treating them prevents skewed
|
| 152 |
+
models.</p>
|
| 153 |
|
| 154 |
<div class="form-group">
|
| 155 |
<button id="btn-detect-iqr" class="btn btn--primary">IQR Method</button>
|
|
|
|
| 163 |
|
| 164 |
<div class="callout callout--insight">💡 The IQR method is robust to non-normal data.</div>
|
| 165 |
<div class="callout callout--mistake">⚠️ Removing legitimate extreme values can erase important signals.</div>
|
| 166 |
+
|
| 167 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 168 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Outlier Math</h3>
|
| 169 |
+
<p><strong>Z-Score</strong> measures how many standard deviations $\sigma$ a point is from the mean $\mu$. It
|
| 170 |
+
assumes the data is normally distributed:</p>
|
| 171 |
+
<div
|
| 172 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 173 |
+
$$ z = \frac{x - \mu}{\sigma} \quad \text{(Threshold: } |z| > 3 \text{)} $$
|
| 174 |
+
</div>
|
| 175 |
+
<p style="margin-bottom: 0;"><strong>IQR (Interquartile Range)</strong> is non-parametric. It defines fences
|
| 176 |
+
based on the 25th ($Q1$) and 75th ($Q3$) percentiles: $[Q1 - 1.5 \times \text{IQR},\ Q3 + 1.5 \times
|
| 177 |
+
\text{IQR}]$. <em>Winsorization</em> caps values at these percentiles instead of dropping them.</p>
|
| 178 |
+
</div>
|
| 179 |
+
|
| 180 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 181 |
+
<div class="code-header">
|
| 182 |
+
<span>outliers.py - Z-Score and Winsorization</span>
|
| 183 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 184 |
+
</div>
|
| 185 |
+
<pre><code>import numpy as np
|
| 186 |
+
from scipy import stats
|
| 187 |
+
|
| 188 |
+
# 1. Z-Score Method (Dropping Outliers)
|
| 189 |
+
z_scores = np.abs(stats.zscore(df['income']))
|
| 190 |
+
# Keep only rows where z-score is less than 3
|
| 191 |
+
df_clean = df[z_scores < 3]
|
| 192 |
+
|
| 193 |
+
# 2. IQR Method (Winsorization / Capping)
|
| 194 |
+
# Capping at 5th and 95th percentiles to retain data points
|
| 195 |
+
lower_limit = df['income'].quantile(0.05)
|
| 196 |
+
upper_limit = df['income'].quantile(0.95)
|
| 197 |
+
|
| 198 |
+
df['income_capped'] = np.clip(df['income'], lower_limit, upper_limit)</code></pre>
|
| 199 |
+
</div>
|
| 200 |
</section>
|
| 201 |
|
| 202 |
<!-- ========================== 4. SCALING ============================ -->
|
|
|
|
| 213 |
<div class="canvas-wrapper">
|
| 214 |
<canvas id="canvas-scaling" width="600" height="300"></canvas>
|
| 215 |
</div>
|
| 216 |
+
|
| 217 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 218 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Scaling Math</h3>
|
| 219 |
+
<p><strong>Min-Max Scaling (Normalization)</strong> scales data to a fixed range, usually $[0, 1]$:</p>
|
| 220 |
+
<div
|
| 221 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 222 |
+
$$ X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}} $$
|
| 223 |
+
</div>
|
| 224 |
+
<p><strong>Standardization (Z-Score Scaling)</strong> centers the data around a mean of 0 with a standard
|
| 225 |
+
deviation of 1. It does not bound data to a specific range, handling outliers better than Min-Max:</p>
|
| 226 |
+
<div
|
| 227 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 228 |
+
$$ X_{std} = \frac{X - \mu}{\sigma} $$
|
| 229 |
+
</div>
|
| 230 |
+
<p style="margin-bottom: 0;"><strong>Robust Scaling</strong> uses statistics that are robust to outliers, like
|
| 231 |
+
the median and Interquartile Range (IQR): $X_{robust} = \frac{X - \text{median}}{Q3 - Q1}$.</p>
|
| 232 |
+
</div>
|
| 233 |
+
|
| 234 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 235 |
+
<div class="code-header">
|
| 236 |
+
<span>scaling.py - Scikit-Learn Scalers</span>
|
| 237 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 238 |
+
</div>
|
| 239 |
+
<pre><code>from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
|
| 240 |
+
|
| 241 |
+
# 1. Min-Max Scaler (Best for Neural Networks/Images)
|
| 242 |
+
minmax = MinMaxScaler()
|
| 243 |
+
df[['age_minmax', 'income_minmax']] = minmax.fit_transform(df[['age', 'income']])
|
| 244 |
+
|
| 245 |
+
# 2. Standard Scaler (Best for PCA, SVM, Logistic Regression)
|
| 246 |
+
standard = StandardScaler()
|
| 247 |
+
df_scaled = standard.fit_transform(df)
|
| 248 |
+
|
| 249 |
+
# 3. Robust Scaler (Best when dataset has many outliers)
|
| 250 |
+
robust = RobustScaler()
|
| 251 |
+
df_robust = robust.fit_transform(df)</code></pre>
|
| 252 |
+
</div>
|
| 253 |
</section>
|
| 254 |
|
| 255 |
<!-- ========================== 5. ENCODING =========================== -->
|
|
|
|
| 266 |
<div class="canvas-wrapper">
|
| 267 |
<canvas id="canvas-encoding" width="600" height="300"></canvas>
|
| 268 |
</div>
|
| 269 |
+
|
| 270 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 271 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Target Encoding Math</h3>
|
| 272 |
+
<p><strong>One-Hot Encoding</strong> creates $N$ sparse binary columns for $N$ categories, which can cause the
|
| 273 |
+
"Curse of Dimensionality" for high-cardinality features.</p>
|
| 274 |
+
<p><strong>Target Encoding</strong> replaces a categorical value with the average target value for that
|
| 275 |
+
category. To prevent overfitting (especially on rare categories), a <em>Bayesian Smoothing</em> average is
|
| 276 |
+
applied:</p>
|
| 277 |
+
<div
|
| 278 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 279 |
+
$$ S = \lambda \cdot \bar{y}_{cat} + (1 - \lambda) \cdot \bar{y}_{global} $$
|
| 280 |
+
</div>
|
| 281 |
+
<p style="margin-bottom: 0;">Where $\bar{y}_{cat}$ is the mean of the target for the specific category,
|
| 282 |
+
$\bar{y}_{global}$ is the global target mean, and $\lambda$ is a weight between 0 and 1 determined by the
|
| 283 |
+
category's frequency.</p>
|
| 284 |
+
</div>
|
| 285 |
+
|
| 286 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 287 |
+
<div class="code-header">
|
| 288 |
+
<span>encoding.py - Category Encoders</span>
|
| 289 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 290 |
+
</div>
|
| 291 |
+
<pre><code>import pandas as pd
|
| 292 |
+
from sklearn.preprocessing import OneHotEncoder
|
| 293 |
+
from category_encoders import TargetEncoder
|
| 294 |
+
|
| 295 |
+
# 1. One-Hot Encoding (Best for nominal variables with few categories)
|
| 296 |
+
ohe = OneHotEncoder(sparse_output=False, drop='first') # drop='first' avoids multicollinearity
|
| 297 |
+
color_encoded = ohe.fit_transform(df[['color']])
|
| 298 |
+
|
| 299 |
+
# Pandas alternative (easy but not ideal for pipelines):
|
| 300 |
+
# pd.get_dummies(df, columns=['color'], drop_first=True)
|
| 301 |
+
|
| 302 |
+
# 2. Target Encoding (Best for high-cardinality nominal variables like zipcodes)
|
| 303 |
+
# Requires 'category_encoders' library
|
| 304 |
+
te = TargetEncoder(smoothing=10) # Higher smoothing pulls estimates closer to global mean
|
| 305 |
+
df['zipcode_encoded'] = te.fit_transform(df['zipcode'], df['target'])</code></pre>
|
| 306 |
+
</div>
|
| 307 |
</section>
|
| 308 |
|
| 309 |
<!-- ===================== 6. FEATURE SELECTION ======================= -->
|
|
|
|
| 320 |
<div class="canvas-wrapper">
|
| 321 |
<canvas id="canvas-selection" width="600" height="300"></canvas>
|
| 322 |
</div>
|
| 323 |
+
|
| 324 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 325 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Selection Math</h3>
|
| 326 |
+
<p>Feature selection can be filter-based, wrapper-based, or intrinsic.</p>
|
| 327 |
+
<p><strong>Filter Method (ANOVA F-Value):</strong> Scikit-Learn's `f_classif` computes the ANOVA F-value
|
| 328 |
+
between numerical features and a categorical target. The F-statistic measures the ratio of variance
|
| 329 |
+
<em>between</em> groups to the variance <em>within</em> groups:
|
| 330 |
+
</p>
|
| 331 |
+
<div
|
| 332 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 333 |
+
$$ F = \frac{\text{Between-group variability}}{\text{Within-group variability}} $$
|
| 334 |
+
</div>
|
| 335 |
+
<p style="margin-bottom: 0;"><strong>Wrapper Method (RFE):</strong> Recursive Feature Elimination fits a model
|
| 336 |
+
(e.g., Logistic Regression or Random Forest), ranks features by importance coefficients, drops the weakest
|
| 337 |
+
feature, and repeats until the desired $N$ features remain.</p>
|
| 338 |
+
</div>
|
| 339 |
+
|
| 340 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 341 |
+
<div class="code-header">
|
| 342 |
+
<span>selection.py - Feature Selection</span>
|
| 343 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 344 |
+
</div>
|
| 345 |
+
<pre><code>from sklearn.feature_selection import SelectKBest, f_classif, RFE
|
| 346 |
+
from sklearn.linear_model import LogisticRegression
|
| 347 |
+
|
| 348 |
+
X = df.drop('target', axis=1)
|
| 349 |
+
y = df['target']
|
| 350 |
+
|
| 351 |
+
# 1. Filter Method: SelectKBest (ANOVA F-value)
|
| 352 |
+
# Keeps the 5 features with the highest ANOVA F-scores
|
| 353 |
+
selector = SelectKBest(score_func=f_classif, k=5)
|
| 354 |
+
X_top_5 = selector.fit_transform(X, y)
|
| 355 |
+
selected_columns = X.columns[selector.get_support()]
|
| 356 |
+
|
| 357 |
+
# 2. Wrapper Method: Recursive Feature Elimination (RFE)
|
| 358 |
+
# Uses a model's intrinsic feature importance assigning to prune
|
| 359 |
+
estimator = LogisticRegression()
|
| 360 |
+
rfe = RFE(estimator, n_features_to_select=5, step=1)
|
| 361 |
+
X_rfe = rfe.fit_transform(X, y)
|
| 362 |
+
rfe_columns = X.columns[rfe.support_]</code></pre>
|
| 363 |
+
</div>
|
| 364 |
</section>
|
| 365 |
|
| 366 |
<!-- =================== 7. IMBALANCED DATA =========================== -->
|
|
|
|
| 377 |
<div class="canvas-wrapper">
|
| 378 |
<canvas id="canvas-imbalanced" width="600" height="300"></canvas>
|
| 379 |
</div>
|
| 380 |
+
|
| 381 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 382 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: SMOTE Math</h3>
|
| 383 |
+
<p><strong>SMOTE (Synthetic Minority Over-sampling Technique)</strong> doesn't just duplicate data (like
|
| 384 |
+
Random Over-Sampling). It creates novel synthetic examples by interpolating between existing minority
|
| 385 |
+
instances.</p>
|
| 386 |
+
<p>For a minority class point $x_i$, SMOTE finds its $k$-nearest minority neighbors. It picks one neighbor
|
| 387 |
+
$x_{zi}$ and generates a synthetic point $x_{new}$ along the line segment joining them:</p>
|
| 388 |
+
<div
|
| 389 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 390 |
+
$$ x_{new} = x_i + \lambda \times (x_{zi} - x_i) $$
|
| 391 |
+
</div>
|
| 392 |
+
<p style="margin-bottom: 0;">Where $\lambda$ is a random number between 0 and 1. This creates a denser, more
|
| 393 |
+
generalized decision region for the minority class.</p>
|
| 394 |
+
</div>
|
| 395 |
+
|
| 396 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 397 |
+
<div class="code-header">
|
| 398 |
+
<span>imbalanced.py - Imblearn Resampling</span>
|
| 399 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 400 |
+
</div>
|
| 401 |
+
<pre><code>from imblearn.over_sampling import SMOTE
|
| 402 |
+
from imblearn.under_sampling import RandomUnderSampler
|
| 403 |
+
from imblearn.pipeline import Pipeline
|
| 404 |
+
|
| 405 |
+
# 1. SMOTE (Over-sampling the minority class)
|
| 406 |
+
smote = SMOTE(sampling_strategy='auto', k_neighbors=5, random_state=42)
|
| 407 |
+
X_smote, y_smote = smote.fit_resample(X, y)
|
| 408 |
+
|
| 409 |
+
# 2. Random Under-Sampling (Reducing the majority class)
|
| 410 |
+
rus = RandomUnderSampler(sampling_strategy='auto', random_state=42)
|
| 411 |
+
X_rus, y_rus = rus.fit_resample(X, y)
|
| 412 |
+
|
| 413 |
+
# 3. Best Practice Pipeline: Under-sample majority THEN SMOTE minority
|
| 414 |
+
# Prevents creating too many synthetic points if the imbalance is extreme
|
| 415 |
+
resample_pipe = Pipeline([
|
| 416 |
+
('rus', RandomUnderSampler(sampling_strategy=0.1)), # Reduce majority until minority is 10%
|
| 417 |
+
('smote', SMOTE(sampling_strategy=0.5)) # SMOTE minority until it's 50%
|
| 418 |
+
])
|
| 419 |
+
X_resampled, y_resampled = resample_pipe.fit_resample(X, y)</code></pre>
|
| 420 |
+
</div>
|
| 421 |
</section>
|
| 422 |
|
| 423 |
<!-- ========================== 8. EDA ================================ -->
|
| 424 |
<section id="eda" class="topic-section">
|
| 425 |
<h2>Exploratory Data Analysis (EDA)</h2>
|
| 426 |
+
<p><strong>Exploratory Data Analysis (EDA)</strong> is a critical step in the machine learning pipeline that
|
| 427 |
+
comes BEFORE feature engineering. EDA helps you understand your data, discover patterns, identify anomalies,
|
| 428 |
+
detect outliers, test hypotheses, and check assumptions through summary statistics and graphical
|
| 429 |
+
representations.</p>
|
| 430 |
|
| 431 |
<div class="info-card">
|
| 432 |
<strong>Key Questions EDA Answers:</strong>
|
|
|
|
| 442 |
</div>
|
| 443 |
|
| 444 |
<div class="info-card">
|
| 445 |
+
<strong>Real-World Example:</strong> Imagine you're analyzing customer data for a bank to predict loan
|
| 446 |
+
defaults. EDA helps you understand:
|
| 447 |
<ul>
|
| 448 |
<li>Age distribution of customers (histogram)</li>
|
| 449 |
<li>Income levels (box plot for outliers)</li>
|
|
|
|
| 485 |
|
| 486 |
<h4>2. Inferential Statistics</h4>
|
| 487 |
<p><strong>Purpose:</strong> Make inferences or generalizations about the population from the sample</p>
|
| 488 |
+
<p><strong>Key Question:</strong> Can we claim this effect exists in the larger population, or is it just by
|
| 489 |
+
chance?</p>
|
| 490 |
|
| 491 |
<div class="info-card">
|
| 492 |
<strong>A. Hypothesis Testing:</strong><br>
|
|
|
|
| 519 |
<strong>3. Analyze Distributions:</strong> Histograms, count plots, box plots<br>
|
| 520 |
<strong>4. Check for Imbalance:</strong> Count target classes, plot distribution<br>
|
| 521 |
<strong>5. Correlation Analysis:</strong> Correlation matrix, heatmap, identify multicollinearity<br>
|
| 522 |
+
<strong>6. Statistical Testing:</strong> Compare groups (t-test, ANOVA), test assumptions, calculate effect
|
| 523 |
+
sizes
|
| 524 |
</div>
|
| 525 |
|
| 526 |
<h3>Interactive EDA Dashboard</h3>
|
|
|
|
| 534 |
</div>
|
| 535 |
|
| 536 |
<div class="form-group">
|
| 537 |
+
<label for="confidenceLevel" class="form-label">Confidence Level: <span
|
| 538 |
+
id="confidenceValue">95</span>%</label>
|
| 539 |
<input type="range" id="confidenceLevel" min="90" max="99" step="1" value="95" class="form-control" />
|
| 540 |
</div>
|
| 541 |
|
|
|
|
| 549 |
<canvas id="canvas-eda" width="800" height="500"></canvas>
|
| 550 |
</div>
|
| 551 |
|
| 552 |
+
<div class="callout callout--insight">💡 EDA typically takes 30-40% of total project time. Good EDA reveals
|
| 553 |
+
which features to engineer.</div>
|
| 554 |
+
<div class="callout callout--mistake">⚠️ Common Mistakes: Skipping EDA, not checking outliers before scaling,
|
| 555 |
+
ignoring missing value patterns, overlooking class imbalance, ignoring multicollinearity.</div>
|
| 556 |
+
<div class="callout callout--tip">✅ Best Practices: ALWAYS start with EDA, visualize EVERY feature, check
|
| 557 |
+
correlations with target, document insights, use both descriptive and inferential statistics.</div>
|
| 558 |
+
|
| 559 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 560 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Skewness & Kurtosis</h3>
|
| 561 |
+
<p>Beyond mean and variance, we examine the geometric shape of our distributions using the 3rd and 4th
|
| 562 |
+
statistical moments.</p>
|
| 563 |
+
<p><strong>Skewness ($s$)</strong> measures asymmetry. Positive means right-tailed, negative means
|
| 564 |
+
left-tailed:</p>
|
| 565 |
+
<div
|
| 566 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 567 |
+
$$ s = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^3}{\sigma^3} $$
|
| 568 |
+
</div>
|
| 569 |
+
<p><strong>Kurtosis ($k$)</strong> measures "tailedness" (presence of outliers). A normal distribution has a
|
| 570 |
+
kurtosis of 3. High kurtosis means heavy tails:</p>
|
| 571 |
+
<div
|
| 572 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 573 |
+
$$ k = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^4}{\sigma^4} $$
|
| 574 |
+
</div>
|
| 575 |
+
</div>
|
| 576 |
+
|
| 577 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 578 |
+
<div class="code-header">
|
| 579 |
+
<span>eda.py - Automated & Visual EDA</span>
|
| 580 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 581 |
+
</div>
|
| 582 |
+
<pre><code>import pandas as pd
|
| 583 |
+
import seaborn as sns
|
| 584 |
+
import matplotlib.pyplot as plt
|
| 585 |
+
|
| 586 |
+
# 1. Deep Descriptive Stats (includes skewness)
|
| 587 |
+
display(df.describe().T)
|
| 588 |
+
print("Skewness:\n", df.skew())
|
| 589 |
+
print("\nMissing Values:\n", df.isnull().sum())
|
| 590 |
+
|
| 591 |
+
# 2. Visual Distributions (Pairplot)
|
| 592 |
+
# Plots histograms on the diagonal and scatter plots for every relationship
|
| 593 |
+
sns.pairplot(df, hue='target_class', diag_kind='kde', corner=True)
|
| 594 |
+
plt.show()
|
| 595 |
+
|
| 596 |
+
# 3. Correlation Heatmap
|
| 597 |
+
plt.figure(figsize=(10, 8))
|
| 598 |
+
corr_matrix = df.corr(method='spearman') # Spearman is robust to non-linear relationships
|
| 599 |
+
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
|
| 600 |
+
plt.title("Spearman Correlation Heatmap")
|
| 601 |
+
plt.show()</code></pre>
|
| 602 |
+
</div>
|
| 603 |
|
| 604 |
<h3>Use Cases and Applications</h3>
|
| 605 |
<ul>
|
|
|
|
| 611 |
</ul>
|
| 612 |
|
| 613 |
<h3>Summary & Key Takeaways</h3>
|
| 614 |
+
<p>Exploratory Data Analysis is the foundation of any successful machine learning project. It combines
|
| 615 |
+
<strong>descriptive statistics</strong> (mean, median, variance, correlation) with <strong>inferential
|
| 616 |
+
statistics</strong> (hypothesis testing, confidence intervals) to understand data deeply.
|
| 617 |
+
</p>
|
| 618 |
<p><strong>Descriptive EDA</strong> answers: "What is happening in the dataset?"<br>
|
| 619 |
+
<strong>Inferential EDA</strong> answers: "Can we claim this effect exists in the larger population?"
|
| 620 |
+
</p>
|
| 621 |
<p>Remember: <strong>Data → EDA → Feature Engineering → ML → Deployment</strong></p>
|
| 622 |
</section>
|
| 623 |
|
| 624 |
<!-- =================== 9. FEATURE TRANSFORMATION ==================== -->
|
| 625 |
<section id="feature-transformation" class="topic-section">
|
| 626 |
<h2>Feature Transformation</h2>
|
| 627 |
+
<p>Feature transformation creates new representations of data to capture non-linear patterns. Techniques like
|
| 628 |
+
polynomial features, binning, and mathematical transformations unlock hidden relationships.</p>
|
| 629 |
|
| 630 |
<div class="info-card">
|
| 631 |
+
<strong>Real Example:</strong> Predicting house prices with polynomial features (adding x² terms) improves
|
| 632 |
+
model fit for non-linear relationships between square footage and price.
|
| 633 |
</div>
|
| 634 |
|
| 635 |
<h3>Mathematical Foundations</h3>
|
|
|
|
| 657 |
<canvas id="canvas-transformation" width="700" height="350"></canvas>
|
| 658 |
</div>
|
| 659 |
|
| 660 |
+
<div class="callout callout--insight">💡 Polynomial features capture curve fitting, but degree=3 on 10 features
|
| 661 |
+
creates 286 features!</div>
|
| 662 |
+
<div class="callout callout--mistake">⚠️ Always scale features after polynomial transformation to prevent
|
| 663 |
+
magnitude issues.</div>
|
| 664 |
+
<div class="callout callout--tip">✅ Start with degree=2 and visualize distributions before/after transformation.
|
| 665 |
+
</div>
|
| 666 |
+
|
| 667 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 668 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Power Transforms</h3>
|
| 669 |
+
<p>When log transformations $\ln(1+x)$ aren't enough to fix severe skewness, we use parametric Power
|
| 670 |
+
Transformations like <strong>Box-Cox</strong> (requires $x > 0$) or <strong>Yeo-Johnson</strong> (supports
|
| 671 |
+
negative values). They automatically find the optimal $\lambda$ parameter using Maximum Likelihood
|
| 672 |
+
Estimation.</p>
|
| 673 |
+
<p><strong>Box-Cox Transformation Formula:</strong></p>
|
| 674 |
+
<div
|
| 675 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 676 |
+
$$ x^{(\lambda)} = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(x) &
|
| 677 |
+
\text{if } \lambda = 0 \end{cases} $$
|
| 678 |
+
</div>
|
| 679 |
+
<p style="margin-bottom: 0;">These transforms stretch and compress the variable to map it as closely to a
|
| 680 |
+
Gaussian (Normal) distribution as mathematically possible.</p>
|
| 681 |
+
</div>
|
| 682 |
+
|
| 683 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 684 |
+
<div class="code-header">
|
| 685 |
+
<span>transformation.py - Power Transforms & Binning</span>
|
| 686 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 687 |
+
</div>
|
| 688 |
+
<pre><code>import numpy as np
|
| 689 |
+
from sklearn.preprocessing import PowerTransformer, KBinsDiscretizer
|
| 690 |
+
|
| 691 |
+
# 1. Power Transformation (Yeo-Johnson)
|
| 692 |
+
# Attempts to map skewed feature to a Gaussian distribution
|
| 693 |
+
pt = PowerTransformer(method='yeo-johnson', standardize=True)
|
| 694 |
+
df['income_gaussian'] = pt.fit_transform(df[['income']])
|
| 695 |
+
|
| 696 |
+
# 2. Log Transformation (np.log1p handles zeros safely by doing log(1+x))
|
| 697 |
+
df['revenue_log'] = np.log1p(df['revenue'])
|
| 698 |
+
|
| 699 |
+
# 3. Discretization / Binning
|
| 700 |
+
# Converts continuous age into 5 categorical bins (quantiles ensures equal frequency per bin)
|
| 701 |
+
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
|
| 702 |
+
df['age_group'] = binner.fit_transform(df[['age']])</code></pre>
|
| 703 |
+
</div>
|
| 704 |
|
| 705 |
<h3>Use Cases</h3>
|
| 706 |
<ul>
|
|
|
|
| 713 |
<!-- =================== 10. FEATURE CREATION ========================= -->
|
| 714 |
<section id="feature-creation" class="topic-section">
|
| 715 |
<h2>Feature Creation</h2>
|
| 716 |
+
<p>Creating new features from existing ones based on domain knowledge. Interaction terms, ratios, and
|
| 717 |
+
domain-specific calculations enhance model performance.</p>
|
| 718 |
|
| 719 |
<div class="info-card">
|
| 720 |
+
<strong>Real Example:</strong> E-commerce revenue = price × quantity. Profit margin = (selling_price -
|
| 721 |
+
cost_price) / cost_price. These derived features often have stronger predictive power than raw features.
|
| 722 |
</div>
|
| 723 |
|
| 724 |
<h3>Mathematical Foundations</h3>
|
|
|
|
| 747 |
<canvas id="canvas-creation" width="700" height="350"></canvas>
|
| 748 |
</div>
|
| 749 |
|
| 750 |
+
<div class="callout callout--insight">💡 Interaction terms are especially powerful in linear models - neural
|
| 751 |
+
networks learn them automatically.</div>
|
| 752 |
+
<div class="callout callout--mistake">⚠️ Creating features without domain knowledge leads to meaningless
|
| 753 |
+
combinations.</div>
|
| 754 |
+
<div class="callout callout--tip">✅ Always check correlation between new and existing features to avoid
|
| 755 |
+
redundancy.</div>
|
| 756 |
+
|
| 757 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 758 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: Polynomial Combinations</h3>
|
| 759 |
+
<p>Scikit-Learn's `PolynomialFeatures` generates a new feature matrix consisting of all polynomial
|
| 760 |
+
combinations of the features with degree less than or equal to the specified degree.</p>
|
| 761 |
+
<p>For two features $X = [x_1, x_2]$ and a degree of 2, the expanded polynomial vector is:</p>
|
| 762 |
+
<div
|
| 763 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 764 |
+
$$ [1,\; x_1,\; x_2,\; x_1^2,\; x_1 \cdot x_2,\; x_2^2] $$
|
| 765 |
+
</div>
|
| 766 |
+
<p style="margin-bottom: 0;">Notice the $x_1 \cdot x_2$ term. This is an <strong>interaction term</strong>,
|
| 767 |
+
which lets a linear model learn conditional relationships (e.g., "if $x_1$ is high, the effect of $x_2$
|
| 768 |
+
changes").</p>
|
| 769 |
+
</div>
|
| 770 |
+
|
| 771 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 772 |
+
<div class="code-header">
|
| 773 |
+
<span>creation.py - Automated Polynomial Features</span>
|
| 774 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 775 |
+
</div>
|
| 776 |
+
<pre><code>from sklearn.preprocessing import PolynomialFeatures
|
| 777 |
+
import pandas as pd
|
| 778 |
+
|
| 779 |
+
# Assume df has two features: 'length' and 'width'
|
| 780 |
+
X = df[['length', 'width']]
|
| 781 |
+
|
| 782 |
+
# Create polynomial and interaction features up to degree 2
|
| 783 |
+
# include_bias=False prevents adding a column of 1s (intercept)
|
| 784 |
+
poly = PolynomialFeatures(degree=2, include_bias=False)
|
| 785 |
+
X_poly = poly.fit_transform(X)
|
| 786 |
+
|
| 787 |
+
# Get the names of the new features (e.g., 'length^2', 'length width')
|
| 788 |
+
feature_names = poly.get_feature_names_out(['length', 'width'])
|
| 789 |
+
df_poly = pd.DataFrame(X_poly, columns=feature_names)
|
| 790 |
+
|
| 791 |
+
print(df_poly.head())</code></pre>
|
| 792 |
+
</div>
|
| 793 |
|
| 794 |
<h3>Use Cases</h3>
|
| 795 |
<ul>
|
|
|
|
| 802 |
<!-- ================ 11. DIMENSIONALITY REDUCTION ==================== -->
|
| 803 |
<section id="dimensionality-reduction" class="topic-section">
|
| 804 |
<h2>Dimensionality Reduction</h2>
|
| 805 |
+
<p>Reducing the number of features while preserving information. PCA (Principal Component Analysis) projects
|
| 806 |
+
high-dimensional data onto lower dimensions by finding directions of maximum variance.</p>
|
| 807 |
|
| 808 |
<div class="info-card">
|
| 809 |
+
<strong>Real Example:</strong> Image compression and genome analysis with thousands of genes benefit from PCA.
|
| 810 |
+
First 2-3 principal components often capture 80%+ of variance.
|
| 811 |
</div>
|
| 812 |
|
| 813 |
<h3>PCA Mathematical Foundations</h3>
|
| 814 |
<div class="info-card">
|
| 815 |
<strong>Algorithm Steps:</strong><br>
|
| 816 |
+
1. Standardize data: $X_{scaled} = \frac{X - \mu}{\sigma}$<br>
|
| 817 |
+
2. Compute covariance matrix: $\Sigma = \frac{1}{n-1} X^T X$<br>
|
| 818 |
+
3. Calculate eigenvalues $\lambda$ and eigenvectors $v$<br>
|
| 819 |
4. Sort eigenvectors by eigenvalues (descending)<br>
|
| 820 |
+
5. Select top $k$ eigenvectors (principal components)<br>
|
| 821 |
+
6. Transform: $X_{new} = X \times v_k$<br><br>
|
| 822 |
+
<strong>Explained Variance:</strong> $\frac{\lambda_i}{\sum \lambda_j}$<br>
|
| 823 |
<strong>Cumulative Variance:</strong> Shows total information preserved<br><br>
|
| 824 |
<strong>Why PCA Works:</strong><br>
|
| 825 |
• Removes correlated features<br>
|
|
|
|
| 838 |
<canvas id="canvas-pca" width="700" height="400"></canvas>
|
| 839 |
</div>
|
| 840 |
|
| 841 |
+
<div class="callout callout--insight">💡 PCA is unsupervised - it doesn't use the target variable. First PC
|
| 842 |
+
always captures most variance.</div>
|
| 843 |
+
<div class="callout callout--mistake">⚠️ Not standardizing before PCA is a critical error - features with large
|
| 844 |
+
scales will dominate.</div>
|
| 845 |
+
<div class="callout callout--tip">✅ Aim for 95% cumulative explained variance when choosing number of
|
| 846 |
+
components.</div>
|
| 847 |
+
|
| 848 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #9900ff;">
|
| 849 |
+
<h3 style="margin-top: 0; color: #9900ff;">🧠 Under the Hood: PCA Math</h3>
|
| 850 |
+
<p>PCA finds the directions (Principal Components) that maximize the variance of the data. Mathematically, it
|
| 851 |
+
works by computing the covariance matrix of the standardized dataset $X$:</p>
|
| 852 |
+
<div
|
| 853 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 854 |
+
$$ \Sigma = \frac{1}{n-1} X^T X $$
|
| 855 |
+
</div>
|
| 856 |
+
<p>Then, we solve for the eigenvectors $V$ and eigenvalues $\lambda$ solving $\Sigma V = \lambda V$.</p>
|
| 857 |
+
<ul style="margin-top: 10px; margin-bottom: 0;">
|
| 858 |
+
<li><strong>Eigenvectors</strong> ($v_i$) are the axes of the new feature space (the directions).</li>
|
| 859 |
+
<li><strong>Eigenvalues</strong> ($\lambda_i$) represent the magnitude of variance captured by each vector.
|
| 860 |
+
</li>
|
| 861 |
+
</ul>
|
| 862 |
+
</div>
|
| 863 |
+
|
| 864 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 865 |
+
<div class="code-header">
|
| 866 |
+
<span>pca.py - Principal Component Analysis</span>
|
| 867 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 868 |
+
</div>
|
| 869 |
+
<pre><code>from sklearn.decomposition import PCA
|
| 870 |
+
from sklearn.preprocessing import StandardScaler
|
| 871 |
+
import matplotlib.pyplot as plt
|
| 872 |
+
import numpy as np
|
| 873 |
+
|
| 874 |
+
# 1. ALWAYS scale data before PCA
|
| 875 |
+
scaler = StandardScaler()
|
| 876 |
+
X_scaled = scaler.fit_transform(X)
|
| 877 |
+
|
| 878 |
+
# 2. Fit PCA without specifying components to see all variance
|
| 879 |
+
pca_full = PCA()
|
| 880 |
+
pca_full.fit(X_scaled)
|
| 881 |
+
|
| 882 |
+
# 3. Plot Cumulative Explained Variance
|
| 883 |
+
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
|
| 884 |
+
plt.plot(cumulative_variance, marker='o')
|
| 885 |
+
plt.axhline(y=0.95, color='r', linestyle='--') # 95% threshold
|
| 886 |
+
plt.xlabel('Number of Components')
|
| 887 |
+
plt.ylabel('Cumulative Explained Variance')
|
| 888 |
+
plt.show()
|
| 889 |
+
|
| 890 |
+
# 4. Apply PCA retaining 95% variance
|
| 891 |
+
# Float between 0 and 1 selects components covering that % of variance
|
| 892 |
+
pca = PCA(n_components=0.95)
|
| 893 |
+
X_pca = pca.fit_transform(X_scaled)
|
| 894 |
+
print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features.")</code></pre>
|
| 895 |
+
</div>
|
| 896 |
|
| 897 |
<h3>Use Cases</h3>
|
| 898 |
<ul>
|
|
|
|
| 914 |
|
| 915 |
<script src="app.js" defer></script>
|
| 916 |
</body>
|
| 917 |
+
|
| 918 |
+
</html>
|