Machine Learning: The Ultimate Learning Platform
-Master ML through Supervised, Unsupervised & Reinforcement Learning
-Complete with step-by-step mathematical solutions, interactive visualizations, and real-world examples
+Master ML through Supervised, Unsupervised & Reinforcement Learning
+Complete with step-by-step mathematical solutions, + interactive visualizations, and real-world examples
📊 Supervised - Regression Linear Regression
+📊 Supervised + - Regression Linear Regression
Linear Regression is one of the simplest and most powerful techniques for predicting continuous values. It finds the "best fit line" through data points.
+Linear Regression is one of the simplest and most powerful techniques for predicting continuous + values. It finds the "best fit line" through data points.
Understanding Linear Regression
-Think of it like this: You want to predict house prices based on size. If you plot size vs. price on a graph, you'll see points scattered around. Linear regression draws the "best" line through these points that you can use to predict prices for houses of any size.
+Think of it like this: You want to predict house prices based on size. If you plot size vs. price + on a graph, you'll see points scattered around. Linear regression draws the "best" line through + these points that you can use to predict prices for houses of any size.
where:
y = predicted value (output)
x = input feature
m = slope (how steep the line is)
c = intercept (where line crosses y-axis) +
where:
y = predicted value (output)
x = input feature
m = slope (how steep + the line is)
c = intercept (where line crosses y-axis)
Example: Predicting Salary from Experience
@@ -729,22 +768,42 @@ canvas { -We can find a line (y = 7.5x + 32) that predicts: Someone with 7 years experience will earn approximately $84.5k.
+We can find a line (y = 7.5x + 32) that predicts: Someone with 7 years experience will earn + approximately $84.5k.
Figure 1: Scatter plot showing experience vs. salary with the best fit line
+Figure 1: Scatter plot showing experience vs. salary + with the best fit line
📐 Complete Mathematical Derivation
+ +Let's solve this step-by-step with actual numbers + using our salary data!
+ ++ Our data points (Experience x, Salary y):
+ (1, 39.8), (2, 48.9), (3, 57.0), (4, 68.3), (5, 77.9), (6, 85.0)
+ Number of data points: n = 6 +
+ + Mean of x (x̄):
+ x̄ = (x₁ + x₂ + x₃ + x₄ + x₅ + x₆) / n
+ x̄ = (1 + 2 + 3 + 4 + 5 + 6) / 6
+ x̄ = 21 / 6
+ x̄ = 3.5
+ + Mean of y (ȳ):
+ ȳ = (y₁ + y₂ + y₃ + y₄ + y₅ + y₆) / n
+ ȳ = (39.8 + 48.9 + 57.0 + 68.3 + 77.9 + 85.0) / 6
+ ȳ = 376.9 / 6
+ ȳ = 62.82 +
+ + Formula for slope:
+ m = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / Σ[(xᵢ - x̄)²]
+ + Calculate numerator (sum of products of deviations):
+
| xᵢ | +yᵢ | +xᵢ - x̄ | +yᵢ - ȳ | +(xᵢ - x̄)(yᵢ - ȳ) | +(xᵢ - x̄)² | +
|---|---|---|---|---|---|
| 1 | +39.8 | +-2.5 | +-23.02 | +57.54 | +6.25 | +
| 2 | +48.9 | +-1.5 | +-13.92 | +20.88 | +2.25 | +
| 3 | +57.0 | +-0.5 | +-5.82 | +2.91 | +0.25 | +
| 4 | +68.3 | +0.5 | +5.48 | +2.74 | +0.25 | +
| 5 | +77.9 | +1.5 | +15.08 | +22.62 | +2.25 | +
| 6 | +85.0 | +2.5 | +22.18 | +55.46 | +6.25 | +
| Sum: | +162.15 | +17.50 | +|||
+ m = 162.15 / 17.50
+ m = 9.27 (salary increases by $9.27k per + year of experience) +
+ + Formula: c = ȳ - m × x̄
+ + c = 62.82 - (9.27 × 3.5)
+ c = 62.82 - 32.45
+ c = 30.37 (base salary with 0 years + experience) +
+ + ŷ = 9.27x + 30.37
+ + Make a Prediction: What salary for 7 years of experience?
+ ŷ = 9.27 × 7 + 30.37
+ ŷ = 64.89 + 30.37
+ ŷ = $95.26k predicted salary +
+ + For each point, calculate (actual - predicted)²:
+
| x | +Actual y | +Predicted ŷ | +Error (y - ŷ) | +Error² | +
|---|---|---|---|---|
| 1 | +39.8 | +39.64 | +0.16 | +0.03 | +
| 2 | +48.9 | +48.91 | +-0.01 | +0.00 | +
| 3 | +57.0 | +58.18 | +-1.18 | +1.39 | +
| 4 | +68.3 | +67.45 | +0.85 | +0.72 | +
| 5 | +77.9 | +76.72 | +1.18 | +1.39 | +
| 6 | +85.0 | +85.99 | +-0.99 | +0.98 | +
| Sum of Squared Errors: | +4.51 | +|||
+ MSE = 4.51 / 6
+ MSE = 0.75 (Very low - great fit!) +
+ 1. m (slope) = Σ[(x-x̄)(y-ȳ)] / Σ[(x-x̄)²] = 9.27
+ 2. c (intercept) = ȳ - m×x̄ = 30.37
+ 3. Final equation: ŷ = 9.27x + 30.37
+ 4. MSE = 0.75 (low error = good model!) +
📊 Supervised + - Regression Polynomial Regression
+ +When your data curves and a straight line won't fit, Polynomial Regression extends linear + regression by adding polynomial terms (x², x³, etc.) to capture non-linear relationships.
+ +-
+
- Extends linear regression to fit curves +
- Uses polynomial features: x, x², x³, etc. +
- Higher degree = more flexible (but beware overfitting!) +
- Still linear in parameters (coefficients) +
When Linear Fails
+Consider predicting car stopping distance based on speed. The relationship isn't linear - + doubling speed quadruples stopping distance (physics: kinetic energy = ½mv²)!
+ ++ Polynomial Degree 2: y = β₀ + β₁x + β₂x²
+ Polynomial Degree 3: y = β₀ + β₁x + β₂x² + β₃x³
+ Polynomial Degree n: y = β₀ + β₁x + β₂x² + ... + βₙxⁿ +
+ Degree 4-5: Can start overfitting
+ Degree > 5: High risk of overfitting - the model memorizes noise! +
📐 Complete Mathematical Derivation
+ +Let's fit a quadratic curve to data step-by-step! +
+ ++ +
| Speed x (mph) | +Stopping Distance y (ft) | +
|---|---|
| 10 | +15 | +
| 20 | +40 | +
| 30 | +80 | +
| 40 | +130 | +
| 50 | +200 | +
+ + For degree 2, we add x² as a new feature:
+ +
| x (speed) | +x² (speed squared) | +y (distance) | +
|---|---|---|
| 10 | +100 | +15 | +
| 20 | +400 | +40 | +
| 30 | +900 | +80 | +
| 40 | +1600 | +130 | +
| 50 | +2500 | +200 | +
+ + Model: y = β₀ + β₁x + β₂x²
+ + Design Matrix X:
+
+ [1 10 100 ] [15 ] [β₀] + [1 20 400 ] [40 ] [β₁] +X = [1 30 900 ] y = [80 ] β = [β₂] + [1 40 1600] [130] + [1 50 2500] [200]+
+ + Normal Equation: β = (XᵀX)⁻¹ Xᵀy
+ + After matrix multiplication (done by computer):
+ + β₀ = 2.5 (base distance)
+ β₁ = 0.5 (linear component)
+ β₂ = 0.07 (quadratic component)
+
+ + ŷ = 2.5 + 0.5x + 0.07x²
+ + Make Predictions:
+ Speed = 25 mph: ŷ = 2.5 + 0.5(25) + 0.07(625) = 2.5 + 12.5 + 43.75 = 58.75 ft
+ Speed = 60 mph: ŷ = 2.5 + 0.5(60) + 0.07(3600) = 2.5 + 30 + 252 = 284.5 ft +
+ 1. Create polynomial features: x → [x, x², x³, ...]
+ 2. Apply standard linear regression on expanded features
+ 3. The model is still "linear" in parameters, just non-linear in input
+ 4. Use cross-validation to choose optimal degree! +
Python Code
++from sklearn.preprocessing import PolynomialFeatures +from sklearn.linear_model import LinearRegression + +# Create polynomial features (degree 2) +poly = PolynomialFeatures(degree=2) +X_poly = poly.fit_transform(X) + +# Fit linear regression on polynomial features +model = LinearRegression() +model.fit(X_poly, y) + +# Predict +y_pred = model.predict(poly.transform(X_new))+
📊 Supervised - Optimization Gradient Descent
+📊 Supervised + - Optimization Gradient Descent
Gradient Descent is the optimization algorithm that helps us find the best values for our model parameters (like m and c in linear regression). Think of it as rolling a ball downhill to find the lowest point.
+Gradient Descent is the optimization algorithm that helps us find the best values for our model + parameters (like m and c in linear regression). Think of it as rolling a ball downhill to find + the lowest point.
Understanding Gradient Descent
-Imagine you're hiking down a mountain in thick fog. You can't see the bottom, but you can feel the slope under your feet. The smart strategy? Always step in the steepest downward direction. That's exactly what gradient descent does with mathematical functions!
+Imagine you're hiking down a mountain in thick fog. You can't see the bottom, but you can feel + the slope under your feet. The smart strategy? Always step in the steepest downward direction. + That's exactly what gradient descent does with mathematical functions!
where:
θ = parameters (m, c)
α = learning rate (step size)
∇J(θ) = gradient (direction and steepness) +
where:
θ = parameters (m, c)
α = learning rate (step size)
∇J(θ) = gradient + (direction and steepness)
The Learning Rate (α)
The learning rate is like your step size when walking down the mountain:
-
-
- Too small: You take tiny steps and it takes forever to reach the bottom -
- Too large: You take huge leaps and might jump over the valley or even go uphill! +
- Too small: You take tiny steps and it takes forever to reach the bottom + +
- Too large: You take huge leaps and might jump over the valley or even go + uphill!
- Just right: You make steady progress toward the minimum
Figure 2: Loss surface showing gradient descent path to minimum
+Figure 2: Loss surface showing gradient descent path + to minimum
Types of Gradient Descent
-
-
- Batch Gradient Descent: Uses all data points for each update. Accurate but slow for large datasets. -
- Stochastic Gradient Descent (SGD): Uses one random data point per update. Fast but noisy. -
- Mini-batch Gradient Descent: Uses small batches (e.g., 32 points). Best of both worlds! +
- Batch Gradient Descent: Uses all data points for each update. Accurate but + slow for large datasets. +
- Stochastic Gradient Descent (SGD): Uses one random data point per update. + Fast but noisy. +
- Mini-batch Gradient Descent: Uses small batches (e.g., 32 points). Best of + both worlds!
📐 Complete Mathematical Derivation: Gradient + Descent in Action
+ +Let's watch gradient descent optimize a simple + example step-by-step!
+ ++ We want to find the value of x that minimizes f(x) = x²
+ Settings:
+ • Starting point: x₀ = 4
+ • Learning rate: α = 0.3
+ • Goal: Find x that minimizes x² (answer should be x = 0) +
+ + The gradient tells us which direction increases the function.
+ f(x) = x²
+ f'(x) = d/dx (x²) = 2x
+ + Why 2x?
+ Using the power rule: d/dx (xⁿ) = n × xⁿ⁻¹
+ So: d/dx (x²) = 2 × x²⁻¹ = 2 × x¹ = 2x +
+ + Update Formula: x_new = x_old - α × f'(x_old)
+ +
| Iteration | +x_old | +f'(x) = 2x | +α × f'(x) | +x_new = x_old - α×f'(x) | +f(x) = x² | +
|---|---|---|---|---|---|
| 0 (Start) | +4.000 | +— | +— | +— | +16.00 | +
| 1 | +4.000 | +2×4 = 8 | +0.3×8 = 2.4 | +4 - 2.4 = 1.600 + | +2.56 | +
| 2 | +1.600 | +2×1.6 = 3.2 | +0.3×3.2 = 0.96 | +1.6 - 0.96 = 0.640 + | +0.41 | +
| 3 | +0.640 | +2×0.64 = 1.28 | +0.3×1.28 = 0.384 | +0.64 - 0.384 = + 0.256 + | +0.066 | +
| 4 | +0.256 | +2×0.256 = 0.512 | +0.3×0.512 = 0.154 | +0.256 - 0.154 = + 0.102 + | +0.010 | +
| 5 | +0.102 | +2×0.102 = 0.205 | +0.3×0.205 = 0.061 | +0.102 - 0.061 = + 0.041 + | +0.002 | +
| ... | +→ | +→ | +→ | +≈ + 0 | +≈ + 0 | +
+ + For linear regression y = mx + c, we minimize MSE:
+ MSE = (1/n) × Σ(yᵢ - (mxᵢ + c))²
+ + Partial derivatives (gradients):
+ ∂MSE/∂m = (-2/n) × Σ xᵢ(yᵢ - ŷᵢ)
+ ∂MSE/∂c = (-2/n) × Σ (yᵢ - ŷᵢ)
+ + Update rules:
+ m_new = m_old - α × ∂MSE/∂m
+ c_new = c_old - α × ∂MSE/∂c
+ + Each iteration brings m and c closer to optimal values! +
+ • Started at x = 4, loss = 16
+ • After 5 iterations: x ≈ 0.041, loss ≈ 0.002
+ • The loss dropped from 16 to 0.002 in just 5 + steps!
+ This is the power of gradient descent - it automatically finds the minimum by following + the steepest path downhill! +
📊 Supervised - Classification Logistic Regression
+📊 Supervised + - Classification Logistic Regression
Logistic Regression is used for binary classification - when you want to predict categories (yes/no, spam/not spam, disease/healthy) not numbers. Despite its name, it's a classification algorithm!
+Logistic Regression is used for binary classification - when you want to predict categories + (yes/no, spam/not spam, disease/healthy) not numbers. Despite its name, it's a classification + algorithm!
Enter the Sigmoid Function
-The sigmoid function σ(z) squashes any input into the range [0, 1], making it perfect for probabilities!
+The sigmoid function σ(z) squashes any input into the range [0, 1], making it perfect for + probabilities!
where:
z = w·x + b (linear combination)
σ(z) = probability (always between 0 and 1)
e ≈ 2.718 (Euler's number) +
where:
z = w·x + b (linear combination)
σ(z) = probability (always between 0 + and 1)
e ≈ 2.718 (Euler's number)
Sigmoid Properties:
@@ -939,9 +1563,10 @@ canvas {Figure: Sigmoid function transforms linear input to probability
+Figure: Sigmoid function transforms linear input to + probability
Logistic Regression Formula
@@ -964,24 +1589,50 @@ canvas { -Figure: Logistic regression with decision boundary at 0.5
+Figure: Logistic regression with decision boundary at + 0.5
Log Loss (Cross-Entropy)
-We can't use MSE for logistic regression because it creates a non-convex optimization surface (multiple local minima). Instead, we use log loss:
+We can't use MSE for logistic regression because it creates a non-convex optimization surface + (multiple local minima). Instead, we use log loss:
Understanding Log Loss:
Case 1: Actual y=1, Predicted p=0.9
-Loss = -[1·log(0.9) + 0·log(0.1)] = -log(0.9) = 0.105 ✓ Low loss (good!)
+Loss = -[1·log(0.9) + 0·log(0.1)] = -log(0.9) = 0.105 ✓ Low loss + (good!)
Case 2: Actual y=1, Predicted p=0.1
-Loss = -[1·log(0.1) + 0·log(0.9)] = -log(0.1) = 2.303 ✗ High loss (bad!)
+Loss = -[1·log(0.1) + 0·log(0.9)] = -log(0.1) = 2.303 ✗ High loss + (bad!)
Case 3: Actual y=0, Predicted p=0.1
-Loss = -[0·log(0.1) + 1·log(0.9)] = -log(0.9) = 0.105 ✓ Low loss (good!)
+Loss = -[0·log(0.1) + 1·log(0.9)] = -log(0.9) = 0.105 ✓ Low loss + (good!)
📐 Complete Mathematical Derivation: Logistic + Regression
+ +Let's walk through the entire process with real + numbers!
+ ++ + Training Data:
+ Person 1: Height = 155 cm → Not Tall (y = 0)
+ Person 2: Height = 165 cm → Not Tall (y = 0)
+ Person 3: Height = 175 cm → Tall (y = 1)
+ Person 4: Height = 185 cm → Tall (y = 1)
+ + Given trained weights: w = 0.05, b = -8.5 +
+ + Formula: z = w × height + b
+ +
| Height (cm) | +z = 0.05 × height - 8.5 | +z value | +
|---|---|---|
| 155 | +0.05 × 155 - 8.5 | +-0.75 | +
| 165 | +0.05 × 165 - 8.5 | +-0.25 | +
| 175 | +0.05 × 175 - 8.5 | ++0.25 | +
| 185 | +0.05 × 185 - 8.5 | ++0.75 | +
+ + Sigmoid Formula: σ(z) = 1 / (1 + e⁻ᶻ)
+ +
| z | +e⁻ᶻ | +1 + e⁻ᶻ | +σ(z) = 1/(1+e⁻ᶻ) | +Interpretation | +
|---|---|---|---|---|
| -0.75 | +e⁰·⁷⁵ = 2.117 | +3.117 | +0.32 | +32% chance tall | +
| -0.25 | +e⁰·²⁵ = 1.284 | +2.284 | +0.44 | +44% chance tall | +
| +0.25 | +e⁻⁰·²⁵ = 0.779 | +1.779 | +0.56 | +56% chance tall | +
| +0.75 | +e⁻⁰·⁷⁵ = 0.472 | +1.472 | +0.68 | +68% chance tall | +
+ +
| Height | +p = σ(z) | +p ≥ 0.5? | +Prediction | +Actual | +Correct? | +
|---|---|---|---|---|---|
| 155 | +0.32 | +No | +0 (Not Tall) | +0 | +✓ | +
| 165 | +0.44 | +No | +0 (Not Tall) | +0 | +✓ | +
| 175 | +0.56 | +Yes | +1 (Tall) | +1 | +✓ | +
| 185 | +0.68 | +Yes | +1 (Tall) | +1 | +✓ | +
+ + Formula: L = -[y × log(p) + (1-y) × log(1-p)]
+ +
| y (actual) | +p (predicted) | +Calculation | +Loss | +
|---|---|---|---|
| 0 | +0.32 | +-[0×log(0.32) + 1×log(0.68)] | +0.39 | +
| 0 | +0.44 | +-[0×log(0.44) + 1×log(0.56)] | +0.58 | +
| 1 | +0.56 | +-[1×log(0.56) + 0×log(0.44)] | +0.58 | +
| 1 | +0.68 | +-[1×log(0.68) + 0×log(0.32)] | +0.39 | +
| Average Log Loss: | +(0.39+0.58+0.58+0.39)/4 = + 0.485 | +||
+ 1. Linear: z = w×x + b (compute a score)
+ 2. Sigmoid: p = 1/(1+e⁻ᶻ) (convert score to probability 0-1)
+ 3. Threshold: if p ≥ 0.5, predict class 1; else predict class 0
+ 4. Loss: Log Loss = -[y×log(p) + (1-y)×log(1-p)]
+ 5. Train: Use gradient descent to minimize total log loss! +
📊 Supervised - Classification Support Vector Machines (SVM)
+📊 Supervised + - Classification Support Vector Machines (SVM)
What is SVM?
-Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. Unlike logistic regression which just needs any line that separates the classes, SVM finds the BEST decision boundary - the one with the maximum margin between classes.
+Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both + classification and regression tasks. Unlike logistic regression which just needs any line that + separates the classes, SVM finds the BEST decision boundary - the one with the maximum margin + between classes.
Decision Boundary
-The decision boundary is a line (or hyperplane in higher dimensions) that separates the two classes. It's defined by the equation:
+The decision boundary is a line (or hyperplane in higher dimensions) that separates the two + classes. It's defined by the equation:
where:
w = [w₁, w₂] is the weight vector
x = [x₁, x₂] is the data point
b is the bias term +
where:
w = [w₁, w₂] is the weight vector
x = [x₁, x₂] is the data point
b is + the bias term
Figure 3: SVM decision boundary with 6 data points. Hover to see scores.
+Figure 3: SVM decision boundary with 6 data points. + Hover to see scores.
Margin and Support Vectors
- +Figure 4: Decision boundary with margin lines and support vectors highlighted in cyan
+Figure 4: Decision boundary with margin lines and + support vectors highlighted in cyan
Hard Margin vs Soft Margin
Hard Margin SVM
-Hard margin SVM requires perfect separation - no points can violate the margin. It works only when data is linearly separable.
+Hard margin SVM requires perfect separation - no points can violate the margin. It works only + when data is linearly separable.
Soft Margin SVM
-Soft margin SVM allows some margin violations, making it more practical for real-world data. It balances margin maximization with allowing some misclassifications.
+Soft margin SVM allows some margin violations, making it more practical for real-world data. It + balances margin maximization with allowing some misclassifications.
↓ ↓
Maximize margin Hinge Loss
- (penalize violations) + (penalize + violations)
The C Parameter
-The C parameter controls the trade-off between maximizing the margin and minimizing classification errors. It acts like regularization in other ML algorithms.
+The C parameter controls the trade-off between maximizing the margin and minimizing + classification errors. It acts like regularization in other ML algorithms.
-
-
- Small C (0.1 or 1): Wider margin, more violations allowed, better generalization, use when data is noisy -
- Large C (1000): Narrower margin, fewer violations, classify everything correctly, risk of overfitting, use when data is clean +
- Small C (0.1 or 1): Wider margin, more violations allowed, better + generalization, use when data is noisy +
- Large C (1000): Narrower margin, fewer violations, classify everything + correctly, risk of overfitting, use when data is clean
Figure 5: Effect of C parameter on margin and violations
+Figure 5: Effect of C parameter on margin and + violations
Slide to see: 0.1 → 1 → 10 → 1000
+Slide to see: 0.1 → 1 → 10 → + 1000
Training Algorithm
-SVM can be trained using gradient descent. For each training sample (xᵢ, yᵢ), we check if it violates the margin and update weights accordingly.
+SVM can be trained using gradient descent. For each training sample (xᵢ, yᵢ), we check if it + violates the margin and update weights accordingly.
@@ -1234,9 +2154,10 @@ canvas {
Figure 6: SVM training visualization - step through each point
+Figure 6: SVM training visualization - step through + each point
SVM Kernels (Advanced)
-Real-world data is often not linearly separable. Kernels transform data to higher dimensions where a linear boundary exists, which appears non-linear in the original space!
+Real-world data is often not linearly separable. Kernels transform data to higher dimensions + where a linear boundary exists, which appears non-linear in the original space!
Figure 7: Kernel comparison on non-linear data
Key Formulas Summary
- +@@ -1353,7 +2277,8 @@ canvas {
Advantages
-
-
- Effective in high dimensions: Works well even when features > samples +
- Effective in high dimensions: Works well even when features > samples +
- Memory efficient: Only stores support vectors, not entire dataset
- Versatile: Different kernels for different data patterns
- Robust: Works well with clear margin of separation @@ -1377,7 +2303,8 @@ canvas {
- Slow on large datasets: Training time grows quickly with >10k samples +
- Slow on large datasets: Training time grows quickly with >10k samples +
- No probability estimates: Doesn't directly provide confidence scores
- Kernel choice: Requires expertise to select right kernel
- Feature scaling: Very sensitive to feature scales @@ -1385,7 +2312,7 @@ canvas {
- x₂ = number of capital letters
Disadvantages
-
-
Real-World Example: Email Spam Classification
- +Imagine we have emails with two features:
@@ -1394,14 +2321,48 @@ canvas {- SVM finds the widest "road" between spam and non-spam emails. Support vectors are the emails closest to this road - they're the trickiest cases that define our boundary! An email far from the boundary is clearly spam or clearly legitimate. + SVM finds the widest "road" between spam and non-spam emails. Support vectors are the emails + closest to this road - they're the trickiest cases that define our boundary! An email far + from the boundary is clearly spam or clearly legitimate.
Python Code
++from sklearn.svm import SVC +from sklearn.preprocessing import StandardScaler +from sklearn.model_selection import train_test_split + +# Scale features (very important for SVM!) +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) +X_test_scaled = scaler.transform(X_test) + +# Create SVM with RBF kernel +svm = SVC( + kernel='rbf', # Options: 'linear', 'poly', 'rbf' + C=1.0, # Regularization parameter + gamma='scale' # Kernel coefficient +) + +# Train +svm.fit(X_train_scaled, y_train) + +# Predict +predictions = svm.predict(X_test_scaled) + +# Get support vectors +print(f"Number of support vectors: {len(svm.support_vectors_)}")+
📊 Supervised - Classification K-Nearest Neighbors (KNN)
+📊 Supervised + - Classification K-Nearest Neighbors (KNN)
K-Nearest Neighbors is the simplest machine learning algorithm! To classify a new point, just look at its K nearest neighbors and take a majority vote. No training required!
+K-Nearest Neighbors is the simplest machine learning algorithm! To classify a new point, just + look at its K nearest neighbors and take a majority vote. No training required!
How KNN Works
- Choose K: Decide how many neighbors (e.g., K=3) -
- Calculate distance: Find distance from new point to all training points +
- Calculate distance: Find distance from new point to all training points +
- Find K nearest: Select K points with smallest distances
- Vote: Majority class wins (or take average for regression)
Distance Metrics
- +Figure: KNN classification - drag the test point to see predictions
+Figure: KNN classification - drag the test point to + see predictions
| Point | Position | Class | Distance |
|---|---|---|---|
| Point | +Position | +Class | +Distance | +
| A | (1.0, 2.0) | Orange | 1.80 |
| B | (0.9, 1.7) | Orange | 2.00 |
| C | (1.5, 2.5) | Orange | 1.00 ← nearest! |
| D | (4.0, 5.0) | Yellow | 3.35 |
| E | (4.2, 4.8) | Yellow | 3.15 |
| F | (3.8, 5.2) | Yellow | 3.12 |
| A | +(1.0, 2.0) | +Orange | +1.80 | +
| B | +(0.9, 1.7) | +Orange | +2.00 | +
| C | +(1.5, 2.5) | +Orange | +1.00 ← nearest! | +
| D | +(4.0, 5.0) | +Yellow | +3.35 | +
| E | +(4.2, 4.8) | +Yellow | +3.15 | +
| F | +(3.8, 5.2) | +Yellow | +3.12 | +
📐 Complete Mathematical Derivation: KNN + Classification
+ +Let's classify a new point step-by-step with + actual calculations!
+ ++ + Training Data:
+
| Fruit | +Weight (g) | +Size (cm) | +Class | +
|---|---|---|---|
| A | +140 | +7 | +Apple | +
| B | +150 | +7.5 | +Apple | +
| C | +180 | +9 | +Orange | +
| D | +200 | +10 | +Orange | +
| E | +160 | +8 | +Orange | +
+ Using K = 3 (3 nearest neighbors) +
+ + Distance Formula: d = √[(x₂-x₁)² + (y₂-y₁)²]
+ +
| Point | +Calculation | +Distance | +
|---|---|---|
| A | +√[(165-140)² + (8.5-7)²] = √[625 + 2.25] | +25.04 | +
| B | +√[(165-150)² + (8.5-7.5)²] = √[225 + 1] | +15.03 | +
| C | +√[(165-180)² + (8.5-9)²] = √[225 + 0.25] | +15.01 | +
| D | +√[(165-200)² + (8.5-10)²] = √[1225 + 2.25] | +35.03 | +
| E | +√[(165-160)² + (8.5-8)²] = √[25 + 0.25] | +5.02 | +
+ + Sort by distance:
+
| Rank | +Point | +Distance | +Class | +Include? | +
|---|---|---|---|---|
| 1st | +E | +5.02 | +Orange | +✓ Yes | +
| 2nd | +C | +15.01 | +Orange | +✓ Yes | +
| 3rd | +B | +15.03 | +Apple | +✓ Yes | +
| 4th | +A | +25.04 | +Apple | +✗ No | +
| 5th | +D | +35.03 | +Orange | +✗ No | +
+ + K=3 Neighbors:
+ • E: Orange (1 vote)
+ • C: Orange (1 vote)
+ • B: Apple (1 vote)
+ + Final Vote Count:
+ • Orange: 2 votes
+ • Apple: 1 vote
+ + 🍊 Prediction: ORANGE (majority + wins!) +
+ 1. Calculate distance from new point to ALL training points
+ 2. Sort distances from smallest to largest
+ 3. Pick K nearest neighbors
+ 4. Vote: Classification = majority class, Regression = average + value
+ Note: Always normalize features first! Weight (100s) would dominate Size (10s) + otherwise! +
Python Code
++from sklearn.neighbors import KNeighborsClassifier +from sklearn.preprocessing import StandardScaler + +# Scale features (essential for KNN!) +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) +X_test_scaled = scaler.transform(X_test) + +# Create KNN classifier +knn = KNeighborsClassifier( + n_neighbors=5, # Number of neighbors (K) + metric='euclidean', # Distance metric + weights='uniform' # 'uniform' or 'distance' +) + +# Train (just stores the data!) +knn.fit(X_train_scaled, y_train) + +# Predict +predictions = knn.predict(X_test_scaled) + +# Get probabilities +probas = knn.predict_proba(X_test_scaled)+
📊 Supervised - Evaluation Model Evaluation
+📊 Supervised + - Evaluation Model Evaluation
How do we know if our model is good? Model evaluation provides metrics to measure performance and identify problems!
+How do we know if our model is good? Model evaluation provides metrics to measure performance and + identify problems!
Figure: Confusion matrix for spam detection (TP=600, FP=100, FN=300, TN=900)
+Figure: Confusion matrix for spam detection (TP=600, + FP=100, FN=300, TN=900)
Classification Metrics
@@ -1585,12 +2798,14 @@ Actual Pos TP FNPercentage of correct predictions overall
Example: (600 + 900) / (600 + 900 + 100 + 300) = 1500/1900 = 0.789 (78.9%)
+Example: (600 + 900) / (600 + 900 + 100 + 300) = 1500/1900 = 0.789 + (78.9%)
Example: 600 / (600 + 100) = 600/700 = 0.857 (85.7%)
-Use when: False positives are costly (e.g., spam filter - don't want to block legitimate emails)
+Use when: False positives are costly (e.g., spam filter - don't want to block + legitimate emails)
Example: 600 / (600 + 300) = 600/900 = 0.667 (66.7%)
-Use when: False negatives are costly (e.g., disease detection - can't miss sick patients)
+Use when: False negatives are costly (e.g., disease detection - can't miss sick + patients)
Harmonic mean - balances precision and recall
Example: 2 × (0.857 × 0.667) / (0.857 + 0.667) = 0.750 (75.0%)
+Example: 2 × (0.857 × 0.667) / (0.857 + 0.667) = 0.750 (75.0%) +
ROC Curve & AUC
-The ROC (Receiver Operating Characteristic) curve shows model performance across ALL possible thresholds!
+The ROC (Receiver Operating Characteristic) curve shows model performance across ALL possible + thresholds!
Figure: ROC curve - slide threshold to see trade-off
+Figure: ROC curve - slide threshold to see trade-off +
Regression Metrics: R² Score
-For regression problems, R² (coefficient of determination) measures how well the model explains variance:
+For regression problems, R² (coefficient of determination) measures how well the model explains + variance:
Figure: R² calculation on height-weight regression
+Figure: R² calculation on height-weight regression +
Regularization prevents overfitting by penalizing complex models. It adds a "simplicity constraint" to force the model to generalize better!
+Regularization prevents overfitting by penalizing complex models. It adds a "simplicity + constraint" to force the model to generalize better!
where:
θ = model parameters (weights)
λ = regularization strength
Penalty = function of parameter magnitudes +
where:
θ = model parameters (weights)
λ = regularization strength
Penalty = + function of parameter magnitudes
L1 Regularization (Lasso)
@@ -1773,9 +2998,10 @@ Actual Pos TP FNFigure: Comparing vanilla, L1, and L2 regularization effects
+Figure: Comparing vanilla, L1, and L2 regularization + effects
Practical Example
Predicting house prices with 10 features (size, bedrooms, age, etc.):
-Without regularization: All features have large, varying coefficients. Model overfits noise.
+Without regularization: All features have large, varying coefficients. Model + overfits noise.
-With L1: Only 4 features remain (size, location, bedrooms, age). Others set to 0. Simpler, more interpretable!
+With L1: Only 4 features remain (size, location, bedrooms, age). Others set to + 0. Simpler, more interpretable!
-With L2: All features kept but coefficients shrunk. More stable predictions, handles correlated features well.
+With L2: All features kept but coefficients shrunk. More stable predictions, + handles correlated features well.
Every model makes two types of errors: bias and variance. The bias-variance tradeoff is the fundamental challenge in machine learning - we must balance them!
+Every model makes two types of errors: bias and variance. The bias-variance tradeoff is the + fundamental challenge in machine learning - we must balance them!
Understanding Bias
-Bias is the error from overly simplistic assumptions. High bias causes underfitting.
+Bias is the error from overly simplistic assumptions. High bias causes + underfitting. +
Characteristics of High Bias:
-
@@ -1861,12 +3094,14 @@ Actual Pos TP FN
- +
-
High Bias (Underfitting):
Failed practice tests, failed real test
→ Can't learn to drive at all
- - +
-
Good Balance:
Passed practice tests, passed real test
→ Actually learned to drive!
- - +
-
High Variance (Overfitting):
Perfect on practice tests, failed real test
→ Memorized practice, didn't truly learn @@ -1951,9 +3191,10 @@ Actual Pos TP FNModel Complexity Curve
- +-Figure: Error vs model complexity - find the sweet spot
+Figure: Error vs model complexity - find the sweet + spot
@@ -1979,19 +3220,554 @@ Actual Pos TP FN++✅ Key Takeaway- The bias-variance tradeoff is unavoidable. You can't have zero bias AND zero variance. The art of machine learning is finding the sweet spot where total error is minimized! + The bias-variance tradeoff is unavoidable. You can't have zero bias AND zero variance. The + art of machine learning is finding the sweet spot where total error is minimized! ++ - Single artificial neuron +
- Takes multiple inputs, produces one output +
- Uses weights to determine importance of inputs +
- Applies activation function to make decision +
- Input Layer: Receives features (one neuron per feature) +
- Hidden Layer(s): Learn abstract representations +
- Output Layer: Produces final prediction +
- Weights: Connect neurons between layers +
- Game Playing: AlphaGo learning to play Go by playing millions of games
- Robotics: Robot learning to walk by trying different leg movements
- Self-Driving Cars: Learning to drive safely through experience -
- Recommendation Systems: Learning what users like from their interactions +
- Recommendation Systems: Learning what users like from their interactions +
- Resource Management: Optimizing data center cooling to save energy @@ -3656,12 +5889,14 @@ New property: 1650 sq ft
- Exploration: Try new actions to discover better rewards
- Exploitation: Use known good actions to maximize reward -
Understanding Variance
-Variance is the error from sensitivity to small fluctuations in training data. High variance causes overfitting.
+Variance is the error from sensitivity to small fluctuations in training data. + High variance causes overfitting.
Characteristics of High Variance:
-
@@ -1880,7 +3115,8 @@ Actual Pos TP FN
Figure: Three models showing underfitting, good fit, and overfitting
+Figure: Three models showing underfitting, good fit, + and overfitting
The Driving Test Analogy
@@ -1911,17 +3148,20 @@ Actual Pos TP FN-
-
🧠 Neural + Networks The Perceptron
+ +The Perceptron is the simplest neural network - just one neuron! It's the building block of all + deep learning and was invented in 1958. Understanding it is key to understanding neural + networks.
+ +-
+
How a Perceptron Works
++ 2. Activation: output = activation(z)
+ Step Function (Original): output = 1 if z > 0, else 0
+ Sigmoid (Modern): output = 1/(1 + e⁻ᶻ) +
📐 Complete Mathematical Derivation: Perceptron +
+ +Let's build a simple AND gate with a perceptron! +
+ ++ +
| x₁ | +x₂ | +AND Output | +
|---|---|---|
| 0 | +0 | +0 | +
| 0 | +1 | +0 | +
| 1 | +0 | +0 | +
| 1 | +1 | +1 | +
+ + Formula: z = w₁x₁ + w₂x₂ + b
+ +
| x₁ | +x₂ | +z = 0.5x₁ + 0.5x₂ - 0.7 | +z value | +
|---|---|---|---|
| 0 | +0 | +0.5(0) + 0.5(0) - 0.7 | +-0.7 | +
| 0 | +1 | +0.5(0) + 0.5(1) - 0.7 | +-0.2 | +
| 1 | +0 | +0.5(1) + 0.5(0) - 0.7 | +-0.2 | +
| 1 | +1 | +0.5(1) + 0.5(1) - 0.7 | ++0.3 | +
+ + Step Function: output = 1 if z > 0, else 0
+ +
| x₁ | +x₂ | +z | +z > 0? | +Output | +Expected | +Match? | +
|---|---|---|---|---|---|---|
| 0 | +0 | +-0.7 | +No | +0 | +0 | +✓ | +
| 0 | +1 | +-0.2 | +No | +0 | +0 | +✓ | +
| 1 | +0 | +-0.2 | +No | +0 | +0 | +✓ | +
| 1 | +1 | ++0.3 | +Yes | +1 | +1 | +✓ | +
+ + Update Rule: w_new = w_old + α × (target - output) × input
+ + Where α = learning rate (e.g., 0.1)
+ + Example update:
+ If prediction was 0 but target was 1 (error = 1):
+ w₁_new = 0.5 + 0.1 × (1 - 0) × 1 = 0.5 + 0.1 = 0.6
+ + Weights increase for inputs that should have been positive! +
+ 1. Initialize weights randomly
+ 2. For each training example: compute z = Σ(wᵢxᵢ) + b
+ 3. Apply activation: output = step(z)
+ 4. Update weights if wrong: w += α × error × input
+ 5. Repeat until all examples correct (or max iterations) +
🧠 Neural + Networks Multi-Layer Perceptron (MLP)
+ +A Multi-Layer Perceptron (MLP) stacks multiple layers of neurons to learn complex, non-linear + patterns. This is the foundation of deep learning!
+ +-
+
Activation Functions
++ ReLU: f(z) = max(0, z) → output [0, ∞)
+ Tanh: tanh(z) = (eᶻ - e⁻ᶻ)/(eᶻ + e⁻ᶻ) → output (-1, 1)
+ Softmax: For multi-class classification +
📐 Complete Mathematical Derivation: Forward + Propagation
+ +Let's trace through a small neural network + step-by-step!
+ ++ + • Input layer: 2 neurons (x₁, x₂)
+ • Hidden layer: 2 neurons (h₁, h₂)
+ • Output layer: 1 neuron (ŷ)
+ + Given Weights:
+ W₁ (input→hidden): [[0.1, 0.3], [0.2, 0.4]]
+ b₁ (hidden bias): [0.1, 0.1]
+ W₂ (hidden→output): [[0.5], [0.6]]
+ b₂ (output bias): [0.2] +
+ + Input: x = [1.0, 2.0]
+ + Hidden neuron h₁:
+ z₁ = w₁₁×x₁ + w₁₂×x₂ + b₁
+ z₁ = 0.1×1.0 + 0.2×2.0 + 0.1
+ z₁ = 0.1 + 0.4 + 0.1 = 0.6
+ h₁ = sigmoid(0.6) = 1/(1 + e⁻⁰·⁶) = 0.646
+ + Hidden neuron h₂:
+ z₂ = w₂₁×x₁ + w₂₂×x₂ + b₂
+ z₂ = 0.3×1.0 + 0.4×2.0 + 0.1
+ z₂ = 0.3 + 0.8 + 0.1 = 1.2
+ h₂ = sigmoid(1.2) = 1/(1 + e⁻¹·²) = 0.769 +
+ + Hidden layer output: h = [0.646, 0.769]
+ + Output neuron:
+ z_out = w₁×h₁ + w₂×h₂ + b
+ z_out = 0.5×0.646 + 0.6×0.769 + 0.2
+ z_out = 0.323 + 0.461 + 0.2 = 0.984
+ + ŷ = sigmoid(0.984) = 1/(1 + e⁻⁰·⁹⁸⁴)
+ ŷ = 0.728 (Final Prediction!) +
+ + Binary Cross-Entropy Loss:
+ L = -[y×log(ŷ) + (1-y)×log(1-ŷ)]
+ + If true label y = 1:
+ L = -[1×log(0.728) + 0×log(0.272)]
+ L = -log(0.728)
+ L = 0.317 (Loss value)
+ + Lower loss = better prediction! +
+ + Chain Rule: ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂z × ∂z/∂w
+ + Output layer gradient:
+ ∂L/∂ŷ = -(y/ŷ) + (1-y)/(1-ŷ) = (ŷ - y) for sigmoid
+ δ_output = 0.728 - 1 = -0.272
+ + Hidden layer gradient:
+ δ_hidden = δ_output × W₂ × h × (1-h)
+ + Gradients flow backward to update all weights! +
+ 1. Forward Pass: Input → Hidden → Output (calculate prediction)
+ 2. Loss Calculation: Compare prediction to true value
+ 3. Backward Pass: Calculate gradients using chain rule
+ 4. Update Weights: w = w - α × gradient
+ 5. Repeat for many epochs until loss minimizes! +
📐 Complete Backpropagation Derivation + (Line-by-Line)
+ +Let's derive backpropagation step-by-step using + the network from the forward pass example!
+ ++ + Network: 2 inputs → 2 hidden → 1 output
+ Input: x = [1.0, 2.0], True label: y = 1
+ + Forward Pass Results:
+ • Hidden layer: h₁ = 0.646, h₂ = 0.769
+ • Output: ŷ = 0.728
+ • Loss: L = 0.317 +
+ + Goal: Calculate ∂L/∂z_out (gradient of loss w.r.t. output before + activation)
+ + Using Chain Rule:
+ δ_output = ∂L/∂z_out = ∂L/∂ŷ × ∂ŷ/∂z_out
+ + For Binary Cross-Entropy + Sigmoid, this simplifies to:
+ δ_output = ŷ - y
+ δ_output = 0.728 - 1
+ δ_output = -0.272 +
+ + Formula: ∂L/∂W₂ = δ_output × h (hidden layer output)
+ + Calculation:
+ ∂L/∂w₁(h→o) = δ_output × h₁ = -0.272 × 0.646 = -0.176
+ ∂L/∂w₂(h→o) = δ_output × h₂ = -0.272 × 0.769 = -0.209
+ + Bias gradient:
+ ∂L/∂b₂ = δ_output = -0.272 +
+ + The Key Insight: Hidden neurons contributed to output error based on their + weights!
+ + Formula: δ_hidden = (W₂ᵀ × δ_output) ⊙ σ'(z_hidden)
+ + Sigmoid derivative: σ'(z) = σ(z) × (1 - σ(z)) = h × (1 - h)
+ + For hidden neuron h₁:
+ σ'(z₁) = h₁ × (1 - h₁) = 0.646 × (1 - 0.646) = 0.646 × 0.354 = 0.229
+ δ₁ = w₁(h→o) × δ_output × σ'(z₁)
+ δ₁ = 0.5 × (-0.272) × 0.229 = -0.031
+ + For hidden neuron h₂:
+ σ'(z₂) = h₂ × (1 - h₂) = 0.769 × (1 - 0.769) = 0.769 × 0.231 = 0.178
+ δ₂ = w₂(h→o) × δ_output × σ'(z₂)
+ δ₂ = 0.6 × (-0.272) × 0.178 = -0.029 +
+ + Formula: ∂L/∂W₁ = δ_hidden × x (input)
+ + Input: x = [1.0, 2.0]
+ + Gradients for weights to h₁:
+ ∂L/∂w₁₁ = δ₁ × x₁ = -0.031 × 1.0 = -0.031
+ ∂L/∂w₁₂ = δ₁ × x₂ = -0.031 × 2.0 = -0.062
+ + Gradients for weights to h₂:
+ ∂L/∂w₂₁ = δ₂ × x₁ = -0.029 × 1.0 = -0.029
+ ∂L/∂w₂₂ = δ₂ × x₂ = -0.029 × 2.0 = -0.058
+ + Bias gradients:
+ ∂L/∂b₁ = δ₁ = -0.031, ∂L/∂b₂ = δ₂ = -0.029 +
+ + Learning rate: α = 0.1
+ + Update Rule: w_new = w_old - α × ∂L/∂w
+ +
| Weight | +Old Value | +Gradient | +Update | +New Value | +
|---|---|---|---|---|
| w₁₁ | +0.1 | +-0.031 | +0.1 - 0.1×(-0.031) | +0.103 | +
| w₁₂ | +0.2 | +-0.062 | +0.2 - 0.1×(-0.062) | +0.206 | +
| w₂₁ | +0.3 | +-0.029 | +0.3 - 0.1×(-0.029) | +0.303 | +
| w₂₂ | +0.4 | +-0.058 | +0.4 - 0.1×(-0.058) | +0.406 | +
| w₁(h→o) | +0.5 | +-0.176 | +0.5 - 0.1×(-0.176) | +0.518 | +
| w₂(h→o) | +0.6 | +-0.209 | +0.6 - 0.1×(-0.209) | +0.621 | +
+ 1. Forward pass: Calculate all activations from input → output
+ 2. Calculate output error: δ_output = ŷ - y (for sigmoid + BCE)
+ 3. Backpropagate error: δ_hidden = (Wᵀ × δ_next) ⊙ σ'(z)
+ 4. Calculate gradients: ∂L/∂W = δ × (input to that layer)ᵀ
+ 5. Update weights: W = W - α × ∂L/∂W
+ This is iterated thousands of times until the loss + converges! +
Python Code
++from sklearn.neural_network import MLPClassifier + +# Create neural network +mlp = MLPClassifier( + hidden_layer_sizes=(100, 50), # 2 hidden layers + activation='relu', + max_iter=500 +) + +# Train +mlp.fit(X_train, y_train) + +# Predict +predictions = mlp.predict(X_test)+
📊 Supervised - Evaluation Cross-Validation
+📊 Supervised + - Evaluation Cross-Validation
Cross-validation gives more reliable performance estimates by testing your model on multiple different splits of the data!
+Cross-validation gives more reliable performance estimates by testing your model on multiple + different splits of the data!
Figure: 3-Fold Cross-Validation - each fold serves as test set once
+Figure: 3-Fold Cross-Validation - each fold serves as + test set once
Example: 3-Fold CV
@@ -2045,7 +3823,12 @@ Actual Pos TP FN| Fold | Test Set | Training Set | Accuracy | |
|---|---|---|---|---|
| Fold | +Test Set | +Training Set | +Accuracy | +|
| Speed | Very Fast | Fast | Slow | Very Slow |
| Works with Little Data | Yes | Yes | No | No |
| Interpretable | Very | Yes | No | No |
| Handles Non-linear | Yes | No | Yes | Yes |
| High Dimensions | Excellent | Good | Good | Poor |
| Speed | +Very Fast | +Fast | +Slow | +Very Slow | +
| Works with Little Data | +Yes | +Yes | +No | +No | +
| Interpretable | +Very | +Yes | +No | +No | +
| Handles Non-linear | +Yes | +No | +Yes | +Yes | +
| High Dimensions | +Excellent | +Good | +Good | +Poor | +
🎯 PART A: Categorical Naive Bayes (Step-by-Step from PDF)
- +Dataset: Tennis Play Prediction
| Outlook | Temperature | Play |
|---|---|---|
| Outlook | +Temperature | +Play | +
| Sunny | Hot | No |
| Sunny | Mild | No |
| Cloudy | Hot | Yes |
| Rainy | Mild | Yes |
| Rainy | Cool | Yes |
| Cloudy | Cool | Yes |
Problem: Predict whether to play tennis when Outlook=Rainy and Temperature=Hot
- +Problem: Predict whether to play tennis when Outlook=Rainy and Temperature=Hot +
+• Count (Rainy AND No) = 0 examples ❌
• Count (No) = 2 total
- • P(Rainy|No) = 0/2 = 0 ⚠️ ZERO PROBABILITY PROBLEM!
+ • P(Rainy|No) = 0/2 = 0 ⚠️ ZERO PROBABILITY + PROBLEM!
For Temperature = "Hot":
• P(Hot|Yes) = 1/4 = 0.25
• P(Hot|No) = 1/2 = 0.5
P(Yes|Rainy,Hot) = P(Yes) × P(Rainy|Yes) × P(Hot|Yes)
- = 0.667 × 0.5 × 0.25
- = 0.0833
+ = + 0.667 × 0.5 × 0.25
+ = + 0.0833
P(No|Rainy,Hot) = P(No) × P(Rainy|No) × P(Hot|No)
- = 0.333 × 0 × 0.5
- = 0 ❌ Problem! + = + 0.333 × 0 × 0.5
+ = + 0 ❌ Problem!
For Outlook (3 categories: Sunny, Cloudy, Rainy):
P(Rainy|Yes) = (2 + 1) / (4 + 1×3)
- = 3/7
- = 0.429 ✓
+ = + 3/7
+ = + 0.429 ✓
P(Rainy|No) = (0 + 1) / (2 + 1×3)
= 1/5
- = 0.2 ✓ Fixed the zero!
+ = + 0.2 ✓ Fixed the zero!
For Temperature (3 categories: Hot, Mild, Cool):
P(Hot|Yes) = (1 + 1) / (4 + 1×3) = 2/7 = 0.286
P(Hot|No) = (1 + 1) / (2 + 1×3) = 2/5 = 0.4
Normalize:
P(Yes|Rainy,Hot) = 0.0818 / 0.1084
- = 0.755 (75.5%)
+ = + 0.755 (75.5%)
P(No|Rainy,Hot) = 0.0266 / 0.1084
- = 0.245 (24.5%)
+ = + 0.245 (24.5%)
-
+
Confidence: 75.5%
Figure: Categorical Naive Bayes calculation visualization
+Figure: Categorical Naive Bayes calculation + visualization
🎯 PART B: Gaussian Naive Bayes (Step-by-Step from PDF)
- +Dataset: 2D Classification
| ID | X₁ | X₂ | Class |
|---|---|---|---|
| ID | +X₁ | +X₂ | +Class | +
| A | 1.0 | 2.0 | Yes |
| B | 2.0 | 1.0 | Yes |
| C | 1.5 | 1.8 | Yes |
| D | 3.0 | 3.0 | No |
| E | 3.5 | 2.8 | No |
| F | 2.9 | 3.2 | No |
| A | +1.0 | +2.0 | +Yes | +
| B | +2.0 | +1.0 | +Yes | +
| C | +1.5 | +1.8 | +Yes | +
| D | +3.0 | +3.0 | +No | +
| E | +3.5 | +2.8 | +No | +
| F | +2.9 | +3.2 | +No | +
Problem: Classify test point [X₁=2.0, X₂=2.0]
- +@@ -2978,7 +4895,7 @@ Actual Pos TP FN
This gives us the probability density at point x given mean μ and variance σ²
Step-by-step:
• Normalization: 1/√(2π × 0.166) = 1/√1.043 = 1/1.021 = 0.9772
- • Exponent: -(2.0-1.5)²/(2 × 0.166) = -(0.5)²/0.332 = -0.25/0.332 = -0.753
+ • Exponent: -(2.0-1.5)²/(2 × 0.166) = -(0.5)²/0.332 = -0.25/0.332 = + -0.753
• e^(-0.753) = 0.471
• Final: 0.9772 × 0.471 = 0.460
@@ -2996,14 +4914,15 @@ Actual Pos TP FN
Step-by-step:
• Normalization: 1/√(2π × 0.0688) = 1.523
- • Exponent: -(2.0-3.133)²/(2 × 0.0688) = -(-1.133)²/0.1376 = -1.283/0.1376 = -9.333
+ • Exponent: -(2.0-3.133)²/(2 × 0.0688) = -(-1.133)²/0.1376 = -1.283/0.1376 = + -9.333
• e^(-9.333) = 0.000088
• Final: 1.523 × 0.000088 = 0.000134
• Point (2.0, ?) is MUCH more likely to be "Yes"!
@@ -3019,7 +4938,7 @@ Actual Pos TP FN = 2.449 × 0.0000000614
= 0.00000015
@@ -3033,7 +4952,7 @@ Actual Pos TP FN = 0.5 × 0.000134 × 0.00000015
= 0.00000000001
@@ -3044,12 +4963,13 @@ Actual Pos TP FN
Prediction: YES ✅
Figure: Gaussian Naive Bayes with decision boundary
+Figure: Gaussian Naive Bayes with decision boundary +
Python Code
++from sklearn.naive_bayes import GaussianNB, MultinomialNB + +# For continuous features (e.g., measurements) +gnb = GaussianNB() +gnb.fit(X_train, y_train) +predictions = gnb.predict(X_test) + +# For text/count data (e.g., TF-IDF features) +from sklearn.feature_extraction.text import CountVectorizer + +# Convert text to word counts +vectorizer = CountVectorizer() +X_train_counts = vectorizer.fit_transform(X_train_text) +X_test_counts = vectorizer.transform(X_test_text) + +# Train Multinomial NB (good for text) +mnb = MultinomialNB(alpha=1.0) # Laplace smoothing +mnb.fit(X_train_counts, y_train) + +# Predict & get probabilities +predictions = mnb.predict(X_test_counts) +probabilities = mnb.predict_proba(X_test_counts)+
🔍 Unsupervised - Clustering K-means Clustering
+🔍 + Unsupervised - Clustering K-means Clustering
K-means is an unsupervised learning algorithm that groups data into K clusters. Each cluster has a centroid (center point), and points are assigned to the nearest centroid. Perfect for customer segmentation, image compression, and pattern discovery!
+K-means is an unsupervised learning algorithm that groups data into K clusters. Each cluster has + a centroid (center point), and points are assigned to the nearest centroid. Perfect for customer + segmentation, image compression, and pattern discovery!
Dataset: 6 Points in 2D Space
| Point | X | Y |
|---|---|---|
| Point | +X | +Y | +
| A | 1 | 2 |
| B | 1.5 | 1.8 |
| C | 5 | 8 |
| D | 8 | 8 |
| E | 1 | 0.6 |
| F | 9 | 11 |
| A | +1 | +2 | +
| B | +1.5 | +1.8 | +
| C | +5 | +8 | +
| D | +8 | +8 | +
| E | +1 | +0.6 | +
| F | +9 | +11 | +
WCSS Calculation:
WCSS₁ = d²(A,c₁) + d²(B,c₁) + d²(E,c₁)
- = (1-1.17)²+(2-1.47)² + (1.5-1.17)²+(1.8-1.47)² + (1-1.17)²+(0.6-1.47)²
+ = (1-1.17)²+(2-1.47)² + (1.5-1.17)²+(1.8-1.47)² + + (1-1.17)²+(0.6-1.47)²
= 0.311 + 0.218 + 0.786 = 1.315
WCSS₂ = d²(C,c₂) + d²(D,c₂) + d²(F,c₂)
- = (5-7.33)²+(8-9)² + (8-7.33)²+(8-9)² + (9-7.33)²+(11-9)²
+ = (5-7.33)²+(8-9)² + (8-7.33)²+(8-9)² + + (9-7.33)²+(11-9)²
= 6.433 + 1.447 + 6.789 = 14.669
Total WCSS = 1.315 + 14.669 = 15.984 @@ -3194,9 +5176,10 @@ Actual Pos TP FN
Figure: K-means clustering visualization with centroid movement
+Figure: K-means clustering visualization with + centroid movement
Finding Optimal K: The Elbow Method
@@ -3217,9 +5200,10 @@ Actual Pos TP FNFigure: Elbow method - optimal K is where the curve bends
+Figure: Elbow method - optimal K is where the curve + bends
📊 Supervised - Regression Decision Tree Regression
+📊 Supervised + - Regression Decision Tree Regression
Decision Tree Regression predicts continuous values by recursively splitting data to minimize variance. Unlike classification trees that use entropy, regression trees use variance reduction!
+Decision Tree Regression predicts continuous values by recursively splitting data to minimize + variance. Unlike classification trees that use entropy, regression trees use variance reduction! +
Dataset: House Price Prediction
| ID | Square Feet | Price (Lakhs) |
|---|---|---|
| ID | +Square Feet | +Price (Lakhs) | +
| 1 | 800 | 50 |
| 2 | 850 | 52 |
| 3 | 900 | 54 |
| 4 | 1500 | 90 |
| 5 | 1600 | 95 |
| 6 | 1700 | 100 |
| 1 | +800 | +50 | +
| 2 | +850 | +52 | +
| 3 | +900 | +54 | +
| 4 | +1500 | +90 | +
| 5 | +1600 | +95 | +
| 6 | +1700 | +100 | +
Figure: Decision tree regression with splits and predictions
+Figure: Decision tree regression with splits and + predictions
Variance Reduction vs Information Gain
| Aspect | Classification Trees | Regression Trees |
|---|---|---|
| Aspect | +Classification Trees | +Regression Trees | +
| Splitting Criterion | Information Gain (Entropy/Gini) | Variance Reduction |
| Prediction | Majority class | Mean value |
| Leaf Node | Class label | Continuous value |
| Goal | Maximize purity | Minimize variance |
| Splitting Criterion | +Information Gain (Entropy/Gini) | +Variance Reduction | +
| Prediction | +Majority class | +Mean value | +
| Leaf Node | +Class label | +Continuous value | +
| Goal | +Maximize purity | +Minimize variance | +
Figure: Comparing different split points and their variance reduction
+Figure: Comparing different split points and their + variance reduction
📊 Supervised Decision Trees
+📊 + Supervised Decision Trees
Decision Trees make decisions by asking yes/no questions recursively. They're interpretable, powerful, and the foundation for ensemble methods like Random Forests!
+Decision Trees make decisions by asking yes/no questions recursively. They're interpretable, + powerful, and the foundation for ensemble methods like Random Forests!
How Decision Trees Work
-Imagine you're playing "20 Questions" to guess an animal. Each question splits possibilities into two groups. Decision Trees work the same way!
+Imagine you're playing "20 Questions" to guess an animal. Each question splits possibilities into + two groups. Decision Trees work the same way!
Figure 1: Interactive decision tree structure
Splitting Criteria
-How do we choose which question to ask at each node? We want splits that maximize information gain!
+How do we choose which question to ask at each node? We want splits that maximize information + gain!
1. Entropy (Information Theory)
Figure 2: Entropy and Information Gain visualization
+Figure 2: Entropy and Information Gain visualization +
3. Gini Impurity (Alternative)
@@ -3540,17 +5584,19 @@ New property: 1650 sq ftFigure 3: Comparing different splits by information gain
+Figure 3: Comparing different splits by information + gain
Decision Boundaries
Figure 4: Decision tree creates rectangular regions
+Figure 4: Decision tree creates rectangular regions +
Overfitting in Decision Trees
@@ -3570,7 +5616,10 @@ New property: 1650 sq ftAdvantages vs Disadvantages
| Advantages ✅ | Disadvantages ❌ |
|---|---|
| Advantages ✅ | +Disadvantages ❌ | +
📐 Complete Mathematical Derivation: Decision + Tree Splitting
+ +Let's calculate Entropy, Information Gain, and + Gini step-by-step!
+ ++ + Training Data (14 days):
+ • 9 days we played tennis (Yes)
+ • 5 days we didn't play (No)
+ + Features: Weather (Sunny/Overcast/Rain), Wind (Weak/Strong) +
+ + Entropy Formula: H(S) = -Σ pᵢ × log₂(pᵢ)
+ + p(Yes) = 9/14 = 0.643
+ p(No) = 5/14 = 0.357
+ + H(S) = -[p(Yes) × log₂(p(Yes)) + p(No) × log₂(p(No))]
+ H(S) = -[0.643 × log₂(0.643) + 0.357 × log₂(0.357)]
+ H(S) = -[0.643 × (-0.637) + 0.357 × (-1.486)]
+ H(S) = -[-0.410 + (-0.531)]
+ H(S) = -[-0.940]
+ H(S) = 0.940 bits (before any + split) +
+ + Split counts:
+ +
| Wind | +Yes | +No | +Total | +Entropy Calculation | +H(subset) | +
|---|---|---|---|---|---|
| Weak | +6 | +2 | +8 | +-[6/8×log₂(6/8) + 2/8×log₂(2/8)] | +0.811 | +
| Strong | +3 | +3 | +6 | +-[3/6×log₂(3/6) + 3/6×log₂(3/6)] | +1.000 | +
+ H(S|Wind) = (8/14) × 0.811 + (6/14) × 1.000
+ H(S|Wind) = 0.463 + 0.429
+ H(S|Wind) = 0.892 +
+ + Formula: IG(S, Feature) = H(S) - H(S|Feature)
+ + IG(S, Wind) = 0.940 - 0.892
+ IG(S, Wind) = 0.048 bits
+ + This means splitting by Wind reduces uncertainty by 0.048 + bits +
+ +
| Feature | +H(S|Feature) | +Information Gain | +Decision | +
|---|---|---|---|
| Weather | +0.693 | +0.247 | +✓ BEST! | +
| Wind | +0.892 | +0.048 | ++ |
+ + Gini Formula: Gini(S) = 1 - Σ pᵢ²
+ + For root node:
+ Gini(S) = 1 - [(9/14)² + (5/14)²]
+ Gini(S) = 1 - [0.413 + 0.128]
+ Gini(S) = 1 - 0.541
+ Gini(S) = 0.459
+ + Interpretation:
+ • Gini = 0: Pure node (all same class)
+ • Gini = 0.5: Maximum impurity (50-50 split)
+ • Our 0.459 indicates moderate impurity +
+ 1. Calculate parent entropy/Gini
+ 2. For each feature:
+ • Split data by feature values
+ • Calculate weighted child entropy/Gini
+ • Compute Information Gain = Parent - Weighted Children
+ 3. Choose feature with HIGHEST Information + Gain
+ 4. Repeat recursively until stopping criteria met! +
Python Code
++from sklearn.tree import DecisionTreeClassifier +from sklearn import tree +import matplotlib.pyplot as plt + +# Create Decision Tree +dt = DecisionTreeClassifier( + criterion='gini', # 'gini' or 'entropy' + max_depth=5, # Limit depth (prevent overfitting) + min_samples_split=2, # Min samples to split + min_samples_leaf=1 # Min samples in leaf +) + +# Train +dt.fit(X_train, y_train) + +# Predict +predictions = dt.predict(X_test) + +# Visualize the tree +plt.figure(figsize=(20, 10)) +tree.plot_tree(dt, filled=True, feature_names=feature_names) +plt.show() + +# Feature importance +print(dict(zip(feature_names, dt.feature_importances_)))+
🎮 Reinforcement Introduction to Reinforcement Learning
+🎮 + Reinforcement Introduction to Reinforcement Learning
Reinforcement Learning (RL) is learning by trial and error, just like teaching a dog tricks! The agent takes actions in an environment, receives rewards or punishments, and learns which actions lead to the best outcomes.
+Reinforcement Learning (RL) is learning by trial and error, just like teaching a dog tricks! The + agent takes actions in an environment, receives rewards or punishments, and learns which actions + lead to the best outcomes.
Reinforcement: "Try things and I'll tell you if you did well or poorly"
- RL must explore to discover good actions, while supervised learning is given correct answers upfront! + RL must explore to discover good actions, while supervised learning is given correct answers + upfront!
Balance is key! Too much exploration wastes time on bad actions. Too much exploitation misses better strategies.
+Balance is key! Too much exploration wastes time on bad actions. Too much exploitation misses + better strategies.
where:
γ = discount factor (0 ≤ γ ≤ 1)
Future rewards are worth less than immediate rewards +
where:
γ = discount factor (0 ≤ γ ≤ 1)
Future rewards are worth less than + immediate rewards
🎮 Reinforcement Q-Learning
+🎮 + Reinforcement Q-Learning
Q-Learning is a value-based RL algorithm that learns the quality (Q-value) of taking each action in each state. It's model-free and can learn optimal policies even without knowing how the environment works!
+Q-Learning is a value-based RL algorithm that learns the quality (Q-value) of taking each action + in each state. It's model-free and can learn optimal policies even without knowing how the + environment works!
🎮 Reinforcement Policy Gradient Methods
+🎮 + Reinforcement Policy Gradient Methods
Policy Gradient methods directly optimize the policy (action selection strategy) instead of learning value functions. They're powerful for continuous action spaces and stochastic policies!
+Policy Gradient methods directly optimize the policy (action selection strategy) instead of + learning value functions. They're powerful for continuous action spaces and stochastic policies! +
Policy vs Value-Based Methods
| Aspect | Value-Based (Q-Learning) | Policy-Based |
|---|---|---|
| Aspect | +Value-Based (Q-Learning) | +Policy-Based | +
| What it learns | Q(s,a) values | π(a|s) policy directly |
| Action selection | argmax Q(s,a) | Sample from π(a|s) |
| Continuous actions | Difficult | Natural |
| Stochastic policy | Indirect | Direct |
| Convergence | Can be unstable | Smoother |
| What it learns | +Q(s,a) values | +π(a|s) policy directly | +
| Action selection | +argmax Q(s,a) | +Sample from π(a|s) | +
| Continuous actions | +Difficult | +Natural | +
| Stochastic policy | +Indirect | +Direct | +
| Convergence | +Can be unstable | +Smoother | +
- PPO (Proximal Policy Optimization): Constrain policy updates for stability
+ Actor-Critic: Combine policy gradient with value function to reduce + variance
+ PPO (Proximal Policy Optimization): Constrain policy updates for + stability
TRPO (Trust Region): Guarantee monotonic improvement
- These advances make policy gradients practical for complex tasks like robot control and game playing! + These advances make policy gradients practical for complex tasks like robot control and game + playing!
🔄 Comparison Algorithm Comparison Tool
+🔄 + Comparison Algorithm Comparison Tool
Step 2: Select Algorithms to Compare (2-5)
-Step 2: Select Algorithms to Compare + (2-5)
+Selected: 0 algorithms
+Selected: 0 + algorithms