diff --git "a/ml_complete-all-topics/index.html" "b/ml_complete-all-topics/index.html" --- "a/ml_complete-all-topics/index.html" +++ "b/ml_complete-all-topics/index.html" @@ -1,505 +1,509 @@ + Machine Learning: Complete Educational Guide + +
@@ -510,7 +514,7 @@ canvas {
@@ -596,8 +610,12 @@ canvas {

Machine Learning: The Ultimate Learning Platform

-

Master ML through Supervised, Unsupervised & Reinforcement Learning

-

Complete with step-by-step mathematical solutions, interactive visualizations, and real-world examples

+

Master ML through Supervised, Unsupervised & Reinforcement Learning

+

Complete with step-by-step mathematical solutions, + interactive visualizations, and real-world examples

-

📊 Supervised - Regression Linear Regression

+

📊 Supervised + - Regression Linear Regression

-

Linear Regression is one of the simplest and most powerful techniques for predicting continuous values. It finds the "best fit line" through data points.

+

Linear Regression is one of the simplest and most powerful techniques for predicting continuous + values. It finds the "best fit line" through data points.

Key Concepts
@@ -710,12 +746,15 @@ canvas {

Understanding Linear Regression

-

Think of it like this: You want to predict house prices based on size. If you plot size vs. price on a graph, you'll see points scattered around. Linear regression draws the "best" line through these points that you can use to predict prices for houses of any size.

+

Think of it like this: You want to predict house prices based on size. If you plot size vs. price + on a graph, you'll see points scattered around. Linear regression draws the "best" line through + these points that you can use to predict prices for houses of any size.

The Linear Equation: y = mx + c -
where:
y = predicted value (output)
x = input feature
m = slope (how steep the line is)
c = intercept (where line crosses y-axis)
+
where:
y = predicted value (output)
x = input feature
m = slope (how steep + the line is)
c = intercept (where line crosses y-axis)

Example: Predicting Salary from Experience

@@ -729,22 +768,42 @@ canvas { - 139.8 - 248.9 - 357.0 - 468.3 - 577.9 - 685.0 + + 1 + 39.8 + + + 2 + 48.9 + + + 3 + 57.0 + + + 4 + 68.3 + + + 5 + 77.9 + + + 6 + 85.0 + -

We can find a line (y = 7.5x + 32) that predicts: Someone with 7 years experience will earn approximately $84.5k.

+

We can find a line (y = 7.5x + 32) that predicts: Someone with 7 years experience will earn + approximately $84.5k.

-

Figure 1: Scatter plot showing experience vs. salary with the best fit line

+

Figure 1: Scatter plot showing experience vs. salary + with the best fit line

@@ -767,14 +826,16 @@ canvas {
💡 Key Insight
- The "best fit line" is the one that minimizes the total error between actual points and predicted points. We square the errors so positive and negative errors don't cancel out. + The "best fit line" is the one that minimizes the total error between actual points and + predicted points. We square the errors so positive and negative errors don't cancel out.
⚠️ Common Mistake
- Linear regression assumes a straight-line relationship. If your data curves, you need polynomial regression or other techniques! + Linear regression assumes a straight-line relationship. If your data curves, you need + polynomial regression or other techniques!
@@ -785,17 +846,420 @@ canvas {
  • Find values of m and c that minimize prediction errors
  • Use the equation y = mx + c to predict new values
  • + + +
    +

    📐 Complete Mathematical Derivation

    + +

    Let's solve this step-by-step with actual numbers + using our salary data!

    + +
    + Step 1: Organize Our Data

    + Our data points (Experience x, Salary y):
    + (1, 39.8), (2, 48.9), (3, 57.0), (4, 68.3), (5, 77.9), (6, 85.0)

    + Number of data points: n = 6 +
    + +
    + Step 2: Calculate Means (x̄ and ȳ)

    + + Mean of x (x̄):
    + x̄ = (x₁ + x₂ + x₃ + x₄ + x₅ + x₆) / n
    + x̄ = (1 + 2 + 3 + 4 + 5 + 6) / 6
    + x̄ = 21 / 6
    + x̄ = 3.5

    + + Mean of y (ȳ):
    + ȳ = (y₁ + y₂ + y₃ + y₄ + y₅ + y₆) / n
    + ȳ = (39.8 + 48.9 + 57.0 + 68.3 + 77.9 + 85.0) / 6
    + ȳ = 376.9 / 6
    + ȳ = 62.82 +
    + +
    + Step 3: Calculate Slope (m) Using the + Formula

    + + Formula for slope:
    + m = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / Σ[(xᵢ - x̄)²]

    + + Calculate numerator (sum of products of deviations):
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    xᵢyᵢxᵢ - x̄yᵢ - ȳ(xᵢ - x̄)(yᵢ - ȳ)(xᵢ - x̄)²
    139.8-2.5-23.0257.546.25
    248.9-1.5-13.9220.882.25
    357.0-0.5-5.822.910.25
    468.30.55.482.740.25
    577.91.515.0822.622.25
    685.02.522.1855.466.25
    Sum:162.1517.50
    + + Calculate m:
    + m = 162.15 / 17.50
    + m = 9.27 (salary increases by $9.27k per + year of experience) +
    + +
    + Step 4: Calculate Intercept (c)

    + + Formula: c = ȳ - m × x̄

    + + c = 62.82 - (9.27 × 3.5)
    + c = 62.82 - 32.45
    + c = 30.37 (base salary with 0 years + experience) +
    + +
    + Step 5: Our Final Equation!

    + + ŷ = 9.27x + 30.37

    + + Make a Prediction: What salary for 7 years of experience?
    + ŷ = 9.27 × 7 + 30.37
    + ŷ = 64.89 + 30.37
    + ŷ = $95.26k predicted salary +
    + +
    + Step 6: Calculate MSE (How Good is Our + Model?)

    + + For each point, calculate (actual - predicted)²:
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    xActual yPredicted ŷError (y - ŷ)Error²
    139.839.640.160.03
    248.948.91-0.010.00
    357.058.18-1.181.39
    468.367.450.850.72
    577.976.721.181.39
    685.085.99-0.990.98
    Sum of Squared Errors:4.51
    + + MSE = Sum of Squared Errors / n
    + MSE = 4.51 / 6
    + MSE = 0.75 (Very low - great fit!) +
    + +
    +
    ✓ What We Learned
    +
    + The Math Summary:
    + 1. m (slope) = Σ[(x-x̄)(y-ȳ)] / Σ[(x-x̄)²] = 9.27
    + 2. c (intercept) = ȳ - m×x̄ = 30.37
    + 3. Final equation: ŷ = 9.27x + 30.37
    + 4. MSE = 0.75 (low error = good model!) +
    +
    +
    +
    +
    + + +
    +
    +

    📊 Supervised + - Regression Polynomial Regression

    + +
    +
    +

    When your data curves and a straight line won't fit, Polynomial Regression extends linear + regression by adding polynomial terms (x², x³, etc.) to capture non-linear relationships.

    + +
    +
    Key Concepts
    +
      +
    • Extends linear regression to fit curves
    • +
    • Uses polynomial features: x, x², x³, etc.
    • +
    • Higher degree = more flexible (but beware overfitting!)
    • +
    • Still linear in parameters (coefficients)
    • +
    +
    + +

    When Linear Fails

    +

    Consider predicting car stopping distance based on speed. The relationship isn't linear - + doubling speed quadruples stopping distance (physics: kinetic energy = ½mv²)!

    + +
    + Linear: y = β₀ + β₁x (straight line)

    + Polynomial Degree 2: y = β₀ + β₁x + β₂x²
    + Polynomial Degree 3: y = β₀ + β₁x + β₂x² + β₃x³
    + Polynomial Degree n: y = β₀ + β₁x + β₂x² + ... + βₙxⁿ +
    + +
    +
    ⚠️ Overfitting Warning!
    +
    + Degree 2-3: Usually safe, captures curves
    + Degree 4-5: Can start overfitting
    + Degree > 5: High risk of overfitting - the model memorizes noise! +
    +
    + + +
    +

    📐 Complete Mathematical Derivation

    + +

    Let's fit a quadratic curve to data step-by-step! +

    + +
    + Problem: Predict stopping distance from + speed

    + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Speed x (mph)Stopping Distance y (ft)
    1015
    2040
    3080
    40130
    50200
    +
    + +
    + Step 1: Create Polynomial Features

    + + For degree 2, we add x² as a new feature:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    x (speed)x² (speed squared)y (distance)
    1010015
    2040040
    3090080
    401600130
    502500200
    +
    + +
    + Step 2: Matrix Form (Design Matrix)

    + + Model: y = β₀ + β₁x + β₂x²

    + + Design Matrix X:
    +
    +    [1   10   100 ]       [15 ]       [β₀]
    +    [1   20   400 ]       [40 ]       [β₁]
    +X = [1   30   900 ]   y = [80 ]   β = [β₂]
    +    [1   40   1600]       [130]
    +    [1   50   2500]       [200]
    +
    + +
    + Step 3: Solve Using Normal Equation

    + + Normal Equation: β = (XᵀX)⁻¹ Xᵀy

    + + After matrix multiplication (done by computer):

    + + β₀ = 2.5 (base distance)
    + β₁ = 0.5 (linear component)
    + β₂ = 0.07 (quadratic component)
    +
    + +
    + Step 4: Final Equation

    + + ŷ = 2.5 + 0.5x + 0.07x²

    + + Make Predictions:
    + Speed = 25 mph: ŷ = 2.5 + 0.5(25) + 0.07(625) = 2.5 + 12.5 + 43.75 = 58.75 ft
    + Speed = 60 mph: ŷ = 2.5 + 0.5(60) + 0.07(3600) = 2.5 + 30 + 252 = 284.5 ft +
    + +
    +
    ✓ Key Points
    +
    + Polynomial Regression Summary:
    + 1. Create polynomial features: x → [x, x², x³, ...]
    + 2. Apply standard linear regression on expanded features
    + 3. The model is still "linear" in parameters, just non-linear in input
    + 4. Use cross-validation to choose optimal degree! +
    +
    +
    + +

    Python Code

    +
    +
    +from sklearn.preprocessing import PolynomialFeatures
    +from sklearn.linear_model import LinearRegression
    +
    +# Create polynomial features (degree 2)
    +poly = PolynomialFeatures(degree=2)
    +X_poly = poly.fit_transform(X)
    +
    +# Fit linear regression on polynomial features
    +model = LinearRegression()
    +model.fit(X_poly, y)
    +
    +# Predict
    +y_pred = model.predict(poly.transform(X_new))
    +
    -

    📊 Supervised - Optimization Gradient Descent

    +

    📊 Supervised + - Optimization Gradient Descent

    -

    Gradient Descent is the optimization algorithm that helps us find the best values for our model parameters (like m and c in linear regression). Think of it as rolling a ball downhill to find the lowest point.

    +

    Gradient Descent is the optimization algorithm that helps us find the best values for our model + parameters (like m and c in linear regression). Think of it as rolling a ball downhill to find + the lowest point.

    Key Concepts
    @@ -808,7 +1272,9 @@ canvas {

    Understanding Gradient Descent

    -

    Imagine you're hiking down a mountain in thick fog. You can't see the bottom, but you can feel the slope under your feet. The smart strategy? Always step in the steepest downward direction. That's exactly what gradient descent does with mathematical functions!

    +

    Imagine you're hiking down a mountain in thick fog. You can't see the bottom, but you can feel + the slope under your feet. The smart strategy? Always step in the steepest downward direction. + That's exactly what gradient descent does with mathematical functions!

    💡 The Mountain Analogy
    @@ -823,14 +1289,17 @@ canvas {
    Gradient Descent Update Rule: θ_new = θ_old - α × ∇J(θ) -
    where:
    θ = parameters (m, c)
    α = learning rate (step size)
    ∇J(θ) = gradient (direction and steepness)
    +
    where:
    θ = parameters (m, c)
    α = learning rate (step size)
    ∇J(θ) = gradient + (direction and steepness)

    The Learning Rate (α)

    The learning rate is like your step size when walking down the mountain:

      -
    • Too small: You take tiny steps and it takes forever to reach the bottom
    • -
    • Too large: You take huge leaps and might jump over the valley or even go uphill!
    • +
    • Too small: You take tiny steps and it takes forever to reach the bottom +
    • +
    • Too large: You take huge leaps and might jump over the valley or even go + uphill!
    • Just right: You make steady progress toward the minimum
    @@ -838,7 +1307,8 @@ canvas {
    -

    Figure 2: Loss surface showing gradient descent path to minimum

    +

    Figure 2: Loss surface showing gradient descent path + to minimum

    @@ -861,15 +1331,19 @@ canvas {

    Types of Gradient Descent

      -
    1. Batch Gradient Descent: Uses all data points for each update. Accurate but slow for large datasets.
    2. -
    3. Stochastic Gradient Descent (SGD): Uses one random data point per update. Fast but noisy.
    4. -
    5. Mini-batch Gradient Descent: Uses small batches (e.g., 32 points). Best of both worlds!
    6. +
    7. Batch Gradient Descent: Uses all data points for each update. Accurate but + slow for large datasets.
    8. +
    9. Stochastic Gradient Descent (SGD): Uses one random data point per update. + Fast but noisy.
    10. +
    11. Mini-batch Gradient Descent: Uses small batches (e.g., 32 points). Best of + both worlds!
    ⚠️ Watch Out!
    - Gradient descent can get stuck in local minima (small valleys) instead of finding the global minimum (deepest valley). This is more common with complex, non-convex loss functions. + Gradient descent can get stuck in local minima (small valleys) instead of finding the global + minimum (deepest valley). This is more common with complex, non-convex loss functions.
    @@ -880,17 +1354,165 @@ canvas {
  • Gradients become very small (near zero)
  • We reach maximum iterations (e.g., 1000 steps)
  • + + +
    +

    📐 Complete Mathematical Derivation: Gradient + Descent in Action

    + +

    Let's watch gradient descent optimize a simple + example step-by-step!

    + +
    + Problem Setup: Finding the Minimum of f(x) = + x²

    + We want to find the value of x that minimizes f(x) = x²

    + Settings:
    + • Starting point: x₀ = 4
    + • Learning rate: α = 0.3
    + • Goal: Find x that minimizes x² (answer should be x = 0) +
    + +
    + Step 1: Calculate the Gradient (Derivative)

    + + The gradient tells us which direction increases the function.

    + f(x) = x²
    + f'(x) = d/dx (x²) = 2x

    + + Why 2x?
    + Using the power rule: d/dx (xⁿ) = n × xⁿ⁻¹
    + So: d/dx (x²) = 2 × x²⁻¹ = 2 × x¹ = 2x +
    + +
    + Step 2: Apply the Update Rule Iteratively

    + + Update Formula: x_new = x_old - α × f'(x_old)

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Iterationx_oldf'(x) = 2xα × f'(x)x_new = x_old - α×f'(x)f(x) = x²
    0 (Start)4.00016.00
    14.0002×4 = 80.3×8 = 2.44 - 2.4 = 1.600 + 2.56
    21.6002×1.6 = 3.20.3×3.2 = 0.961.6 - 0.96 = 0.640 + 0.41
    30.6402×0.64 = 1.280.3×1.28 = 0.3840.64 - 0.384 = + 0.256 + 0.066
    40.2562×0.256 = 0.5120.3×0.512 = 0.1540.256 - 0.154 = + 0.102 + 0.010
    50.1022×0.102 = 0.2050.3×0.205 = 0.0610.102 - 0.061 = + 0.041 + 0.002
    ...≈ + 0≈ + 0
    +
    + +
    + Step 3: Applying to Linear Regression

    + + For linear regression y = mx + c, we minimize MSE:
    + MSE = (1/n) × Σ(yᵢ - (mxᵢ + c))²

    + + Partial derivatives (gradients):
    + ∂MSE/∂m = (-2/n) × Σ xᵢ(yᵢ - ŷᵢ)
    + ∂MSE/∂c = (-2/n) × Σ (yᵢ - ŷᵢ)

    + + Update rules:
    + m_new = m_old - α × ∂MSE/∂m
    + c_new = c_old - α × ∂MSE/∂c

    + + Each iteration brings m and c closer to optimal values! +
    + +
    +
    ✓ Key Insight
    +
    + Watch what happens:
    + • Started at x = 4, loss = 16
    + • After 5 iterations: x ≈ 0.041, loss ≈ 0.002
    + • The loss dropped from 16 to 0.002 in just 5 + steps!

    + This is the power of gradient descent - it automatically finds the minimum by following + the steepest path downhill! +
    +
    +
    -

    📊 Supervised - Classification Logistic Regression

    +

    📊 Supervised + - Classification Logistic Regression

    -

    Logistic Regression is used for binary classification - when you want to predict categories (yes/no, spam/not spam, disease/healthy) not numbers. Despite its name, it's a classification algorithm!

    +

    Logistic Regression is used for binary classification - when you want to predict categories + (yes/no, spam/not spam, disease/healthy) not numbers. Despite its name, it's a classification + algorithm!

    Key Concepts
    @@ -919,12 +1541,14 @@ canvas {

    Enter the Sigmoid Function

    -

    The sigmoid function σ(z) squashes any input into the range [0, 1], making it perfect for probabilities!

    +

    The sigmoid function σ(z) squashes any input into the range [0, 1], making it perfect for + probabilities!

    Sigmoid Function: σ(z) = 1 / (1 + e^(-z)) -
    where:
    z = w·x + b (linear combination)
    σ(z) = probability (always between 0 and 1)
    e ≈ 2.718 (Euler's number)
    +
    where:
    z = w·x + b (linear combination)
    σ(z) = probability (always between 0 + and 1)
    e ≈ 2.718 (Euler's number)

    Sigmoid Properties:

    @@ -939,9 +1563,10 @@ canvas {
    - +
    -

    Figure: Sigmoid function transforms linear input to probability

    +

    Figure: Sigmoid function transforms linear input to + probability

    Logistic Regression Formula

    @@ -964,24 +1589,50 @@ canvas { - 1500 (Not Tall)0.2 - 16000.35 - 17000.5 - 1801 (Tall)0.65 - 19010.8 - 20010.9 + + 150 + 0 (Not Tall) + 0.2 + + + 160 + 0 + 0.35 + + + 170 + 0 + 0.5 + + + 180 + 1 (Tall) + 0.65 + + + 190 + 1 + 0.8 + + + 200 + 1 + 0.9 +
    - +
    -

    Figure: Logistic regression with decision boundary at 0.5

    +

    Figure: Logistic regression with decision boundary at + 0.5

    Log Loss (Cross-Entropy)

    -

    We can't use MSE for logistic regression because it creates a non-convex optimization surface (multiple local minima). Instead, we use log loss:

    +

    We can't use MSE for logistic regression because it creates a non-convex optimization surface + (multiple local minima). Instead, we use log loss:

    Log Loss for Single Sample: @@ -991,18 +1642,22 @@ canvas {

    Understanding Log Loss:

    Case 1: Actual y=1, Predicted p=0.9

    -

    Loss = -[1·log(0.9) + 0·log(0.1)] = -log(0.9) = 0.105 ✓ Low loss (good!)

    +

    Loss = -[1·log(0.9) + 0·log(0.1)] = -log(0.9) = 0.105 ✓ Low loss + (good!)

    Case 2: Actual y=1, Predicted p=0.1

    -

    Loss = -[1·log(0.1) + 0·log(0.9)] = -log(0.1) = 2.303 ✗ High loss (bad!)

    +

    Loss = -[1·log(0.1) + 0·log(0.9)] = -log(0.1) = 2.303 ✗ High loss + (bad!)

    Case 3: Actual y=0, Predicted p=0.1

    -

    Loss = -[0·log(0.1) + 1·log(0.9)] = -log(0.9) = 0.105 ✓ Low loss (good!)

    +

    Loss = -[0·log(0.1) + 1·log(0.9)] = -log(0.9) = 0.105 ✓ Low loss + (good!)

    💡 Why Log Loss Works
    - Log loss heavily penalizes confident wrong predictions! If you predict 0.99 but the answer is 0, you get a huge penalty. This encourages the model to be accurate AND calibrated. + Log loss heavily penalizes confident wrong predictions! If you predict 0.99 but the answer + is 0, you get a huge penalty. This encourages the model to be accurate AND calibrated.
    @@ -1019,7 +1674,216 @@ canvas {
    ✅ Key Takeaway
    - Logistic regression = Linear regression + Sigmoid function + Log loss. It's called "regression" for historical reasons, but it's actually for classification! + Logistic regression = Linear regression + Sigmoid function + Log loss. It's called + "regression" for historical reasons, but it's actually for classification! +
    +
    + + +
    +

    📐 Complete Mathematical Derivation: Logistic + Regression

    + +

    Let's walk through the entire process with real + numbers!

    + +
    + Problem: Predict if a person is "Tall" based on + height

    + + Training Data:
    + Person 1: Height = 155 cm → Not Tall (y = 0)
    + Person 2: Height = 165 cm → Not Tall (y = 0)
    + Person 3: Height = 175 cm → Tall (y = 1)
    + Person 4: Height = 185 cm → Tall (y = 1)

    + + Given trained weights: w = 0.05, b = -8.5 +
    + +
    + Step 1: Calculate Linear Combination (z)

    + + Formula: z = w × height + b

    + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Height (cm)z = 0.05 × height - 8.5z value
    1550.05 × 155 - 8.5-0.75
    1650.05 × 165 - 8.5-0.25
    1750.05 × 175 - 8.5+0.25
    1850.05 × 185 - 8.5+0.75
    + + Negative z → likely class 0, Positive z → likely class 1 +
    + +
    + Step 2: Apply Sigmoid Function σ(z)

    + + Sigmoid Formula: σ(z) = 1 / (1 + e⁻ᶻ)

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ze⁻ᶻ1 + e⁻ᶻσ(z) = 1/(1+e⁻ᶻ)Interpretation
    -0.75e⁰·⁷⁵ = 2.1173.1170.3232% chance tall
    -0.25e⁰·²⁵ = 1.2842.2840.4444% chance tall
    +0.25e⁻⁰·²⁵ = 0.7791.7790.5656% chance tall
    +0.75e⁻⁰·⁷⁵ = 0.4721.4720.6868% chance tall
    +
    + +
    + Step 3: Make Predictions (threshold = 0.5)

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Heightp = σ(z)p ≥ 0.5?PredictionActualCorrect?
    1550.32No0 (Not Tall)0
    1650.44No0 (Not Tall)0
    1750.56Yes1 (Tall)1
    1850.68Yes1 (Tall)1
    + + 100% accuracy on training data! +
    + +
    + Step 4: Calculate Log Loss (Cross-Entropy)

    + + Formula: L = -[y × log(p) + (1-y) × log(1-p)]

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    y (actual)p (predicted)CalculationLoss
    00.32-[0×log(0.32) + 1×log(0.68)]0.39
    00.44-[0×log(0.44) + 1×log(0.56)]0.58
    10.56-[1×log(0.56) + 0×log(0.44)]0.58
    10.68-[1×log(0.68) + 0×log(0.32)]0.39
    Average Log Loss:(0.39+0.58+0.58+0.39)/4 = + 0.485
    +
    + +
    +
    ✓ Summary of Logistic Regression Math
    +
    + The Complete Pipeline:
    + 1. Linear: z = w×x + b (compute a score)
    + 2. Sigmoid: p = 1/(1+e⁻ᶻ) (convert score to probability 0-1)
    + 3. Threshold: if p ≥ 0.5, predict class 1; else predict class 0
    + 4. Loss: Log Loss = -[y×log(p) + (1-y)×log(1-p)]
    + 5. Train: Use gradient descent to minimize total log loss! +
    @@ -1028,13 +1892,17 @@ canvas {
    -

    📊 Supervised - Classification Support Vector Machines (SVM)

    +

    📊 Supervised + - Classification Support Vector Machines (SVM)

    What is SVM?

    -

    Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. Unlike logistic regression which just needs any line that separates the classes, SVM finds the BEST decision boundary - the one with the maximum margin between classes.

    +

    Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both + classification and regression tasks. Unlike logistic regression which just needs any line that + separates the classes, SVM finds the BEST decision boundary - the one with the maximum margin + between classes.

    Key Concepts
    @@ -1049,7 +1917,8 @@ canvas {
    💡 Key Insight
    - SVM doesn't just want w·x + b > 0, it wants every point to be confidently far from the boundary. The score is directly proportional to the distance from the decision boundary! + SVM doesn't just want w·x + b > 0, it wants every point to be confidently far from the + boundary. The score is directly proportional to the distance from the decision boundary!
    @@ -1067,12 +1936,42 @@ canvas { - A27+1 - B38+1 - C47+1 - D62-1 - E73-1 - F82-1 + + A + 2 + 7 + +1 + + + B + 3 + 8 + +1 + + + C + 4 + 7 + +1 + + + D + 6 + 2 + -1 + + + E + 7 + 3 + -1 + + + F + 8 + 2 + -1 + @@ -1080,12 +1979,14 @@ canvas {

    Decision Boundary

    -

    The decision boundary is a line (or hyperplane in higher dimensions) that separates the two classes. It's defined by the equation:

    +

    The decision boundary is a line (or hyperplane in higher dimensions) that separates the two + classes. It's defined by the equation:

    Decision Boundary Equation: w·x + b = 0 -
    where:
    w = [w₁, w₂] is the weight vector
    x = [x₁, x₂] is the data point
    b is the bias term
    +
    where:
    w = [w₁, w₂] is the weight vector
    x = [x₁, x₂] is the data point
    b is + the bias term
    @@ -1099,9 +2000,10 @@ canvas {
    - +
    -

    Figure 3: SVM decision boundary with 6 data points. Hover to see scores.

    +

    Figure 3: SVM decision boundary with 6 data points. + Hover to see scores.

    @@ -1121,11 +2023,14 @@ canvas {

    Margin and Support Vectors

    - +
    📏 Understanding Margin
    - The margin is the distance between the decision boundary and the closest points from each class. Support vectors are the points exactly at the margin (with score = ±1). These are the points with "lowest acceptable confidence" and they're the only ones that matter for defining the boundary! + The margin is the distance between the decision boundary and the closest + points from each class. Support vectors are the points exactly at the + margin (with score = ±1). These are the points with "lowest acceptable confidence" and + they're the only ones that matter for defining the boundary!
    @@ -1142,16 +2047,18 @@ canvas {
    - +
    -

    Figure 4: Decision boundary with margin lines and support vectors highlighted in cyan

    +

    Figure 4: Decision boundary with margin lines and + support vectors highlighted in cyan

    Hard Margin vs Soft Margin

    Hard Margin SVM

    -

    Hard margin SVM requires perfect separation - no points can violate the margin. It works only when data is linearly separable.

    +

    Hard margin SVM requires perfect separation - no points can violate the margin. It works only + when data is linearly separable.

    Hard Margin Optimization: @@ -1162,61 +2069,74 @@ canvas {
    ⚠️ Hard Margin Limitation
    - Hard margin can lead to overfitting if we force perfect separation on noisy data! Real-world data often has outliers and noise. + Hard margin can lead to overfitting if we force perfect separation on noisy data! Real-world + data often has outliers and noise.

    Soft Margin SVM

    -

    Soft margin SVM allows some margin violations, making it more practical for real-world data. It balances margin maximization with allowing some misclassifications.

    +

    Soft margin SVM allows some margin violations, making it more practical for real-world data. It + balances margin maximization with allowing some misclassifications.

    Soft Margin Cost Function: Cost = (1/2)||w||² + C·Σ max(0, 1 - yᵢ(w·xᵢ + b))
          ↓                           ↓
    Maximize margin      Hinge Loss
    -                           (penalize violations) +                           (penalize + violations)

    The C Parameter

    -

    The C parameter controls the trade-off between maximizing the margin and minimizing classification errors. It acts like regularization in other ML algorithms.

    +

    The C parameter controls the trade-off between maximizing the margin and minimizing + classification errors. It acts like regularization in other ML algorithms.

    Effects of C Parameter
      -
    • Small C (0.1 or 1): Wider margin, more violations allowed, better generalization, use when data is noisy
    • -
    • Large C (1000): Narrower margin, fewer violations, classify everything correctly, risk of overfitting, use when data is clean
    • +
    • Small C (0.1 or 1): Wider margin, more violations allowed, better + generalization, use when data is noisy
    • +
    • Large C (1000): Narrower margin, fewer violations, classify everything + correctly, risk of overfitting, use when data is clean
    - +
    -

    Figure 5: Effect of C parameter on margin and violations

    +

    Figure 5: Effect of C parameter on margin and + violations

    -

    Slide to see: 0.1 → 1 → 10 → 1000

    +

    Slide to see: 0.1 → 1 → 10 → + 1000

    -
    +
    Margin Width
    -
    2.00
    +
    2.00 +
    -
    +
    Violations
    -
    0
    +
    0 +

    Training Algorithm

    -

    SVM can be trained using gradient descent. For each training sample (xᵢ, yᵢ), we check if it violates the margin and update weights accordingly.

    +

    SVM can be trained using gradient descent. For each training sample (xᵢ, yᵢ), we check if it + violates the margin and update weights accordingly.

    Update Rules:
    @@ -1234,9 +2154,10 @@ canvas {
    - +
    -

    Figure 6: SVM training visualization - step through each point

    +

    Figure 6: SVM training visualization - step through + each point

    @@ -1245,7 +2166,8 @@ canvas {
    -
    +
    Step: 0 / 6
    Current Point: -
    w = [0.00, 0.00]
    @@ -1268,12 +2190,14 @@ canvas {

    SVM Kernels (Advanced)

    -

    Real-world data is often not linearly separable. Kernels transform data to higher dimensions where a linear boundary exists, which appears non-linear in the original space!

    +

    Real-world data is often not linearly separable. Kernels transform data to higher dimensions + where a linear boundary exists, which appears non-linear in the original space!

    💡 The Kernel Trick
    - Kernels let us solve non-linear problems without explicitly computing high-dimensional features! They compute similarity between points in transformed space efficiently. + Kernels let us solve non-linear problems without explicitly computing high-dimensional + features! They compute similarity between points in transformed space efficiently.
    @@ -1296,7 +2220,7 @@ canvas {
    - +

    Figure 7: Kernel comparison on non-linear data

    @@ -1318,7 +2242,7 @@ canvas {

    Key Formulas Summary

    - +
    Essential SVM Formulas:

    @@ -1353,7 +2277,8 @@ canvas {
    ✅ Why SVM is Powerful
    - SVM only cares about support vectors - the points closest to the boundary. Other points don't affect the decision boundary at all! This makes it memory efficient and robust. + SVM only cares about support vectors - the points closest to the boundary. Other points + don't affect the decision boundary at all! This makes it memory efficient and robust.
    @@ -1369,7 +2294,8 @@ canvas {

    Advantages

      -
    • Effective in high dimensions: Works well even when features > samples
    • +
    • Effective in high dimensions: Works well even when features > samples +
    • Memory efficient: Only stores support vectors, not entire dataset
    • Versatile: Different kernels for different data patterns
    • Robust: Works well with clear margin of separation
    • @@ -1377,7 +2303,8 @@ canvas {

      Disadvantages

        -
      • Slow on large datasets: Training time grows quickly with >10k samples
      • +
      • Slow on large datasets: Training time grows quickly with >10k samples +
      • No probability estimates: Doesn't directly provide confidence scores
      • Kernel choice: Requires expertise to select right kernel
      • Feature scaling: Very sensitive to feature scales
      • @@ -1385,7 +2312,7 @@ canvas {

        Real-World Example: Email Spam Classification

        - +
        📧 Email Spam Detection

        Imagine we have emails with two features:

        @@ -1394,14 +2321,48 @@ canvas {
      • x₂ = number of capital letters

      - SVM finds the widest "road" between spam and non-spam emails. Support vectors are the emails closest to this road - they're the trickiest cases that define our boundary! An email far from the boundary is clearly spam or clearly legitimate. + SVM finds the widest "road" between spam and non-spam emails. Support vectors are the emails + closest to this road - they're the trickiest cases that define our boundary! An email far + from the boundary is clearly spam or clearly legitimate.

    +

    Python Code

    +
    +
    +from sklearn.svm import SVC
    +from sklearn.preprocessing import StandardScaler
    +from sklearn.model_selection import train_test_split
    +
    +# Scale features (very important for SVM!)
    +scaler = StandardScaler()
    +X_train_scaled = scaler.fit_transform(X_train)
    +X_test_scaled = scaler.transform(X_test)
    +
    +# Create SVM with RBF kernel
    +svm = SVC(
    +    kernel='rbf',      # Options: 'linear', 'poly', 'rbf'
    +    C=1.0,             # Regularization parameter
    +    gamma='scale'      # Kernel coefficient
    +)
    +
    +# Train
    +svm.fit(X_train_scaled, y_train)
    +
    +# Predict
    +predictions = svm.predict(X_test_scaled)
    +
    +# Get support vectors
    +print(f"Number of support vectors: {len(svm.support_vectors_)}")
    +
    +
    🎯 Key Takeaway
    - Unlike other algorithms that try to classify all points correctly, SVM focuses on the decision boundary. It asks: "What's the safest road I can build between these two groups?" The answer: Make it as wide as possible! + Unlike other algorithms that try to classify all points correctly, SVM focuses on the + decision boundary. It asks: "What's the safest road I can build between these two groups?" + The answer: Make it as wide as possible!
    @@ -1410,11 +2371,13 @@ canvas {
    -

    📊 Supervised - Classification K-Nearest Neighbors (KNN)

    +

    📊 Supervised + - Classification K-Nearest Neighbors (KNN)

    -

    K-Nearest Neighbors is the simplest machine learning algorithm! To classify a new point, just look at its K nearest neighbors and take a majority vote. No training required!

    +

    K-Nearest Neighbors is the simplest machine learning algorithm! To classify a new point, just + look at its K nearest neighbors and take a majority vote. No training required!

    Key Concepts
    @@ -1429,13 +2392,14 @@ canvas {

    How KNN Works

    1. Choose K: Decide how many neighbors (e.g., K=3)
    2. -
    3. Calculate distance: Find distance from new point to all training points
    4. +
    5. Calculate distance: Find distance from new point to all training points +
    6. Find K nearest: Select K points with smallest distances
    7. Vote: Majority class wins (or take average for regression)

    Distance Metrics

    - +
    Euclidean Distance (straight line): d = √[(x₁-x₂)² + (y₁-y₂)²] @@ -1450,9 +2414,10 @@ canvas {
    - +
    -

    Figure: KNN classification - drag the test point to see predictions

    +

    Figure: KNN classification - drag the test point to + see predictions

    @@ -1463,7 +2428,8 @@ canvas {
    - +
    @@ -1474,15 +2440,50 @@ canvas { - + + + + + + - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    PointPositionClassDistance
    PointPositionClassDistance
    A(1.0, 2.0)Orange1.80
    B(0.9, 1.7)Orange2.00
    C(1.5, 2.5)Orange1.00 ← nearest!
    D(4.0, 5.0)Yellow3.35
    E(4.2, 4.8)Yellow3.15
    F(3.8, 5.2)Yellow3.12
    A(1.0, 2.0)Orange1.80
    B(0.9, 1.7)Orange2.00
    C(1.5, 2.5)Orange1.00 ← nearest!
    D(4.0, 5.0)Yellow3.35
    E(4.2, 4.8)Yellow3.15
    F(3.8, 5.2)Yellow3.12
    @@ -1501,7 +2502,8 @@ canvas {
    ⚠️ Critical: Feature Scaling!
    - Always scale features before using KNN! If one feature has range [0, 1000] and another [0, 1], the large feature dominates distance calculations. Use StandardScaler or MinMaxScaler. + Always scale features before using KNN! If one feature has range [0, 1000] and another [0, + 1], the large feature dominates distance calculations. Use StandardScaler or MinMaxScaler.
    @@ -1526,19 +2528,229 @@ canvas {
    💡 When to Use KNN
    - KNN works best with small to medium datasets (<10,000 samples) with few features (<20). Great for recommendation systems, pattern recognition, and as a baseline to compare other models! + KNN works best with small to medium datasets (<10,000 samples) with few features + (<20). Great for recommendation systems, pattern recognition, and as a baseline to + compare other models! +
    +
    + + +
    +

    📐 Complete Mathematical Derivation: KNN + Classification

    + +

    Let's classify a new point step-by-step with + actual calculations!

    + +
    + Problem: Classify a new fruit

    + + Training Data:
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    FruitWeight (g)Size (cm)Class
    A1407Apple
    B1507.5Apple
    C1809Orange
    D20010Orange
    E1608Orange
    + + New point to classify: Weight = 165g, Size = 8.5cm
    + Using K = 3 (3 nearest neighbors) +
    + +
    + Step 1: Calculate Euclidean Distance to ALL + Points

    + + Distance Formula: d = √[(x₂-x₁)² + (y₂-y₁)²]

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    PointCalculationDistance
    A√[(165-140)² + (8.5-7)²] = √[625 + 2.25]25.04
    B√[(165-150)² + (8.5-7.5)²] = √[225 + 1]15.03
    C√[(165-180)² + (8.5-9)²] = √[225 + 0.25]15.01
    D√[(165-200)² + (8.5-10)²] = √[1225 + 2.25]35.03
    E√[(165-160)² + (8.5-8)²] = √[25 + 0.25]5.02
    +
    + +
    + Step 2: Find K=3 Nearest Neighbors

    + + Sort by distance:
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    RankPointDistanceClassInclude?
    1stE5.02Orange✓ Yes
    2ndC15.01Orange✓ Yes
    3rdB15.03Apple✓ Yes
    4thA25.04Apple✗ No
    5thD35.03Orange✗ No
    +
    + +
    + Step 3: Vote Among K Neighbors

    + + K=3 Neighbors:
    + • E: Orange (1 vote)
    + • C: Orange (1 vote)
    + • B: Apple (1 vote)

    + + Final Vote Count:
    + • Orange: 2 votes
    + • Apple: 1 vote

    + + 🍊 Prediction: ORANGE (majority + wins!) +
    + +
    +
    ✓ KNN Math Summary
    +
    + The KNN Algorithm:
    + 1. Calculate distance from new point to ALL training points
    + 2. Sort distances from smallest to largest
    + 3. Pick K nearest neighbors
    + 4. Vote: Classification = majority class, Regression = average + value

    + Note: Always normalize features first! Weight (100s) would dominate Size (10s) + otherwise! +
    + +

    Python Code

    +
    +
    +from sklearn.neighbors import KNeighborsClassifier
    +from sklearn.preprocessing import StandardScaler
    +
    +# Scale features (essential for KNN!)
    +scaler = StandardScaler()
    +X_train_scaled = scaler.fit_transform(X_train)
    +X_test_scaled = scaler.transform(X_test)
    +
    +# Create KNN classifier
    +knn = KNeighborsClassifier(
    +    n_neighbors=5,        # Number of neighbors (K)
    +    metric='euclidean',   # Distance metric
    +    weights='uniform'     # 'uniform' or 'distance'
    +)
    +
    +# Train (just stores the data!)
    +knn.fit(X_train_scaled, y_train)
    +
    +# Predict
    +predictions = knn.predict(X_test_scaled)
    +
    +# Get probabilities
    +probas = knn.predict_proba(X_test_scaled)
    +
    -

    📊 Supervised - Evaluation Model Evaluation

    +

    📊 Supervised + - Evaluation Model Evaluation

    -

    How do we know if our model is good? Model evaluation provides metrics to measure performance and identify problems!

    +

    How do we know if our model is good? Model evaluation provides metrics to measure performance and + identify problems!

    Key Metrics
    @@ -1572,9 +2784,10 @@ Actual Pos TP FN
    - +
    -

    Figure: Confusion matrix for spam detection (TP=600, FP=100, FN=300, TN=900)

    +

    Figure: Confusion matrix for spam detection (TP=600, + FP=100, FN=300, TN=900)

    Classification Metrics

    @@ -1585,12 +2798,14 @@ Actual Pos TP FN
    Percentage of correct predictions overall
    -

    Example: (600 + 900) / (600 + 900 + 100 + 300) = 1500/1900 = 0.789 (78.9%)

    +

    Example: (600 + 900) / (600 + 900 + 100 + 300) = 1500/1900 = 0.789 + (78.9%)

    ⚠️ Accuracy Paradox
    - Accuracy misleads on imbalanced data! If 99% emails are not spam, a model that always predicts "not spam" gets 99% accuracy but is useless! + Accuracy misleads on imbalanced data! If 99% emails are not spam, a model that always + predicts "not spam" gets 99% accuracy but is useless!
    @@ -1601,7 +2816,8 @@ Actual Pos TP FN

    Example: 600 / (600 + 100) = 600/700 = 0.857 (85.7%)

    -

    Use when: False positives are costly (e.g., spam filter - don't want to block legitimate emails)

    +

    Use when: False positives are costly (e.g., spam filter - don't want to block + legitimate emails)

    Recall (Sensitivity, TPR): @@ -1610,7 +2826,8 @@ Actual Pos TP FN

    Example: 600 / (600 + 300) = 600/900 = 0.667 (66.7%)

    -

    Use when: False negatives are costly (e.g., disease detection - can't miss sick patients)

    +

    Use when: False negatives are costly (e.g., disease detection - can't miss sick + patients)

    F1-Score: @@ -1618,10 +2835,12 @@ Actual Pos TP FN
    Harmonic mean - balances precision and recall
    -

    Example: 2 × (0.857 × 0.667) / (0.857 + 0.667) = 0.750 (75.0%)

    +

    Example: 2 × (0.857 × 0.667) / (0.857 + 0.667) = 0.750 (75.0%) +

    ROC Curve & AUC

    -

    The ROC (Receiver Operating Characteristic) curve shows model performance across ALL possible thresholds!

    +

    The ROC (Receiver Operating Characteristic) curve shows model performance across ALL possible + thresholds!

    ROC Components: @@ -1632,9 +2851,10 @@ Actual Pos TP FN
    - +
    -

    Figure: ROC curve - slide threshold to see trade-off

    +

    Figure: ROC curve - slide threshold to see trade-off +

    @@ -1659,7 +2879,8 @@ Actual Pos TP FN

    Regression Metrics: R² Score

    -

    For regression problems, R² (coefficient of determination) measures how well the model explains variance:

    +

    For regression problems, R² (coefficient of determination) measures how well the model explains + variance:

    R² Formula: @@ -1680,9 +2901,10 @@ Actual Pos TP FN
    - +
    -

    Figure: R² calculation on height-weight regression

    +

    Figure: R² calculation on height-weight regression +

    @@ -1704,7 +2926,8 @@ Actual Pos TP FN
    -

    Regularization prevents overfitting by penalizing complex models. It adds a "simplicity constraint" to force the model to generalize better!

    +

    Regularization prevents overfitting by penalizing complex models. It adds a "simplicity + constraint" to force the model to generalize better!

    Key Concepts
    @@ -1728,7 +2951,8 @@ Actual Pos TP FN
    ⚠️ Overfitting Example
    - Imagine fitting a 10th-degree polynomial to 12 data points. It perfectly fits training data (even noise) but fails on new data. Regularization prevents this! + Imagine fitting a 10th-degree polynomial to 12 data points. It perfectly fits training data + (even noise) but fails on new data. Regularization prevents this!
    @@ -1738,7 +2962,8 @@ Actual Pos TP FN
    Regularized Cost Function: Cost = Loss + λ × Penalty(θ) -
    where:
    θ = model parameters (weights)
    λ = regularization strength
    Penalty = function of parameter magnitudes
    +
    where:
    θ = model parameters (weights)
    λ = regularization strength
    Penalty = + function of parameter magnitudes

    L1 Regularization (Lasso)

    @@ -1773,9 +2998,10 @@ Actual Pos TP FN
    - +
    -

    Figure: Comparing vanilla, L1, and L2 regularization effects

    +

    Figure: Comparing vanilla, L1, and L2 regularization + effects

    @@ -1813,16 +3039,20 @@ Actual Pos TP FN

    Practical Example

    Predicting house prices with 10 features (size, bedrooms, age, etc.):

    -

    Without regularization: All features have large, varying coefficients. Model overfits noise.

    +

    Without regularization: All features have large, varying coefficients. Model + overfits noise.

    -

    With L1: Only 4 features remain (size, location, bedrooms, age). Others set to 0. Simpler, more interpretable!

    +

    With L1: Only 4 features remain (size, location, bedrooms, age). Others set to + 0. Simpler, more interpretable!

    -

    With L2: All features kept but coefficients shrunk. More stable predictions, handles correlated features well.

    +

    With L2: All features kept but coefficients shrunk. More stable predictions, + handles correlated features well.

    ✅ Key Takeaway
    - Regularization is like adding a "simplicity tax" to your model. Complex models pay more tax, encouraging simpler solutions that generalize better! + Regularization is like adding a "simplicity tax" to your model. Complex models pay more tax, + encouraging simpler solutions that generalize better!
    @@ -1834,7 +3064,8 @@ Actual Pos TP FN
    -

    Every model makes two types of errors: bias and variance. The bias-variance tradeoff is the fundamental challenge in machine learning - we must balance them!

    +

    Every model makes two types of errors: bias and variance. The bias-variance tradeoff is the + fundamental challenge in machine learning - we must balance them!

    Key Concepts
    @@ -1847,7 +3078,9 @@ Actual Pos TP FN

    Understanding Bias

    -

    Bias is the error from overly simplistic assumptions. High bias causes underfitting.

    +

    Bias is the error from overly simplistic assumptions. High bias causes + underfitting. +

    Characteristics of High Bias:

      @@ -1861,12 +3094,14 @@ Actual Pos TP FN
      🎯 High Bias Example
      - Trying to fit a parabola with a straight line. No matter how much training data you have, a line can't capture the curve. That's bias! + Trying to fit a parabola with a straight line. No matter how much training data you have, a + line can't capture the curve. That's bias!

      Understanding Variance

      -

      Variance is the error from sensitivity to small fluctuations in training data. High variance causes overfitting.

      +

      Variance is the error from sensitivity to small fluctuations in training data. + High variance causes overfitting.

      Characteristics of High Variance:

        @@ -1880,7 +3115,8 @@ Actual Pos TP FN
        📊 High Variance Example
        - A wiggly curve that passes through every training point perfectly, including outliers. Change one data point and the entire curve changes dramatically. That's variance! + A wiggly curve that passes through every training point perfectly, including outliers. + Change one data point and the entire curve changes dramatically. That's variance!
        @@ -1900,9 +3136,10 @@ Actual Pos TP FN
        - +
        -

        Figure: Three models showing underfitting, good fit, and overfitting

        +

        Figure: Three models showing underfitting, good fit, + and overfitting

        The Driving Test Analogy

        @@ -1911,17 +3148,20 @@ Actual Pos TP FN
        Driving Test Analogy
          -
        • +
        • High Bias (Underfitting):
          Failed practice tests, failed real test
          → Can't learn to drive at all
        • -
        • +
        • Good Balance:
          Passed practice tests, passed real test
          → Actually learned to drive!
        • -
        • +
        • High Variance (Overfitting):
          Perfect on practice tests, failed real test
          → Memorized practice, didn't truly learn @@ -1951,9 +3191,10 @@ Actual Pos TP FN

          Model Complexity Curve

          - +
          -

          Figure: Error vs model complexity - find the sweet spot

          +

          Figure: Error vs model complexity - find the sweet + spot

          @@ -1979,19 +3220,554 @@ Actual Pos TP FN
          ✅ Key Takeaway
          - The bias-variance tradeoff is unavoidable. You can't have zero bias AND zero variance. The art of machine learning is finding the sweet spot where total error is minimized! + The bias-variance tradeoff is unavoidable. You can't have zero bias AND zero variance. The + art of machine learning is finding the sweet spot where total error is minimized! +
          +
          +
          +
        + + +
        +
        +

        🧠 Neural + Networks The Perceptron

        + +
        +
        +

        The Perceptron is the simplest neural network - just one neuron! It's the building block of all + deep learning and was invented in 1958. Understanding it is key to understanding neural + networks.

        + +
        +
        Key Concepts
        +
          +
        • Single artificial neuron
        • +
        • Takes multiple inputs, produces one output
        • +
        • Uses weights to determine importance of inputs
        • +
        • Applies activation function to make decision
        • +
        +
        + +

        How a Perceptron Works

        +
        + 1. Weighted Sum: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

        + 2. Activation: output = activation(z)

        + Step Function (Original): output = 1 if z > 0, else 0
        + Sigmoid (Modern): output = 1/(1 + e⁻ᶻ) +
        + + +
        +

        📐 Complete Mathematical Derivation: Perceptron +

        + +

        Let's build a simple AND gate with a perceptron! +

        + +
        + Problem: Learn the AND logic gate

        + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        x₁x₂AND Output
        000
        010
        100
        111
        + + Given weights: w₁ = 0.5, w₂ = 0.5, b = -0.7 +
        + +
        + Step 1: Compute Weighted Sum for Each Input

        + + Formula: z = w₁x₁ + w₂x₂ + b

        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        x₁x₂z = 0.5x₁ + 0.5x₂ - 0.7z value
        000.5(0) + 0.5(0) - 0.7-0.7
        010.5(0) + 0.5(1) - 0.7-0.2
        100.5(1) + 0.5(0) - 0.7-0.2
        110.5(1) + 0.5(1) - 0.7+0.3
        +
        + +
        + Step 2: Apply Step Activation Function

        + + Step Function: output = 1 if z > 0, else 0

        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        x₁x₂zz > 0?OutputExpectedMatch?
        00-0.7No00
        01-0.2No00
        10-0.2No00
        11+0.3Yes11
        + + 🎉 The perceptron perfectly learns the AND + gate! +
        + +
        + Step 3: Perceptron Learning Rule (How to Find + Weights)

        + + Update Rule: w_new = w_old + α × (target - output) × input

        + + Where α = learning rate (e.g., 0.1)

        + + Example update:
        + If prediction was 0 but target was 1 (error = 1):
        + w₁_new = 0.5 + 0.1 × (1 - 0) × 1 = 0.5 + 0.1 = 0.6

        + + Weights increase for inputs that should have been positive! +
        + +
        +
        ✓ Perceptron Summary
        +
        + The Perceptron Algorithm:
        + 1. Initialize weights randomly
        + 2. For each training example: compute z = Σ(wᵢxᵢ) + b
        + 3. Apply activation: output = step(z)
        + 4. Update weights if wrong: w += α × error × input
        + 5. Repeat until all examples correct (or max iterations) +
        +
        +
        + +
        +
        ⚠️ Perceptron Limitation
        +
        + A single perceptron can only learn linearly separable patterns. It CANNOT + learn XOR! + This is why we need multi-layer networks (next section). +
        +
        +
        +
        + + +
        +
        +

        🧠 Neural + Networks Multi-Layer Perceptron (MLP)

        + +
        +
        +

        A Multi-Layer Perceptron (MLP) stacks multiple layers of neurons to learn complex, non-linear + patterns. This is the foundation of deep learning!

        + +
        +
        Network Architecture
        +
          +
        • Input Layer: Receives features (one neuron per feature)
        • +
        • Hidden Layer(s): Learn abstract representations
        • +
        • Output Layer: Produces final prediction
        • +
        • Weights: Connect neurons between layers
        • +
        +
        + +

        Activation Functions

        +
        + Sigmoid: σ(z) = 1/(1 + e⁻ᶻ) → output (0, 1)
        + ReLU: f(z) = max(0, z) → output [0, ∞)
        + Tanh: tanh(z) = (eᶻ - e⁻ᶻ)/(eᶻ + e⁻ᶻ) → output (-1, 1)
        + Softmax: For multi-class classification +
        + + +
        +

        📐 Complete Mathematical Derivation: Forward + Propagation

        + +

        Let's trace through a small neural network + step-by-step!

        + +
        + Network Architecture: 2 → 2 → 1

        + + • Input layer: 2 neurons (x₁, x₂)
        + • Hidden layer: 2 neurons (h₁, h₂)
        + • Output layer: 1 neuron (ŷ)

        + + Given Weights:
        + W₁ (input→hidden): [[0.1, 0.3], [0.2, 0.4]]
        + b₁ (hidden bias): [0.1, 0.1]
        + W₂ (hidden→output): [[0.5], [0.6]]
        + b₂ (output bias): [0.2] +
        + +
        + Step 1: Forward Pass - Input to Hidden + Layer

        + + Input: x = [1.0, 2.0]

        + + Hidden neuron h₁:
        + z₁ = w₁₁×x₁ + w₁₂×x₂ + b₁
        + z₁ = 0.1×1.0 + 0.2×2.0 + 0.1
        + z₁ = 0.1 + 0.4 + 0.1 = 0.6
        + h₁ = sigmoid(0.6) = 1/(1 + e⁻⁰·⁶) = 0.646

        + + Hidden neuron h₂:
        + z₂ = w₂₁×x₁ + w₂₂×x₂ + b₂
        + z₂ = 0.3×1.0 + 0.4×2.0 + 0.1
        + z₂ = 0.3 + 0.8 + 0.1 = 1.2
        + h₂ = sigmoid(1.2) = 1/(1 + e⁻¹·²) = 0.769 +
        + +
        + Step 2: Forward Pass - Hidden to Output + Layer

        + + Hidden layer output: h = [0.646, 0.769]

        + + Output neuron:
        + z_out = w₁×h₁ + w₂×h₂ + b
        + z_out = 0.5×0.646 + 0.6×0.769 + 0.2
        + z_out = 0.323 + 0.461 + 0.2 = 0.984

        + + ŷ = sigmoid(0.984) = 1/(1 + e⁻⁰·⁹⁸⁴)
        + ŷ = 0.728 (Final Prediction!) +
        + +
        + Step 3: Calculate Loss

        + + Binary Cross-Entropy Loss:
        + L = -[y×log(ŷ) + (1-y)×log(1-ŷ)]

        + + If true label y = 1:
        + L = -[1×log(0.728) + 0×log(0.272)]
        + L = -log(0.728)
        + L = 0.317 (Loss value)

        + + Lower loss = better prediction! +
        + +
        + Step 4: Backpropagation (Gradient + Calculation)

        + + Chain Rule: ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂z × ∂z/∂w

        + + Output layer gradient:
        + ∂L/∂ŷ = -(y/ŷ) + (1-y)/(1-ŷ) = (ŷ - y) for sigmoid
        + δ_output = 0.728 - 1 = -0.272

        + + Hidden layer gradient:
        + δ_hidden = δ_output × W₂ × h × (1-h)

        + + Gradients flow backward to update all weights! +
        + +
        +
        ✓ Neural Network Training Summary
        +
        + The Full Training Loop:
        + 1. Forward Pass: Input → Hidden → Output (calculate prediction)
        + 2. Loss Calculation: Compare prediction to true value
        + 3. Backward Pass: Calculate gradients using chain rule
        + 4. Update Weights: w = w - α × gradient
        + 5. Repeat for many epochs until loss minimizes! +
        +
        +
        + + +
        +

        📐 Complete Backpropagation Derivation + (Line-by-Line)

        + +

        Let's derive backpropagation step-by-step using + the network from the forward pass example!

        + +
        + Recap: Network Architecture & Forward Pass + Results

        + + Network: 2 inputs → 2 hidden → 1 output
        + Input: x = [1.0, 2.0], True label: y = 1

        + + Forward Pass Results:
        + • Hidden layer: h₁ = 0.646, h₂ = 0.769
        + • Output: ŷ = 0.728
        + • Loss: L = 0.317 +
        + +
        + Step 1: Output Layer Error (δ_output)

        + + Goal: Calculate ∂L/∂z_out (gradient of loss w.r.t. output before + activation)

        + + Using Chain Rule:
        + δ_output = ∂L/∂z_out = ∂L/∂ŷ × ∂ŷ/∂z_out

        + + For Binary Cross-Entropy + Sigmoid, this simplifies to:
        + δ_output = ŷ - y
        + δ_output = 0.728 - 1
        + δ_output = -0.272 +
        + +
        + Step 2: Gradients for Hidden→Output Weights + (W₂)

        + + Formula: ∂L/∂W₂ = δ_output × h (hidden layer output)

        + + Calculation:
        + ∂L/∂w₁(h→o) = δ_output × h₁ = -0.272 × 0.646 = -0.176
        + ∂L/∂w₂(h→o) = δ_output × h₂ = -0.272 × 0.769 = -0.209

        + + Bias gradient:
        + ∂L/∂b₂ = δ_output = -0.272 +
        + +
        + Step 3: Backpropagate Error to Hidden Layer + (δ_hidden)

        + + The Key Insight: Hidden neurons contributed to output error based on their + weights!

        + + Formula: δ_hidden = (W₂ᵀ × δ_output) ⊙ σ'(z_hidden)

        + + Sigmoid derivative: σ'(z) = σ(z) × (1 - σ(z)) = h × (1 - h)

        + + For hidden neuron h₁:
        + σ'(z₁) = h₁ × (1 - h₁) = 0.646 × (1 - 0.646) = 0.646 × 0.354 = 0.229
        + δ₁ = w₁(h→o) × δ_output × σ'(z₁)
        + δ₁ = 0.5 × (-0.272) × 0.229 = -0.031

        + + For hidden neuron h₂:
        + σ'(z₂) = h₂ × (1 - h₂) = 0.769 × (1 - 0.769) = 0.769 × 0.231 = 0.178
        + δ₂ = w₂(h→o) × δ_output × σ'(z₂)
        + δ₂ = 0.6 × (-0.272) × 0.178 = -0.029 +
        + +
        + Step 4: Gradients for Input→Hidden Weights + (W₁)

        + + Formula: ∂L/∂W₁ = δ_hidden × x (input)

        + + Input: x = [1.0, 2.0]

        + + Gradients for weights to h₁:
        + ∂L/∂w₁₁ = δ₁ × x₁ = -0.031 × 1.0 = -0.031
        + ∂L/∂w₁₂ = δ₁ × x₂ = -0.031 × 2.0 = -0.062

        + + Gradients for weights to h₂:
        + ∂L/∂w₂₁ = δ₂ × x₁ = -0.029 × 1.0 = -0.029
        + ∂L/∂w₂₂ = δ₂ × x₂ = -0.029 × 2.0 = -0.058

        + + Bias gradients:
        + ∂L/∂b₁ = δ₁ = -0.031, ∂L/∂b₂ = δ₂ = -0.029 +
        + +
        + Step 5: Update All Weights

        + + Learning rate: α = 0.1

        + + Update Rule: w_new = w_old - α × ∂L/∂w

        + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        WeightOld ValueGradientUpdateNew Value
        w₁₁0.1-0.0310.1 - 0.1×(-0.031)0.103
        w₁₂0.2-0.0620.2 - 0.1×(-0.062)0.206
        w₂₁0.3-0.0290.3 - 0.1×(-0.029)0.303
        w₂₂0.4-0.0580.4 - 0.1×(-0.058)0.406
        w₁(h→o)0.5-0.1760.5 - 0.1×(-0.176)0.518
        w₂(h→o)0.6-0.2090.6 - 0.1×(-0.209)0.621
        + + Weights increased because gradient was negative (we want to + increase output toward 1) +
        + +
        +
        ✓ Backpropagation Summary
        +
        + The Algorithm:
        + 1. Forward pass: Calculate all activations from input → output
        + 2. Calculate output error: δ_output = ŷ - y (for sigmoid + BCE)
        + 3. Backpropagate error: δ_hidden = (Wᵀ × δ_next) ⊙ σ'(z)
        + 4. Calculate gradients: ∂L/∂W = δ × (input to that layer)ᵀ
        + 5. Update weights: W = W - α × ∂L/∂W

        + This is iterated thousands of times until the loss + converges! +
        + +

        Python Code

        +
        +
        +from sklearn.neural_network import MLPClassifier
        +
        +# Create neural network
        +mlp = MLPClassifier(
        +    hidden_layer_sizes=(100, 50),  # 2 hidden layers
        +    activation='relu',
        +    max_iter=500
        +)
        +
        +# Train
        +mlp.fit(X_train, y_train)
        +
        +# Predict
        +predictions = mlp.predict(X_test)
        +
        -

        📊 Supervised - Evaluation Cross-Validation

        +

        📊 Supervised + - Evaluation Cross-Validation

        -

        Cross-validation gives more reliable performance estimates by testing your model on multiple different splits of the data!

        +

        Cross-validation gives more reliable performance estimates by testing your model on multiple + different splits of the data!

        Key Concepts
        @@ -2015,7 +3791,8 @@ Actual Pos TP FN
        ⚠️ Single Split Problem
        - You test once and get 85% accuracy. Is that good? Or did you just get lucky with an easy test set? Without multiple tests, you don't know! + You test once and get 85% accuracy. Is that good? Or did you just get lucky with an easy + test set? Without multiple tests, you don't know!
        @@ -2035,9 +3812,10 @@ Actual Pos TP FN
        - +
        -

        Figure: 3-Fold Cross-Validation - each fold serves as test set once

        +

        Figure: 3-Fold Cross-Validation - each fold serves as + test set once

        Example: 3-Fold CV

        @@ -2045,7 +3823,12 @@ Actual Pos TP FN - + + + + + + @@ -2087,7 +3870,8 @@ Actual Pos TP FN

        Stratified K-Fold

        -

        For classification with imbalanced classes, use stratified K-fold to maintain class proportions in each fold!

        +

        For classification with imbalanced classes, use stratified K-fold to maintain + class proportions in each fold!

        💡 Example
        @@ -2140,11 +3924,13 @@ Actual Pos TP FN
        -

        🔍 Unsupervised - Preprocessing Data Preprocessing

        +

        🔍 + Unsupervised - Preprocessing Data Preprocessing

        -

        Raw data is messy! Data preprocessing cleans and transforms data into a format that machine learning algorithms can use effectively.

        +

        Raw data is messy! Data preprocessing cleans and transforms data into a format that machine + learning algorithms can use effectively.

        Key Steps
        @@ -2172,7 +3958,8 @@ Actual Pos TP FN
        ⚠️ Warning
        - Never drop columns with many missing values without investigation! The missingness itself might be informative (e.g., income not reported might correlate with high income). + Never drop columns with many missing values without investigation! The missingness itself + might be informative (e.g., income not reported might correlate with high income).
        @@ -2206,12 +3993,14 @@ Actual Pos TP FN
        ⚠️ Don't Mix Them Up!
        - Never use label encoding for nominal data! If you encode ["Red", "Blue", "Green"] as [0, 1, 2], the model thinks Green > Blue > Red, which is meaningless! + Never use label encoding for nominal data! If you encode ["Red", "Blue", "Green"] as [0, 1, + 2], the model thinks Green > Blue > Red, which is meaningless!

        3. Feature Scaling

        -

        Different features have different scales. Age (0-100) vs Income ($0-$1M). This causes problems!

        +

        Different features have different scales. Age (0-100) vs Income ($0-$1M). This causes problems! +

        Why Scale?

          @@ -2225,7 +4014,8 @@ Actual Pos TP FN
          Formula: z = (x - μ) / σ -
          where:
          μ = mean of feature
          σ = standard deviation
          Result: mean=0, std=1
          +
          where:
          μ = mean of feature
          σ = standard deviation
          Result: mean=0, + std=1

          Example: [10, 20, 30, 40, 50]

          @@ -2244,9 +4034,10 @@ Actual Pos TP FN
          - +
          -

          Figure: Feature distributions before and after scaling

          +

          Figure: Feature distributions before and after + scaling

          Critical: fit_transform vs transform

          @@ -2295,7 +4086,7 @@ Actual Pos TP FN

          Complete Pipeline Example

          - +

          Figure: Complete preprocessing pipeline

          @@ -2319,7 +4110,8 @@ Actual Pos TP FN
        -

        Loss functions measure how wrong our predictions are. Different problems need different loss functions! The choice dramatically affects what your model learns.

        +

        Loss functions measure how wrong our predictions are. Different problems need different loss + functions! The choice dramatically affects what your model learns.

        Key Concepts
        @@ -2390,9 +4182,10 @@ Actual Pos TP FN
        - +
        -

        Figure: Comparing MSE, MAE, and their response to errors

        +

        Figure: Comparing MSE, MAE, and their response to + errors

        Loss Functions for Classification

        @@ -2471,7 +4264,7 @@ Actual Pos TP FN

        Visualizing Loss Curves

        - +

        Figure: How different losses respond to errors

        @@ -2511,7 +4304,8 @@ Actual Pos TP FN
        -

        Choosing the right K value is critical for KNN performance! Too small causes overfitting, too large causes underfitting. Let's explore systematic methods to find the optimal K.

        +

        Choosing the right K value is critical for KNN performance! Too small causes overfitting, too + large causes underfitting. Let's explore systematic methods to find the optimal K.

        Key Methods
        @@ -2524,17 +4318,20 @@ Actual Pos TP FN

        Method 1: Elbow Method

        -

        Test different K values and plot performance. Look for the "elbow" where adding more neighbors doesn't help much.

        +

        Test different K values and plot performance. Look for the "elbow" where adding more neighbors + doesn't help much.

        -

        Figure 1: Elbow curve showing optimal K at the bend

        +

        Figure 1: Elbow curve showing optimal K at the bend +

        Method 2: Cross-Validation Approach

        -

        For each K value, run k-fold cross-validation and calculate mean accuracy. Choose K with highest mean accuracy.

        +

        For each K value, run k-fold cross-validation and calculate mean accuracy. Choose K with highest + mean accuracy.

        Cross-Validation Process: @@ -2551,9 +4348,10 @@ Actual Pos TP FN
        - +
        -

        Figure 2: Cross-validation accuracies heatmap for different K values

        +

        Figure 2: Cross-validation accuracies heatmap for + different K values

        @@ -2596,7 +4394,9 @@ Actual Pos TP FN
        -

        Hyperparameters control how your model learns. Unlike model parameters (learned from data), hyperparameters are set BEFORE training. GridSearch systematically finds the best combination!

        +

        Hyperparameters control how your model learns. Unlike model parameters (learned from data), + hyperparameters are set BEFORE training. GridSearch systematically finds the best combination! +

        Common Hyperparameters
        @@ -2610,7 +4410,8 @@ Actual Pos TP FN

        GridSearch Explained

        -

        GridSearch tests ALL combinations of hyperparameters you specify. It's exhaustive but guarantees finding the best combination in your grid.

        +

        GridSearch tests ALL combinations of hyperparameters you specify. It's exhaustive but guarantees + finding the best combination in your grid.

        Example: SVM GridSearch @@ -2626,9 +4427,10 @@ Actual Pos TP FN
        - +
        -

        Figure: GridSearch heatmap showing accuracy for C vs gamma combinations

        +

        Figure: GridSearch heatmap showing accuracy for C vs + gamma combinations

        @@ -2644,9 +4446,10 @@ Actual Pos TP FN

        Performance Surface (3D View)

        - +
        -

        Figure: 3D surface showing how parameters affect performance

        +

        Figure: 3D surface showing how parameters affect + performance

        When GridSearch Fails

        @@ -2678,11 +4481,14 @@ Actual Pos TP FN
        -

        📊 Supervised - Classification Naive Bayes Classification

        +

        📊 Supervised + - Classification Naive Bayes Classification

        -

        Naive Bayes is a probabilistic classifier based on Bayes' Theorem. Despite its "naive" independence assumption, it works surprisingly well for text classification and other tasks! We'll cover both Categorical and Gaussian Naive Bayes with complete mathematical solutions.

        +

        Naive Bayes is a probabilistic classifier based on Bayes' Theorem. Despite its "naive" + independence assumption, it works surprisingly well for text classification and other tasks! + We'll cover both Categorical and Gaussian Naive Bayes with complete mathematical solutions.

        Key Concepts
        @@ -2701,7 +4507,8 @@ Actual Pos TP FN
              ↓                             ↓                  ↓               ↓
        Posterior              Likelihood        Prior       Evidence
        - (What we want)     (From data)     (Baseline)  (Normalizer) + (What we want)     (From + data)     (Baseline)  (Normalizer)

        The Naive Independence Assumption

        @@ -2716,7 +4523,7 @@ Actual Pos TP FN
        - +

        Figure 1: Bayes' Theorem visual explanation

        @@ -2742,9 +4549,10 @@ Actual Pos TP FN
        - +
        -

        Figure 2: Spam classification calculation step-by-step

        +

        Figure 2: Spam classification calculation + step-by-step

        Step-by-Step Calculation

        @@ -2785,33 +4593,92 @@ Actual Pos TP FN
        - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        FoldTest SetTraining SetAccuracy
        FoldTest SetTraining SetAccuracy
        SpeedVery FastFastSlowVery Slow
        Works with Little DataYesYesNoNo
        InterpretableVeryYesNoNo
        Handles Non-linearYesNoYesYes
        High DimensionsExcellentGoodGoodPoor
        SpeedVery FastFastSlowVery Slow
        Works with Little DataYesYesNoNo
        InterpretableVeryYesNoNo
        Handles Non-linearYesNoYesYes
        High DimensionsExcellentGoodGoodPoor

        🎯 PART A: Categorical Naive Bayes (Step-by-Step from PDF)

        - +

        Dataset: Tennis Play Prediction

        - + + + + + - - - - - - - -
        OutlookTemperaturePlay
        OutlookTemperaturePlay
        SunnyHotNo
        SunnyMildNo
        CloudyHotYes
        RainyMildYes
        RainyCoolYes
        CloudyCoolYes
        - -

        Problem: Predict whether to play tennis when Outlook=Rainy and Temperature=Hot

        - + + Sunny + Hot + No + + + Sunny + Mild + No + + + Cloudy + Hot + Yes + + + Rainy + Mild + Yes + + + Rainy + Cool + Yes + + + Cloudy + Cool + Yes + + + + +

        Problem: Predict whether to play tennis when Outlook=Rainy and Temperature=Hot +

        +
        STEP 1: Calculate Prior Probabilities
        @@ -2824,7 +4691,7 @@ Actual Pos TP FN P(No) = 2/6 = 0.333 (33.3%)
        - +
        STEP 2: Calculate Conditional Probabilities (Before Smoothing)
        @@ -2835,33 +4702,40 @@ Actual Pos TP FN
        • Count (Rainy AND No) = 0 examples ❌
        • Count (No) = 2 total
        - • P(Rainy|No) = 0/2 = 0 ⚠️ ZERO PROBABILITY PROBLEM!
        + • P(Rainy|No) = 0/2 = 0 ⚠️ ZERO PROBABILITY + PROBLEM!

        For Temperature = "Hot":
        • P(Hot|Yes) = 1/4 = 0.25
        • P(Hot|No) = 1/2 = 0.5
        - +
        Step 3: Apply Bayes' Theorem (Initial)

        P(Yes|Rainy,Hot) = P(Yes) × P(Rainy|Yes) × P(Hot|Yes)
        -                    = 0.667 × 0.5 × 0.25
        -                    = 0.0833
        +                    = + 0.667 × 0.5 × 0.25
        +                    = + 0.0833

        P(No|Rainy,Hot) = P(No) × P(Rainy|No) × P(Hot|No)
        -                   = 0.333 × 0 × 0.5
        -                   = 0 ❌ Problem! +                   = + 0.333 × 0 × 0.5
        +                   = + 0 ❌ Problem!
        - +
        ⚠️ Zero Probability Problem
        - When P(Rainy|No) = 0, the entire probability becomes 0! This is unrealistic - just because we haven't seen "Rainy" with "No" in our training data doesn't mean it's impossible. We need Laplace Smoothing! + When P(Rainy|No) = 0, the entire probability becomes 0! This is unrealistic - just because + we haven't seen "Rainy" with "No" in our training data doesn't mean it's impossible. We need + Laplace Smoothing!
        - +
        STEP 4: Apply Laplace Smoothing (α = 1)
        @@ -2870,19 +4744,22 @@ Actual Pos TP FN
        For Outlook (3 categories: Sunny, Cloudy, Rainy):
        P(Rainy|Yes) = (2 + 1) / (4 + 1×3)
        -               = 3/7
        -               = 0.429
        +               = + 3/7
        +               = + 0.429

        P(Rainy|No) = (0 + 1) / (2 + 1×3)
                    = 1/5
        -             = 0.2Fixed the zero!
        +             = + 0.2Fixed the zero!

        For Temperature (3 categories: Hot, Mild, Cool):
        P(Hot|Yes) = (1 + 1) / (4 + 1×3) = 2/7 = 0.286
        P(Hot|No) = (1 + 1) / (2 + 1×3) = 2/5 = 0.4
        - +
        STEP 5: Recalculate with Smoothing
        @@ -2897,7 +4774,7 @@ Actual Pos TP FN = 0.0266
        - +
        STEP 6: Normalize to Get Final Probabilities
        @@ -2906,44 +4783,84 @@ Actual Pos TP FN
        Normalize:
        P(Yes|Rainy,Hot) = 0.0818 / 0.1084
        -                  = 0.755 (75.5%)
        +                  = + 0.755 (75.5%)

        P(No|Rainy,Hot) = 0.0266 / 0.1084
        -                 = 0.245 (24.5%)
        +                 = + 0.245 (24.5%)

        -
        - ✅ FINAL PREDICTION: YES (Play Tennis!)
        +
        + ✅ FINAL PREDICTION: YES (Play + Tennis!)
        Confidence: 75.5%
        - +
        - +
        -

        Figure: Categorical Naive Bayes calculation visualization

        +

        Figure: Categorical Naive Bayes calculation + visualization

        - +

        🎯 PART B: Gaussian Naive Bayes (Step-by-Step from PDF)

        - +

        Dataset: 2D Classification

        - + + + + + + - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        IDX₁X₂Class
        IDX₁X₂Class
        A1.02.0Yes
        B2.01.0Yes
        C1.51.8Yes
        D3.03.0No
        E3.52.8No
        F2.93.2No
        A1.02.0Yes
        B2.01.0Yes
        C1.51.8Yes
        D3.03.0No
        E3.52.8No
        F2.93.2No
        - +

        Problem: Classify test point [X₁=2.0, X₂=2.0]

        - +
        STEP 1: Calculate Mean and Variance for Each Class
        @@ -2970,7 +4887,7 @@ Actual Pos TP FN σ₂²(No) = 0.0266
        - +
        Step 2: Gaussian Probability Density Function

        @@ -2978,7 +4895,7 @@ Actual Pos TP FN
        This gives us the probability density at point x given mean μ and variance σ²
        - +
        STEP 3: Calculate P(X₁=2.0 | Class) using Gaussian PDF
        @@ -2987,7 +4904,8 @@ Actual Pos TP FN
        Step-by-step:
        • Normalization: 1/√(2π × 0.166) = 1/√1.043 = 1/1.021 = 0.9772
        - • Exponent: -(2.0-1.5)²/(2 × 0.166) = -(0.5)²/0.332 = -0.25/0.332 = -0.753
        + • Exponent: -(2.0-1.5)²/(2 × 0.166) = -(0.5)²/0.332 = -0.25/0.332 = + -0.753
        • e^(-0.753) = 0.471
        • Final: 0.9772 × 0.471 = 0.460

        @@ -2996,14 +4914,15 @@ Actual Pos TP FN
        Step-by-step:
        • Normalization: 1/√(2π × 0.0688) = 1.523
        - • Exponent: -(2.0-3.133)²/(2 × 0.0688) = -(-1.133)²/0.1376 = -1.283/0.1376 = -9.333
        + • Exponent: -(2.0-3.133)²/(2 × 0.0688) = -(-1.133)²/0.1376 = -1.283/0.1376 = + -9.333
        • e^(-9.333) = 0.000088
        • Final: 1.523 × 0.000088 = 0.000134

        • Point (2.0, ?) is MUCH more likely to be "Yes"!
        - +
        Step 4: Calculate P(X₂=2.0 | Class)

        @@ -3019,7 +4938,7 @@ Actual Pos TP FN           = 2.449 × 0.0000000614
                  = 0.00000015
        - +
        Step 5: Combine with Prior (assume equal priors)

        @@ -3033,7 +4952,7 @@ Actual Pos TP FN          = 0.5 × 0.000134 × 0.00000015
                 = 0.00000000001
        - +
        Step 6: Normalize

        @@ -3044,12 +4963,13 @@ Actual Pos TP FN
        Prediction: YES ✅
        - +
        - +
        -

        Figure: Gaussian Naive Bayes with decision boundary

        +

        Figure: Gaussian Naive Bayes with decision boundary +

        @@ -3071,17 +4991,48 @@ Actual Pos TP FN • Complex feature interactions matter
        + +

        Python Code

        +
        +
        +from sklearn.naive_bayes import GaussianNB, MultinomialNB
        +
        +# For continuous features (e.g., measurements)
        +gnb = GaussianNB()
        +gnb.fit(X_train, y_train)
        +predictions = gnb.predict(X_test)
        +
        +# For text/count data (e.g., TF-IDF features)
        +from sklearn.feature_extraction.text import CountVectorizer
        +
        +# Convert text to word counts
        +vectorizer = CountVectorizer()
        +X_train_counts = vectorizer.fit_transform(X_train_text)
        +X_test_counts = vectorizer.transform(X_test_text)
        +
        +# Train Multinomial NB (good for text)
        +mnb = MultinomialNB(alpha=1.0)  # Laplace smoothing
        +mnb.fit(X_train_counts, y_train)
        +
        +# Predict & get probabilities
        +predictions = mnb.predict(X_test_counts)
        +probabilities = mnb.predict_proba(X_test_counts)
        +
        -

        🔍 Unsupervised - Clustering K-means Clustering

        +

        🔍 + Unsupervised - Clustering K-means Clustering

        -

        K-means is an unsupervised learning algorithm that groups data into K clusters. Each cluster has a centroid (center point), and points are assigned to the nearest centroid. Perfect for customer segmentation, image compression, and pattern discovery!

        +

        K-means is an unsupervised learning algorithm that groups data into K clusters. Each cluster has + a centroid (center point), and points are assigned to the nearest centroid. Perfect for customer + segmentation, image compression, and pattern discovery!

        Key Concepts
        @@ -3098,15 +5049,43 @@ Actual Pos TP FN

        Dataset: 6 Points in 2D Space

        - + + + + + - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        PointXY
        PointXY
        A12
        B1.51.8
        C58
        D88
        E10.6
        F911
        A12
        B1.51.8
        C58
        D88
        E10.6
        F911
        @@ -3159,7 +5138,8 @@ Actual Pos TP FN
        ⚠️ Poor Initial Centroids!
        - All points assigned to c₁! This happens with bad initialization. Let's try better initial centroids for the algorithm to work properly. + All points assigned to c₁! This happens with bad initialization. Let's try better initial + centroids for the algorithm to work properly.
        @@ -3173,11 +5153,13 @@ Actual Pos TP FN
        WCSS Calculation:
        WCSS₁ = d²(A,c₁) + d²(B,c₁) + d²(E,c₁)
        -        = (1-1.17)²+(2-1.47)² + (1.5-1.17)²+(1.8-1.47)² + (1-1.17)²+(0.6-1.47)²
        +        = (1-1.17)²+(2-1.47)² + (1.5-1.17)²+(1.8-1.47)² + + (1-1.17)²+(0.6-1.47)²
               = 0.311 + 0.218 + 0.786 = 1.315

        WCSS₂ = d²(C,c₂) + d²(D,c₂) + d²(F,c₂)
        -        = (5-7.33)²+(8-9)² + (8-7.33)²+(8-9)² + (9-7.33)²+(11-9)²
        +        = (5-7.33)²+(8-9)² + (8-7.33)²+(8-9)² + + (9-7.33)²+(11-9)²
               = 6.433 + 1.447 + 6.789 = 14.669

        Total WCSS = 1.315 + 14.669 = 15.984 @@ -3194,9 +5176,10 @@ Actual Pos TP FN
        - +
        -

        Figure: K-means clustering visualization with centroid movement

        +

        Figure: K-means clustering visualization with + centroid movement

        Finding Optimal K: The Elbow Method

        @@ -3217,9 +5200,10 @@ Actual Pos TP FN
        - +
        -

        Figure: Elbow method - optimal K is where the curve bends

        +

        Figure: Elbow method - optimal K is where the curve + bends

        @@ -3257,11 +5241,14 @@ Actual Pos TP FN
        -

        📊 Supervised - Regression Decision Tree Regression

        +

        📊 Supervised + - Regression Decision Tree Regression

        -

        Decision Tree Regression predicts continuous values by recursively splitting data to minimize variance. Unlike classification trees that use entropy, regression trees use variance reduction!

        +

        Decision Tree Regression predicts continuous values by recursively splitting data to minimize + variance. Unlike classification trees that use entropy, regression trees use variance reduction! +

        Key Concepts
        @@ -3278,164 +5265,216 @@ Actual Pos TP FN

        Dataset: House Price Prediction

        - + + + + + - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
        IDSquare FeetPrice (Lakhs)
        IDSquare FeetPrice (Lakhs)
        180050
        285052
        390054
        4150090
        5160095
        61700100
        180050
        285052
        390054
        4150090
        5160095
        61700100
        STEP 1: Calculate Parent Variance
        -Mean price = (50 + 52 + 54 + 90 + 95 + 100) / 6 - = 441 / 6 - = 73.5 Lakhs + Mean price = (50 + 52 + 54 + 90 + 95 + 100) / 6 + = 441 / 6 + = 73.5 Lakhs -Variance = Σ(yᵢ - mean)² / n + Variance = Σ(yᵢ - mean)² / n -Calculating each term: -• (50 - 73.5)² = (-23.5)² = 552.25 -• (52 - 73.5)² = (-21.5)² = 462.25 -• (54 - 73.5)² = (-19.5)² = 380.25 -• (90 - 73.5)² = (16.5)² = 272.25 -• (95 - 73.5)² = (21.5)² = 462.25 -• (100 - 73.5)² = (26.5)² = 702.25 + Calculating each term: + • (50 - 73.5)² = (-23.5)² = 552.25 + • (52 - 73.5)² = (-21.5)² = 462.25 + • (54 - 73.5)² = (-19.5)² = 380.25 + • (90 - 73.5)² = (16.5)² = 272.25 + • (95 - 73.5)² = (21.5)² = 462.25 + • (100 - 73.5)² = (26.5)² = 702.25 -Sum = 552.25 + 462.25 + 380.25 + 272.25 + 462.25 + 702.25 - = 2831.5 + Sum = 552.25 + 462.25 + 380.25 + 272.25 + 462.25 + 702.25 + = 2831.5 -Variance = 2831.5 / 6 = 471.92 + Variance = 2831.5 / 6 = 471.92 -✓ Parent Variance = 471.92 + ✓ Parent Variance = 471.92
        STEP 2: Test Split Points
        -Sort by Square Feet: 800, 850, 900, 1500, 1600, 1700 + Sort by Square Feet: 800, 850, 900, 1500, 1600, 1700 -Possible midpoints: 825, 875, 1200, 1550, 1650 + Possible midpoints: 825, 875, 1200, 1550, 1650 -Testing Split at 1200: + Testing Split at 1200: -LEFT (Square Feet <= 1200): -Samples: 800(50), 850(52), 900(54) -Left Mean = (50 + 52 + 54) / 3 = 156 / 3 = 52 + LEFT (Square Feet <= 1200): + Samples: 800(50), 850(52), 900(54) + Left Mean = (50 + 52 + 54) / 3 = 156 / 3 = 52 -Left Variance: -• (50 - 52)² = 4 -• (52 - 52)² = 0 -• (54 - 52)² = 4 -Sum = 8 -Variance = 8 / 3 = 2.67 + Left Variance: + • (50 - 52)² = 4 + • (52 - 52)² = 0 + • (54 - 52)² = 4 + Sum = 8 + Variance = 8 / 3 = 2.67 -RIGHT (Square Feet > 1200): -Samples: 1500(90), 1600(95), 1700(100) -Right Mean = (90 + 95 + 100) / 3 = 285 / 3 = 95 + RIGHT (Square Feet > 1200): + Samples: 1500(90), 1600(95), 1700(100) + Right Mean = (90 + 95 + 100) / 3 = 285 / 3 = 95 -Right Variance: -• (90 - 95)² = 25 -• (95 - 95)² = 0 -• (100 - 95)² = 25 -Sum = 50 -Variance = 50 / 3 = 16.67 + Right Variance: + • (90 - 95)² = 25 + • (95 - 95)² = 0 + • (100 - 95)² = 25 + Sum = 50 + Variance = 50 / 3 = 16.67
        STEP 3: Calculate Weighted Variance After Split
        -Weighted Variance = (n_left/n_total) × Var_left + (n_right/n_total) × Var_right + Weighted Variance = (n_left/n_total) × Var_left + (n_right/n_total) × Var_right - = (3/6) × 2.67 + (3/6) × 16.67 - = 0.5 × 2.67 + 0.5 × 16.67 - = 1.335 + 8.335 - = 9.67 + = (3/6) × 2.67 + (3/6) × 16.67 + = 0.5 × 2.67 + 0.5 × 16.67 + = 1.335 + 8.335 + = 9.67
        STEP 4: Calculate Variance Reduction
        -Variance Reduction = Parent Variance - Weighted Variance After Split + Variance Reduction = Parent Variance - Weighted Variance After Split - = 471.92 - 9.67 - = 462.25 + = 471.92 - 9.67 + = 462.25 -✓ This is the BEST SPLIT! -Splitting at 1200 sq ft reduces variance by 462.25 + ✓ This is the BEST SPLIT! + Splitting at 1200 sq ft reduces variance by 462.25
        STEP 5: Build Final Tree Structure
        -Final Decision Tree: + Final Decision Tree: - [All data, Mean=73.5, Var=471.92] - │ - Split at Square Feet = 1200 - / \ - <= 1200 > 1200 - / \ - Mean = 52 Split at 1550 - (3 samples) / \ - <= 1550 > 1550 - / \ - Mean = 90 Mean = 97.5 - (1 sample) (2 samples) + [All data, Mean=73.5, Var=471.92] + │ + Split at Square Feet = 1200 + / \ + <= 1200 > 1200 + / \ + Mean = 52 Split at 1550 + (3 samples) / \ + <= 1550 > 1550 + / \ + Mean = 90 Mean = 97.5 + (1 sample) (2 samples) -Prediction Example: -New property: 950 sq ft -├─ 950 <= 1200? YES → Go LEFT -└─ Prediction: ₹52 Lakhs + Prediction Example: + New property: 950 sq ft + ├─ 950 <= 1200? YES → Go LEFT + └─ Prediction: ₹52 Lakhs -New property: 1650 sq ft -├─ 1650 <= 1200? NO → Go RIGHT -├─ 1650 <= 1550? NO → Go RIGHT -└─ Prediction: ₹97.5 Lakhs + New property: 1650 sq ft + ├─ 1650 <= 1200? NO → Go RIGHT + ├─ 1650 <= 1550? NO → Go RIGHT + └─ Prediction: ₹97.5 Lakhs
        - +
        -

        Figure: Decision tree regression with splits and predictions

        +

        Figure: Decision tree regression with splits and + predictions

        ✅ Key Takeaway
        - Decision Tree Regression finds splits that minimize variance in leaf nodes. Each leaf predicts the mean of samples in that region. The recursive splitting creates a piecewise constant function! + Decision Tree Regression finds splits that minimize variance in leaf nodes. Each leaf + predicts the mean of samples in that region. The recursive splitting creates a piecewise + constant function!

        Variance Reduction vs Information Gain

        - + + + + + - - - - + + + + + + + + + + + + + + + + + + + +
        AspectClassification TreesRegression Trees
        AspectClassification TreesRegression Trees
        Splitting CriterionInformation Gain (Entropy/Gini)Variance Reduction
        PredictionMajority classMean value
        Leaf NodeClass labelContinuous value
        GoalMaximize purityMinimize variance
        Splitting CriterionInformation Gain (Entropy/Gini)Variance Reduction
        PredictionMajority classMean value
        Leaf NodeClass labelContinuous value
        GoalMaximize purityMinimize variance
        - +
        -

        Figure: Comparing different split points and their variance reduction

        +

        Figure: Comparing different split points and their + variance reduction

        @@ -3443,11 +5482,13 @@ New property: 1650 sq ft
        -

        📊 Supervised Decision Trees

        +

        📊 + Supervised Decision Trees

        -

        Decision Trees make decisions by asking yes/no questions recursively. They're interpretable, powerful, and the foundation for ensemble methods like Random Forests!

        +

        Decision Trees make decisions by asking yes/no questions recursively. They're interpretable, + powerful, and the foundation for ensemble methods like Random Forests!

        Key Concepts
        @@ -3460,17 +5501,19 @@ New property: 1650 sq ft

        How Decision Trees Work

        -

        Imagine you're playing "20 Questions" to guess an animal. Each question splits possibilities into two groups. Decision Trees work the same way!

        +

        Imagine you're playing "20 Questions" to guess an animal. Each question splits possibilities into + two groups. Decision Trees work the same way!

        - +

        Figure 1: Interactive decision tree structure

        Splitting Criteria

        -

        How do we choose which question to ask at each node? We want splits that maximize information gain!

        +

        How do we choose which question to ask at each node? We want splits that maximize information + gain!

        1. Entropy (Information Theory)

        @@ -3497,9 +5540,10 @@ New property: 1650 sq ft
        - +
        -

        Figure 2: Entropy and Information Gain visualization

        +

        Figure 2: Entropy and Information Gain visualization +

        3. Gini Impurity (Alternative)

        @@ -3540,17 +5584,19 @@ New property: 1650 sq ft
        - +
        -

        Figure 3: Comparing different splits by information gain

        +

        Figure 3: Comparing different splits by information + gain

        Decision Boundaries

        - +
        -

        Figure 4: Decision tree creates rectangular regions

        +

        Figure 4: Decision tree creates rectangular regions +

        Overfitting in Decision Trees

        @@ -3570,7 +5616,10 @@ New property: 1650 sq ft

        Advantages vs Disadvantages

        - + + + + @@ -3595,19 +5644,201 @@ New property: 1650 sq ft
        Advantages ✅Disadvantages ❌
        Advantages ✅Disadvantages ❌
        + + +
        +

        📐 Complete Mathematical Derivation: Decision + Tree Splitting

        + +

        Let's calculate Entropy, Information Gain, and + Gini step-by-step!

        + +
        + Problem: Should we play tennis today?

        + + Training Data (14 days):
        + • 9 days we played tennis (Yes)
        + • 5 days we didn't play (No)

        + + Features: Weather (Sunny/Overcast/Rain), Wind (Weak/Strong) +
        + +
        + Step 1: Calculate Root Entropy H(S)

        + + Entropy Formula: H(S) = -Σ pᵢ × log₂(pᵢ)

        + + p(Yes) = 9/14 = 0.643
        + p(No) = 5/14 = 0.357

        + + H(S) = -[p(Yes) × log₂(p(Yes)) + p(No) × log₂(p(No))]
        + H(S) = -[0.643 × log₂(0.643) + 0.357 × log₂(0.357)]
        + H(S) = -[0.643 × (-0.637) + 0.357 × (-1.486)]
        + H(S) = -[-0.410 + (-0.531)]
        + H(S) = -[-0.940]
        + H(S) = 0.940 bits (before any + split) +
        + +
        + Step 2: Calculate Entropy After Splitting by + "Wind"

        + + Split counts:
        + + + + + + + + + + + + + + + + + + + + + + + + + + +
        WindYesNoTotalEntropy CalculationH(subset)
        Weak628-[6/8×log₂(6/8) + 2/8×log₂(2/8)]0.811
        Strong336-[3/6×log₂(3/6) + 3/6×log₂(3/6)]1.000
        + + Weighted Average Entropy:
        + H(S|Wind) = (8/14) × 0.811 + (6/14) × 1.000
        + H(S|Wind) = 0.463 + 0.429
        + H(S|Wind) = 0.892 +
        + +
        + Step 3: Calculate Information Gain

        + + Formula: IG(S, Feature) = H(S) - H(S|Feature)

        + + IG(S, Wind) = 0.940 - 0.892
        + IG(S, Wind) = 0.048 bits

        + + This means splitting by Wind reduces uncertainty by 0.048 + bits +
        + +
        + Step 4: Compare with Other Features

        + + + + + + + + + + + + + + + + + + + + +
        FeatureH(S|Feature)Information GainDecision
        Weather0.6930.247✓ BEST!
        Wind0.8920.048
        + + → Split by "Weather" first (highest + information gain!) +
        + +
        + Step 5: Gini Impurity Alternative

        + + Gini Formula: Gini(S) = 1 - Σ pᵢ²

        + + For root node:
        + Gini(S) = 1 - [(9/14)² + (5/14)²]
        + Gini(S) = 1 - [0.413 + 0.128]
        + Gini(S) = 1 - 0.541
        + Gini(S) = 0.459

        + + Interpretation:
        + • Gini = 0: Pure node (all same class)
        + • Gini = 0.5: Maximum impurity (50-50 split)
        + • Our 0.459 indicates moderate impurity +
        + +
        +
        ✓ Summary: Decision Tree Math
        +
        + The algorithm at each node:
        + 1. Calculate parent entropy/Gini
        + 2. For each feature:
        +    • Split data by feature values
        +    • Calculate weighted child entropy/Gini
        +    • Compute Information Gain = Parent - Weighted Children
        + 3. Choose feature with HIGHEST Information + Gain
        + 4. Repeat recursively until stopping criteria met! +
        +
        +
        + +

        Python Code

        +
        +
        +from sklearn.tree import DecisionTreeClassifier
        +from sklearn import tree
        +import matplotlib.pyplot as plt
        +
        +# Create Decision Tree
        +dt = DecisionTreeClassifier(
        +    criterion='gini',        # 'gini' or 'entropy'
        +    max_depth=5,             # Limit depth (prevent overfitting)
        +    min_samples_split=2,     # Min samples to split
        +    min_samples_leaf=1       # Min samples in leaf
        +)
        +
        +# Train
        +dt.fit(X_train, y_train)
        +
        +# Predict
        +predictions = dt.predict(X_test)
        +
        +# Visualize the tree
        +plt.figure(figsize=(20, 10))
        +tree.plot_tree(dt, filled=True, feature_names=feature_names)
        +plt.show()
        +
        +# Feature importance
        +print(dict(zip(feature_names, dt.feature_importances_)))
        +
        - +
        -

        🎮 Reinforcement Introduction to Reinforcement Learning

        +

        🎮 + Reinforcement Introduction to Reinforcement Learning

        -

        Reinforcement Learning (RL) is learning by trial and error, just like teaching a dog tricks! The agent takes actions in an environment, receives rewards or punishments, and learns which actions lead to the best outcomes.

        +

        Reinforcement Learning (RL) is learning by trial and error, just like teaching a dog tricks! The + agent takes actions in an environment, receives rewards or punishments, and learns which actions + lead to the best outcomes.

        Key Concepts
        @@ -3637,7 +5868,8 @@ New property: 1650 sq ft Supervised: "Here's the right answer for each example"
        Reinforcement: "Try things and I'll tell you if you did well or poorly"

        - RL must explore to discover good actions, while supervised learning is given correct answers upfront! + RL must explore to discover good actions, while supervised learning is given correct answers + upfront!
        @@ -3646,7 +5878,8 @@ New property: 1650 sq ft
      • Game Playing: AlphaGo learning to play Go by playing millions of games
      • Robotics: Robot learning to walk by trying different leg movements
      • Self-Driving Cars: Learning to drive safely through experience
      • -
      • Recommendation Systems: Learning what users like from their interactions
      • +
      • Recommendation Systems: Learning what users like from their interactions +
      • Resource Management: Optimizing data center cooling to save energy
      @@ -3656,12 +5889,14 @@ New property: 1650 sq ft
    • Exploration: Try new actions to discover better rewards
    • Exploitation: Use known good actions to maximize reward
    -

    Balance is key! Too much exploration wastes time on bad actions. Too much exploitation misses better strategies.

    +

    Balance is key! Too much exploration wastes time on bad actions. Too much exploitation misses + better strategies.

    Reward Signal: Total Return = R = r₁ + γr₂ + γ²r₃ + ... = Σ γᵗ rᵗ₊₁ -
    where:
    γ = discount factor (0 ≤ γ ≤ 1)
    Future rewards are worth less than immediate rewards
    +
    where:
    γ = discount factor (0 ≤ γ ≤ 1)
    Future rewards are worth less than + immediate rewards
    @@ -3669,11 +5904,14 @@ New property: 1650 sq ft
    -

    🎮 Reinforcement Q-Learning

    +

    🎮 + Reinforcement Q-Learning

    -

    Q-Learning is a value-based RL algorithm that learns the quality (Q-value) of taking each action in each state. It's model-free and can learn optimal policies even without knowing how the environment works!

    +

    Q-Learning is a value-based RL algorithm that learns the quality (Q-value) of taking each action + in each state. It's model-free and can learn optimal policies even without knowing how the + environment works!

    Key Concepts
    @@ -3803,11 +6041,14 @@ New property: 1650 sq ft
    -

    🎮 Reinforcement Policy Gradient Methods

    +

    🎮 + Reinforcement Policy Gradient Methods

    -

    Policy Gradient methods directly optimize the policy (action selection strategy) instead of learning value functions. They're powerful for continuous action spaces and stochastic policies!

    +

    Policy Gradient methods directly optimize the policy (action selection strategy) instead of + learning value functions. They're powerful for continuous action spaces and stochastic policies! +

    Key Concepts
    @@ -3822,14 +6063,38 @@ New property: 1650 sq ft

    Policy vs Value-Based Methods

    - + + + + + - - - - - + + + + + + + + + + + + + + + + + + + + + + + + +
    AspectValue-Based (Q-Learning)Policy-Based
    AspectValue-Based (Q-Learning)Policy-Based
    What it learnsQ(s,a) valuesπ(a|s) policy directly
    Action selectionargmax Q(s,a)Sample from π(a|s)
    Continuous actionsDifficultNatural
    Stochastic policyIndirectDirect
    ConvergenceCan be unstableSmoother
    What it learnsQ(s,a) valuesπ(a|s) policy directly
    Action selectionargmax Q(s,a)Sample from π(a|s)
    Continuous actionsDifficultNatural
    Stochastic policyIndirectDirect
    ConvergenceCan be unstableSmoother
    @@ -3923,11 +6188,14 @@ New property: 1650 sq ft
    ✅ Modern Improvements
    - Actor-Critic: Combine policy gradient with value function to reduce variance
    - PPO (Proximal Policy Optimization): Constrain policy updates for stability
    + Actor-Critic: Combine policy gradient with value function to reduce + variance
    + PPO (Proximal Policy Optimization): Constrain policy updates for + stability
    TRPO (Trust Region): Guarantee monotonic improvement

    - These advances make policy gradients practical for complex tasks like robot control and game playing! + These advances make policy gradients practical for complex tasks like robot control and game + playing!
    @@ -3936,7 +6204,8 @@ New property: 1650 sq ft
    -

    🔄 Comparison Algorithm Comparison Tool

    +

    🔄 + Comparison Algorithm Comparison Tool

    @@ -3948,28 +6217,35 @@ New property: 1650 sq ft
    - +
    -

    Step 2: Select Algorithms to Compare (2-5)

    -
    +

    Step 2: Select Algorithms to Compare + (2-5)

    +
    -

    Selected: 0 algorithms

    +

    Selected: 0 + algorithms

    - +
    - + + \ No newline at end of file