diff --git "a/ml_complete-all-topics/index.html" "b/ml_complete-all-topics/index.html" --- "a/ml_complete-all-topics/index.html" +++ "b/ml_complete-all-topics/index.html" @@ -3,9699 +3,4056 @@ - Mathematics Mastery Platform - Statistics, Linear Algebra & Calculus - + Machine Learning: Complete Educational Guide + + - - - - -
- - -
-

Module 5: Distributions

- -
+ +
+
+

Machine Learning: The Ultimate Learning Platform

+

Master ML through Supervised, Unsupervised & Reinforcement Learning

+

Complete with step-by-step mathematical solutions, interactive visualizations, and real-world examples

+
- + + +
+
+

1. Introduction to Machine Learning

+ +
+
+

Machine Learning is teaching computers to learn from experience, just like humans do. Instead of programming every rule, we let the computer discover patterns in data and make decisions on its own.

+ +
+
+
๐Ÿ“Š
+

Supervised Learning

+

Learning with labeled data - like a teacher providing answers

+
+
โœ“ Regression
+
โœ“ Classification
+
โœ“ Evaluation
+
+
+
+
๐Ÿ”
+

Unsupervised Learning

+

Finding patterns without labels - discovering hidden structure

+
+
โœ“ Clustering
+
โœ“ Dimensionality Reduction
+
โœ“ Preprocessing
+
+
+
+
๐ŸŽฎ
+

Reinforcement Learning

+

Learning through trial & error - maximizing rewards

+
+
โœ“ Q-Learning
+
โœ“ Policy Gradient
+
โœ“ Applications
+
+
+
+ +
+
Key Concepts
+
    +
  • Learning from data instead of explicit programming
  • +
  • Three types: Supervised, Unsupervised, Reinforcement
  • +
  • Powers Netflix recommendations, Face ID, and more
  • +
  • Requires: Data, Algorithm, and Computing Power
  • +
+
- +

Understanding Machine Learning

+

Imagine teaching a child to recognize animals. You show them pictures of cats and dogs, telling them which is which. After seeing many examples, the child learns to identify new animals they've never seen before. Machine Learning works the same way!

- +

The Three Types of Learning:

+
    +
  1. Supervised Learning: Learning with a teacher. You provide labeled examples (like "this is a cat", "this is a dog"), and the model learns to predict labels for new data.
  2. +
  3. Unsupervised Learning: Learning without labels. The model finds hidden patterns on its own, like grouping similar customers together.
  4. +
  5. Reinforcement Learning: Learning by trial and error. The model tries actions and learns from rewards/punishments, like teaching a robot to walk.
  6. +
- - +
+
๐Ÿ’ก Key Insight
+
+ ML is not magic! It's mathematics + statistics + computer science working together to find patterns in data. +
+
- - +
- + +
+
+

๐Ÿ“Š Supervised - Regression Linear Regression

+ +
+
+

Linear Regression is one of the simplest and most powerful techniques for predicting continuous values. It finds the "best fit line" through data points.

+ +
+
Key Concepts
+
    +
  • Predicts continuous values (prices, temperatures, etc.)
  • +
  • Finds the straight line that best fits the data
  • +
  • Uses equation: y = mx + c
  • +
  • Minimizes prediction errors
  • +
+
- +

Understanding Linear Regression

+

Think of it like this: You want to predict house prices based on size. If you plot size vs. price on a graph, you'll see points scattered around. Linear regression draws the "best" line through these points that you can use to predict prices for houses of any size.

- - +
+ The Linear Equation: + y = mx + c +
where:
y = predicted value (output)
x = input feature
m = slope (how steep the line is)
c = intercept (where line crosses y-axis)
+
- +

Example: Predicting Salary from Experience

+

Let's say we have data about employees' years of experience and their salaries:

- + + + + + + + + + + + + + + + +
Experience (years)Salary ($k)
139.8
248.9
357.0
468.3
577.9
685.0
- - +

We can find a line (y = 7.5x + 32) that predicts: Someone with 7 years experience will earn approximately $84.5k.

- +
+
+ +
+

Figure 1: Scatter plot showing experience vs. salary with the best fit line

+
- +
+
+ + +
+
+ + +
+
- - +
+ Cost Function (Mean Squared Error): + MSE = ฮฃ(y_actual - y_predicted)ยฒ / n +
This measures how wrong our predictions are. Lower MSE = better fit! +
- +
+
๐Ÿ’ก Key Insight
+
+ The "best fit line" is the one that minimizes the total error between actual points and predicted points. We square the errors so positive and negative errors don't cancel out. +
+
- +
+
โš ๏ธ Common Mistake
+
+ Linear regression assumes a straight-line relationship. If your data curves, you need polynomial regression or other techniques! +
+
- +
- + +
+
+

๐Ÿ“Š Supervised - Optimization Gradient Descent

+ +
+
+

Gradient Descent is the optimization algorithm that helps us find the best values for our model parameters (like m and c in linear regression). Think of it as rolling a ball downhill to find the lowest point.

+ +
+
Key Concepts
+
    +
  • Optimization algorithm to minimize loss function
  • +
  • Takes small steps in the direction of steepest descent
  • +
  • Learning rate controls step size
  • +
  • Stops when it reaches the minimum (convergence)
  • +
+
- +

Understanding Gradient Descent

+

Imagine you're hiking down a mountain in thick fog. You can't see the bottom, but you can feel the slope under your feet. The smart strategy? Always step in the steepest downward direction. That's exactly what gradient descent does with mathematical functions!

- -
- +
+
๐Ÿ’ก The Mountain Analogy
+
+ Your position on the mountain = current parameter values (m, c)
+ Your altitude = loss/error
+ Goal = reach the valley (minimum loss)
+ Gradient = tells you which direction is steepest +
+
- -
- -
-
- Topic 1 -

๐Ÿ“Š What is Statistics & Why It Matters

-

The science of collecting, organizing, analyzing, and interpreting data

-
+
+ Gradient Descent Update Rule: + ฮธ_new = ฮธ_old - ฮฑ ร— โˆ‡J(ฮธ) +
where:
ฮธ = parameters (m, c)
ฮฑ = learning rate (step size)
โˆ‡J(ฮธ) = gradient (direction and steepness)
+
-
-

Introduction

-

What is it? Statistics is a branch of mathematics that deals with data. It provides methods to make sense of numbers and help us make informed decisions based on evidence rather than guesswork.

-

Why it matters: From business forecasting to medical research, sports analysis to government policy, statistics powers nearly every decision in our modern world.

-

When to use it: Whenever you need to understand patterns, test theories, make predictions, or draw conclusions from data.

-
+

The Learning Rate (ฮฑ)

+

The learning rate is like your step size when walking down the mountain:

+
    +
  • Too small: You take tiny steps and it takes forever to reach the bottom
  • +
  • Too large: You take huge leaps and might jump over the valley or even go uphill!
  • +
  • Just right: You make steady progress toward the minimum
  • +
-
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

Imagine Netflix deciding what shows to produce. They analyze viewing statistics: what genres people watch, when they pause, what they finish. Statistics transforms millions of data points into actionable insights like "Create more thriller series" or "Release episodes on Fridays."

-
+
+
+ +
+

Figure 2: Loss surface showing gradient descent path to minimum

+
-
-

Two Branches of Statistics

-
-
-

Descriptive Statistics

-
    -
  • Summarizes and describes data
  • -
  • Uses charts, graphs, averages
  • -
  • Example: "Average class score is 85"
  • -
+
+
+ +
-
-

Inferential Statistics

-
    -
  • Makes predictions and inferences
  • -
  • Tests hypotheses
  • -
  • Example: "New teaching method improves scores"
  • -
+
+ +
-
-
-

Use Cases & Applications

-
    -
  • Healthcare: Clinical trials testing new drugs, disease outbreak tracking
  • -
  • Business: Customer behavior analysis, sales forecasting, A/B testing
  • -
  • Government: Census data, economic indicators, policy impact assessment
  • -
  • Sports: Player performance metrics, game strategy optimization
  • -
-
+
+ Gradients for Linear Regression: + โˆ‚MSE/โˆ‚m = (2/n) ร— ฮฃ(ลท - y) ร— x
+ โˆ‚MSE/โˆ‚c = (2/n) ร— ฮฃ(ลท - y) +
These tell us how much to adjust m and c +
+ +

Types of Gradient Descent

+
    +
  1. Batch Gradient Descent: Uses all data points for each update. Accurate but slow for large datasets.
  2. +
  3. Stochastic Gradient Descent (SGD): Uses one random data point per update. Fast but noisy.
  4. +
  5. Mini-batch Gradient Descent: Uses small batches (e.g., 32 points). Best of both worlds!
  6. +
+ +
+
โš ๏ธ Watch Out!
+
+ Gradient descent can get stuck in local minima (small valleys) instead of finding the global minimum (deepest valley). This is more common with complex, non-convex loss functions. +
+
-
-

๐ŸŽฏ Key Takeaways

+

Convergence Criteria

+

How do we know when to stop? We stop when:

    -
  • Statistics transforms raw data into meaningful insights
  • -
  • Two main branches: Descriptive (what happened) and Inferential (what will happen)
  • -
  • Essential for decision-making across all fields
  • -
  • Combines mathematics with real-world problem solving
  • +
  • Loss stops decreasing significantly (e.g., change < 0.0001)
  • +
  • Gradients become very small (near zero)
  • +
  • We reach maximum iterations (e.g., 1000 steps)
-
- - -
-
- Topic 2 -

๐Ÿ‘ฅ Population vs Sample

-

Understanding the difference between the entire group and a subset

-
+
-
-

Introduction

-

What is it? A population includes ALL members of a defined group. A sample is a subset selected from that population.

-

Why it matters: It's usually impossible or impractical to study entire populations. Sampling allows us to make inferences about large groups by studying smaller representative groups.

-

When to use it: Use populations when you can access all data; use samples when populations are too large, expensive, or time-consuming to study.

-
+ +
+
+

๐Ÿ“Š Supervised - Classification Logistic Regression

+ +
+
+

Logistic Regression is used for binary classification - when you want to predict categories (yes/no, spam/not spam, disease/healthy) not numbers. Despite its name, it's a classification algorithm!

+ +
+
Key Concepts
+
    +
  • Binary classification (2 classes: 0 or 1)
  • +
  • Uses sigmoid function to output probabilities
  • +
  • Output is always between 0 and 1
  • +
  • Uses log loss (cross-entropy) instead of MSE
  • +
+
-
-
๐Ÿ’ก REAL-WORLD ANALOGY
-

Think of tasting soup. You don't need to eat the entire pot (population) to know if it needs salt. A single spoonful (sample) gives you a good ideaโ€”as long as you stirred it well first!

-
+

Why Not Linear Regression?

+

Imagine using linear regression (y = mx + c) for classification. The problems:

+
    +
  • Can predict values < 0 or > 1 (not valid probabilities!)
  • +
  • Sensitive to outliers pulling the line
  • +
  • No natural threshold for decision making
  • +
-
-

Interactive Visualization

- -
- - -
- - +
+
โš ๏ธ The Problem
+
+ Linear regression: ลท = mx + c can give ANY value (-โˆž to +โˆž)
+ Classification needs: probability between 0 and 1
-
- -
-

Key Differences

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AspectPopulationSample
SizeEntire group (N)Subset (n)
SymbolN (uppercase)n (lowercase)
CostHighLower
TimeLongShorter
Accuracy100% (if measured correctly)Has sampling error
-
-
-
โš ๏ธ COMMON MISTAKE
-

Biased Sampling: If your sample doesn't represent the population, your conclusions will be wrong. Example: Surveying only morning shoppers at a store will miss evening customer patterns.

-
+

Enter the Sigmoid Function

+

The sigmoid function ฯƒ(z) squashes any input into the range [0, 1], making it perfect for probabilities!

-
-
โœ… PRO TIP
-

For a sample to be representative, use random sampling. Every member of the population should have an equal chance of being selected.

-
+
+ Sigmoid Function: + ฯƒ(z) = 1 / (1 + e^(-z)) +
where:
z = wยทx + b (linear combination)
ฯƒ(z) = probability (always between 0 and 1)
e โ‰ˆ 2.718 (Euler's number)
+
-
-

๐ŸŽฏ Key Takeaways

+

Sigmoid Properties:

    -
  • Population (N): All members of a defined group
  • -
  • Sample (n): A subset selected from the population
  • -
  • Good samples are random and representative
  • -
  • Larger samples generally provide better estimates
  • +
  • Input: Any real number (-โˆž to +โˆž)
  • +
  • Output: Always between 0 and 1
  • +
  • Shape: S-shaped curve
  • +
  • At z=0: ฯƒ(0) = 0.5 (middle point)
  • +
  • As zโ†’โˆž: ฯƒ(z) โ†’ 1
  • +
  • As zโ†’-โˆž: ฯƒ(z) โ†’ 0
-
- - - -
-
- Topic 3 -

๐Ÿ“ˆ Parameters vs Statistics

-

Population measures vs sample measures

-
-
-

Introduction

-

What is it? A parameter is a numerical characteristic of a population. A statistic is a numerical characteristic of a sample.

-

Why it matters: We usually can't measure parameters directly (populations are too large), so we estimate them using statistics from samples.

-

When to use it: Parameters are what we want to know; statistics are what we can calculate.

-
+
+
+ +
+

Figure: Sigmoid function transforms linear input to probability

+
-
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

You want to know the average height of all students in your country (parameter). You can't measure everyone, so you measure 1,000 students (sample) and calculate their average height (statistic) to estimate the population parameter.

-
+

Logistic Regression Formula

+
+ Complete Process: + 1. Linear combination: z = wยทx + b
+ 2. Sigmoid transformation: p = ฯƒ(z) = 1/(1 + e^(-z))
+ 3. Decision: if p โ‰ฅ 0.5 โ†’ Class 1, else โ†’ Class 0 +
+ +

Example: Height Classification

+

Let's classify people as "Tall" (1) or "Not Tall" (0) based on height:

-
-

Common Parameters and Statistics

- +
- - - + + + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + +
MeasureParameter (Population)Statistic (Sample)Height (cm)LabelProbability
Mean (Average)ฮผ (mu)xฬ„ (x-bar)
Standard Deviationฯƒ (sigma)s
Varianceฯƒยฒsยฒ
Proportionppฬ‚ (p-hat)
SizeNn
1500 (Not Tall)0.2
16000.35
17000.5
1801 (Tall)0.65
19010.8
20010.9
-
-
-

The Relationship

-
-
Key Concept
-

- Statistic โ†’ Estimates โ†’ Parameter -

-

We use statistics (calculated from samples) to estimate parameters (unknown population values).

+
+
+ +
+

Figure: Logistic regression with decision boundary at 0.5

-
-
-
๐Ÿ“Š EXAMPLE
-
-

Scenario: A factory wants to know the average weight of cereal boxes.

-
    -
  • Population: All cereal boxes produced (millions)
  • -
  • Parameter: ฮผ = true average weight of ALL boxes (unknown)
  • -
  • Sample: 100 randomly selected boxes
  • -
  • Statistic: xฬ„ = 510 grams (calculated from the 100 boxes)
  • -
  • Inference: We estimate ฮผ โ‰ˆ 510 grams
  • -
+

Log Loss (Cross-Entropy)

+

We can't use MSE for logistic regression because it creates a non-convex optimization surface (multiple local minima). Instead, we use log loss:

+ +
+ Log Loss for Single Sample: + L(y, p) = -[yยทlog(p) + (1-y)ยทlog(1-p)] +
where:
y = actual label (0 or 1)
p = predicted probability
-
-
-
โš ๏ธ COMMON MISTAKE
-

Confusing symbols! Greek letters (ฮผ, ฯƒ, ฯ) refer to parameters (population). Roman letters (xฬ„, s, r) refer to statistics (sample).

-
+

Understanding Log Loss:

+

Case 1: Actual y=1, Predicted p=0.9

+

Loss = -[1ยทlog(0.9) + 0ยทlog(0.1)] = -log(0.9) = 0.105 โœ“ Low loss (good!)

-
-

๐ŸŽฏ Key Takeaways

-
    -
  • Parameter: Describes a population (usually unknown)
  • -
  • Statistic: Describes a sample (calculated from data)
  • -
  • Greek letters = population, Roman letters = sample
  • -
  • Statistics are used to estimate parameters
  • -
-
-
- - -
-
- Topic 4 -

๐Ÿ”ข Types of Data

-

Categorical, Numerical, Discrete, Continuous, Ordinal, Nominal

-
+

Case 2: Actual y=1, Predicted p=0.1

+

Loss = -[1ยทlog(0.1) + 0ยทlog(0.9)] = -log(0.1) = 2.303 โœ— High loss (bad!)

-
-

Introduction

-

What is it? Data comes in different types, and understanding these types determines which statistical methods you can use.

-

Why it matters: Using the wrong analysis method for your data type leads to incorrect conclusions. You can't calculate an average of colors!

-

When to use it: Before any analysis, identify your data type to choose appropriate statistical techniques.

-
+

Case 3: Actual y=0, Predicted p=0.1

+

Loss = -[0ยทlog(0.1) + 1ยทlog(0.9)] = -log(0.9) = 0.105 โœ“ Low loss (good!)

-
-

Data Type Hierarchy

-
-
-
DATA
-
-
-
CATEGORICAL
-
NUMERICAL
-
-
-
Nominal
-
Ordinal
-
Discrete
-
Continuous
+
+
๐Ÿ’ก Why Log Loss Works
+
+ Log loss heavily penalizes confident wrong predictions! If you predict 0.99 but the answer is 0, you get a huge penalty. This encourages the model to be accurate AND calibrated.
-
-
-

Categorical Data

-

Represents categories or groups (qualitative)

- -
-
-

Nominal

-

Categories with NO order

-
    -
  • Colors: Red, Blue, Green
  • -
  • Gender: Male, Female, Non-binary
  • -
  • Country: USA, India, Japan
  • -
  • Blood Type: A, B, AB, O
  • -
-
-
-

Ordinal

-

Categories WITH meaningful order

-
    -
  • Education: High School < Bachelor's < Master's
  • -
  • Satisfaction: Poor < Fair < Good < Excellent
  • -
  • Medal: Bronze < Silver < Gold
  • -
  • Size: Small < Medium < Large
  • -
+

Training with Gradient Descent

+

Just like linear regression, we use gradient descent to optimize weights:

+ +
+ Gradient for Logistic Regression: + โˆ‚Loss/โˆ‚w = (p - y)ยทx
+ โˆ‚Loss/โˆ‚b = (p - y) +
Update: w = w - ฮฑยทโˆ‚Loss/โˆ‚w +
+ +
+
โœ… Key Takeaway
+
+ Logistic regression = Linear regression + Sigmoid function + Log loss. It's called "regression" for historical reasons, but it's actually for classification!
+
-
-

Numerical Data

-

Represents quantities (quantitative)

- -
-
-

Discrete

-

Countable, specific values only

-
    -
  • Number of students: 25, 30, 42
  • -
  • Number of cars: 0, 1, 2, 3...
  • -
  • Dice roll: 1, 2, 3, 4, 5, 6
  • -
  • Number of children: 0, 1, 2, 3...
  • -
-

Can't have 2.5 students!

-
-
-

Continuous

-

Can take any value in a range

-
    -
  • Height: 165.3 cm, 180.7 cm
  • -
  • Weight: 68.5 kg, 72.3 kg
  • -
  • Temperature: 23.4ยฐC, 24.7ยฐC
  • -
  • Time: 3.25 seconds
  • -
-

Infinite precision possible

+ +
+
+

๐Ÿ“Š Supervised - Classification Support Vector Machines (SVM)

+ +
+
+ +

What is SVM?

+

Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. Unlike logistic regression which just needs any line that separates the classes, SVM finds the BEST decision boundary - the one with the maximum margin between classes.

+ +
+
Key Concepts
+
    +
  • Finds the best decision boundary with maximum margin
  • +
  • Support vectors are critical points that define the margin
  • +
  • Score is proportional to distance from boundary
  • +
  • Only support vectors matter - other points don't affect boundary
  • +
+
+ +
+
๐Ÿ’ก Key Insight
+
+ SVM doesn't just want wยทx + b > 0, it wants every point to be confidently far from the boundary. The score is directly proportional to the distance from the decision boundary!
-
-
-
๐Ÿ’ก QUICK TEST
-

Ask yourself:

-
    -
  1. Is it a label/category? โ†’ Categorical
  2. -
  3. Is it a number? โ†’ Numerical
  4. -
  5. Can you count it? โ†’ Discrete
  6. -
  7. Can you measure it? โ†’ Continuous
  8. -
  9. Does order matter? โ†’ Ordinal (else Nominal)
  10. -
-
+ +

Dataset and Example

+

Let's work with a simple 2D dataset to understand SVM:

-
-
๐Ÿ“Š EXAMPLES
- +
- - - + + + + - - - - - - - - - - - - - - - - - - - - + + + + + +
DataTypeReasonPointXโ‚Xโ‚‚Class
Zip codesCategorical (Nominal)Numbers used as labels, not quantities
Test scores (A, B, C, D, F)Categorical (Ordinal)Categories with clear order
Number of pages in booksNumerical (Discrete)Countable whole numbers
Reaction time in millisecondsNumerical (Continuous)Can be measured to any precision
A27+1
B38+1
C47+1
D62-1
E73-1
F82-1
-
-
-
โš ๏ธ COMMON MISTAKE
-

Just because something is written as a number doesn't make it numerical! Phone numbers, jersey numbers, and zip codes are categorical because they identify categories, not quantities.

-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Categorical: Labels/categories (Nominal: no order, Ordinal: has order)
  • -
  • Numerical: Quantities (Discrete: countable, Continuous: measurable)
  • -
  • Data type determines which statistical methods to use
  • -
  • Always identify data type before analysis
  • -
-
-
- - -
-
- Topic 5 -

๐Ÿ“ Measures of Central Tendency

-

Mean, Median, Mode - Finding the center of data

-
+

Initial parameters: wโ‚ = 1, wโ‚‚ = 1, b = -10

-
-

Introduction

-

What is it? Measures of central tendency are single values that represent the "center" or "typical" value in a dataset.

-

Why it matters: Instead of looking at hundreds of numbers, one central value summarizes the data. "Average salary" tells you more than listing every employee's salary.

-

When to use it: When you need to summarize data with a single representative value.

-
+ +

Decision Boundary

+

The decision boundary is a line (or hyperplane in higher dimensions) that separates the two classes. It's defined by the equation:

-
-
๐Ÿ’ก REAL-WORLD ANALOGY
-

Imagine finding the "center" of a group of people standing on a field. Mean is like finding the balance point where they'd balance on a seesaw. Median is literally the middle person. Mode is where the most people are clustered together.

-
+
+ Decision Boundary Equation: + wยทx + b = 0 +
where:
w = [wโ‚, wโ‚‚] is the weight vector
x = [xโ‚, xโ‚‚] is the data point
b is the bias term
+
-
-

Mathematical Foundations

- -
-
Mean (Average)
-
- ฮผ = - - ฮฃx - - n - -
-

Where:

-
    -
  • ฮผ (mu) = population mean or xฬ„ (x-bar) = sample mean
  • -
  • ฮฃx = sum of all values
  • -
  • n = number of values
  • +
    +
    Interpretation
    +
      +
    • wยทx + b > 0 โ†’ point above line โ†’ class +1
    • +
    • wยทx + b < 0 โ†’ point below line โ†’ class -1
    • +
    • wยทx + b = 0 โ†’ exactly on boundary
    -
    -

    Steps:

    -
      -
    1. Add all values together
    2. -
    3. Divide by the count of values
    4. -
    -
    -
    -
    Median (Middle Value)
    -
    -

    If odd number of values: Middle value

    -

    If even number of values: Average of two middle values

    -
    -
    -

    Steps:

    -
      -
    1. Sort values in ascending order
    2. -
    3. Find the middle position: (n + 1) / 2
    4. -
    5. If between two values, average them
    6. -
    +
    +
    +
    +

    Figure 3: SVM decision boundary with 6 data points. Hover to see scores.

    -
    -
    Mode (Most Frequent)
    -
    -

    The value(s) that appear most frequently

    +
    +
    + +
    -
    -

    Types:

    -
      -
    • Unimodal: One mode
    • -
    • Bimodal: Two modes
    • -
    • Multimodal: More than two modes
    • -
    • No mode: All values appear equally
    • -
    +
    + + +
    +
    + +
    -
    -
    -

    Interactive Calculator

    - -
    -
    - - - - -
    -
    -
    Mean: 30
    -
    Median: 30
    -
    Mode: None
    + +

    Margin and Support Vectors

    + +
    +
    ๐Ÿ“ Understanding Margin
    +
    + The margin is the distance between the decision boundary and the closest points from each class. Support vectors are the points exactly at the margin (with score = ยฑ1). These are the points with "lowest acceptable confidence" and they're the only ones that matter for defining the boundary!
    -
    -
    -
    ๐Ÿ“Š WORKED EXAMPLE
    -

    Dataset: Test scores: 65, 70, 75, 80, 85, 90, 95

    -
    -

    Mean:

    -

    Sum = 65 + 70 + 75 + 80 + 85 + 90 + 95 = 560

    -

    Mean = 560 / 7 = 80

    - -

    Median:

    -

    Already sorted. Middle position = (7 + 1) / 2 = 4th value

    -

    Median = 80

    - -

    Mode:

    -

    All values appear once. No mode

    +
    + Margin Constraints: + For positive points (yแตข = +1): wยทxแตข + b โ‰ฅ +1
    + For negative points (yแตข = -1): wยทxแตข + b โ‰ค -1
    +
    + Combined: yแตข(wยทxแตข + b) โ‰ฅ 1
    +
    + Margin Width: 2/||w|| +
    To maximize margin โ†’ minimize ||w||
    -
    -
    -

    When to Use Which?

    -
    -
    -

    Use Mean

    -
      -
    • Data is symmetrical
    • -
    • No extreme outliers
    • -
    • Numerical data
    • -
    • Need to use all data points
    • -
    -
    -
    -

    Use Median

    -
      -
    • Data has outliers
    • -
    • Data is skewed
    • -
    • Ordinal data
    • -
    • Need robust measure
    • -
    -
    -
    -

    Use Mode

    -
      -
    • Categorical data
    • -
    • Finding most common value
    • -
    • Discrete data
    • -
    • Multiple peaks in data
    • -
    +
    +
    +
    +

    Figure 4: Decision boundary with margin lines and support vectors highlighted in cyan

    -
    -
    -
    โš ๏ธ COMMON MISTAKE
    -

    Mean is affected by outliers! In salary data like $30K, $35K, $40K, $45K, $500K, the mean is $130K (misleading!). The median of $40K better represents typical salary.

    -
    + +

    Hard Margin vs Soft Margin

    -
    -
    โœ… PRO TIP
    -

    For skewed data (like income, house prices), always report the median along with the mean. If they're very different, your data has outliers or is skewed!

    -
    +

    Hard Margin SVM

    +

    Hard margin SVM requires perfect separation - no points can violate the margin. It works only when data is linearly separable.

    - -
    -

    ๐Ÿ“ Worked Example - Step by Step

    - -
    -

    Problem:

    -

    Find the mean, median, and mode of: [12, 15, 12, 18, 20, 15, 12, 22]

    +
    + Hard Margin Optimization: + minimize (1/2)||w||ยฒ
    + subject to: yแตข(wยทxแตข + b) โ‰ฅ 1 for all i
    - -
    -

    Solution:

    - -
    -
    Step 1:
    -
    -

    Calculate the Mean (Average)

    -
    - Sum = 12 + 15 + 12 + 18 + 20 + 15 + 12 + 22 = 126
    - Count (n) = 8 values
    - Mean = Sum รท n = 126 รท 8 = 15.75 -
    -

    Add all values together, then divide by how many values there are

    -
    + +
    +
    โš ๏ธ Hard Margin Limitation
    +
    + Hard margin can lead to overfitting if we force perfect separation on noisy data! Real-world data often has outliers and noise.
    - -
    -
    Step 2:
    -
    -

    Find the Median (Middle Value)

    -
    - Sorted data: [12, 12, 12, 15, 15, 18, 20, 22]
    - Even number of values (8), so average the middle two
    - Middle positions: 4th and 5th values = 15 and 15
    - Median = (15 + 15) รท 2 = 15 -
    -

    For even-sized datasets, average the two middle values

    -
    +
    + +

    Soft Margin SVM

    +

    Soft margin SVM allows some margin violations, making it more practical for real-world data. It balances margin maximization with allowing some misclassifications.

    + +
    + Soft Margin Cost Function: + Cost = (1/2)||w||ยฒ + Cยทฮฃ max(0, 1 - yแตข(wยทxแตข + b))
    +       โ†“                           โ†“
    + Maximize margin      Hinge Loss
    +                           (penalize violations) +
    + + +

    The C Parameter

    +

    The C parameter controls the trade-off between maximizing the margin and minimizing classification errors. It acts like regularization in other ML algorithms.

    + +
    +
    Effects of C Parameter
    +
      +
    • Small C (0.1 or 1): Wider margin, more violations allowed, better generalization, use when data is noisy
    • +
    • Large C (1000): Narrower margin, fewer violations, classify everything correctly, risk of overfitting, use when data is clean
    • +
    +
    + +
    +
    +
    - -
    -
    Step 3:
    -
    -

    Find the Mode (Most Frequent Value)

    -
    - Frequency count:
    - โ€ข 12 appears 3 times โ† Most frequent!
    - โ€ข 15 appears 2 times
    - โ€ข 18, 20, 22 each appear 1 time
    - Mode = 12 -
    -

    The mode is the value that appears most often

    +

    Figure 5: Effect of C parameter on margin and violations

    +
    + +
    +
    + + +

    Slide to see: 0.1 โ†’ 1 โ†’ 10 โ†’ 1000

    +
    +
    +
    +
    Margin Width
    +
    2.00
    +
    +
    Violations
    +
    0
    +
    +
    +
    + + +

    Training Algorithm

    +

    SVM can be trained using gradient descent. For each training sample (xแตข, yแตข), we check if it violates the margin and update weights accordingly.

    + +
    + Update Rules:
    +
    + Case 1: No violation (yแตข(wยทxแตข + b) โ‰ฅ 1)
    +   w = w - ฮทยทw  (just regularization)
    +   b = b
    +
    + Case 2: Violation (yแตข(wยทxแตข + b) < 1)
    +   w = w - ฮท(w - Cยทyแตขยทxแตข)
    +   b = b + ฮทยทCยทyแตข
    +
    + where ฮท = learning rate (e.g., 0.01) +
    + +
    +
    +
    - -
    - Final Answer: - Mean = 15.75, Median = 15, Mode = 12 +

    Figure 6: SVM training visualization - step through each point

    +
    + +
    +
    + + +
    - -
    - โœ“ Check: -

    Mean (15.75) is slightly higher than median (15) because the outlier 22 pulls it up. The mode (12) is the lowest because it's the most common value at the lower end.

    +
    +
    Step: 0 / 6
    +
    Current Point: -
    +
    w = [0.00, 0.00]
    +
    b = 0.00
    +
    Violation: -
    - -
    -

    ๐Ÿ’ช Try These:

    -
      -
    1. Find the mean of: [5, 10, 15, 20]
    2. -
    3. What's the median of: [3, 1, 4, 1, 5]?
    4. -
    5. Find the mode of: [2, 2, 3, 4, 4, 4, 5]
    6. -
    - - -
    -

    ๐ŸŽฏ Key Takeaways

    -
      -
    • Mean: Sum of all values divided by count (affected by outliers)
    • -
    • Median: Middle value when sorted (resistant to outliers)
    • -
    • Mode: Most frequent value (useful for categorical data)
    • -
    • Choose the measure that best represents your data type and distribution
    • -
    -
    -
- - -
-
- Topic 6 -

โšก Outliers

-

Extreme values that don't fit the pattern

-
+ +

SVM Kernels (Advanced)

+

Real-world data is often not linearly separable. Kernels transform data to higher dimensions where a linear boundary exists, which appears non-linear in the original space!

-
-

Introduction

-

What is it? Outliers are data points that are significantly different from other observations in a dataset.

-

Why it matters: Outliers can indicate data errors, special cases, or important patterns. They can also severely distort statistical analyses.

-

When to use it: Always check for outliers before analyzing data, especially when calculating means and standard deviations.

-
+
+
๐Ÿ’ก The Kernel Trick
+
+ Kernels let us solve non-linear problems without explicitly computing high-dimensional features! They compute similarity between points in transformed space efficiently. +
+
-
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

In a salary dataset for entry-level employees: $35K, $38K, $40K, $37K, $250K. The $250K is an outlierโ€”maybe it's a data entry error (someone added an extra zero) or a special case (CEO's child). Either way, it needs investigation!

-
+
+ Three Main Kernels:
+
+ 1. Linear Kernel
+ K(xโ‚, xโ‚‚) = xโ‚ยทxโ‚‚
+ Use case: Linearly separable data
+
+ 2. Polynomial Kernel (degree 2)
+ K(xโ‚, xโ‚‚) = (xโ‚ยทxโ‚‚ + 1)ยฒ
+ Use case: Curved boundaries, circular patterns
+
+ 3. RBF / Gaussian Kernel
+ K(xโ‚, xโ‚‚) = e^(-ฮณ||xโ‚-xโ‚‚||ยฒ)
+ Use case: Complex non-linear patterns
+ Most popular in practice! +
-
-

Detection Methods

-
-
-

IQR Method

-

Most common approach:

-
    -
  • Calculate Q1, Q3, and IQR = Q3 - Q1
  • -
  • Lower fence = Q1 - 1.5 ร— IQR
  • -
  • Upper fence = Q3 + 1.5 ร— IQR
  • -
  • Outliers fall outside fences
  • -
-
-
-

Z-Score Method

-

For normal distributions:

-
    -
  • Calculate z-score for each value
  • -
  • z = (x - ฮผ) / ฯƒ
  • -
  • If |z| > 3: definitely outlier
  • -
  • If |z| > 2: possible outlier
  • -
+
+
+
+

Figure 7: Kernel comparison on non-linear data

-
-
-
โš ๏ธ COMMON MISTAKE
-

Never automatically delete outliers! They might be: (1) Valid extreme values, (2) Data entry errors, (3) Important discoveries. Always investigate before removing.

-
+
+
+ +
+ + + +
+
+ +
+ + +

Key Formulas Summary

+ +
+ Essential SVM Formulas:
+
+ 1. Decision Boundary: wยทx + b = 0
+
+ 2. Classification Rule: sign(wยทx + b)
+
+ 3. Margin Width: 2/||w||
+
+ 4. Hard Margin Optimization:
+    minimize (1/2)||w||ยฒ
+    subject to yแตข(wยทxแตข + b) โ‰ฅ 1
+
+ 5. Soft Margin Cost:
+    (1/2)||w||ยฒ + Cยทฮฃ max(0, 1 - yแตข(wยทxแตข + b))
+
+ 6. Hinge Loss: max(0, 1 - yแตข(wยทxแตข + b))
+
+ 7. Update Rules (if violation):
+    w = w - ฮท(w - Cยทyแตขยทxแตข)
+    b = b + ฮทยทCยทyแตข
+
+ 8. Kernel Functions:
+    Linear: K(xโ‚, xโ‚‚) = xโ‚ยทxโ‚‚
+    Polynomial: K(xโ‚, xโ‚‚) = (xโ‚ยทxโ‚‚ + 1)^d
+    RBF: K(xโ‚, xโ‚‚) = e^(-ฮณ||xโ‚-xโ‚‚||ยฒ) +
+ + +

Practical Insights

+ +
+
โœ… Why SVM is Powerful
+
+ SVM only cares about support vectors - the points closest to the boundary. Other points don't affect the decision boundary at all! This makes it memory efficient and robust. +
+
+ +
+
When to Use SVM
+
    +
  • Small to medium datasets (works great up to ~10,000 samples)
  • +
  • High-dimensional data (even more features than samples!)
  • +
  • Clear margin of separation exists between classes
  • +
  • Need interpretable decision boundary
  • +
+
-
-

๐ŸŽฏ Key Takeaways

+

Advantages

    -
  • Outliers are extreme values that differ significantly from other data
  • -
  • Use IQR method (1.5 ร— IQR rule) or Z-score method to detect
  • -
  • Mean is heavily affected by outliers; median is resistant
  • -
  • Always investigate outliers before deciding to keep or remove
  • +
  • Effective in high dimensions: Works well even when features > samples
  • +
  • Memory efficient: Only stores support vectors, not entire dataset
  • +
  • Versatile: Different kernels for different data patterns
  • +
  • Robust: Works well with clear margin of separation
-
-
- - -
-
- Topic 7 -

๐Ÿ“ Variance & Standard Deviation

-

Measuring spread and variability in data

-
-
-

Introduction

-

What is it? Variance measures the average squared deviation from the mean. Standard deviation is the square root of variance.

-

Why it matters: Shows how spread out data is. Low values mean data is clustered; high values mean data is scattered.

-

When to use it: Whenever you need to understand data variabilityโ€”in finance (risk), manufacturing (quality control), or research (reliability).

-
+

Disadvantages

+
    +
  • Slow on large datasets: Training time grows quickly with >10k samples
  • +
  • No probability estimates: Doesn't directly provide confidence scores
  • +
  • Kernel choice: Requires expertise to select right kernel
  • +
  • Feature scaling: Very sensitive to feature scales
  • +
-
-

Mathematical Formulas

-
-
Population Variance (ฯƒยฒ)
-
ฯƒยฒ = ฮฃ(x - ฮผ)ยฒ / N
-

Where N = population size, ฮผ = population mean

-
-
-
Sample Variance (sยฒ)
-
sยฒ = ฮฃ(x - xฬ„)ยฒ / (n - 1)
-

Where n = sample size, xฬ„ = sample mean. We use (n-1) for unbiased estimation.

+ +

Real-World Example: Email Spam Classification

+ +
+
๐Ÿ“ง Email Spam Detection
+

Imagine we have emails with two features:

+
    +
  • xโ‚ = number of promotional words ("free", "buy", "limited")
  • +
  • xโ‚‚ = number of capital letters
  • +
+

+ SVM finds the widest "road" between spam and non-spam emails. Support vectors are the emails closest to this road - they're the trickiest cases that define our boundary! An email far from the boundary is clearly spam or clearly legitimate. +

-
-
Standard Deviation
-
ฯƒ = โˆš(variance)
-

Same units as original data, easier to interpret

+ +
+
๐ŸŽฏ Key Takeaway
+
+ Unlike other algorithms that try to classify all points correctly, SVM focuses on the decision boundary. It asks: "What's the safest road I can build between these two groups?" The answer: Make it as wide as possible! +
+
-
-
๐Ÿ“Š WORKED EXAMPLE
-

Dataset: [4, 8, 6, 5, 3, 7]

-
-

Step 1: Mean = (4+8+6+5+3+7)/6 = 5.5

-

Step 2: Deviations: [-1.5, 2.5, 0.5, -0.5, -2.5, 1.5]

-

Step 3: Squared: [2.25, 6.25, 0.25, 0.25, 6.25, 2.25]

-

Step 4: Sum = 17.5

-

Step 5: Variance = 17.5/(6-1) = 3.5

-

Step 6: Std Dev = โˆš3.5 = 1.87

+ +
+
+

๐Ÿ“Š Supervised - Classification K-Nearest Neighbors (KNN)

+ +
+
+

K-Nearest Neighbors is the simplest machine learning algorithm! To classify a new point, just look at its K nearest neighbors and take a majority vote. No training required!

+ +
+
Key Concepts
+
    +
  • Lazy learning: No training phase, just memorize data
  • +
  • K = number of neighbors to consider
  • +
  • Uses distance metrics (Euclidean, Manhattan)
  • +
  • Classification: majority vote | Regression: average
  • +
-
- -
-

๐Ÿ“ Worked Example - Step by Step

+

How KNN Works

+
    +
  1. Choose K: Decide how many neighbors (e.g., K=3)
  2. +
  3. Calculate distance: Find distance from new point to all training points
  4. +
  5. Find K nearest: Select K points with smallest distances
  6. +
  7. Vote: Majority class wins (or take average for regression)
  8. +
+ +

Distance Metrics

-
-

Problem:

-

Calculate the variance and standard deviation for the dataset: [4, 8, 6, 5, 3]

+
+ Euclidean Distance (straight line): + d = โˆš[(xโ‚-xโ‚‚)ยฒ + (yโ‚-yโ‚‚)ยฒ] +
Like measuring with a ruler - shortest path
- -
-

Solution:

- -
-
Step 1:
-
-

Calculate the Mean

-
- Sum = 4 + 8 + 6 + 5 + 3 = 26
- Mean (xฬ„) = 26 รท 5 = 5.2 -
-

First, we need the mean to calculate deviations

-
-
- -
-
Step 2:
-
-

Find Deviations from Mean

-
- (4 - 5.2) = -1.2
- (8 - 5.2) = 2.8
- (6 - 5.2) = 0.8
- (5 - 5.2) = -0.2
- (3 - 5.2) = -2.2 -
-

Subtract the mean from each value

-
-
- -
-
Step 3:
-
-

Square Each Deviation

-
- (-1.2)ยฒ = 1.44
- (2.8)ยฒ = 7.84
- (0.8)ยฒ = 0.64
- (-0.2)ยฒ = 0.04
- (-2.2)ยฒ = 4.84 -
-

Squaring eliminates negative signs and emphasizes larger deviations

-
+ +
+ Manhattan Distance (city blocks): + d = |xโ‚-xโ‚‚| + |yโ‚-yโ‚‚| +
Like walking on city grid - only horizontal/vertical +
+ +
+
+
- -
-
Step 4:
-
-

Calculate Variance (sample)

-
- Sum of squared deviations = 1.44 + 7.84 + 0.64 + 0.04 + 4.84 = 14.8
- Divide by (n-1) = 5-1 = 4
- sยฒ = 14.8 รท 4 = 3.7 -
-

We use (n-1) for sample variance (Bessel's correction)

-
+

Figure: KNN classification - drag the test point to see predictions

+
+ +
+
+ +
- -
-
Step 5:
-
-

Calculate Standard Deviation

-
- s = โˆšsยฒ = โˆš3.7 โ‰ˆ 1.92 -
-

Standard deviation is the square root of variance

+
+ +
+ +
- -
- Final Answer: - Variance = 3.7, Standard Deviation = 1.92 -
- -
- โœ“ Interpretation: -

A standard deviation of 1.92 means most values fall within about 1.92 units of the mean (5.2). This indicates moderate spread in the data.

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. Calculate the standard deviation of: [2, 4, 6, 8]
  2. -
  3. Find the variance of: [10, 12, 14, 16, 18]
  4. -
- -
-
-
-

๐ŸŽฏ Key Takeaways

-
    -
  • Variance measures average squared deviation from mean
  • -
  • Standard deviation is square root of variance (same units as data)
  • -
  • Use (n-1) for sample variance to avoid bias
  • -
  • Higher values = more spread; lower values = more clustered
  • -
-
-
- - -
-
- Topic 8 -

๐ŸŽฏ Quartiles & Percentiles

-

Dividing data into equal parts

-
+

Worked Example

+

Test point at (2.5, 2.5), K=3:

-
-

Introduction

-

What is it? Quartiles divide sorted data into 4 equal parts. Percentiles divide data into 100 equal parts.

-

Why it matters: Shows relative position in a dataset. "90th percentile" means you scored better than 90% of people.

-
+ + + + + + + + + + + + +
PointPositionClassDistance
A(1.0, 2.0)Orange1.80
B(0.9, 1.7)Orange2.00
C(1.5, 2.5)Orange1.00 โ† nearest!
D(4.0, 5.0)Yellow3.35
E(4.2, 4.8)Yellow3.15
F(3.8, 5.2)Yellow3.12
+ +

3-Nearest Neighbors: C (orange), A (orange), B (orange)

+

Vote: 3 orange, 0 yellow โ†’ Prediction: Orange ๐ŸŸ 

-
-

The Five-Number Summary

+

Choosing K

    -
  • Minimum: Smallest value
  • -
  • Q1 (25th percentile): 25% of data below this
  • -
  • Q2 (50th percentile/Median): Middle value
  • -
  • Q3 (75th percentile): 75% of data below this
  • -
  • Maximum: Largest value
  • +
  • K=1: Very sensitive to noise, overfits
  • +
  • Small K (3,5): Flexible boundaries, can capture local patterns
  • +
  • Large K (>10): Smoother boundaries, more stable but might underfit
  • +
  • Odd K: Avoids ties in binary classification
  • +
  • Rule of thumb: K = โˆšn (where n = number of training samples)
-
-
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

SAT scores: If you score 1350 and that's the 90th percentile, it means you scored higher than 90% of test-takers. Percentiles are perfect for standardized tests!

-
+
+
โš ๏ธ Critical: Feature Scaling!
+
+ Always scale features before using KNN! If one feature has range [0, 1000] and another [0, 1], the large feature dominates distance calculations. Use StandardScaler or MinMaxScaler. +
+
-
-

๐ŸŽฏ Key Takeaways

+

Advantages

    -
  • Q1 = 25th percentile, Q2 = median, Q3 = 75th percentile
  • -
  • Percentiles show relative standing in a dataset
  • -
  • Five-number summary: Min, Q1, Q2, Q3, Max
  • -
  • Useful for understanding data distribution
  • +
  • โœ“ Simple to understand and implement
  • +
  • โœ“ No training time (just stores data)
  • +
  • โœ“ Works with any number of classes
  • +
  • โœ“ Can learn complex decision boundaries
  • +
  • โœ“ Naturally handles multi-class problems
-
-
- - -
-
- Topic 9 -

๐Ÿ“ฆ Interquartile Range (IQR)

-

Middle 50% of data and outlier detection

-
-
-

Introduction

-

What is it? IQR = Q3 - Q1. It represents the range of the middle 50% of your data.

-

Why it matters: IQR is resistant to outliers and is the foundation of the 1.5ร—IQR rule for outlier detection.

-
+

Disadvantages

+
    +
  • โœ— Slow prediction (compares to ALL training points)
  • +
  • โœ— High memory usage (stores entire dataset)
  • +
  • โœ— Sensitive to feature scaling
  • +
  • โœ— Curse of dimensionality (struggles with many features)
  • +
  • โœ— Sensitive to irrelevant features
  • +
-
-

The 1.5 ร— IQR Rule

-
-
Outlier Boundaries
-
- Lower Fence = Q1 - 1.5 ร— IQR
- Upper Fence = Q3 + 1.5 ร— IQR +
+
๐Ÿ’ก When to Use KNN
+
+ KNN works best with small to medium datasets (<10,000 samples) with few features (<20). Great for recommendation systems, pattern recognition, and as a baseline to compare other models!
-

Any value outside these fences is considered an outlier

+
+ +
+
+

๐Ÿ“Š Supervised - Evaluation Model Evaluation

+ +
+
+

How do we know if our model is good? Model evaluation provides metrics to measure performance and identify problems!

+ +
+
Key Metrics
+
    +
  • Confusion Matrix: Shows all prediction outcomes
  • +
  • Accuracy, Precision, Recall, F1-Score
  • +
  • ROC Curve & AUC: Performance across thresholds
  • +
  • Rยฒ Score: For regression problems
  • +
+
+ +

Confusion Matrix

+

The confusion matrix shows all possible outcomes of binary classification:

-
-

๐ŸŽฏ Key Takeaways

+
+ Confusion Matrix Structure: +
+                Predicted
+                Pos    Neg
+Actual  Pos     TP     FN
+        Neg     FP     TN
+
+ +

Definitions:

    -
  • IQR = Q3 - Q1 (range of middle 50% of data)
  • -
  • Resistant to outliers (unlike standard deviation)
  • -
  • 1.5ร—IQR rule: standard method for outlier detection
  • -
  • Box plots visualize IQR and outliers
  • +
  • True Positive (TP): Correctly predicted positive
  • +
  • True Negative (TN): Correctly predicted negative
  • +
  • False Positive (FP): Wrongly predicted positive (Type I error)
  • +
  • False Negative (FN): Wrongly predicted negative (Type II error)
-
-
- - -
-
- Topic 10 -

๐Ÿ“‰ Skewness

-

Understanding data distribution shape

-
-
-

Introduction

-

What is it? Skewness measures the asymmetry of a distribution.

-

Why it matters: Indicates whether data leans left or right, affecting which statistical methods to use.

-
- -
-

Types of Skewness

-
-
-

Negative (Left) Skew

-

Tail extends to the left

-

Mean < Median < Mode

-

Example: Test scores when most students do well

-
-
-

Symmetric (No Skew)

-

Perfectly balanced

-

Mean = Median = Mode

-

Example: Normal distribution

-
-
-

Positive (Right) Skew

-

Tail extends to the right

-

Mode < Median < Mean

-

Example: Income data, house prices

+
+
+
+

Figure: Confusion matrix for spam detection (TP=600, FP=100, FN=300, TN=900)

-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Calculate and interpret skewness for dataset: [2, 3, 4, 5, 15]

+

Classification Metrics

+ +
+ Accuracy: + Accuracy = (TP + TN) / (TP + TN + FP + FN) +
Percentage of correct predictions overall
- -
-

Solution:

- -
-
Step 1:
-
-

Calculate the Mean

-
- Sum = 2 + 3 + 4 + 5 + 15 = 29
- n = 5
- Mean (xฬ„) = 29/5 = 5.8 -
-

First, find the average of all values

-
-
- -
-
Step 2:
-
-

Calculate Standard Deviation

-
- Deviations from mean: (2-5.8), (3-5.8), (4-5.8), (5-5.8), (15-5.8)
- = -3.8, -2.8, -1.8, -0.8, 9.2
- Squared: 14.44, 7.84, 3.24, 0.64, 84.64
- Variance (sample) = (14.44+7.84+3.24+0.64+84.64)/4 = 110.8/4 = 27.7
- SD = โˆš27.7 = 5.26 -
-

We need standard deviation for the skewness formula

-
-
- -
-
Step 3:
-
-

Calculate Skewness

-
- Cubed deviations: (-3.8)ยณ, (-2.8)ยณ, (-1.8)ยณ, (-0.8)ยณ, (9.2)ยณ
- = -54.87, -21.95, -5.83, -0.51, 778.69
- Sum = 695.53
- Skewness = (695.53/5) / (5.26)ยณ = 139.11 / 145.77 = 0.95 -
-

Skewness formula uses cubed deviations divided by cubed standard deviation

-
-
- -
-
Step 4:
-
-

Interpret the Result

-
- Skewness = +0.95 (positive)
- Distribution is right-skewed
- The value 15 pulls the tail to the right
- Most data clustered on left, long tail on right -
-

Positive skewness means tail extends to the right

-
-
- -
- โœ“ Final Answer: - Skewness = +0.95 (positively skewed, right tail) + +

Example: (600 + 900) / (600 + 900 + 100 + 300) = 1500/1900 = 0.789 (78.9%)

+ +
+
โš ๏ธ Accuracy Paradox
+
+ Accuracy misleads on imbalanced data! If 99% emails are not spam, a model that always predicts "not spam" gets 99% accuracy but is useless!
- -
- Check: -

The positive skewness confirms that the outlier (15) creates a long right tail, pulling the mean (5.8) above the median (4).

+
+ +
+ Precision: + Precision = TP / (TP + FP) +
"Of all predicted positives, how many are actually positive?" +
+ +

Example: 600 / (600 + 100) = 600/700 = 0.857 (85.7%)

+

Use when: False positives are costly (e.g., spam filter - don't want to block legitimate emails)

+ +
+ Recall (Sensitivity, TPR): + Recall = TP / (TP + FN) +
"Of all actual positives, how many did we catch?" +
+ +

Example: 600 / (600 + 300) = 600/900 = 0.667 (66.7%)

+

Use when: False negatives are costly (e.g., disease detection - can't miss sick patients)

+ +
+ F1-Score: + F1 = 2 ร— (Precision ร— Recall) / (Precision + Recall) +
Harmonic mean - balances precision and recall +
+ +

Example: 2 ร— (0.857 ร— 0.667) / (0.857 + 0.667) = 0.750 (75.0%)

+ +

ROC Curve & AUC

+

The ROC (Receiver Operating Characteristic) curve shows model performance across ALL possible thresholds!

+ +
+ ROC Components: + TPR (True Positive Rate) = TP / (TP + FN) = Recall
+ FPR (False Positive Rate) = FP / (FP + TN) +
Plot: FPR (x-axis) vs TPR (y-axis) +
+ +
+
+
+

Figure: ROC curve - slide threshold to see trade-off

- -
-

๐Ÿ’ช Try These:

-
    -
  1. Find skewness of [1, 1, 2, 3, 3]
  2. -
  3. Data with left tail - positive or negative skew?
  4. -
  5. If mean < median, what type of skew?
  6. -
- - -
-

๐ŸŽฏ Key Takeaways

+

Understanding ROC:

    -
  • Skewness measures asymmetry in distribution
  • -
  • Negative skew: tail to left, Mean < Median
  • -
  • Positive skew: tail to right, Mean > Median
  • -
  • Symmetric: Mean = Median = Mode
  • +
  • Top-left corner (0, 1): Perfect classifier
  • +
  • Diagonal line: Random guessing
  • +
  • Above diagonal: Better than random
  • +
  • Below diagonal: Worse than random (invert predictions!)
-
-
- - -
-
- Topic 11 -

๐Ÿ”— Covariance

-

How two variables vary together

-
-
-

Introduction

-

What is it? Covariance measures how two variables change together.

-

Why it matters: Shows if variables have a positive, negative, or no relationship.

-
+
+ AUC (Area Under Curve): + AUC = Area under ROC curve +
AUC = 1.0: Perfect | AUC = 0.5: Random | AUC > 0.8: Good +
-
-

Formula

-
-
Sample Covariance
-
Cov(X,Y) = ฮฃ(xแตข - xฬ„)(yแตข - ศณ) / (n-1)
+

Regression Metrics: Rยฒ Score

+

For regression problems, Rยฒ (coefficient of determination) measures how well the model explains variance:

+ +
+ Rยฒ Formula: + Rยฒ = 1 - (SS_res / SS_tot)
+
+ SS_res = ฮฃ(y - ลท)ยฒ (sum of squared residuals)
+ SS_tot = ฮฃ(y - ศณ)ยฒ (total sum of squares)
+
ศณ = mean of actual values
-
-
-

Interpretation

+

Interpreting Rยฒ:

    -
  • Positive: Variables increase together
  • -
  • Negative: One increases as other decreases
  • -
  • Zero: No linear relationship
  • -
  • Problem: Scale-dependent, hard to interpret magnitude
  • +
  • Rยฒ = 1.0: Perfect fit (model explains 100% of variance)
  • +
  • Rยฒ = 0.7: Model explains 70% of variance (pretty good!)
  • +
  • Rยฒ = 0.0: Model no better than just using the mean
  • +
  • Rยฒ < 0: Model worse than mean (something's very wrong!)
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Find covariance between X=[2, 4, 6, 8] and Y=[1, 3, 5, 7]

-
- -
-

Solution:

- -
-
Step 1:
-
-

Calculate the Means

-
- xฬ„ = (2 + 4 + 6 + 8) / 4 = 20 / 4 = 5
- ศณ = (1 + 3 + 5 + 7) / 4 = 16 / 4 = 4 -
-

Find the average of each variable

-
-
- -
-
Step 2:
-
-

Create Deviation Table

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
xy(x-xฬ„)(y-ศณ)(x-xฬ„)(y-ศณ)
21-3-39
43-1-11
65111
87339
Sum20
-
-

Calculate deviations from means and their products

-
-
- -
-
Step 3:
-
-

Calculate Sample Covariance

-
- Cov(X,Y) = ฮฃ(x-xฬ„)(y-ศณ) / (n-1)
- Cov(X,Y) = 20 / (4-1)
- Cov(X,Y) = 20 / 3
- Cov(X,Y) = 6.67 -
-

Use n-1 for sample covariance (Bessel's correction)

-
-
- -
-
Step 4:
-
-

Interpret the Result

-
- Cov(X,Y) = 6.67 > 0
- Positive covariance indicates:
- โ€ข X and Y tend to increase together
- โ€ข When X is above its mean, Y tends to be above its mean
- โ€ข When X is below its mean, Y tends to be below its mean -
-

Positive covariance shows positive relationship

-
-
- -
- Final Answer: - Cov(X,Y) = 6.67 (positive relationship) -
- -
- โœ“ Verification: -

The positive covariance confirms that X and Y have a positive linear relationship. As X increases by 2, Y also increases by 2, showing consistent movement together.

+
+
+
+

Figure: Rยฒ calculation on height-weight regression

- -
-

๐Ÿ’ช Try These:

-
    -
  1. Calculate Cov(X,Y) for X=[1, 2, 3] and Y=[2, 4, 6]
  2. -
  3. If Cov(X,Y) = -5, what does this tell you about the relationship?
  4. -
  5. Find Cov(X,Y) for X=[5, 5, 5] and Y=[1, 2, 3]. What do you notice?
  6. -
- - +
+ +
+
+

8. Regularization

+ +
+
+

Regularization prevents overfitting by penalizing complex models. It adds a "simplicity constraint" to force the model to generalize better!

+ +
+
Key Concepts
+
    +
  • Prevents overfitting by penalizing large coefficients
  • +
  • L1 (Lasso): Drives coefficients to zero, feature selection
  • +
  • L2 (Ridge): Shrinks coefficients proportionally
  • +
  • ฮป controls penalty strength
  • +
+
-
-

๐ŸŽฏ Key Takeaways

+

The Overfitting Problem

+

Without regularization, models can learn training data TOO well:

    -
  • Covariance measures joint variability of two variables
  • -
  • Positive: variables move together; Negative: inverse relationship
  • -
  • Scale-dependent (unlike correlation)
  • -
  • Foundation for correlation calculation
  • +
  • Captures noise instead of patterns
  • +
  • High training accuracy, poor test accuracy
  • +
  • Large coefficient values
  • +
  • Model too complex for the problem
-
-
- - -
-
- Topic 12 -

๐Ÿ’ž Correlation

-

Standardized measure of relationship strength

-
-
-

Introduction

-

What is it? Correlation coefficient (r) is a standardized measure of linear relationship between two variables.

-

Why it matters: Always between -1 and +1, making it easy to interpret strength and direction of relationships.

-
+
+
โš ๏ธ Overfitting Example
+
+ Imagine fitting a 10th-degree polynomial to 12 data points. It perfectly fits training data (even noise) but fails on new data. Regularization prevents this! +
+
-
-

Pearson Correlation Formula

-
-
Correlation Coefficient (r)
-
r = Cov(X,Y) / (ฯƒโ‚“ ร— ฯƒแตง)
-

Covariance divided by product of standard deviations

+

The Regularization Solution

+

Instead of minimizing just the loss, we minimize: Loss + Penalty

+ +
+ Regularized Cost Function: + Cost = Loss + ฮป ร— Penalty(ฮธ) +
where:
ฮธ = model parameters (weights)
ฮป = regularization strength
Penalty = function of parameter magnitudes
-
-
-

Interpretation Guide

+

L1 Regularization (Lasso)

+
+ L1 Penalty: + Cost = MSE + ฮป ร— ฮฃ|ฮธแตข| +
Sum of absolute values of coefficients +
+ +

L1 Effects:

    -
  • r = +1: Perfect positive correlation
  • -
  • r = 0.7 to 0.9: Strong positive
  • -
  • r = 0.4 to 0.6: Moderate positive
  • -
  • r = 0.1 to 0.3: Weak positive
  • -
  • r = 0: No correlation
  • -
  • r = -0.1 to -0.3: Weak negative
  • -
  • r = -0.4 to -0.6: Moderate negative
  • -
  • r = -0.7 to -0.9: Strong negative
  • -
  • r = -1: Perfect negative correlation
  • +
  • Feature selection: Drives coefficients to exactly 0
  • +
  • Sparse models: Only important features remain
  • +
  • Interpretable: Easy to see which features matter
  • +
  • Use when: Many features, few are important
-
- -
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

Study hours vs exam scores typically show r = 0.7 (strong positive). More study hours correlate with higher scores.

-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Calculate correlation coefficient for X=[2, 4, 6, 8] and Y=[1, 3, 5, 7]

+

L2 Regularization (Ridge)

+
+ L2 Penalty: + Cost = MSE + ฮป ร— ฮฃฮธแตขยฒ +
Sum of squared coefficients
- -
-

Solution:

- -
-
Step 1:
-
-

Use Covariance from Topic 11

-
- From previous calculation:
- Cov(X,Y) = 6.67
- xฬ„ = 5, ศณ = 4 -
-

We already calculated this in Topic 11

-
-
- -
-
Step 2:
-
-

Calculate Standard Deviation of X

-
- Deviations from mean: -3, -1, 1, 3
- Squared deviations: 9, 1, 1, 9
- Sum of squared deviations = 20
- Variance_x = 20 / (4-1) = 20/3 = 6.67
- SD_x = โˆš6.67 โ‰ˆ 2.58 -
-

Standard deviation measures spread of X values

-
-
- -
-
Step 3:
-
-

Calculate Standard Deviation of Y

-
- Deviations from mean: -3, -1, 1, 3
- Squared deviations: 9, 1, 1, 9
- Sum of squared deviations = 20
- Variance_y = 20 / (4-1) = 20/3 = 6.67
- SD_y = โˆš6.67 โ‰ˆ 2.58 -
-

Standard deviation measures spread of Y values

-
-
- -
-
Step 4:
-
-

Calculate Correlation Coefficient

-
- r = Cov(X,Y) / (SD_x ร— SD_y)
- r = 6.67 / (2.58 ร— 2.58)
- r = 6.67 / 6.66
- r โ‰ˆ 1.00 -
-

Correlation standardizes covariance by dividing by both standard deviations

-
-
- -
-
Step 5:
-
-

Interpret the Result

-
- r = 1.00 (perfect positive correlation)
- This means:
- โ€ข X and Y have a perfect linear relationship
- โ€ข As X increases by 2, Y increases by 2 (exactly)
- โ€ข All points lie exactly on a straight line
- โ€ข The relationship is: Y = 0.5X (or Y = -1 + 0.5X when adjusted) -
-

r = 1 indicates perfect positive linear correlation

-
+ +

L2 Effects:

+
    +
  • Shrinks coefficients: Makes them smaller, not zero
  • +
  • Keeps all features: No automatic selection
  • +
  • Smooth predictions: Less sensitive to individual features
  • +
  • Use when: Many correlated features (multicollinearity)
  • +
+ +
+
+
- -
- Final Answer: - r = 1.00 (perfect positive linear correlation) +

Figure: Comparing vanilla, L1, and L2 regularization effects

+
+ +
+
+ +
- -
- โœ“ Verification: -

Check: If we plot these points, they form a perfect line. When X=2, Y=1; X=4, Y=3; X=6, Y=5; X=8, Y=7. The relationship is Y = (X/2) - 1 + (X/2) = 0.5X, which is indeed perfectly linear! โœ“

+
+ +

The Lambda (ฮป) Parameter

+
    +
  • ฮป = 0: No regularization (original model, risk of overfitting)
  • +
  • Small ฮป (0.01): Weak penalty, slight regularization
  • +
  • Medium ฮป (1): Balanced, good generalization
  • +
  • Large ฮป (100): Strong penalty, risk of underfitting
  • +
+ +
+
๐Ÿ’ก L1 vs L2: Quick Guide
+
+ Use L1 when:
+ โ€ข You suspect many features are irrelevant
+ โ€ข You want automatic feature selection
+ โ€ข You need interpretability
+
+ Use L2 when:
+ โ€ข All features might be useful
+ โ€ข Features are highly correlated
+ โ€ข You want smooth, stable predictions
+
+ Elastic Net: Combines both L1 and L2!
- -
-

๐Ÿ’ช Try These:

-
    -
  1. If Cov(X,Y) = 10, SD_x = 2, SD_y = 5, find r
  2. -
  3. What does r = -0.8 indicate about the relationship?
  4. -
  5. Can correlation be greater than 1? Why or why not?
  6. -
- - +
-
-

๐ŸŽฏ Key Takeaways

+
+
+

9. Bias-Variance Tradeoff

+ +
+
+

Every model makes two types of errors: bias and variance. The bias-variance tradeoff is the fundamental challenge in machine learning - we must balance them!

+ +
+
Key Concepts
+
    +
  • Bias = systematic error (underfitting)
  • +
  • Variance = sensitivity to training data (overfitting)
  • +
  • Can't minimize both simultaneously
  • +
  • Goal: Find the sweet spot
  • +
+
+ +

Understanding Bias

+

Bias is the error from overly simplistic assumptions. High bias causes underfitting.

+ +

Characteristics of High Bias:

    -
  • r ranges from -1 to +1
  • -
  • Measures strength AND direction of linear relationship
  • -
  • Scale-independent (unlike covariance)
  • -
  • Only measures LINEAR relationships
  • +
  • Model too simple for the problem
  • +
  • High error on training data
  • +
  • High error on test data
  • +
  • Can't capture underlying patterns
  • +
  • Example: Using a straight line for curved data
-
-
- - -
-
- Topic 13 -

๐Ÿ’ช Interpreting Correlation

-

Correlation vs causation and common pitfalls

-
-
-

The Golden Rule

-
-
โš ๏ธ CORRELATION โ‰  CAUSATION
-

Just because two variables are correlated does NOT mean one causes the other!

+
+
๐ŸŽฏ High Bias Example
+
+ Trying to fit a parabola with a straight line. No matter how much training data you have, a line can't capture the curve. That's bias! +
-
-
-

Common Scenarios

+

Understanding Variance

+

Variance is the error from sensitivity to small fluctuations in training data. High variance causes overfitting.

+ +

Characteristics of High Variance:

    -
  • Direct Causation: X causes Y (smoking causes cancer)
  • -
  • Reverse Causation: Y causes X (not the direction you thought)
  • -
  • Third Variable: Z causes both X and Y (confounding variable)
  • -
  • Coincidence: Pure chance with no real relationship
  • +
  • Model too complex for the problem
  • +
  • Very low error on training data
  • +
  • High error on test data
  • +
  • Captures noise as if it were pattern
  • +
  • Example: Using 10th-degree polynomial for simple data
-
-
-
๐Ÿ“Š FAMOUS EXAMPLE
-

Ice cream sales correlate with drowning deaths.

-

Does ice cream cause drowning? NO! The third variable is summer weatherโ€”more people swim in summer (more drownings) and eat ice cream in summer.

-
+
+
๐Ÿ“Š High Variance Example
+
+ A wiggly curve that passes through every training point perfectly, including outliers. Change one data point and the entire curve changes dramatically. That's variance! +
+
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Study finds r = -0.75 between hours of TV watched and exam scores. Interpret this result and discuss causation.

+

The Tradeoff

+
+ Total Error Decomposition: + Total Error = Biasยฒ + Variance + Irreducible Error +
Irreducible error = noise in data (can't be eliminated)
- -
-

Solution:

- -
-
Step 1:
-
-

Analyze the Sign

-
- Negative correlation (r < 0)
- As one variable increases, the other decreases
- More TV โ†’ Lower scores (or vice versa) -
-

The negative sign tells us the direction of the relationship

-
-
- -
-
Step 2:
-
-

Analyze the Strength

-
- |r| = |-0.75| = 0.75
- Interpretation scale:
- โ€ข 0.0-0.3 = Weak
- โ€ข 0.3-0.7 = Moderate
- โ€ข 0.7-1.0 = Strong
- 0.75 falls in "Strong" category -
-

The absolute value determines relationship strength

-
-
- -
-
Step 3:
-
-

State the Relationship

-
- Strong negative correlation
- Students who watch more TV tend to have lower exam scores
- Relationship is fairly consistent but not perfect -
-

Combine sign and strength for complete interpretation

-
-
- -
-
Step 4:
-
-

Address Causation

-
- Correlation โ‰  Causation!
- Possible explanations:
- a) TV causes lower scores (less study time)
- b) Lower-performing students watch more TV (compensating)
- c) Third variable: stress causes both TV watching and poor performance
- Cannot determine causation from correlation alone -
-

Correlation never proves causation - always consider alternatives

-
-
- -
-
Step 5:
-
-

Predict Using Correlation

-
- If we know TV hours, we can predict exam score
- But prediction โ‰  causation
- rยฒ = 0.75ยฒ = 0.56 = 56% of variance explained -
-

rยฒ shows percentage of variance in one variable explained by the other

-
-
- -
- โœ“ Final Answer: - Strong negative correlation (r = -0.75), but does NOT prove TV causes lower scores -
- -
- Check: -

While the correlation is strong, we must resist concluding causation. The relationship could be coincidental, reverse-causal, or due to confounding variables.

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. r = +0.90 between study hours and grades. Interpret.
  2. -
  3. Can r = 1.5? Why or why not?
  4. -
  5. If r = 0, does that mean no relationship at all?
  6. -
- - -
-
-
-

๐ŸŽฏ Key Takeaways

+

The tradeoff:

    -
  • Correlation shows relationship, NOT causation
  • -
  • Always consider third variables (confounders)
  • -
  • Need controlled experiments to prove causation
  • -
  • Be skeptical of correlation claims in media
  • +
  • Decrease bias โ†’ Increase variance (more complex model)
  • +
  • Decrease variance โ†’ Increase bias (simpler model)
  • +
  • Goal: Minimize total error by balancing both
-
-
- - -
-
- Topic 14 -

๐ŸŽฒ Probability Basics

-

Foundation of statistical inference

-
-
-

Introduction

-

What is it? Probability measures the likelihood of an event occurring, ranging from 0 (impossible) to 1 (certain).

-

Why it matters: Foundation for all statistical inference, hypothesis testing, and prediction.

-
+
+
+ +
+

Figure: Three models showing underfitting, good fit, and overfitting

+
-
-

Basic Formula

-
-
Probability of Event E
-
P(E) = Number of favorable outcomes / Total number of possible outcomes
+

The Driving Test Analogy

+

Think of learning to drive:

+ +
+
Driving Test Analogy
+
    +
  • + High Bias (Underfitting):
    + Failed practice tests, failed real test
    + โ†’ Can't learn to drive at all +
  • +
  • + Good Balance:
    + Passed practice tests, passed real test
    + โ†’ Actually learned to drive! +
  • +
  • + High Variance (Overfitting):
    + Perfect on practice tests, failed real test
    + โ†’ Memorized practice, didn't truly learn +
  • +
-
-
-

Key Rules

+

How to Find the Balance

+ +

Reduce Bias (if underfitting):

    -
  • Range: 0 โ‰ค P(E) โ‰ค 1
  • -
  • Complement: P(not E) = 1 - P(E)
  • -
  • Addition (OR): P(A or B) = P(A) + P(B) - P(A and B)
  • -
  • Multiplication (AND): P(A and B) = P(A) ร— P(B) [if independent]
  • +
  • Use more complex model (more features, higher degree polynomial)
  • +
  • Add more features
  • +
  • Reduce regularization
  • +
  • Train longer (more iterations)
-
- -
-
๐Ÿ“Š EXAMPLE
-

Rolling a die:

-

P(rolling a 4) = 1/6 โ‰ˆ 0.167

-

P(rolling even) = 3/6 = 0.5

-

P(not rolling a 6) = 5/6 โ‰ˆ 0.833

-
-
-

๐ŸŽฏ Key Takeaways

+

Reduce Variance (if overfitting):

    -
  • Probability ranges from 0 to 1
  • -
  • P(E) = favorable outcomes / total outcomes
  • -
  • Complement rule: P(not E) = 1 - P(E)
  • -
  • Foundation for all statistical inference
  • +
  • Use simpler model (fewer features, lower degree)
  • +
  • Get more training data
  • +
  • Add regularization (L1, L2)
  • +
  • Use cross-validation
  • +
  • Feature selection or dimensionality reduction
-
-
- - -
-
- Topic 15 -

๐Ÿ”ท Set Theory

-

Union, intersection, and complement

-
-
-

Introduction

-

What is it? Set theory provides a mathematical framework for organizing events and calculating probabilities.

+

Model Complexity Curve

+
+
+ +
+

Figure: Error vs model complexity - find the sweet spot

+
+ +
+
๐Ÿ’ก Detecting Bias vs Variance
+
+ High Bias:
+ Training error: High ๐Ÿ”ด
+ Test error: High ๐Ÿ”ด
+ Gap: Small
+
+ High Variance:
+ Training error: Low ๐ŸŸข
+ Test error: High ๐Ÿ”ด
+ Gap: Large โš ๏ธ
+
+ Good Model:
+ Training error: Low ๐ŸŸข
+ Test error: Low ๐ŸŸข
+ Gap: Small โœ“ +
+
+ +
+
โœ… Key Takeaway
+
+ The bias-variance tradeoff is unavoidable. You can't have zero bias AND zero variance. The art of machine learning is finding the sweet spot where total error is minimized! +
+
+
-
-

Key Concepts

+
+
+

๐Ÿ“Š Supervised - Evaluation Cross-Validation

+ +
+
+

Cross-validation gives more reliable performance estimates by testing your model on multiple different splits of the data!

+ +
+
Key Concepts
+
    +
  • Splits data into K folds
  • +
  • Trains K times, each with different test fold
  • +
  • Averages results for robust estimate
  • +
  • Reduces variance in performance estimate
  • +
+
+ +

The Problem with Simple Train-Test Split

+

With a single 80-20 split:

    -
  • Union (A โˆช B): A OR B (either event occurs)
  • -
  • Intersection (A โˆฉ B): A AND B (both events occur)
  • -
  • Complement (A'): NOT A (event doesn't occur)
  • -
  • Mutually Exclusive: A โˆฉ B = โˆ… (can't both occur)
  • +
  • Performance depends on which data you randomly picked
  • +
  • Might get lucky/unlucky with the split
  • +
  • 20% of data wasted (not used for training)
  • +
  • One number doesn't tell you about variance
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

In a class of 40 students: 25 like Math, 20 like Science, 10 like both. Find: a) P(Math OR Science), b) P(only Math), c) P(neither)

-
- -
-

Solution:

- -
-
Step 1:
-
-

Set Up the Information

-
- Total students: n = 40
- P(Math) = 25/40 = 0.625
- P(Science) = 20/40 = 0.5
- P(Math โˆฉ Science) = 10/40 = 0.25 -
-

Convert all counts to probabilities

-
-
- -
-
Step 2:
-
-

Find P(Math โˆช Science) using Addition Rule

-
- Formula: P(A โˆช B) = P(A) + P(B) - P(A โˆฉ B)
- P(Math โˆช Science) = 0.625 + 0.5 - 0.25
- = 1.125 - 0.25
- = 0.875 -
-

We subtract the intersection to avoid double-counting

-
-
- -
-
Step 3:
-
-

Find P(only Math)

-
- Only Math = Math AND NOT Science
- Students in only Math = 25 - 10 = 15
- P(only Math) = 15/40 = 0.375 -
-

Subtract those who like both from total Math students

-
-
- -
-
Step 4:
-
-

Find P(neither)

-
- Neither = NOT (Math OR Science)
- P(neither) = 1 - P(Math โˆช Science)
- = 1 - 0.875
- = 0.125
- Or: 40 - 35 = 5 students, so 5/40 = 0.125 โœ“ -
-

Use complement rule or count directly

-
-
- -
- โœ“ Final Answer: - a) P(Math OR Science) = 0.875 (87.5%)
b) P(only Math) = 0.375 (37.5%)
c) P(neither) = 0.125 (12.5%)
-
- -
- Verification: -

Check: 0.375 (only Math) + 0.25 (both) + 0.25 (only Science) + 0.125 (neither) = 1.0 โœ“

+
+
โš ๏ธ Single Split Problem
+
+ You test once and get 85% accuracy. Is that good? Or did you just get lucky with an easy test set? Without multiple tests, you don't know!
- -
-

๐Ÿ’ช Try These:

-
    -
  1. P(A)=0.6, P(B)=0.5, P(AโˆฉB)=0.3. Find P(AโˆชB)
  2. -
  3. If P(AโˆชB)=0.8, P(A)=0.5, P(B)=0.4, find P(AโˆฉB)
  4. -
  5. 100 students: 60 like pizza, 40 like burgers, 20 like both. How many like neither?
  6. -
- - -
-

๐ŸŽฏ Key Takeaways

+

Choosing K

    -
  • Union (โˆช): OR operation
  • -
  • Intersection (โˆฉ): AND operation
  • -
  • Complement ('): NOT operation
  • -
  • Venn diagrams visualize set relationships
  • +
  • K=5: Most common, good balance
  • +
  • K=10: More reliable, standard in research
  • +
  • K=n (Leave-One-Out): Maximum data usage, but expensive
  • +
  • Larger K: More computation, less bias, more variance
  • +
  • Smaller K: Less computation, more bias, less variance
-
- - - -
-
- Topic 16 -

๐Ÿ”€ Conditional Probability

-

Probability given that something else happened

-
-
-

Introduction

-

What is it? Conditional probability is the probability of event A occurring given that event B has already occurred.

-
+

Stratified K-Fold

+

For classification with imbalanced classes, use stratified K-fold to maintain class proportions in each fold!

-
-

Formula

-
-
Conditional Probability
-
P(A|B) = P(A and B) / P(B)
-

Read as: "Probability of A given B"

+
+
๐Ÿ’ก Example
+
+ Dataset: 80% class 0, 20% class 1
+
+ Regular K-fold: One fold might have 90% class 0, another 70%
+ Stratified K-fold: Every fold has 80% class 0, 20% class 1 โœ“ +
-
-
-
๐Ÿ“Š EXAMPLE
-

Drawing cards: P(King | Red card) = ?

-

P(Red card) = 26/52

-

P(King and Red) = 2/52

-

P(King | Red) = (2/52) / (26/52) = 2/26 = 1/13

-
+

Leave-One-Out Cross-Validation (LOOCV)

+

Special case where K = n (number of samples):

+
    +
  • Each sample is test set once
  • +
  • Train on n-1 samples, test on 1
  • +
  • Repeat n times
  • +
  • Maximum use of training data
  • +
  • Very expensive for large datasets
  • +
-
-

๐ŸŽฏ Key Takeaways

+

Benefits of Cross-Validation

    -
  • P(A|B) = probability of A given B occurred
  • -
  • Formula: P(A|B) = P(A and B) / P(B)
  • -
  • Critical for Bayes' Theorem
  • -
  • Used in machine learning and diagnostics
  • +
  • โœ“ More reliable performance estimate
  • +
  • โœ“ Uses all data for both training and testing
  • +
  • โœ“ Reduces variance in estimate
  • +
  • โœ“ Detects overfitting (high variance across folds)
  • +
  • โœ“ Better for small datasets
-
-
- - -
-
- Topic 17 -

๐ŸŽฏ Independence

-

When events don't affect each other

-
-
-

Introduction

-

What is it? Two events are independent if the occurrence of one doesn't affect the probability of the other.

-
+

Drawbacks

+
    +
  • โœ— Computationally expensive (train K times)
  • +
  • โœ— Not suitable for time series (can't shuffle)
  • +
  • โœ— Still need final train-test split for final model
  • +
-
-

Test for Independence

-
-
Events A and B are independent if:
-
P(A|B) = P(A)
-

OR equivalently:

-
P(A and B) = P(A) ร— P(B)
+
+
โœ… Best Practice
+
+ 1. Use cross-validation to evaluate models and tune hyperparameters
+ 2. Once you pick the best model, train on ALL training data
+ 3. Test once on held-out test set for final unbiased estimate
+
+ Never use test set during cross-validation! +
+
+ +
+
+

๐Ÿ” Unsupervised - Preprocessing Data Preprocessing

+ +
+
+

Raw data is messy! Data preprocessing cleans and transforms data into a format that machine learning algorithms can use effectively.

+ +
+
Key Steps
+
    +
  • Handle missing values
  • +
  • Encode categorical variables
  • +
  • Scale/normalize features
  • +
  • Split data properly
  • +
+
+ +

1. Handling Missing Values

+

Real-world data often has missing values. We can't just ignore them!

-
-

Examples

+

Strategies:

    -
  • Independent: Coin flips, die rolls with replacement
  • -
  • Dependent: Drawing cards without replacement, weather on consecutive days
  • +
  • Drop rows: If only few values missing (<5%)
  • +
  • Mean imputation: Replace with column mean (numerical)
  • +
  • Median imputation: Replace with median (robust to outliers)
  • +
  • Mode imputation: Replace with most frequent (categorical)
  • +
  • Forward/backward fill: Use previous/next value (time series)
  • +
  • Predictive imputation: Train model to predict missing values
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Two dice are rolled. Let A = "first die shows 6" and B = "sum is 7". Are A and B independent?

-
- -
-

Solution:

- -
-
Step 1:
-
-

Find P(A)

-
- First die shows 6: one outcome out of 6
- P(A) = 1/6 โ‰ˆ 0.167 -
-

Probability the first die is 6

-
-
- -
-
Step 2:
-
-

Find P(B)

-
- Sum equals 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1)
- 6 favorable outcomes out of 36 total
- P(B) = 6/36 = 1/6 โ‰ˆ 0.167 -
-

Count all ways to get sum of 7

-
-
- -
-
Step 3:
-
-

Find P(A โˆฉ B)

-
- First die is 6 AND sum is 7
- Only possibility: (6,1)
- P(A โˆฉ B) = 1/36 โ‰ˆ 0.028 -
-

Find where both events occur simultaneously

-
-
- -
-
Step 4:
-
-

Test Independence

-
- If independent: P(A โˆฉ B) = P(A) ร— P(B)
- P(A) ร— P(B) = (1/6) ร— (1/6) = 1/36
- P(A โˆฉ B) = 1/36
- 1/36 = 1/36 โœ“ EQUAL! -
-

Compare the two probabilities to test independence

-
-
- -
-
Step 5:
-
-

Conclusion

-
- Events A and B ARE independent
- Knowing first die is 6 doesn't change probability of sum being 7 -
-

When the product rule holds, events are independent

-
-
- -
- โœ“ Final Answer: - YES, events are independent. P(AโˆฉB) = P(A)ร—P(B) = 1/36 -
- -
- Check: -

We can also verify: P(B|A) = P(AโˆฉB)/P(A) = (1/36)/(1/6) = 1/6 = P(B). Since P(B|A) = P(B), the events are independent.

+
+
โš ๏ธ Warning
+
+ Never drop columns with many missing values without investigation! The missingness itself might be informative (e.g., income not reported might correlate with high income).
- -
-

๐Ÿ’ช Try These:

-
    -
  1. P(A)=0.3, P(B)=0.4, P(AโˆฉB)=0.12. Independent?
  2. -
  3. Coin flip: P(Heads) and P(Tails). Independent?
  4. -
  5. Drawing two cards without replacement. Independent?
  6. -
- - -
-

๐ŸŽฏ Key Takeaways

+

3. Feature Scaling

+

Different features have different scales. Age (0-100) vs Income ($0-$1M). This causes problems!

+ +

Why Scale?

    -
  • Independent events don't affect each other
  • -
  • Test: P(A and B) = P(A) ร— P(B)
  • -
  • With replacement โ†’ independent
  • -
  • Without replacement โ†’ dependent
  • +
  • Gradient descent converges faster
  • +
  • Distance-based algorithms (KNN, SVM) need it
  • +
  • Regularization treats features equally
  • +
  • Neural networks train better
-
-
- - -
-
- Topic 18 -

๐Ÿงฎ Bayes' Theorem

-

Updating probabilities with new evidence

-
-
-

Introduction

-

What is it? Bayes' Theorem shows how to update probability based on new information.

-

Why it matters: Used in medical diagnosis, spam filters, machine learning, and countless applications.

-
+

StandardScaler (Z-score normalization)

+
+ Formula: + z = (x - ฮผ) / ฯƒ +
where:
ฮผ = mean of feature
ฯƒ = standard deviation
Result: mean=0, std=1
+
-
-

The Formula

-
-
Bayes' Theorem
-
P(A|B) = [P(B|A) ร— P(A)] / P(B)
-
    -
  • P(A|B) = posterior probability
  • -
  • P(B|A) = likelihood
  • -
  • P(A) = prior probability
  • -
  • P(B) = marginal probability
  • -
+

Example: [10, 20, 30, 40, 50]

+

ฮผ = 30, ฯƒ = 15.81

+

Scaled: [-1.26, -0.63, 0, 0.63, 1.26]

+ +

MinMaxScaler

+
+ Formula: + x' = (x - min) / (max - min) +
Result: range [0, 1]
-
-
-
๐Ÿ“Š MEDICAL DIAGNOSIS EXAMPLE
-

Disease affects 1% of population. Test is 95% accurate.

-

You test positive. What's probability you have disease?

-
-

P(Disease) = 0.01

-

P(Positive|Disease) = 0.95

-

P(Positive|No Disease) = 0.05

-

P(Positive) = 0.01ร—0.95 + 0.99ร—0.05 = 0.059

-

P(Disease|Positive) = (0.95ร—0.01)/0.059 = 0.161

-

Only 16.1% chance you have the disease!

+

Example: [10, 20, 30, 40, 50]

+

Scaled: [0, 0.25, 0.5, 0.75, 1.0]

+ +
+
+ +
+

Figure: Feature distributions before and after scaling

-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

A disease affects 1% of the population. A test is 99% accurate (detects 99% of sick people and correctly identifies 99% of healthy people). You test positive. What's the probability you actually have the disease?

+

Critical: fit_transform vs transform

+

This is where many beginners make mistakes!

+ +
+ fit_transform():
+ 1. Learns parameters (ฮผ, ฯƒ, min, max) from data
+ 2. Transforms the data
+ Use on: Training data ONLY
+
+ transform():
+ 1. Uses already-learned parameters
+ 2. Transforms the data
+ Use on: Test data, new data
- -
-

Solution:

- -
-
Step 1:
-
-

Define the Events and Given Information

-
- Let A = has disease
- Let B = tests positive
- P(A) = 0.01 (1% of population has disease)
- P(B|A) = 0.99 (99% true positive rate)
- P(B|A') = 0.01 (1% false positive rate) -
-

Set up all known probabilities before applying Bayes' Theorem

-
-
- -
-
Step 2:
-
-

Calculate P(B) using Total Probability

-
- P(B) = P(B|A) ร— P(A) + P(B|A') ร— P(A')
- P(B) = (0.99 ร— 0.01) + (0.01 ร— 0.99)
- P(B) = 0.0099 + 0.0099 = 0.0198 -
-

Find the overall probability of testing positive

-
-
- -
-
Step 3:
-
-

Apply Bayes' Theorem

-
- P(A|B) = [P(B|A) ร— P(A)] / P(B)
- P(A|B) = (0.99 ร— 0.01) / 0.0198
- P(A|B) = 0.0099 / 0.0198
- P(A|B) = 0.5 = 50% -
-

This is the posterior probability - what we want to find!

-
-
- -
- Final Answer: - Only 50% chance you have the disease despite testing positive! + +
+
โš ๏ธ DATA LEAKAGE!
+
+ WRONG:
+ scaler.fit(test_data) # Learns from test data!
+
+ CORRECT:
+ scaler.fit(train_data) # Learn from train only
+ train_scaled = scaler.transform(train_data)
+ test_scaled = scaler.transform(test_data)
+
+ If you fit on test data, you're "peeking" at the answers!
- -
- โœ“ Why So Low? -

This counter-intuitive result occurs because the disease is so rare (1%). Even with a 99% accurate test, there are many more false positives from the healthy 99% than true positives from the sick 1%. Base rates matter!

+
+ +

4. Train-Test Split

+

Always split data BEFORE any preprocessing that learns parameters!

+ +
+ Correct Order:
+ 1. Split data โ†’ train (80%), test (20%)
+ 2. Handle missing values (fit on train)
+ 3. Encode categories (fit on train)
+ 4. Scale features (fit on train)
+ 5. Train model
+ 6. Test model (using same transformations) +
+ +

Complete Pipeline Example

+
+
+
+

Figure: Complete preprocessing pipeline

- -
-

๐Ÿ’ช Try These:

-
    -
  1. What if the disease affects 10% of the population instead? Recalculate P(A|B)
  2. -
  3. If the test was 95% accurate instead of 99%, what would P(A|B) be?
  4. -
- - +
-
-

๐ŸŽฏ Key Takeaways

-
    -
  • Updates probability based on new evidence
  • -
  • P(A|B) = [P(B|A) ร— P(A)] / P(B)
  • -
  • Critical for medical testing and machine learning
  • -
  • Counter-intuitive results common (base rate matters!)
  • -
-
-
- - -
-
- Topic 19 -

๐Ÿ“Š Probability Mass Function (PMF)

-

Probabilities for discrete random variables

-
+
+
+

12. Loss Functions

+ +
+
+

Loss functions measure how wrong our predictions are. Different problems need different loss functions! The choice dramatically affects what your model learns.

+ +
+
Key Concepts
+
    +
  • Loss = how wrong a single prediction is
  • +
  • Cost = average loss over all samples
  • +
  • Regression: MSE, MAE, RMSE
  • +
  • Classification: Log Loss, Hinge Loss
  • +
+
-
-

Introduction

-

What is it? PMF gives the probability that a discrete random variable equals a specific value.

-

Why it matters: Used for countable outcomes like dice rolls, coin flips, or number of defects.

-
+

Loss Functions for Regression

+ +

Mean Squared Error (MSE)

+
+ Formula: + MSE = (1/n) ร— ฮฃ(y - ลท)ยฒ +
where:
y = actual value
ลท = predicted value
n = number of samples
+
-
-

Properties

+
Characteristics:
    -
  • 0 โ‰ค P(X = x) โ‰ค 1 for all x
  • -
  • Sum of all probabilities = 1
  • -
  • Only defined for discrete variables
  • -
  • Visualized with bar charts
  • +
  • Squares errors: Penalizes large errors heavily
  • +
  • Always positive: Minimum is 0 (perfect predictions)
  • +
  • Differentiable: Great for gradient descent
  • +
  • Sensitive to outliers: One huge error dominates
  • +
  • Units: Squared units (harder to interpret)
-
-
-
๐Ÿ“Š EXAMPLE: Die Roll
-

P(X = 1) = 1/6

-

P(X = 2) = 1/6

-

... and so on

-

Sum = 6 ร— (1/6) = 1 โœ“

-
+

Example: Predictions [12, 19, 32], Actual [10, 20, 30]

+

Errors: [2, -1, 2]

+

Squared: [4, 1, 4]

+

MSE = (4 + 1 + 4) / 3 = 3.0

+ +

Mean Absolute Error (MAE)

+
+ Formula: + MAE = (1/n) ร— ฮฃ|y - ลท| +
Absolute value of errors +
-
-

๐ŸŽฏ Key Takeaways

+
Characteristics:
    -
  • PMF is for discrete random variables
  • -
  • Gives P(X = specific value)
  • -
  • All probabilities sum to 1
  • -
  • Visualized with bar charts
  • +
  • Linear penalty: All errors weighted equally
  • +
  • Robust to outliers: One huge error doesn't dominate
  • +
  • Interpretable units: Same units as target
  • +
  • Not differentiable at 0: Slightly harder to optimize
-
-
- - -
-
- Topic 20 -

๐Ÿ“ˆ Probability Density Function (PDF)

-

Probabilities for continuous random variables

-
-
-

Introduction

-

What is it? PDF describes probability for continuous random variables. Probability at exact point is 0; we calculate probability over intervals.

-
+

Example: Predictions [12, 19, 32], Actual [10, 20, 30]

+

Errors: [2, -1, 2]

+

Absolute: [2, 1, 2]

+

MAE = (2 + 1 + 2) / 3 = 1.67

+ +

Root Mean Squared Error (RMSE)

+
+ Formula: + RMSE = โˆšMSE +
Square root of MSE +
-
-

Key Differences from PMF

+
Characteristics:
    -
  • For continuous (not discrete) variables
  • -
  • P(X = exact value) = 0
  • -
  • Calculate P(a < X < b) = area under curve
  • -
  • Total area under curve = 1
  • +
  • Same units as target: More interpretable than MSE
  • +
  • Still sensitive to outliers: But less than MSE
  • +
  • Common in competitions: Kaggle, etc.
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Continuous random variable X has uniform distribution on interval [0, 10]. a) Find the PDF f(x), b) Calculate P(3 โ‰ค X โ‰ค 7)

-
- -
-

Solution:

- -
-
Step 1:
-
-

Understand Uniform Distribution

-
- X is equally likely anywhere between 0 and 10
- For uniform on [a, b], PDF is constant
- Total area under curve must equal 1 -
-

Uniform means constant probability density across the interval

-
-
- -
-
Step 2:
-
-

Find PDF Height

-
- Interval length = b - a = 10 - 0 = 10
- For area = 1: height ร— width = 1
- height ร— 10 = 1
- height = 1/10 = 0.1
- Therefore: f(x) = 0.1 for 0 โ‰ค x โ‰ค 10, and 0 otherwise -
-

The constant height must give total area of 1

-
-
- -
-
Step 3:
-
-

Calculate P(3 โ‰ค X โ‰ค 7)

-
- For continuous uniform: P(a โ‰ค X โ‰ค b) = (b-a) ร— height
- P(3 โ‰ค X โ‰ค 7) = (7-3) ร— 0.1
- = 4 ร— 0.1
- = 0.4 -
-

Probability is the area of the rectangle

-
-
- -
-
Step 4:
-
-

Visualize (Area Under Curve)

-
- Rectangle: width = 4, height = 0.1
- Area = 4 ร— 0.1 = 0.4
- This represents probability -
-

The geometric area equals the probability

-
-
- -
- โœ“ Final Answer: - a) f(x) = 0.1 for x โˆˆ [0,10]
b) P(3 โ‰ค X โ‰ค 7) = 0.4 (40%)
-
- -
- Verification: -

P(0 โ‰ค X โ‰ค 10) = 10 ร— 0.1 = 1.0 โœ“ (total probability = 1)

+
+
+
+

Figure: Comparing MSE, MAE, and their response to errors

- -
-

๐Ÿ’ช Try These:

-
    -
  1. Uniform on [5,15]. Find PDF.
  2. -
  3. For above, find P(8 โ‰ค X โ‰ค 12)
  4. -
  5. Why is P(X = 7) = 0 for continuous distributions?
  6. -
- - + +

Loss Functions for Classification

+ +

Log Loss (Cross-Entropy)

+
+ Binary Cross-Entropy: + Loss = -(1/n) ร— ฮฃ[yยทlog(ลท) + (1-y)ยทlog(1-ลท)] +
where:
y โˆˆ {0, 1} = actual label
ลท โˆˆ (0, 1) = predicted probability
-
-
-

๐ŸŽฏ Key Takeaways

+
Characteristics:
    -
  • PDF is for continuous random variables
  • -
  • Probability = area under curve
  • -
  • P(X = exact point) = 0
  • -
  • Total area under PDF = 1
  • +
  • For probabilities: Output must be [0, 1]
  • +
  • Heavily penalizes confident wrong predictions: Good!
  • +
  • Convex: No local minima, easy to optimize
  • +
  • Probabilistic interpretation: Maximum likelihood
-
-
- - -
-
- Topic 21 -

๐Ÿ“‰ Cumulative Distribution Function (CDF)

-

Probability up to a value

-
-
-

Introduction

-

What is it? CDF gives the probability that X is less than or equal to a specific value.

-

Formula: F(x) = P(X โ‰ค x)

-
+

Example: y=1 (spam), predicted p=0.9

+

Loss = -[1ยทlog(0.9) + 0ยทlog(0.1)] = -log(0.9) = 0.105 (low, good!)

+ +

Example: y=1 (spam), predicted p=0.1

+

Loss = -[1ยทlog(0.1) + 0ยทlog(0.9)] = -log(0.1) = 2.303 (high, bad!)

+ +

Hinge Loss (for SVM)

+
+ Formula: + Loss = max(0, 1 - yยทscore) +
where:
y โˆˆ {-1, +1}
score = wยทx + b
+
-
-

Properties

+
Characteristics:
    -
  • Always non-decreasing
  • -
  • F(-โˆž) = 0
  • -
  • F(+โˆž) = 1
  • -
  • P(a < X โ‰ค b) = F(b) - F(a)
  • +
  • Margin-based: Encourages confident predictions
  • +
  • Zero loss for correct & confident: When yยทscore โ‰ฅ 1
  • +
  • Linear penalty: For violations
  • +
  • Used in SVM: Maximizes margin
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

For the uniform distribution from Topic 20 (X ~ Uniform[0,10]), find: a) F(5) = P(X โ‰ค 5), b) F(12), c) P(2 < X โ‰ค 8)

+

When to Use Which Loss?

+ +
+
Regression Problems
+
    +
  • + MSE: Default choice, smooth optimization, use when outliers are errors +
  • +
  • + MAE: When you have outliers that are valid data points +
  • +
  • + RMSE: When you need interpretable metric in original units +
  • +
  • + Huber Loss: Combines MSE and MAE - best of both worlds! +
  • +
- -
-

Solution:

- -
-
Step 1:
-
-

Recall PDF

-
- f(x) = 0.1 for 0 โ‰ค x โ‰ค 10
- CDF is cumulative (area from left up to x) -
-

CDF accumulates probability from the left

-
-
- -
-
Step 2:
-
-

Find F(5)

-
- F(5) = P(X โ‰ค 5)
- Area from 0 to 5: width = 5, height = 0.1
- F(5) = 5 ร— 0.1 = 0.5 -
-

Half of the distribution is below x = 5

-
-
- -
-
Step 3:
-
-

Find F(12)

-
- F(12) = P(X โ‰ค 12)
- But X can't exceed 10
- All probability is accounted for by x = 10
- F(12) = 1.0 (certainty) -
-

CDF plateaus at 1 beyond the support of the distribution

-
-
- -
-
Step 4:
-
-

Find P(2 < X โ‰ค 8)

-
- Using CDF: P(a < X โ‰ค b) = F(b) - F(a)
- F(8) = 8 ร— 0.1 = 0.8
- F(2) = 2 ร— 0.1 = 0.2
- P(2 < X โ‰ค 8) = 0.8 - 0.2 = 0.6 -
-

Subtract lower CDF from upper CDF

-
-
- -
-
Step 5:
-
-

General CDF Formula

-
- For uniform [0, 10]:
- โ€ข F(x) = 0 if x < 0
- โ€ข F(x) = x/10 if 0 โ‰ค x โ‰ค 10
- โ€ข F(x) = 1 if x > 10 -
-

The complete CDF function has three pieces

-
-
- -
- โœ“ Final Answer: - a) F(5) = 0.5
b) F(12) = 1.0
c) P(2 < X โ‰ค 8) = 0.6
+ +
+
Classification Problems
+
    +
  • + Log Loss: Default for binary/multi-class, when you need probabilities +
  • +
  • + Hinge Loss: For SVM, when you want maximum margin +
  • +
  • + Focal Loss: For highly imbalanced datasets +
  • +
+
+ +

Visualizing Loss Curves

+
+
+
- -
- Check: -

F(0) = 0 (no probability below 0), F(10) = 1 (all probability by 10), F is non-decreasing โœ“

+

Figure: How different losses respond to errors

+
+ +
+
๐Ÿ’ก Impact of Outliers
+
+ Imagine predictions [100, 102, 98, 150] for actuals [100, 100, 100, 100]:
+
+ MSE: (0 + 4 + 4 + 2500) / 4 = 627 โ† Dominated by outlier!
+ MAE: (0 + 2 + 2 + 50) / 4 = 13.5 โ† More balanced
+
+ MSE is 48ร— larger because it squares the huge error!
- -
-

๐Ÿ’ช Try These:

-
    -
  1. For uniform [5,15], find F(10)
  2. -
  3. What is P(X > 7) using the CDF?
  4. -
  5. If F(x) = 0.75, what does this mean?
  6. -
- - -
-

๐ŸŽฏ Key Takeaways

-
    -
  • CDF: F(x) = P(X โ‰ค x)
  • -
  • Works for both discrete and continuous
  • -
  • Always increases from 0 to 1
  • -
  • Useful for finding percentiles
  • -
-
-
- - -
-
- Topic 22 -

๐Ÿช™ Bernoulli Distribution

-

Single trial with two outcomes

+
-
-

Introduction

-

What is it? Models a single trial with two outcomes: success (1) or failure (0).

-

Examples: Coin flip, pass/fail test, yes/no question

-
+ +
+
+

13. Finding Optimal K in KNN

+ +
+
+

Choosing the right K value is critical for KNN performance! Too small causes overfitting, too large causes underfitting. Let's explore systematic methods to find the optimal K.

+ +
+
Key Methods
+
    +
  • Elbow Method: Plot accuracy vs K, find the "elbow"
  • +
  • Cross-Validation: Test multiple K values with k-fold CV
  • +
  • Grid Search: Systematically test K values
  • +
  • Avoid K=1 (overfits) and K=n (underfits)
  • +
+
-
-

Formula

-
-
Bernoulli PMF
-
P(X = 1) = p
-
P(X = 0) = 1 - p = q
-

Mean = p, Variance = p(1-p)

+

Method 1: Elbow Method

+

Test different K values and plot performance. Look for the "elbow" where adding more neighbors doesn't help much.

+ +
+
+ +
+

Figure 1: Elbow curve showing optimal K at the bend

-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Flip a fair coin once. Let X = 1 if Heads, X = 0 if Tails. a) Find P(X=1) and P(X=0), b) Calculate E(X) and Var(X)

+

Method 2: Cross-Validation Approach

+

For each K value, run k-fold cross-validation and calculate mean accuracy. Choose K with highest mean accuracy.

+ +
+ Cross-Validation Process: + for K in [1, 2, 3, ..., 20]:
+   accuracies = []
+   for fold in [1, 2, 3]:
+     train model with K neighbors
+     test on validation fold
+     accuracies.append(accuracy)
+   mean_accuracy[K] = mean(accuracies)
+
+ optimal_K = argmax(mean_accuracy)
- -
-

Solution:

- -
-
Step 1:
-
-

Identify Bernoulli Trial

-
- Single trial with two outcomes (Success/Failure)
- Success = Heads, p = 0.5
- Failure = Tails, 1-p = 0.5 -
-

This is a classic Bernoulli trial

-
+ +
+
+
- -
-
Step 2:
-
-

Find Probabilities

-
- P(X = 1) = p = 0.5 (probability of heads)
- P(X = 0) = 1-p = 0.5 (probability of tails)
- Check: 0.5 + 0.5 = 1.0 โœ“ -
-

Probabilities must sum to 1

-
+

Figure 2: Cross-validation accuracies heatmap for different K values

+
+ +
+
โœ… Why Cross-Validation is Better
+
+ Single train-test split might be lucky/unlucky. Cross-validation gives you: +
    +
  • Mean accuracy (average performance)
  • +
  • Standard deviation (how stable is K?)
  • +
  • Confidence in your choice
  • +
- -
-
Step 3:
-
-

Calculate Expected Value

-
- Formula: E(X) = p
- E(X) = 0.5
- Or: E(X) = 0ร—P(X=0) + 1ร—P(X=1)
- = 0ร—0.5 + 1ร—0.5 = 0.5 โœ“ -
-

Expected value is the probability of success

-
+
+ +

Practical Guidelines

+
    +
  • Start with K = โˆšn: Good rule of thumb
  • +
  • Try odd K values: Avoids ties in binary classification
  • +
  • Test range [1, 20]: Covers most practical scenarios
  • +
  • Check for stability: Low std dev across folds
  • +
+ +
+
๐Ÿ’ก Real-World Example
+
+ Iris Dataset (150 samples):
+ โˆš150 โ‰ˆ 12, so start testing around K=11, K=13, K=15
+ After CV: K=5 gives 96% ยฑ 2% โ†’ Optimal choice!
+ K=1 gives 94% ยฑ 8% โ†’ Too much variance
+ K=25 gives 88% ยฑ 1% โ†’ Too smooth, underfitting
- -
-
Step 4:
-
-

Calculate Variance

-
- Formula: Var(X) = p(1-p)
- Var(X) = 0.5 ร— 0.5 = 0.25
- Standard deviation: ฯƒ = โˆš0.25 = 0.5 -
-

Variance measures spread of outcomes

-
+
+
+
+ + +
+
+

14. Hyperparameter Tuning with GridSearch

+ +
+
+

Hyperparameters control how your model learns. Unlike model parameters (learned from data), hyperparameters are set BEFORE training. GridSearch systematically finds the best combination!

+ +
+
Common Hyperparameters
+
    +
  • Learning rate (ฮฑ) - Gradient Descent step size
  • +
  • K - Number of neighbors in KNN
  • +
  • C, gamma - SVM parameters
  • +
  • Max depth - Decision Tree depth
  • +
  • Number of trees - Random Forest
  • +
+
+ +

GridSearch Explained

+

GridSearch tests ALL combinations of hyperparameters you specify. It's exhaustive but guarantees finding the best combination in your grid.

+ +
+ Example: SVM GridSearch + param_grid = {
+   'C': [0.1, 1, 10, 100],
+   'gamma': [0.001, 0.01, 0.1, 1],
+   'kernel': ['linear', 'rbf']
+ }
+
+ Total combinations: 4 ร— 4 ร— 2 = 32
+ With 5-fold CV: 32 ร— 5 = 160 model trainings! +
+ +
+
+
- -
-
Step 5:
-
-

Interpret

-
- On average, we get 0.5 heads per flip
- Variance measures spread of 0 and 1 outcomes -
-

Expected value represents long-run average

+

Figure: GridSearch heatmap showing accuracy for C vs gamma combinations

+
+ +
+
+ +
+ +
- -
- โœ“ Final Answer: - a) P(X=1) = 0.5, P(X=0) = 0.5
b) E(X) = 0.5, Var(X) = 0.25
-
- -
- Check: -

For fair coin, p = 0.5 makes sense. Over many flips, we expect half heads (E(X) = 0.5).

+
+ +

Performance Surface (3D View)

+
+
+
+

Figure: 3D surface showing how parameters affect performance

- -
-

๐Ÿ’ช Try These:

-
    -
  1. Biased coin: P(Heads) = 0.7. Find E(X) and Var(X)
  2. -
  3. Free throw: 80% success rate. Model as Bernoulli
  4. -
  5. When is Var(X) maximized for Bernoulli?
  6. -
- - -
-

๐ŸŽฏ Key Takeaways

+

Best Practices

    -
  • Single trial, two outcomes (0 or 1)
  • -
  • Parameter: p (probability of success)
  • -
  • Mean = p, Variance = p(1-p)
  • -
  • Building block for binomial distribution
  • +
  • Start coarse: Wide range, few values (e.g., C: [0.1, 1, 10, 100])
  • +
  • Then refine: Narrow range around best (e.g., C: [5, 7, 9, 11])
  • +
  • Use cross-validation: Avoid overfitting to validation set
  • +
  • Log scale for wide ranges: [0.001, 0.01, 0.1, 1, 10, 100]
  • +
  • Consider computation time: More folds = more reliable but slower
- - - -
-
- Topic 23 -

๐ŸŽฐ Binomial Distribution

-

Multiple independent Bernoulli trials

-
+
-
-

Introduction

-

What is it? Models the number of successes in n independent Bernoulli trials.

-

Requirements: Fixed n, same p, independent trials, binary outcomes

-
+ +
+
+

๐Ÿ“Š Supervised - Classification Naive Bayes Classification

+ +
+
+

Naive Bayes is a probabilistic classifier based on Bayes' Theorem. Despite its "naive" independence assumption, it works surprisingly well for text classification and other tasks! We'll cover both Categorical and Gaussian Naive Bayes with complete mathematical solutions.

+ +
+
Key Concepts
+
    +
  • Based on Bayes' Theorem from probability theory
  • +
  • Assumes features are independent (naive assumption)
  • +
  • Very fast training and prediction
  • +
  • Works well with high-dimensional data
  • +
+
-
-

Formula

-
-
Binomial PMF
-
P(X = k) = C(n,k) ร— p^k ร— (1-p)^(n-k)
-

C(n,k) = n! / (k!(n-k)!)

-

Mean = np, Variance = np(1-p)

+

Bayes' Theorem

+
+ The Foundation: + P(Class|Features) = P(Features|Class) ร— P(Class) / P(Features)
+
+       โ†“                             โ†“                  โ†“               โ†“
+ Posterior              Likelihood        Prior       Evidence
+ (What we want)     (From data)     (Baseline)  (Normalizer)
-
-
-
๐Ÿ“Š EXAMPLE
-

Flip coin 10 times. P(exactly 6 heads)?

-

n=10, k=6, p=0.5

-

P(X=6) = C(10,6) ร— 0.5^6 ร— 0.5^4 = 210 ร— 0.000977 โ‰ˆ 0.205

-
+

The Naive Independence Assumption

+

"Naive" because we assume all features are independent given the class:

-
-

๐ŸŽฏ Key Takeaways

-
    -
  • n independent trials, probability p each
  • -
  • Counts number of successes
  • -
  • Mean = np, Variance = np(1-p)
  • -
  • Common in quality control and surveys
  • -
-
- - - -
-
- Topic 24 -

๐Ÿ”” Normal Distribution

-

The bell curve and 68-95-99.7 rule

-
+
+ Independence Assumption: + P(xโ‚, xโ‚‚, ..., xโ‚™ | Class) = P(xโ‚|Class) ร— P(xโ‚‚|Class) ร— ... ร— P(xโ‚™|Class)
+
+ This is often NOT true in reality, but works anyway! +
-
-

Introduction

-

What is it? The most important continuous probability distributionโ€”symmetric, bell-shaped curve.

-

Why it matters: Many natural phenomena follow normal distribution. Foundation of inferential statistics.

-
+
+
+ +
+

Figure 1: Bayes' Theorem visual explanation

+
-
-

Properties

-
    -
  • Symmetric around mean ฮผ
  • -
  • Bell-shaped curve
  • -
  • Mean = Median = Mode
  • -
  • Defined by ฮผ (mean) and ฯƒ (standard deviation)
  • -
  • Total area under curve = 1
  • -
-
+

Real-World Example: Email Spam Detection

+

Let's classify an email with words: ["free", "winner", "click"]

+ +
+ Training Data:
+ โ€ข 300 spam emails (30%)
+ โ€ข 700 not-spam emails (70%)
+
+ Word frequencies:
+ P("free" | spam) = 0.8 (appears in 80% of spam)
+ P("free" | not-spam) = 0.1 (appears in 10% of not-spam)
+
+ P("winner" | spam) = 0.7
+ P("winner" | not-spam) = 0.05
+
+ P("click" | spam) = 0.6
+ P("click" | not-spam) = 0.2 +
-
-

The 68-95-99.7 Rule (Empirical Rule)

+
+
+ +
+

Figure 2: Spam classification calculation step-by-step

+
+ +

Step-by-Step Calculation

+
+
๐Ÿ“ง Classifying Our Email
+
+ P(spam | features):
+ = P("free"|spam) ร— P("winner"|spam) ร— P("click"|spam) ร— P(spam)
+ = 0.8 ร— 0.7 ร— 0.6 ร— 0.3
+ = 0.1008
+
+ P(not-spam | features):
+ = P("free"|not-spam) ร— P("winner"|not-spam) ร— P("click"|not-spam) ร— P(not-spam)
+ = 0.1 ร— 0.05 ร— 0.2 ร— 0.7
+ = 0.0007
+
+ Prediction: 0.1008 > 0.0007 โ†’ SPAM! ๐Ÿ“งโŒ +
+
+ +

Why It Works Despite Wrong Assumption

    -
  • 68% of data within ฮผ ยฑ 1ฯƒ
  • -
  • 95% of data within ฮผ ยฑ 2ฯƒ
  • -
  • 99.7% of data within ฮผ ยฑ 3ฯƒ
  • +
  • Don't need exact probabilities: Just need correct ranking
  • +
  • Errors cancel out: Multiple features reduce impact
  • +
  • Simple is robust: Fewer parameters = less overfitting
  • +
  • Fast: Just multiply probabilities!
-
-
-
๐Ÿ’ก REAL-WORLD EXAMPLE
-

IQ scores: ฮผ = 100, ฯƒ = 15

-

68% of people have IQ between 85-115

-

95% have IQ between 70-130

-

99.7% have IQ between 55-145

-
+

Comparison with Other Classifiers

+ + + + + + + + + + + + + + + + + +
AspectNaive BayesLogistic RegSVMKNN
SpeedVery FastFastSlowVery Slow
Works with Little DataYesYesNoNo
InterpretableVeryYesNoNo
Handles Non-linearYesNoYesYes
High DimensionsExcellentGoodGoodPoor
- -
-

๐Ÿ“ Worked Example - Step by Step

+

๐ŸŽฏ PART A: Categorical Naive Bayes (Step-by-Step from PDF)

-
-

Problem:

-

IQ scores follow Normal distribution with ฮผ = 100, ฯƒ = 15. Find: a) P(IQ โ‰ค 115), b) P(85 โ‰ค IQ โ‰ค 115), c) IQ score at 95th percentile

-
+

Dataset: Tennis Play Prediction

+ + + + + + + + + + + + +
OutlookTemperaturePlay
SunnyHotNo
SunnyMildNo
CloudyHotYes
RainyMildYes
RainyCoolYes
CloudyCoolYes
-
-

Solution:

- -
-
Step 1:
-
-

Understand Normal Distribution

-
- Bell-shaped, symmetric around mean
- ฮผ = 100 (center)
- ฯƒ = 15 (spread) -
-

Parameters define the shape and location of the curve

-
-
- -
-
Step 2:
-
-

Find P(IQ โ‰ค 115) using z-score

-
- z = (x - ฮผ)/ฯƒ = (115 - 100)/15 = 15/15 = 1
- P(Z โ‰ค 1) = 0.8413 (from z-table)
- About 84.13% have IQ โ‰ค 115 -
-

Standardize to z-score, then use standard normal table

-
-
- -
-
Step 3:
-
-

Find P(85 โ‰ค IQ โ‰ค 115)

-
- Lower bound: zโ‚ = (85-100)/15 = -15/15 = -1
- Upper bound: zโ‚‚ = (115-100)/15 = 1
- This is ฮผ ยฑ 1ฯƒ (68-95-99.7 rule)
- P(-1 โ‰ค Z โ‰ค 1) = 0.68 (approximately 68%)
- Exact: P(Zโ‰ค1) - P(Zโ‰ค-1) = 0.8413 - 0.1587 = 0.6826 -
-

One standard deviation on each side covers 68% of data

-
-
- -
-
Step 4:
-
-

Find 95th Percentile

-
- P(IQ โ‰ค x) = 0.95
- From z-table: z = 1.645 for 95th percentile
- x = ฮผ + zฯƒ = 100 + 1.645ร—15
- = 100 + 24.675 = 124.675
- IQ โ‰ˆ 125 -
-

Convert z-score back to original scale using inverse formula

-
-
- -
- โœ“ Final Answer: - a) P(IQ โ‰ค 115) = 0.8413 (84.13%)
b) P(85 โ‰ค IQ โ‰ค 115) = 0.6826 (68.26%)
c) 95th percentile = IQ of 125
-
- -
- Verification: -

Using 68-95-99.7 rule: ฮผยฑ1ฯƒ contains 68% โœ“, ฮผยฑ2ฯƒ contains 95%, ฮผยฑ3ฯƒ contains 99.7%. Our answer matches the empirical rule!

-
-
+

Problem: Predict whether to play tennis when Outlook=Rainy and Temperature=Hot

+ +
+
STEP 1: Calculate Prior Probabilities
+
+ Count occurrences in training data:
+ โ€ข Play=Yes appears 4 times out of 6 total
+ โ€ข Play=No appears 2 times out of 6 total
+
+ Calculation:
+ P(Yes) = 4/6 = 0.667 (66.7%)
+ P(No) = 2/6 = 0.333 (33.3%) +
+
+ +
+
STEP 2: Calculate Conditional Probabilities (Before Smoothing)
+
+ For Outlook = "Rainy":
+ โ€ข Count (Rainy AND Yes) = 2 examples
+ โ€ข Count (Yes) = 4 total
+ โ€ข P(Rainy|Yes) = 2/4 = 0.5
+
+ โ€ข Count (Rainy AND No) = 0 examples โŒ
+ โ€ข Count (No) = 2 total
+ โ€ข P(Rainy|No) = 0/2 = 0 ๏ฟฝ๏ฟฝ๏ฟฝ๏ธ ZERO PROBABILITY PROBLEM!
+
+ For Temperature = "Hot":
+ โ€ข P(Hot|Yes) = 1/4 = 0.25
+ โ€ข P(Hot|No) = 1/2 = 0.5 +
+
+ +
+ Step 3: Apply Bayes' Theorem (Initial)
+
+ P(Yes|Rainy,Hot) = P(Yes) ร— P(Rainy|Yes) ร— P(Hot|Yes)
+                    = 0.667 ร— 0.5 ร— 0.25
+                    = 0.0833
+
+ P(No|Rainy,Hot) = P(No) ร— P(Rainy|No) ร— P(Hot|No)
+                   = 0.333 ร— 0 ร— 0.5
+                   = 0 โŒ Problem! +
+ +
+
โš ๏ธ Zero Probability Problem
+
+ When P(Rainy|No) = 0, the entire probability becomes 0! This is unrealistic - just because we haven't seen "Rainy" with "No" in our training data doesn't mean it's impossible. We need Laplace Smoothing! +
+
+ +
+
STEP 4: Apply Laplace Smoothing (ฮฑ = 1)
+
+ Smoothed formula:
+ P(x|c) = (count(x,c) + ฮฑ) / (count(c) + ฮฑ ร— num_categories)
+
+ For Outlook (3 categories: Sunny, Cloudy, Rainy):
+ P(Rainy|Yes) = (2 + 1) / (4 + 1ร—3)
+               = 3/7
+               = 0.429 โœ“
+
+ P(Rainy|No) = (0 + 1) / (2 + 1ร—3)
+             = 1/5
+             = 0.2 โœ“ Fixed the zero!
+
+ For Temperature (3 categories: Hot, Mild, Cool):
+ P(Hot|Yes) = (1 + 1) / (4 + 1ร—3) = 2/7 = 0.286
+ P(Hot|No) = (1 + 1) / (2 + 1ร—3) = 2/5 = 0.4 +
+
+ +
+
STEP 5: Recalculate with Smoothing
+
+ P(Yes|Rainy,Hot):
+ = P(Yes) ร— P(Rainy|Yes) ร— P(Hot|Yes)
+ = 0.667 ร— 0.429 ร— 0.286
+ = 0.0818
+
+ P(No|Rainy,Hot):
+ = P(No) ร— P(Rainy|No) ร— P(Hot|No)
+ = 0.333 ร— 0.2 ร— 0.4
+ = 0.0266 +
+
+ +
+
STEP 6: Normalize to Get Final Probabilities
+
+ Sum of probabilities:
+ Sum = 0.0818 + 0.0266 = 0.1084
+
+ Normalize:
+ P(Yes|Rainy,Hot) = 0.0818 / 0.1084
+                  = 0.755 (75.5%)
+
+ P(No|Rainy,Hot) = 0.0266 / 0.1084
+                 = 0.245 (24.5%)
+
+
+ โœ… FINAL PREDICTION: YES (Play Tennis!)
+ Confidence: 75.5% +
+
+
+ +
+
+ +
+

Figure: Categorical Naive Bayes calculation visualization

+
+ +

๐ŸŽฏ PART B: Gaussian Naive Bayes (Step-by-Step from PDF)

+ +

Dataset: 2D Classification

+ + + + + + + + + + + + +
IDXโ‚Xโ‚‚Class
A1.02.0Yes
B2.01.0Yes
C1.51.8Yes
D3.03.0No
E3.52.8No
F2.93.2No
-
-

๐Ÿ’ช Try These:

-
    -
  1. Find P(IQ > 130) using same distribution
  2. -
  3. What IQ scores contain the middle 95% of people?
  4. -
  5. If z = -2, what percentile is this?
  6. -
- - +
-
-

๐ŸŽฏ Key Takeaways

-
    -
  • Symmetric bell curve, parameters ฮผ and ฯƒ
  • -
  • 68-95-99.7 rule for standard deviations
  • -
  • Foundation for hypothesis testing
  • -
  • Central Limit Theorem connects to sampling
  • -
-
-
- - -
-
- Topic 25 -

โš–๏ธ Hypothesis Testing Introduction

-

Making decisions from data

-
+ +
+
+

๐Ÿ” Unsupervised - Clustering K-means Clustering

+ +
+
+

K-means is an unsupervised learning algorithm that groups data into K clusters. Each cluster has a centroid (center point), and points are assigned to the nearest centroid. Perfect for customer segmentation, image compression, and pattern discovery!

+ +
+
Key Concepts
+
    +
  • Unsupervised: No labels needed!
  • +
  • K = number of clusters (you choose)
  • +
  • Minimizes Within-Cluster Sum of Squares (WCSS)
  • +
  • Iterative: Updates centroids until convergence
  • +
+
-
-

Introduction

-

What is it? Statistical method for testing claims about populations using sample data.

-

Why it matters: Allows us to make evidence-based decisions and determine if effects are real or due to chance.

-
+

๐ŸŽฏ Step-by-Step K-means Algorithm (from PDF)

-
-

The Two Hypotheses

-
    -
  • Null Hypothesis (Hโ‚€): Status quo, no effect, no difference
  • -
  • Alternative Hypothesis (Hโ‚ or Hโ‚): What we're trying to prove
  • +

    Dataset: 6 Points in 2D Space

    + + + + + + + + + + + + +
    PointXY
    A12
    B1.51.8
    C58
    D88
    E10.6
    F911
    + +

    Goal: Group into K=2 clusters

    +

    Initial Centroids: cโ‚ = [3, 4], cโ‚‚ = [5, 1]

    + +
    + Distance Formula (Euclidean):
    + d(point, centroid) = โˆš[(xโ‚-xโ‚‚)ยฒ + (yโ‚-yโ‚‚)ยฒ] +
    + +

    Iteration 1

    + +
    + Step 1: Calculate Distances to All Centroids
    +
    + Point A (1, 2):
    + d(A, cโ‚) = โˆš[(1-3)ยฒ + (2-4)ยฒ] = โˆš[4+4] = โˆš8 = 2.83
    + d(A, cโ‚‚) = โˆš[(1-5)ยฒ + (2-1)ยฒ] = โˆš[16+1] = โˆš17 = 4.12
    + โ†’ Assign to cโ‚ (closer)
    +
    + Point B (1.5, 1.8):
    + d(B, cโ‚) = โˆš[(1.5-3)ยฒ + (1.8-4)ยฒ] = โˆš[2.25+4.84] = 2.66
    + d(B, cโ‚‚) = โˆš[(1.5-5)ยฒ + (1.8-1)ยฒ] = โˆš[12.25+0.64] = 3.59
    + โ†’ Assign to cโ‚
    +
    + Point C (5, 8):
    + d(C, cโ‚) = โˆš[(5-3)ยฒ + (8-4)ยฒ] = โˆš[4+16] = 4.47
    + d(C, cโ‚‚) = โˆš[(5-5)ยฒ + (8-1)ยฒ] = โˆš[0+49] = 7.0
    + โ†’ Assign to cโ‚
    +
    + Point D (8, 8):
    + d(D, cโ‚) = โˆš[(8-3)ยฒ + (8-4)ยฒ] = โˆš[25+16] = 6.40
    + d(D, cโ‚‚) = โˆš[(8-5)ยฒ + (8-1)ยฒ] = โˆš[9+49] = 7.62
    + โ†’ Assign to cโ‚
    +
    + Point E (1, 0.6):
    + d(E, cโ‚) = โˆš[(1-3)ยฒ + (0.6-4)ยฒ] = โˆš[4+11.56] = 3.94
    + d(E, cโ‚‚) = โˆš[(1-5)ยฒ + (0.6-1)ยฒ] = โˆš[16+0.16] = 4.02
    + โ†’ Assign to cโ‚
    +
    + Point F (9, 11):
    + d(F, cโ‚) = โˆš[(9-3)ยฒ + (11-4)ยฒ] = โˆš[36+49] = 9.22
    + d(F, cโ‚‚) = โˆš[(9-5)ยฒ + (11-1)ยฒ] = โˆš[16+100] = 10.77
    + โ†’ Assign to cโ‚
    +
    + Result: Cluster 1 = {A, B, C, D, E, F}, Cluster 2 = {} +
    + +
    +
    โš ๏ธ Poor Initial Centroids!
    +
    + All points assigned to cโ‚! This happens with bad initialization. Let's try better initial centroids for the algorithm to work properly. +
    +
    + +

    Better Initial Centroids: cโ‚ = [1, 1], cโ‚‚ = [8, 9]

    + +
    + Iteration 1 (Revised):
    +
    + Cluster 1: {A, B, E} โ†’ cโ‚_new = mean = [(1+1.5+1)/3, (2+1.8+0.6)/3] = [1.17, 1.47]
    + Cluster 2: {C, D, F} โ†’ cโ‚‚_new = mean = [(5+8+9)/3, (8+8+11)/3] = [7.33, 9.00]
    +
    + WCSS Calculation:
    + WCSSโ‚ = dยฒ(A,cโ‚) + dยฒ(B,cโ‚) + dยฒ(E,cโ‚)
    +        = (1-1.17)ยฒ+(2-1.47)ยฒ + (1.5-1.17)ยฒ+(1.8-1.47)ยฒ + (1-1.17)ยฒ+(0.6-1.47)ยฒ
    +        = 0.311 + 0.218 + 0.786 = 1.315
    +
    + WCSSโ‚‚ = dยฒ(C,cโ‚‚) + dยฒ(D,cโ‚‚) + dยฒ(F,cโ‚‚)
    +        = (5-7.33)ยฒ+(8-9)ยฒ + (8-7.33)ยฒ+(8-9)ยฒ + (9-7.33)ยฒ+(11-9)ยฒ
    +        = 6.433 + 1.447 + 6.789 = 14.669
    +
    + Total WCSS = 1.315 + 14.669 = 15.984 +
    + +
    + Iteration 2:
    +
    + Using cโ‚ = [1.17, 1.47] and cโ‚‚ = [7.33, 9.00], recalculate distances...
    +
    + Result: Same assignments! Centroids don't change.
    + โœ“ Converged! +
    + +
    +
    + +
    +

    Figure: K-means clustering visualization with centroid movement

    +
    + +

    Finding Optimal K: The Elbow Method

    + +

    How do we choose K? Try different values and plot WCSS!

    + +
    + WCSS for Different K Values:
    +
    + K=1: WCSS = 50.0 (all in one cluster)
    + K=2: WCSS = 18.0
    + K=3: WCSS = 10.0 โ† Elbow point!
    + K=4: WCSS = 8.0
    + K=5: WCSS = 7.0
    +
    + Rule: Choose K at the "elbow" where WCSS stops decreasing rapidly +
    + +
    +
    + +
    +

    Figure: Elbow method - optimal K is where the curve bends

    +
    + +
    +
    ๐Ÿ’ก K-means Tips
    +
    + Advantages:
    + โœ“ Simple and fast
    + โœ“ Works well with spherical clusters
    + โœ“ Scales to large datasets
    +
    + Disadvantages:
    + โœ— Need to specify K in advance
    + โœ— Sensitive to initial centroids (use K-means++!)
    + โœ— Assumes spherical clusters
    + โœ— Sensitive to outliers
    +
    + Solutions:
    + โ€ข Use elbow method for K
    + โ€ข Use K-means++ initialization
    + โ€ข Run multiple times with different initializations +
    +
    + +

    Real-World Applications

    +
      +
    • Customer Segmentation: Group customers by behavior
    • +
    • Image Compression: Reduce colors in images
    • +
    • Document Clustering: Group similar articles
    • +
    • Anomaly Detection: Points far from centroids are outliers
    • +
    • Feature Learning: Learn representations for neural networks
+
-
-

Decision Process

-
    -
  1. State hypotheses (Hโ‚€ and Hโ‚)
  2. -
  3. Choose significance level (ฮฑ)
  4. -
  5. Collect data and calculate test statistic
  6. -
  7. Find p-value or critical value
  8. -
  9. Make decision: Reject Hโ‚€ or Fail to reject Hโ‚€
  10. -
-
+ +
+
+

๐Ÿ“Š Supervised Decision Trees

+ +
+
+

Decision Trees make decisions by asking yes/no questions recursively. They're interpretable, powerful, and the foundation for ensemble methods like Random Forests!

+ +
+
Key Concepts
+
    +
  • Recursive partitioning of feature space
  • +
  • Each node asks a yes/no question
  • +
  • Leaves contain predictions
  • +
  • Uses Information Gain or Gini Impurity for splitting
  • +
+
-
-
๐Ÿ“Š EXAMPLE
-

Claim: New teaching method improves test scores

-

Hโ‚€: ฮผ = 75 (no improvement)

-

Hโ‚: ฮผ > 75 (scores improved)

-
+

How Decision Trees Work

+

Imagine you're playing "20 Questions" to guess an animal. Each question splits possibilities into two groups. Decision Trees work the same way!

-
-

๐ŸŽฏ Key Takeaways

-
    -
  • Hโ‚€ = null hypothesis (status quo)
  • -
  • Hโ‚ = alternative hypothesis (what we test)
  • -
  • We either reject or fail to reject Hโ‚€
  • -
  • Never "accept" or "prove" anything
  • -
-
-
- - -
-
- Topic 26 -

๐ŸŽฏ Significance Level (ฮฑ)

-

Setting your error tolerance

-
+
+
+ +
+

Figure 1: Interactive decision tree structure

+
-
-

Introduction

-

What is it? ฮฑ (alpha) is the probability of rejecting Hโ‚€ when it's actually true (Type I error rate).

-

Common values: 0.05 (5%), 0.01 (1%), 0.10 (10%)

-
+

Splitting Criteria

+

How do we choose which question to ask at each node? We want splits that maximize information gain!

-
-

Interpretation

-
    -
  • ฮฑ = 0.05: Willing to be wrong 5% of the time
  • -
  • Lower ฮฑ: More stringent, harder to reject Hโ‚€
  • -
  • Higher ฮฑ: More lenient, easier to reject Hโ‚€
  • -
  • Confidence level: 1 - ฮฑ (e.g., 0.05 โ†’ 95% confidence)
  • -
-
+

1. Entropy (Information Theory)

+
+ Entropy Formula: + H(S) = -ฮฃ pแตข ร— logโ‚‚(pแตข)
+
+ where pแตข = proportion of class i
+
+ Interpretation:
+ โ€ข Entropy = 0: Pure (all same class)
+ โ€ข Entropy = 1: Maximum disorder (50-50 split)
+ โ€ข Lower entropy = better! +
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Explain the difference between ฮฑ = 0.05 and ฮฑ = 0.01. Which is more strict? Find critical values for both in a two-tailed test.

+

2. Information Gain

+
+ Information Gain Formula: + IG(S, A) = H(S) - ฮฃ |Sแตฅ|/|S| ร— H(Sแตฅ)
+
+ = Entropy before split - Weighted entropy after split
+
+ We choose the split with HIGHEST information gain!
- -
-

Solution:

- -
-
Step 1:
-
-

Understand ฮฑ = 0.05

-
- ฮฑ = 0.05 means 5% significance
- 95% confidence level (1 - 0.05)
- P(Type I error) = 5%
- Willing to be wrong 5% of the time
-
-
-
- -
-
Step 2:
-
-

Understand ฮฑ = 0.01

-
- ฮฑ = 0.01 means 1% significance
- 99% confidence level (1 - 0.01)
- P(Type I error) = 1%
- Only willing to be wrong 1% of the time
-
-
-
- -
-
Step 3:
-
-

Find Critical Values for ฮฑ = 0.05

-
- Two-tailed: split ฮฑ into both tails
- Each tail = 0.05/2 = 0.025
- Zโ‚€.โ‚‰โ‚‡โ‚… = ยฑ1.96
- Reject if |z| > 1.96
-
-
+ +
+
+
- -
-
Step 4:
-
-

Find Critical Values for ฮฑ = 0.01

-
- Two-tailed: each tail = 0.01/2 = 0.005
- Zโ‚€.โ‚‰โ‚‰โ‚… = ยฑ2.576
- Reject if |z| > 2.576
- Harder to reject (more strict!)
-
-
+

Figure 2: Entropy and Information Gain visualization

+
+ +

3. Gini Impurity (Alternative)

+
+ Gini Formula: + Gini(S) = 1 - ฮฃ pแตขยฒ
+
+ Interpretation:
+ โ€ข Gini = 0: Pure
+ โ€ข Gini = 0.5: Maximum impurity (binary)
+ โ€ข Faster to compute than entropy +
+ +

Worked Example: Email Classification

+

Dataset: 10 emails - 7 spam, 3 not spam

+ +
+
๐Ÿ“Š Calculating Information Gain
+
+ Initial Entropy:
+ H(S) = -7/10ร—logโ‚‚(7/10) - 3/10ร—logโ‚‚(3/10)
+ H(S) = 0.881 bits
+
+ Split by "Contains 'FREE'":
+ โ€ข Left (5 emails): 4 spam, 1 not โ†’ H = 0.722
+ โ€ข Right (5 emails): 3 spam, 2 not โ†’ H = 0.971
+
+ Weighted Entropy:
+ = 5/10 ร— 0.722 + 5/10 ร— 0.971 = 0.847
+
+ Information Gain:
+ IG = 0.881 - 0.847 = 0.034 bits
+
+ Split by "Has suspicious link":
+ IG = 0.156 bits โ† BETTER! Use this split!
- -
-
Step 5:
-
-

Compare

-
- ฮฑ = 0.01 is MORE STRICT
- Requires stronger evidence to reject Hโ‚€
- Reduces Type I errors but increases Type II
-
-
+
+ +
+
+
- -
- โœ“ Final Answer: - ฮฑ = 0.05: z = ยฑ1.96; ฮฑ = 0.01: z = ยฑ2.576 (more strict) +

Figure 3: Comparing different splits by information gain

+
+ +

Decision Boundaries

+
+
+
+

Figure 4: Decision tree creates rectangular regions

- -
-

๐Ÿ’ช Practice Problems:

-
    -
  1. Find critical value for ฮฑ = 0.10, two-tailed
  2. -
  3. If we want to be very strict, should we use ฮฑ = 0.05 or ฮฑ = 0.001?
  4. -
  5. What happens to Type II error when ฮฑ decreases?
  6. -
+ +

Overfitting in Decision Trees

+
+
โš ๏ธ The Overfitting Problem
+
+ Without constraints, decision trees grow until each leaf has ONE sample!
+
+ Solutions:
+ โ€ข Max depth: Limit tree height (e.g., max_depth=5)
+ โ€ข Min samples split: Need X samples to split (e.g., min=10)
+ โ€ข Min samples leaf: Each leaf must have X samples
+ โ€ข Pruning: Grow full tree, then remove branches +
+ +

Advantages vs Disadvantages

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
Advantages โœ…Disadvantages โŒ
Easy to understand and interpretProne to overfitting
No feature scaling neededSmall changes โ†’ big tree changes
Handles non-linear relationshipsBiased toward features with more levels
Works with mixed data typesCan't extrapolate beyond training data
Fast predictionLess accurate than ensemble methods
+
+ + + + +
+
+

๐ŸŽฎ Reinforcement Introduction to Reinforcement Learning

+ +
+
+

Reinforcement Learning (RL) is learning by trial and error, just like teaching a dog tricks! The agent takes actions in an environment, receives rewards or punishments, and learns which actions lead to the best outcomes.

+ +
+
Key Concepts
+
    +
  • Agent: The learner/decision maker
  • +
  • Environment: The world the agent interacts with
  • +
  • State: Current situation of the agent
  • +
  • Action: What the agent can do
  • +
  • Reward: Feedback signal (positive or negative)
  • +
  • Policy: Strategy the agent follows
  • +
+
+ +

The RL Loop

+
    +
  1. Observe state: Agent sees current situation
  2. +
  3. Choose action: Based on policy ฯ€(s)
  4. +
  5. Execute action: Interact with environment
  6. +
  7. Receive reward: Get feedback r
  8. +
  9. Transition to new state: Environment changes to s'
  10. +
  11. Learn and update: Improve policy
  12. +
+ +
+
๐Ÿ’ก Key Difference from Supervised Learning
+
+ Supervised: "Here's the right answer for each example"
+ Reinforcement: "Try things and I'll tell you if you did well or poorly"
+
+ RL must explore to discover good actions, while supervised learning is given correct answers upfront! +
+
-
-

๐ŸŽฏ Key Takeaways

+

Real-World Examples

    -
  • ฮฑ = probability of Type I error
  • -
  • Common: ฮฑ = 0.05 (5% error rate)
  • -
  • Set before collecting data
  • -
  • Trade-off between Type I and Type II errors
  • +
  • Game Playing: AlphaGo learning to play Go by playing millions of games
  • +
  • Robotics: Robot learning to walk by trying different leg movements
  • +
  • Self-Driving Cars: Learning to drive safely through experience
  • +
  • Recommendation Systems: Learning what users like from their interactions
  • +
  • Resource Management: Optimizing data center cooling to save energy
-
-
- - -
-
- Topic 27 -

๐Ÿ“Š Standard Error

-

Measuring sampling variability

-
-
-

Introduction

-

What is it? Standard error (SE) measures how much sample means vary from the true population mean.

-
+

Exploration vs Exploitation

+

The fundamental dilemma in RL:

+
    +
  • Exploration: Try new actions to discover better rewards
  • +
  • Exploitation: Use known good actions to maximize reward
  • +
+

Balance is key! Too much exploration wastes time on bad actions. Too much exploitation misses better strategies.

-
-

Formula

-
-
Standard Error of Mean
-
SE = ฯƒ / โˆšn
-

or estimate: SE = s / โˆšn

+
+ Reward Signal: + Total Return = R = rโ‚ + ฮณrโ‚‚ + ฮณยฒrโ‚ƒ + ... = ฮฃ ฮณแต— rแต—โ‚Šโ‚ +
where:
ฮณ = discount factor (0 โ‰ค ฮณ โ‰ค 1)
Future rewards are worth less than immediate rewards
+
-
-

Key Points

-
    -
  • Decreases as sample size increases
  • -
  • Measures precision of sample mean
  • -
  • Lower SE = better estimate
  • -
  • Used in confidence intervals and hypothesis tests
  • + +
    +
    +

    ๐ŸŽฎ Reinforcement Q-Learning

    + +
    +
    +

    Q-Learning is a value-based RL algorithm that learns the quality (Q-value) of taking each action in each state. It's model-free and can learn optimal policies even without knowing how the environment works!

    + +
    +
    Key Concepts
    +
      +
    • Q-value: Expected future reward for action a in state s
    • +
    • Q-table: Stores Q-values for all state-action pairs
    • +
    • Off-policy: Can learn optimal policy while following exploratory policy
    • +
    • Temporal Difference: Learn from each step, not just end of episode
    • +
    +
    + +
    + Q-Learning Update Rule: + Q(s, a) โ† Q(s, a) + ฮฑ[r + ฮณ ยท max Q(s', a') - Q(s, a)] +

    + Breaking it down:
    + Q(s, a) = Current Q-value estimate
    + ฮฑ = Learning rate (e.g., 0.1)
    + r = Immediate reward received
    + ฮณ = Discount factor (e.g., 0.9)
    + max Q(s', a') = Best Q-value in next state
    + [r + ฮณ ยท max Q(s', a') - Q(s, a)] = TD error (how wrong we were) +
    + +

    Step-by-Step Example: Grid World Navigation

    +

    Problem: Agent navigates 3x3 grid to reach goal at (2,2)

    + +
    +
    STEP 1: Initialize Q-Table
    +
    + States: 9 positions (0,0) to (2,2)
    + Actions: 4 directions (Up, Down, Left, Right)
    +
    + Q-table: 9 ร— 4 = 36 values, all initialized to 0
    +
    + Example entry: Q((1,1), Right) = 0.0 +
    +
    + +
    +
    STEP 2: Episode 1 - Random Exploration
    +
    + Start: s = (0,0)
    +
    + Step 1: Choose action a = Right (ฮต-greedy)
    + Execute: Move to s' = (0,1)
    + Reward: r = -1 (penalty for each step)
    +
    + Update Q((0,0), Right):
    + Q = 0 + 0.1[-1 + 0.9 ร— max(0, 0, 0, 0) - 0]
    + Q = 0 + 0.1[-1]
    + Q((0,0), Right) = -0.1 โœ“
    +
    + Step 2: s = (0,1), action = Down
    + s' = (1,1), r = -1
    + Q((0,1), Down) = 0 + 0.1[-1 + 0] = -0.1
    +
    + Step 3: s = (1,1), action = Right
    + s' = (1,2), r = -1
    + Q((1,1), Right) = -0.1
    +
    + Step 4: s = (1,2), action = Down
    + s' = (2,2) โ† GOAL!
    + r = +100 (big reward!)
    +
    + Q((1,2), Down) = 0 + 0.1[100 + 0]
    + Q((1,2), Down) = 10.0 โœ“โœ“โœ“ +
    +
    + +
    +
    STEP 3: Episode 2 - Learning Propagates Backward
    +
    + Path: (0,0) โ†’ (0,1) โ†’ (1,1) โ†’ (1,2) โ†’ (2,2)
    +
    + At (1,1), choosing Right:
    + Q((1,1), Right) = -0.1 + 0.1[-1 + 0.9 ร— 10.0 - (-0.1)]
    + = -0.1 + 0.1[-1 + 9.0 + 0.1]
    + = -0.1 + 0.1[8.1]
    + = -0.1 + 0.81
    + Q((1,1), Right) = 0.71 โœ“
    +
    + โ†’ The value of being near the goal propagates backward! +
    +
    + +
    +
    โœ… After Many Episodes
    +
    + The Q-table converges to optimal values:
    +
    + Q((0,0), Right) โ‰ˆ 7.3
    + Q((1,1), Right) โ‰ˆ 8.1
    + Q((1,2), Down) โ‰ˆ 9.0
    +
    + Optimal Policy: Always move toward (2,2) via shortest path!
    + Agent has learned to navigate perfectly through trial and error. +
    +
    + +

    ฮต-Greedy Policy

    +
    + Action Selection:
    + With probability ฮต: Choose random action (explore)
    + With probability 1-ฮต: Choose argmax Q(s,a) (exploit)
    +
    + Common: Start ฮต=1.0, decay to ฮต=0.01 over time +
    + +

    Advantages

    +
      +
    • โœ“ Simple to implement
    • +
    • โœ“ Guaranteed to converge to optimal policy
    • +
    • โœ“ Model-free (doesn't need environment model)
    • +
    • โœ“ Off-policy (learn from exploratory behavior)
    • +
    + +

    Disadvantages

    +
      +
    • โœ— Doesn't scale to large/continuous state spaces
    • +
    • โœ— Slow convergence in complex environments
    • +
    • โœ— Requires discrete actions
    +
    - -
    -

    ๐Ÿ“ Worked Example - Step by Step

    - -
    -

    Problem:

    -

    Population has ฯƒ = 20. Calculate standard error for sample sizes: n = 4, n = 16, n = 64, n = 100. What pattern do you notice?

    + +
    +
    +

    ๐ŸŽฎ Reinforcement Policy Gradient Methods

    + +
    +
    +

    Policy Gradient methods directly optimize the policy (action selection strategy) instead of learning value functions. They're powerful for continuous action spaces and stochastic policies!

    + +
    +
    Key Concepts
    +
      +
    • Direct policy optimization: Learn ฯ€แตง(a|s) directly
    • +
    • Parameterized policy: Use neural network with weights ฮธ
    • +
    • Gradient ascent: Move parameters to maximize expected reward
    • +
    • Works with continuous actions: Can output action distributions
    • +
    - -
    -

    Solution:

    - -
    -
    Step 1:
    -
    -

    Recall Standard Error Formula

    -
    - SE = ฯƒ / โˆšn
    - Where:
    - - ฯƒ = population standard deviation
    - - n = sample size
    - SE measures variability of sample means
    -
    -
    + +

    Policy vs Value-Based Methods

    + + + + + + + + + + + +
    AspectValue-Based (Q-Learning)Policy-Based
    What it learnsQ(s,a) valuesฯ€(a|s) policy directly
    Action selectionargmax Q(s,a)Sample from ฯ€(a|s)
    Continuous actionsDifficultNatural
    Stochastic policyIndirectDirect
    ConvergenceCan be unstableSmoother
    + +
    + Policy Gradient Theorem: + โˆ‡แตง J(ฮธ) = Eแตง[โˆ‡แตง log ฯ€แตง(a|s) ยท Qแตง(s,a)] +

    + Practical form (REINFORCE):
    + โˆ‡แตง J(ฮธ) โ‰ˆ โˆ‡แตง log ฯ€แตง(aแต—|sแต—) ยท Gแต—
    +
    + where:
    + Gแต— = Total return from time t onward
    + ฯ€แตง(a|s) = Probability of action a in state s
    + ฮธ = Policy parameters (neural network weights) +
    + +

    REINFORCE Algorithm (Monte Carlo Policy Gradient)

    +
    +
    Algorithm Steps
    +
    + 1. Initialize: Random policy parameters ฮธ
    +
    + 2. For each episode:
    +    a. Generate trajectory: sโ‚€, aโ‚€, rโ‚, sโ‚, aโ‚, rโ‚‚, ..., sโ‚œ
    +    b. For each time step t:
    +       - Calculate return: Gแต— = rแต—โ‚Šโ‚ + ฮณrแต—โ‚Šโ‚‚ + ฮณยฒrแต—โ‚Šโ‚ƒ + ...
    +       - Update: ฮธ โ† ฮธ + ฮฑ ยท Gแต— ยท โˆ‡แตง log ฯ€แตง(aแต—|sแต—)
    +
    + 3. Repeat until policy converges +
    +
    + +

    Example: CartPole Balancing

    +

    Problem: Balance a pole on a cart by moving left or right

    + +
    +
    Episode Example
    +
    + State: s = [cart_pos, cart_vel, pole_angle, pole_vel]
    + Actions: a โˆˆ {Left, Right}
    +
    + Time t=0:
    + sโ‚€ = [0.0, 0.0, 0.1, 0.0] (pole leaning right)
    + ฯ€(Left|sโ‚€) = 0.3, ฯ€(Right|sโ‚€) = 0.7
    + Sample action: aโ‚€ = Right
    + Reward: rโ‚ = +1 (pole still balanced)
    +
    + Time t=1:
    + sโ‚ = [0.05, 0.1, 0.08, -0.05]
    + Action: aโ‚ = Right
    + rโ‚‚ = +1
    +
    + ... episode continues for T=200 steps ...
    +
    + Total return: G = 200 (balanced entire episode!)
    +
    + Update policy:
    + For each (sแต—, aแต—) in trajectory:
    + ฮธ โ† ฮธ + 0.01 ร— 200 ร— โˆ‡ log ฯ€(aแต—|sแต—)
    +
    + โ†’ Increase probability of all actions taken in this successful episode! +
    +
    + +
    +
    ๐Ÿ’ก Why It Works
    +
    + Good episode (high G): Increase probability of actions taken
    + Bad episode (low G): Decrease probability of actions taken
    +
    + Over many episodes, the policy learns which actions lead to better outcomes! +
    +
    + +

    Advantages

    +
      +
    • โœ“ Works with continuous action spaces
    • +
    • โœ“ Can learn stochastic policies
    • +
    • โœ“ Better convergence properties
    • +
    • โœ“ Effective in high-dimensional spaces
    • +
    + +

    Disadvantages

    +
      +
    • โœ— High variance in gradient estimates
    • +
    • โœ— Sample inefficient (needs many episodes)
    • +
    • โœ— Can get stuck in local optima
    • +
    • โœ— Sensitive to learning rate
    • +
    + +
    +
    โœ… Modern Improvements
    +
    + Actor-Critic: Combine policy gradient with value function to reduce variance
    + PPO (Proximal Policy Optimization): Constrain policy updates for stability
    + TRPO (Trust Region): Guarantee monotonic improvement
    +
    + These advances make policy gradients practical for complex tasks like robot control and game playing!
    - -
    -
    Step 2:
    -
    -

    Calculate SE for n = 4

    -
    - SE = 20 / โˆš4
    - SE = 20 / 2
    - SE = 10
    -
    -
    +
    +
    +
    + + +
    +
    +

    ๐Ÿ”„ Comparison Algorithm Comparison Tool

    + +
    +
    +

    Compare machine learning algorithms side-by-side to choose the best one for your problem!

    + + +
    +

    Step 1: Select Learning Category

    +
    + + +
    - -
    -
    Step 3:
    -
    -

    Calculate SE for n = 16

    -
    - SE = 20 / โˆš16
    - SE = 20 / 4
    - SE = 5
    -
    -
    +
    + + +
    +

    Step 2: Select Algorithms to Compare (2-5)

    +
    +
    - -
    -
    Step 4:
    -
    -

    Calculate SE for n = 64

    -
    - SE = 20 / โˆš64
    - SE = 20 / 8
    - SE = 2.5
    -
    -
    +

    Selected: 0 algorithms

    +
    + + +
    + +
    + + +
- - -
-
- Topic 28 -

๐Ÿ“ Z-Test

-

Hypothesis test for large samples with known ฯƒ

-
+ + -
-

Formula

-
-
Z-Test Statistic
-
z = (xฬ„ - ฮผโ‚€) / (ฯƒ / โˆšn)
-

xฬ„ = sample mean

-

ฮผโ‚€ = hypothesized population mean

-

ฯƒ = population standard deviation

-

n = sample size

-
-
+ + - -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

A factory claims ฮผ = 100. Sample: n = 36, xฬ„ = 105, ฯƒ = 12. Test at ฮฑ = 0.05 (two-tailed).

+ +
+ +
- -
-

Solution:

- -
-
Step 1:
-
-

State Hypotheses

-
- Hโ‚€: ฮผ = 100 (claim is true)
- Hโ‚: ฮผ โ‰  100 (claim is false)
- ฮฑ = 0.05, two-tailed test
+ + +
+

๐ŸŽฏ Not Sure Which Algorithm? Take the Quiz!

+
+
+

Question 1: Do you have labeled data?

+
+ +
-
- -
-
Step 2:
-
-

Calculate Standard Error

-
- SE = ฯƒ / โˆšn
- SE = 12 / โˆš36
- SE = 12 / 6
- SE = 2
+ -
- -
-
Step 3:
-
-

Calculate Z-Statistic

-
- z = (xฬ„ - ฮผโ‚€) / SE
- z = (105 - 100) / 2
- z = 5 / 2
- z = 2.5
+ -
- -
-
Step 4:
-
-

Find Critical Values

-
- ฮฑ = 0.05, two-tailed
- Critical values: z = ยฑ1.96
- Rejection regions: z < -1.96 or z > 1.96
+ -
- -
-
Step 5:
-
-

Make Decision

-
- Test statistic: z = 2.5
- Critical value: z = 1.96
- 2.5 > 1.96 โ†’ In rejection region
-
- REJECT Hโ‚€
-
+
- -
-
Step 6:
-
-

Interpret

-
- There IS significant evidence that ฮผ โ‰  100
- The sample mean of 105 is statistically different
- Factory's claim is likely false
-
-
-
- -
- โœ“ Final Answer: - z = 2.5 > 1.96, REJECT Hโ‚€ (claim is false) -
- -
- Check: -

P-value = 2 ร— P(Z > 2.5) = 2 ร— 0.0062 = 0.0124 < 0.05 โœ“ Confirms rejection

-
- -
-

๐Ÿ’ช Practice Problems:

-
    -
  1. Test: ฮผโ‚€ = 50, xฬ„ = 48, ฯƒ = 10, n = 25, ฮฑ = 0.05
  2. -
  3. If z = -1.5, ฮฑ = 0.05, two-tailed, what's your decision?
  4. -
  5. When should we use z-test vs t-test?
  6. -
-
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Use when n โ‰ฅ 30 and ฯƒ known
  • -
  • z = (xฬ„ - ฮผโ‚€) / SE
  • -
  • Compare z to critical value or find p-value
  • -
  • Large |z| = evidence against Hโ‚€
  • -
-
-
- - -
-
- Topic 29 -

๐ŸŽš๏ธ Z-Score & Critical Values

-

Standardization and rejection regions

+
-
-

Z-Score (Standardization)

-
-
Z-Score Formula
-
z = (x - ฮผ) / ฯƒ
-

Converts any normal distribution to standard normal (ฮผ=0, ฯƒ=1)

+ +
+
+

๐Ÿ“Š Supervised Ensemble Methods

+ +
+
+

"Wisdom of the crowds" applied to machine learning! Ensemble methods combine multiple weak learners to create a strong learner. They power most Kaggle competition winners!

+ +
+
Key Concepts
+
    +
  • Combine multiple models for better predictions
  • +
  • Bagging: Train on random subsets (parallel)
  • +
  • Boosting: Sequential learning from mistakes
  • +
  • Stacking: Meta-learner combines base models
  • +
-
-
-

Critical Values

-
    -
  • ฮฑ = 0.05 (two-tailed): z = ยฑ1.96
  • -
  • ฮฑ = 0.05 (one-tailed): z = 1.645
  • -
  • ฮฑ = 0.01 (two-tailed): z = ยฑ2.576
  • -
-
+

Why Ensembles Work

+

Imagine 100 doctors diagnosing a patient. Even if each is 70% accurate individually, their majority vote is 95%+ accurate! Same principle applies to ML.

- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Find critical z-values for: a) ฮฑ = 0.05 one-tailed (right), b) ฮฑ = 0.05 two-tailed, c) ฮฑ = 0.01 two-tailed. Draw rejection regions.

-
- -
-

Solution:

- -
-
Step 1:
-
-

One-Tailed Right (ฮฑ = 0.05)

-
- All ฮฑ in right tail
- Find z where P(Z > z) = 0.05
- P(Z โ‰ค z) = 1 - 0.05 = 0.95
- From z-table: zโ‚€.โ‚‰โ‚… = 1.645
-
- Critical value: z = 1.645
- Reject Hโ‚€ if z > 1.645
-
-
-
- -
-
Step 2:
-
-

Two-Tailed (ฮฑ = 0.05)

-
- Split ฮฑ between both tails
- Each tail = 0.05/2 = 0.025
- Left tail: P(Z < z) = 0.025 โ†’ z = -1.96
- Right tail: P(Z > z) = 0.025 โ†’ z = +1.96
-
- Critical values: z = ยฑ1.96
- Reject Hโ‚€ if |z| > 1.96
-
-
-
- -
-
Step 3:
-
-

Two-Tailed (ฮฑ = 0.01)

-
- More strict test
- Each tail = 0.01/2 = 0.005
- P(Z < z) = 0.005 โ†’ z = -2.576
- P(Z > z) = 0.005 โ†’ z = +2.576
-
- Critical values: z = ยฑ2.576
- Reject Hโ‚€ if |z| > 2.576
-
-
-
- -
-
Step 4:
-
-

Visualize Rejection Regions

-
- One-tailed (ฮฑ=0.05): [______|โ–ˆโ–ˆโ–ˆโ–ˆ] z > 1.645
- Two-tailed (ฮฑ=0.05): [โ–ˆโ–ˆ|________|โ–ˆโ–ˆ] |z| > 1.96
- Two-tailed (ฮฑ=0.01): [โ–ˆ|__________|โ–ˆ] |z| > 2.576
-
- Smaller ฮฑ โ†’ Larger critical values โ†’ Harder to reject
-
-
-
- -
- โœ“ Final Answer: - a) z = 1.645, b) z = ยฑ1.96, c) z = ยฑ2.576 +
+
๐ŸŽฏ The Magic of Diversity
+
+ Key insight: Each model makes DIFFERENT errors!
+
+ Model A: Correct on samples [1,2,3,5,7,9] - 60% accuracy
+ Model B: Correct on samples [2,4,5,6,8,10] - 60% accuracy
+ Model C: Correct on samples [1,3,4,6,7,8] - 60% accuracy
+
+ Majority vote: Correct on [1,2,3,4,5,6,7,8] - 80% accuracy!
+
+ Diversity reduces variance!
- -
-

๐Ÿ’ช Practice Problems:

-
    -
  1. Find critical value for ฮฑ = 0.10, one-tailed (left)
  2. -
  3. If your test statistic is z = 2.0, which tests would reject Hโ‚€?
  4. -
  5. Why are two-tailed critical values larger than one-tailed?
  6. -
-
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Z-score standardizes values
  • -
  • Critical values define rejection region
  • -
  • |z| > critical value โ†’ reject Hโ‚€
  • -
  • Common: ยฑ1.96 for 95% confidence
  • -
-
- - - -
-
- Topic 30 -

๐Ÿ’ฏ P-Value Method

-

Probability of observing data if Hโ‚€ is true

-
- -
-

Introduction

-

What is it? P-value is the probability of getting results as extreme as observed, assuming Hโ‚€ is true.

-
- -
-

Decision Rule

-
    -
  • If p-value โ‰ค ฮฑ: Reject Hโ‚€ (statistically significant)
  • -
  • If p-value > ฮฑ: Fail to reject Hโ‚€ (not significant)
  • -
-
- -
-

Interpretation

-
    -
  • p < 0.01: Very strong evidence against Hโ‚€
  • -
  • 0.01 โ‰ค p < 0.05: Strong evidence against Hโ‚€
  • -
  • 0.05 โ‰ค p < 0.10: Weak evidence against Hโ‚€
  • -
  • p โ‰ฅ 0.10: Little or no evidence against Hโ‚€
  • -
-
-
-
โš ๏ธ COMMON MISCONCEPTION
-

P-value is NOT the probability that Hโ‚€ is true! It's the probability of observing your data IF Hโ‚€ were true.

-
+

Method 1: Bagging (Bootstrap Aggregating)

+

Train multiple models on different random subsets of data (with replacement), then average predictions.

- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Sample of 36 students has mean score xฬ„ = 78. Population mean claimed to be ฮผโ‚€ = 75 with ฯƒ = 12. Test at ฮฑ = 0.05 using p-value method.

-
- -
-

Solution:

- -
-
Step 1:
-
-

State Hypotheses

-
- Hโ‚€: ฮผ = 75 (null hypothesis - no difference)
- Hโ‚: ฮผ โ‰  75 (alternative - there is a difference)
- Two-tailed test -
-

Set up null and alternative hypotheses

-
-
- -
-
Step 2:
-
-

Calculate Test Statistic

-
- z = (xฬ„ - ฮผโ‚€) / (ฯƒ/โˆšn)
- z = (78 - 75) / (12/โˆš36)
- z = 3 / (12/6)
- z = 3 / 2 = 1.5 -
-

Calculate the z-score

-
-
- -
-
Step 3:
-
-

Find P-Value

-
- For two-tailed: p-value = 2 ร— P(Z > |1.5|)
- P(Z > 1.5) = 1 - 0.9332 = 0.0668
- p-value = 2 ร— 0.0668 = 0.1336 -
-

Multiply by 2 for two-tailed test

-
-
- -
-
Step 4:
-
-

Compare with ฮฑ

-
- p-value = 0.1336
- ฮฑ = 0.05
- 0.1336 > 0.05 -
-

Since p-value exceeds ฮฑ, we fail to reject Hโ‚€

-
-
- -
-
Step 5:
-
-

Make Decision

-
- Since p-value > ฮฑ, FAIL TO REJECT Hโ‚€
- Not enough evidence to conclude mean differs from 75
- p-value of 13.36% means we'd see results this extreme
- 13.36% of time if Hโ‚€ true -
-

Interpret in context

-
-
- -
- โœ“ Final Answer: - p-value = 0.1336 > 0.05, Fail to reject Hโ‚€ -
- -
- Check: -

The result is not statistically significant at ฮฑ = 0.05 level. We need stronger evidence to claim the mean differs from 75.

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. If z = 2.5, ฮฑ = 0.01, find p-value and decide
  2. -
  3. When do we reject Hโ‚€ using p-value method?
  4. -
- - +
+ Bagging Algorithm:
+ 1. Create B bootstrap samples (random sampling with replacement)
+ 2. Train a model on each sample independently
+ 3. For prediction:
+    โ€ข Regression: Average all predictions
+    โ€ข Classification: Majority vote
+
+ Effect: Reduces variance, prevents overfitting
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • P-value = P(data | Hโ‚€ true)
  • -
  • Reject Hโ‚€ if p โ‰ค ฮฑ
  • -
  • Smaller p-value = stronger evidence against Hโ‚€
  • -
  • Most common approach in modern statistics
  • -
-
-
- - -
-
- Topic 31 -

โ†”๏ธ One-Tailed vs Two-Tailed Tests

-

Directional vs non-directional hypotheses

-
-
-

Two-Tailed Test

-
    -
  • Hโ‚: ฮผ โ‰  ฮผโ‚€ (different, could be higher or lower)
  • -
  • Testing for any difference
  • -
  • Rejection regions in both tails
  • -
  • More conservative
  • -
-
- -
-

One-Tailed Test

-
    -
  • Right-tailed: Hโ‚: ฮผ > ฮผโ‚€
  • -
  • Left-tailed: Hโ‚: ฮผ < ฮผโ‚€
  • -
  • Testing for specific direction
  • -
  • Rejection region in one tail
  • -
  • More powerful for directional effects
  • -
-
- - -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Researcher claims new drug LOWERS blood pressure (ฮผ < 120). Sample of 49: xฬ„ = 115, ฯƒ = 21. Test at ฮฑ = 0.05. Should this be one-tailed or two-tailed?

-
- -
-

Solution:

- -
-
Step 1:
-
-

Analyze the Claim

-
- Claim: drug LOWERS pressure (directional)
- Looking for decrease specifically
- This requires ONE-TAILED test (left tail) -
-

Directional claim = one-tailed test

-
-
- -
-
Step 2:
-
-

Set Up Hypotheses

-
- Hโ‚€: ฮผ โ‰ฅ 120 (blood pressure not lower)
- Hโ‚: ฮผ < 120 (blood pressure IS lower)
- Left-tailed test -
-

Alternative hypothesis shows the direction

-
-
- -
-
Step 3:
-
-

Calculate Z-Score

-
- z = (xฬ„ - ฮผโ‚€) / (ฯƒ/โˆšn)
- z = (115 - 120) / (21/โˆš49)
- z = -5 / (21/7)
- z = -5 / 3 = -1.67 -
-

Negative z-score indicates below mean

-
-
- -
-
Step 4:
-
-

Find Critical Value (One-Tailed)

-
- For ฮฑ = 0.05, one-tailed (left)
- Critical value: z = -1.645 -
-

One-tailed critical value differs from two-tailed

-
-
- -
-
Step 5:
-
-

Make Decision

-
- Test statistic: z = -1.67
- Critical value: z = -1.645
- -1.67 < -1.645 (in rejection region)
- REJECT Hโ‚€ -
-

Falls in rejection region, so reject null

-
-
- -
-
Step 6:
-
-

Contrast with Two-Tailed

-
- If two-tailed: critical values ยฑ1.96
- Our |z| = 1.67 < 1.96
- Would NOT reject Hโ‚€ with two-tailed!
- This shows importance of choosing correct test -
-

Test choice matters!

-
-
- -
- โœ“ Final Answer: - Use ONE-TAILED (left). z = -1.67 < -1.645, Reject Hโ‚€ -
- -
- Check: -

Evidence supports claim that drug lowers blood pressure. One-tailed test was appropriate for directional claim.

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. Claim: ฮผ > 50. One-tailed or two-tailed?
  2. -
  3. Claim: ฮผ โ‰  100. Which test?
  4. -
- - - -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Two-tailed: testing for any difference
  • -
  • One-tailed: testing for specific direction
  • -
  • Choose before collecting data
  • -
  • Two-tailed is more conservative
  • -
-
-
- - -
-
- Topic 32 -

๐Ÿ“ T-Test

-

Hypothesis test for small samples or unknown ฯƒ

-
-
-

When to Use T-Test

-
    -
  • Small sample (n < 30)
  • -
  • Population ฯƒ unknown (use sample s)
  • -
  • Population approximately normal
  • -
-
+

Method 2: Boosting (Sequential Learning)

+

Train models sequentially, where each new model focuses on examples the previous models got wrong.

-
-

Formula

-
-
T-Test Statistic
-
t = (xฬ„ - ฮผโ‚€) / (s / โˆšn)
-

Same as z-test but uses s instead of ฯƒ

-

Follows t-distribution with df = n - 1

+
+ Boosting Algorithm:
+ 1. Start with equal weights for all samples
+ 2. Train model on weighted data
+ 3. Increase weights for misclassified samples
+ 4. Train next model (focuses on hard examples)
+ 5. Repeat for M iterations
+ 6. Final prediction = weighted vote of all models
+
+ Effect: Reduces bias AND variance
-
- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Small sample: n = 16, xฬ„ = 52, s = 8. Test if ฮผ = 50 at ฮฑ = 0.05. Population ฯƒ unknown.

-
- -
-

Solution:

- -
-
Step 1:
-
-

Choose Correct Test

-
- n = 16 < 30 (small sample)
- ฯƒ unknown (use sample s)
- Use T-TEST instead of z-test -
-

Small sample + unknown ฯƒ = t-test

-
-
- -
-
Step 2:
-
-

Calculate T-Statistic

-
- t = (xฬ„ - ฮผโ‚€) / (s/โˆšn)
- t = (52 - 50) / (8/โˆš16)
- t = 2 / (8/4)
- t = 2 / 2 = 1.0 -
-

Use sample standard deviation s

-
-
- -
-
Step 3:
-
-

Find Degrees of Freedom

-
- df = n - 1
- df = 16 - 1 = 15 -
-

Lose 1 df for estimating mean

-
-
- -
-
Step 4:
-
-

Find Critical Value

-
- Two-tailed test, ฮฑ = 0.05
- df = 15
- From t-table: tโ‚€.โ‚€โ‚‚โ‚…,โ‚โ‚… = ยฑ2.131 -
-

Look up in t-distribution table

-
-
- -
-
Step 5:
-
-

Compare and Decide

-
- Test statistic: t = 1.0
- Critical values: ยฑ2.131
- |1.0| < 2.131
- FAIL TO REJECT Hโ‚€ -
-

Test statistic not in rejection region

-
-
- -
-
Step 6:
-
-

Interpret

-
- Not enough evidence that ฮผ โ‰  50
- Sample mean of 52 is not significantly different from 50 -
-

Interpret in context of problem

-
-
- -
- โœ“ Final Answer: - t = 1.0, critical = ยฑ2.131, Fail to reject Hโ‚€ -
- -
- Check: -

The difference between 52 and 50 is not statistically significant at ฮฑ = 0.05 level with this small sample.

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. n = 25, xฬ„ = 100, s = 15, test ฮผ = 95 at ฮฑ = 0.01
  2. -
  3. Why use t-test instead of z-test?
  4. -
- - - -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Use when ฯƒ unknown or n < 30
  • -
  • t = (xฬ„ - ฮผโ‚€) / (s / โˆšn)
  • -
  • Follows t-distribution
  • -
  • More variable than z-distribution
  • -
-
-
- - -
-
- Topic 33 -

๐Ÿ”“ Degrees of Freedom

-

Independent pieces of information

-
- -
-

Introduction

-

What is it? Degrees of freedom (df) is the number of independent values that can vary in analysis.

-
- -
-

Common Formulas

-
    -
  • One-sample t-test: df = n - 1
  • -
  • Two-sample t-test: df โ‰ˆ nโ‚ + nโ‚‚ - 2
  • -
  • Chi-squared: df = (rows-1)(cols-1)
  • -
-
-
-

Why It Matters

-
    -
  • Determines shape of t-distribution
  • -
  • Higher df โ†’ closer to normal distribution
  • -
  • Affects critical values
  • -
-
+

Random Forest: Bagging + Decision Trees

+

The most popular ensemble method! Combines bagging with feature randomness.

- -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Calculate degrees of freedom for: a) Single sample t-test: n = 20, b) Two-sample t-test: nโ‚ = 15, nโ‚‚ = 18, c) Chi-squared test: 3ร—4 contingency table

-
- -
-

Solution:

- -
-
Step 1:
-
-

Single Sample T-Test

-
- Formula: df = n - 1
- n = 20
- df = 20 - 1 = 19
- We "lose" 1 df because we estimate mean from sample -
-

Each parameter estimated reduces df by 1

-
-
- -
-
Step 2:
-
-

Two-Sample T-Test (Equal Variances)

-
- Formula: df = nโ‚ + nโ‚‚ - 2
- nโ‚ = 15, nโ‚‚ = 18
- df = 15 + 18 - 2 = 31
- Lose 1 df per sample for estimating each mean -
-

Two samples = two means estimated

-
-
- -
-
Step 3:
-
-

Chi-Squared Contingency Table

-
- Formula: df = (rows - 1) ร— (columns - 1)
- 3 rows, 4 columns
- df = (3 - 1) ร— (4 - 1)
- df = 2 ร— 3 = 6 -
-

Degrees of freedom for independence test

-
-
- -
-
Step 4:
-
-

Explain Concept

-
- Degrees of freedom = number of values free to vary
- Each parameter estimated reduces df by 1
- Higher df โ†’ distribution closer to normal -
-

Conceptual understanding

-
-
- -
- โœ“ Final Answer: - a) df = 19, b) df = 31, c) df = 6 -
- -
- Check: -

These df values would be used to find appropriate critical values from respective distribution tables.

-
+
+ Random Forest Algorithm:
+ 1. Create B bootstrap samples
+ 2. For each sample:
+    โ€ข Grow decision tree
+    โ€ข At each split, consider random subset of features
+    โ€ข Don't prune (let trees overfit!)
+ 3. Final prediction = average/vote of all trees
+
+ Typical values: B=100-500 trees, โˆšfeatures per split
- -
-

๐Ÿ’ช Try These:

-
    -
  1. Sample size 100, find df for t-test
  2. -
  3. 5ร—3 table, find df for chi-squared
  4. -
- - - -
-

๐ŸŽฏ Key Takeaways

-
    -
  • df = number of independent values
  • -
  • For t-test: df = n - 1
  • -
  • Higher df โ†’ distribution closer to normal
  • -
  • Critical for finding correct critical values
  • -
-
-
- - -
-
- Topic 34 -

โš ๏ธ Type I & Type II Errors

-

False positives and false negatives

-
-
-

The Two Types of Errors

- +

Comparison: Bagging vs Boosting

+
- - - - - + - - - - - - - - - - + + + + + +
Hโ‚€ TrueHโ‚€ False
AspectBaggingBoosting
Reject Hโ‚€Type I Error (ฮฑ)Correct!
Fail to Reject Hโ‚€Correct!Type II Error (ฮฒ)
TrainingParallel (independent)Sequential (dependent)
FocusReduce varianceReduce bias & variance
WeightsEqual for all samplesHigher for hard samples
SpeedFast (parallelizable)Slower (sequential)
OverfittingResistantCan overfit if too many iterations
ExamplesRandom ForestAdaBoost, Gradient Boosting, XGBoost
-
- -
-

Definitions

-
    -
  • Type I Error (ฮฑ): Rejecting true Hโ‚€ (false positive)
  • -
  • Type II Error (ฮฒ): Failing to reject false Hโ‚€ (false negative)
  • -
  • Power = 1 - ฮฒ: Probability of correctly rejecting false Hโ‚€
  • -
-
- -
-
๐Ÿ“Š MEDICAL ANALOGY
-

Type I Error: Telling healthy person they're sick (false alarm)

-

Type II Error: Telling sick person they're healthy (missed diagnosis)

-
- - -
-

๐Ÿ“ Worked Example - Step by Step

- -
-

Problem:

-

Drug trial tests Hโ‚€: "Drug is safe" vs Hโ‚: "Drug is dangerous". Describe Type I and Type II errors with consequences.

-
- -
-

Solution:

- -
-
Step 1:
-
-

Define Type I Error (False Positive)

-
- Type I: Reject Hโ‚€ when Hโ‚€ is TRUE
- In this case: Conclude drug is dangerous when it's actually safe
- Probability = ฮฑ (significance level)
- Consequence: Safe drug rejected, patients miss beneficial treatment -
-

False alarm - reject truth

-
-
- -
-
Step 2:
-
-

Define Type II Error (False Negative)

-
- Type II: Fail to reject Hโ‚€ when Hโ‚ is TRUE
- In this case: Conclude drug is safe when it's actually dangerous
- Probability = ฮฒ
- Consequence: Dangerous drug approved, patients harmed! -
-

Miss detecting danger

-
-
- -
-
Step 3:
-
-

Create Decision Matrix

-
- Reality vs Decision:
- If Hโ‚€ true (safe) + Reject Hโ‚€ (call dangerous) = TYPE I
- If Hโ‚ true (dangerous) + Fail to reject = TYPE II
- Correct decisions: Accept truth or reject false -
-

Four possible outcomes

-
-
- -
-
Step 4:
-
-

Calculate Example

-
- If ฮฑ = 0.05: 5% chance of Type I error
- If ฮฒ = 0.20: 20% chance of Type II error
- Power = 1 - ฮฒ = 0.80 (80% chance of detecting dangerous drug) -
-

Probabilities of each error

-
-
- -
-
Step 5:
-
-

Compare Consequences

-
- Type I: Waste safe drug (economic cost)
- Type II: Approve dangerous drug (LIFE RISK!)
- Type II often more serious โ†’ use lower ฮฑ -
-

Context determines which error is worse

-
-
- -
- โœ“ Final Answer: - Type I (ฮฑ): Reject safe drug
Type II (ฮฒ): Approve dangerous drug
Type II more dangerous in this case!
-
- -
- Check: -

In medical contexts, Type II errors (missing danger) are often considered worse than Type I errors (false alarms).

-
-
- -
-

๐Ÿ’ช Try These:

-
    -
  1. Security scanner: Hโ‚€ = "Safe". Describe Type I/II errors
  2. -
  3. If ฮฑ = 0.01, what's P(Type I error)?
  4. -
- - -
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Type I: False positive (ฮฑ)
  • -
  • Type II: False negative (ฮฒ)
  • -
  • Trade-off: decreasing one increases the other
  • -
  • Power = 1 - ฮฒ (ability to detect true effect)
  • -
-
-
- - -
-
- Topic 35 -

ฯ‡ยฒ Chi-Squared Distribution

-

Distribution for categorical data analysis

-
- -
-

Introduction

-

What is it? Chi-squared (ฯ‡ยฒ) distribution is used for testing hypotheses about categorical data.

-
- -
-

Properties

-
    -
  • Always positive (ranges from 0 to โˆž)
  • -
  • Right-skewed
  • -
  • Shape depends on degrees of freedom
  • -
  • Higher df โ†’ more symmetric
  • -
-
- -
-

Uses

-
    -
  • Goodness of fit test
  • -
  • Test of independence
  • -
  • Testing variance
  • -
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Used for categorical data
  • -
  • Always positive, right-skewed
  • -
  • Shape depends on df
  • -
  • Foundation for chi-squared tests
  • -
-
-
- - -
-
- Topic 36 -

โœ“ Goodness of Fit Test

-

Testing if data follows expected distribution

-
- -
-

Introduction

-

What is it? Tests whether observed frequencies match expected frequencies from a theoretical distribution.

-
- -
-

Formula

-
-
Chi-Squared Test Statistic
-
ฯ‡ยฒ = ฮฃ [(O - E)ยฒ / E]
-

O = observed frequency

-

E = expected frequency

-

df = k - 1 (k = number of categories)

-
-
-
-
๐Ÿ“Š EXAMPLE
-

Testing if die is fair:

-

Roll 60 times. Expected: 10 per face

-

Observed: 8, 12, 11, 9, 10, 10

-

Calculate ฯ‡ยฒ and compare to critical value

-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Tests if observed matches expected distribution
  • -
  • ฯ‡ยฒ = ฮฃ(O-E)ยฒ/E
  • -
  • Large ฯ‡ยฒ = poor fit
  • -
  • df = number of categories - 1
  • -
+

Real-World Success Stories

+
    +
  • Netflix Prize (2009): Winning team used ensemble of 100+ models
  • +
  • Kaggle competitions: 99% of winners use ensembles
  • +
  • XGBoost: Most popular algorithm for structured data
  • +
  • Random Forests: Default choice for many data scientists
  • +
+ +
+
๐Ÿ’ก When to Use Each Method
+
+ Use Random Forest when:
+ โ€ข You want good accuracy with minimal tuning
+ โ€ข You have high-variance base models
+ โ€ข Interpretability is secondary
+
+ Use Gradient Boosting (XGBoost) when:
+ โ€ข You want maximum accuracy
+ โ€ข You can afford hyperparameter tuning
+ โ€ข You have high-bias base models
+
+ Use Stacking when:
+ โ€ข You want to combine very different model types
+ โ€ข You're in a competition (squeeze every 0.1%!) +
+
+ +

๐ŸŽ‰ Course Complete!

+

+ Congratulations! You've mastered all 17 machine learning topics - from basic linear regression to advanced ensemble methods! You now have the knowledge to: +

+
    +
  • Choose the right algorithm for any problem
  • +
  • Understand the math behind each method
  • +
  • Tune hyperparameters systematically
  • +
  • Evaluate models properly
  • +
  • Build production-ready ML systems
  • +
+

+ Keep practicing, building projects, and exploring! The ML journey never ends. ๐Ÿš€โœจ +

-
- - -
-
- Topic 37 -

๐Ÿ”— Test of Independence

-

Testing relationship between categorical variables

-
- -
-

Introduction

-

What is it? Tests whether two categorical variables are independent or associated.

-
- -
-

Formula

-
-
Chi-Squared for Independence
-
ฯ‡ยฒ = ฮฃ [(O - E)ยฒ / E]
-

E = (row total ร— column total) / grand total

-

df = (rows - 1)(columns - 1)

-
-
- -
-
๐Ÿ“Š EXAMPLE
-

Are gender and color preference independent?

-

Create contingency table, calculate expected frequencies, compute ฯ‡ยฒ, and test against critical value.

-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Tests independence of two categorical variables
  • -
  • Uses contingency tables
  • -
  • df = (r-1)(c-1)
  • -
  • Large ฯ‡ยฒ suggests association
  • -
-
-
- - -
-
- Topic 38 -

๐Ÿ“ Chi-Squared Variance Test

-

Testing claims about population variance

-
- -
-

Introduction

-

What is it? Tests hypotheses about population variance or standard deviation.

-
- -
-

Formula

-
-
Chi-Squared for Variance
-
ฯ‡ยฒ = (n-1)sยฒ / ฯƒโ‚€ยฒ
-

n = sample size

-

sยฒ = sample variance

-

ฯƒโ‚€ยฒ = hypothesized population variance

-

df = n - 1

-
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Tests claims about variance/standard deviation
  • -
  • ฯ‡ยฒ = (n-1)sยฒ/ฯƒโ‚€ยฒ
  • -
  • Requires normal population
  • -
  • Common in quality control
  • -
-
-
- - -
-
- Topic 39 -

๐Ÿ“Š Confidence Intervals

-

Range of plausible values for parameter

-
- -
-

Introduction

-

What is it? A confidence interval provides a range of values that likely contains the true population parameter.

-

Why it matters: More informative than point estimatesโ€”shows precision and uncertainty.

-
- -
-

Formula

-
-
Confidence Interval for Mean
-
CI = xฬ„ ยฑ (critical value ร— SE)
-

For z: CI = xฬ„ ยฑ z* ร— (ฯƒ/โˆšn)

-

For t: CI = xฬ„ ยฑ t* ร— (s/โˆšn)

-
-
- -
-

Common Confidence Levels

-
    -
  • 90% CI: z* = 1.645
  • -
  • 95% CI: z* = 1.96
  • -
  • 99% CI: z* = 2.576
  • -
-
- -
-
๐Ÿ“Š EXAMPLE
-

Sample: n=100, xฬ„=50, s=10

-

95% CI = 50 ยฑ 1.96(10/โˆš100)

-

95% CI = 50 ยฑ 1.96 = (48.04, 51.96)

-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • CI = point estimate ยฑ margin of error
  • -
  • 95% CI most common
  • -
  • Wider CI = more uncertainty
  • -
  • Larger sample = narrower CI
  • -
-
-
- - -
-
- Topic 40 -

ยฑ Margin of Error

-

Measuring estimate precision

-
- -
-

Introduction

-

What is it? Margin of error (MOE) is the ยฑ part of a confidence interval, showing the precision of an estimate.

-
- -
-

Formula

-
-
Margin of Error
-
MOE = (critical value) ร— SE
-

MOE = z* ร— (ฯƒ/โˆšn) or t* ร— (s/โˆšn)

-
-
- -
-

Factors Affecting MOE

-
    -
  • Sample size: Larger n โ†’ smaller MOE
  • -
  • Confidence level: Higher confidence โ†’ larger MOE
  • -
  • Variability: Higher ฯƒ โ†’ larger MOE
  • -
-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • MOE = critical value ร— SE
  • -
  • Indicates precision of estimate
  • -
  • Inversely related to sample size
  • -
  • Trade-off between confidence and precision
  • -
-
-
- - -
-
- Topic 41 -

๐Ÿ” Interpreting Confidence Intervals

-

Common misconceptions and proper interpretation

-
- -
-

Correct Interpretation

-

"We are 95% confident that the true population parameter lies within this interval."

-

This means: If we repeated this process many times, 95% of the intervals would contain the true parameter.

-
- -
-
โš ๏ธ COMMON MISCONCEPTIONS
-
    -
  • WRONG: "There's a 95% probability the parameter is in this interval."
  • -
  • WRONG: "95% of the data falls in this interval."
  • -
  • WRONG: "We are 95% sure our sample mean is in this interval."
  • -
-
- -
-

Using CIs for Hypothesis Testing

-
    -
  • If hypothesized value is INSIDE CI โ†’ fail to reject Hโ‚€
  • -
  • If hypothesized value is OUTSIDE CI โ†’ reject Hโ‚€
  • -
  • 95% CI corresponds to ฮฑ = 0.05 test
  • -
-
- -
-
โœ… PRO TIP
-

Report confidence intervals instead of just p-values! CIs provide more information: effect size AND statistical significance.

-
- -
-

๐ŸŽฏ Key Takeaways

-
    -
  • Correct interpretation: confidence in the method, not the specific interval
  • -
  • 95% refers to long-run success rate
  • -
  • Can use CIs for hypothesis testing
  • -
  • More informative than p-values alone
  • -
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +