diff --git "a/math-ds-complete/index.html" "b/math-ds-complete/index.html" --- "a/math-ds-complete/index.html" +++ "b/math-ds-complete/index.html" @@ -16,6 +16,7 @@ 📐 Linear Algebra ∫ Calculus 🤖 Data Science + 🚀 Machine Learning @@ -233,6 +234,89 @@ 85. Loss Functions + + + + Supervised Learning - Regression + + ML-1. Linear Regression + ML-2. Polynomial Regression + ML-3. Ridge Regression (L2) + ML-4. Lasso Regression (L1) + ML-5. Elastic Net + ML-6. Support Vector Regression + + + + + Supervised Learning - Classification + + ML-7. Logistic Regression + ML-8. K-Nearest Neighbors + ML-9. Support Vector Machines + ML-10. Decision Trees + ML-11. Naive Bayes + ML-12. Random Forest + ML-13. Gradient Boosting + ML-14. Neural Networks + + + + + Unsupervised - Clustering + + ML-15. K-Means Clustering + ML-16. Hierarchical Clustering + ML-17. DBSCAN + ML-18. Gaussian Mixture Models + + + + + Unsupervised - Dim. Reduction + + ML-19. PCA + ML-20. t-SNE + ML-21. Autoencoders + + + + + Reinforcement Learning + + ML-22. Q-Learning + ML-23. Deep Q-Networks + ML-24. Policy Gradient + + + + + Model Evaluation & Optimization + + ML-25. Cross-Validation + ML-26. GridSearch & RandomSearch + ML-27. Hyperparameter Tuning + ML-28. Model Evaluation Metrics + ML-29. Regularization + ML-30. Bias-Variance Tradeoff + + + + + Advanced Topics + + ML-31. Ensemble Methods + ML-32. Feature Engineering + ML-33. Imbalanced Data + ML-34. Time Series Analysis + ML-35. Anomaly Detection + ML-36. Transfer Learning + ML-37. Fine-tuning Models + ML-38. Model Interpretability + ML-39. Optimization Algorithms + ML-40. Batch Norm & Dropout + + @@ -7547,6 +7631,2072 @@ + + + + + + ML Algorithm 1 + 📈 Linear Regression + Predicting continuous values with a straight line + + + + 📚 What is Linear Regression? + Linear regression is the simplest supervised learning algorithm that models the relationship between input features and a continuous output variable using a straight line (in 2D) or hyperplane (in higher dimensions). + Analogy: Like drawing the best-fit line through scattered points on a graph to predict future values based on the trend. + + + + 💡 How It Works + Step-by-step intuition: + + Plot your data points on a graph + Find the line that minimizes distance to all points + Use "least squares" - minimize sum of squared errors + Calculate optimal slope and intercept mathematically + Use the line to predict new values + + + + + 🧮 Mathematics Behind It + + Equation + y = β₀ + β₁x + ε + β₀ = intercept, β₁ = slope, ε = error + + + Slope Calculation + β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)² + + + Intercept Calculation + β₀ = ȳ - β₁x̄ + + + Cost Function (MSE) + J = (1/n)Σ(yᵢ - ŷᵢ)² + + + + + 📝 Worked Example - Predicting House Prices + + + Problem: + A real estate company has data on 5 houses. Predict the price of a 2500 sq ft house. + + Size (sq ft)Price ($1000s) + 1000150 + 1500200 + 2000250 + 2500? + 3000350 + + + + + Solution: + + + Step 1: + + Calculate Means + + x̄ = (1000 + 1500 + 2000 + 3000) / 4 = 1875 sq ft + ȳ = (150 + 200 + 250 + 350) / 4 = 237.5 ($1000s) + + We exclude the house we're predicting from training + + + + + Step 2: + + Calculate Deviations + + (x - x̄): -875, -375, 125, 1125 + (y - ȳ): -87.5, -37.5, 12.5, 112.5 + + Find how much each point differs from the mean + + + + + Step 3: + + Calculate Slope (β₁) + + Numerator: (-875)(-87.5) + (-375)(-37.5) + (125)(12.5) + (1125)(112.5) + = 76562.5 + 14062.5 + 1562.5 + 126562.5 = 218750 + Denominator: (-875)² + (-375)² + (125)² + (1125)² + = 765625 + 140625 + 15625 + 1265625 = 2187500 + β₁ = 218750 / 2187500 = 0.10 + + Slope tells us price change per sq ft + + + + + Step 4: + + Calculate Intercept (β₀) + + β₀ = ȳ - β₁ × x̄ + β₀ = 237.5 - 0.10 × 1875 + β₀ = 237.5 - 187.5 = 50 + + Base price when size = 0 + + + + + Step 5: + + Write Prediction Equation + + Price = 50 + 0.10 × Size + For 2500 sq ft: + Price = 50 + 0.10 × 2500 = 50 + 250 = 300 + + $300,000 predicted price + + + + + Step 6: + + Calculate R² Score + + Predictions: 150, 200, 250, 350 + Residuals: 0, 0, 0, 0 (perfect fit!) + R² = 1 - (SS_res / SS_tot) = 1.0 + + R² = 1.0 means perfect linear fit + + + + + ✓ Final Prediction: + House Price = $300,000 for 2500 sq ft + Equation: Price = $50k + $0.10k × Size + + + + Validation: + The model fits perfectly (R²=1.0). Each additional sq ft adds $100 to the price. The $50k base price represents fixed costs. + + + + + 💪 Practice Problems: + + What would a 3500 sq ft house cost? + If price is $275k, estimate the house size + What does the slope 0.10 mean in real terms? + + Show Answers + + Answers: + + $400,000 (50 + 0.10×3500 = 400) + 2250 sq ft (solve: 275 = 50 + 0.10x → x = 2250) + Each sq ft adds $100 to the price + + + + + + + ⚙️ Algorithm Details + + When to use: Linear relationship between features and target + Advantages: Simple, interpretable, fast, works well with limited data + Disadvantages: Only models linear relationships, sensitive to outliers + Hyperparameters: None (closed-form solution) + Applications: Sales forecasting, real estate, economics, trend analysis + + + + + 💻 Implementation (Python) + + from sklearn.linear_model import LinearRegression + import numpy as np + # Training data + X = np.array([[1000], [1500], [2000], [3000]]) + y = np.array([150, 200, 250, 350]) + # Create and train model + model = LinearRegression() + model.fit(X, y) + # Make prediction + prediction = model.predict([[2500]]) + print(f"Predicted price: ${prediction[0]}k") + # Model parameters + print(f"Slope: {model.coef_[0]:.3f}") + print(f"Intercept: {model.intercept_:.2f}") + + + + + 📊 Interactive Visualization + + + Fit Line + Reset + + + + + 🔍 Algorithm Comparison + + + Aspect + Linear Regression + Polynomial Regression + + + Complexity + Simple (straight line) + Complex (curved line) + + + Overfitting Risk + Low + High (with high degree) + + + Interpretability + Very easy + Moderate + + + Training Speed + Very fast + Fast + + + + + + 🎯 Key Takeaways + + Simplest ML algorithm - predicts with straight line + Minimizes squared errors (least squares method) + Closed-form solution: no iterative training needed + Best for linear relationships, interpretable coefficients + + + + + + + + ML Algorithm 8 + 🎯 K-Nearest Neighbors (KNN) + Classification by majority vote of nearest neighbors + + + + 📚 What is KNN? + K-Nearest Neighbors is a simple, non-parametric algorithm that classifies data points based on how their neighbors are classified. It finds the K closest training examples and uses majority vote. + Analogy: "You are the average of the 5 people you spend the most time with." KNN says "You're similar to your closest neighbors in feature space!" + + + + 💡 How It Works + Step-by-step intuition: + + Store all training data (lazy learning) + When predicting, calculate distance to all training points + Find K closest neighbors + Take majority vote of their classes + Assign the most common class to new point + + + + + 🧮 Mathematics Behind It + + Euclidean Distance + d(p,q) = √[Σ(pᵢ - qᵢ)²] + Most common distance metric for KNN + + + Manhattan Distance + d(p,q) = Σ|pᵢ - qᵢ| + Alternative: sum of absolute differences + + + Classification Rule + ŷ = mode(y₁, y₂, ..., y_k) + Most frequent class among K neighbors + + + + + 📝 Worked Example - Classifying Iris Flowers + + + Problem: + Classify a new iris flower with sepal length=5.0cm, sepal width=3.5cm. Use K=3. + + Sepal LengthSepal WidthSpecies + 5.13.5Setosa + 4.93.0Setosa + 7.03.2Versicolor + 6.43.2Versicolor + 5.03.6Setosa + + + + + Solution: + + + Step 1: + + Define New Point + + New flower: x_new = [5.0, 3.5] + K = 3 (we'll find 3 nearest neighbors) + + The flower we want to classify + + + + + Step 2: + + Calculate Distances to All Points + + d₁ = √[(5.0-5.1)² + (3.5-3.5)²] = √[0.01 + 0] = 0.10 + d₂ = √[(5.0-4.9)² + (3.5-3.0)²] = √[0.01 + 0.25] = 0.51 + d₃ = √[(5.0-7.0)² + (3.5-3.2)²] = √[4.0 + 0.09] = 2.02 + d₄ = √[(5.0-6.4)² + (3.5-3.2)²] = √[1.96 + 0.09] = 1.43 + d₅ = √[(5.0-5.0)² + (3.5-3.6)²] = √[0 + 0.01] = 0.10 + + Euclidean distance to each training point + + + + + Step 3: + + Sort by Distance + + + RankDistanceSpecies + 10.10Setosa + 20.10Setosa + 30.51Setosa + 41.43Versicolor + 52.02Versicolor + + + Select top 3 (highlighted) for K=3 + + + + + Step 4: + + Take Majority Vote + + 3 nearest neighbors: + Neighbor 1: Setosa (distance 0.10) + Neighbor 2: Setosa (distance 0.10) + Neighbor 3: Setosa (distance 0.51) + Vote count: Setosa = 3, Versicolor = 0 + Winner: Setosa (unanimous!) + + Majority class wins + + + + + Step 5: + + Make Prediction + + Predicted Class: Setosa + Confidence: 3/3 = 100% + + All neighbors agree + + + + + ✓ Final Classification: + Predicted Species = Setosa (100% confidence) + + + + Validation: + The new flower is extremely close to known Setosa examples (distances 0.10, 0.10, 0.51). The unanimous vote gives us high confidence in this classification. + + + + + 💪 Practice Problems: + + What if we used K=5 instead? Would classification change? + If distances were 0.5(Setosa), 0.6(Setosa), 0.7(Versicolor), predict class + Why is K usually chosen as odd number? + + Show Answers + + Answers: + + Would include 2 Versicolor but still 3 Setosa → Setosa wins + Setosa (2 votes vs 1) + To avoid ties in binary classification + + + + + + + ⚙️ Algorithm Details + + When to use: Non-linear decision boundaries, small-medium datasets + Advantages: Simple, no training phase, works for any decision boundary + Disadvantages: Slow prediction, memory-intensive, sensitive to irrelevant features + Hyperparameters: K (number of neighbors), distance metric, weights + Applications: Recommendation systems, pattern recognition, anomaly detection + + + + + 💻 Implementation (Python) + + from sklearn.neighbors import KNeighborsClassifier + import numpy as np + # Training data + X = np.array([[5.1,3.5], [4.9,3.0], [7.0,3.2], [6.4,3.2], [5.0,3.6]]) + y = np.array(['Setosa', 'Setosa', 'Versicolor', 'Versicolor', 'Setosa']) + # Create and train model + model = KNeighborsClassifier(n_neighbors=3) + model.fit(X, y) + # Make prediction + new_flower = np.array([[5.0, 3.5]]) + prediction = model.predict(new_flower) + proba = model.predict_proba(new_flower) + print(f"Predicted: {prediction[0]}") + print(f"Confidence: {proba[0].max():.2%}") + + + + + 🔍 Algorithm Comparison + + + Aspect + KNN + Decision Trees + + + Training Time + None (lazy learning) + Moderate + + + Prediction Time + Slow (compute all distances) + Fast (traverse tree) + + + Interpretability + Low + High (visual rules) + + + Feature Scaling + Required + Not required + + + + + + 🎯 Key Takeaways + + Lazy learning: no training phase, stores all data + Classification by K-nearest neighbor majority vote + Sensitive to feature scaling - always normalize! + Choose K: small K = noisy, large K = smooth boundaries + + + + + + + + ML Algorithm 10 + 🌳 Decision Trees + Tree-based decisions using feature splits + + + + 📚 What is a Decision Tree? + Decision Trees make predictions by asking a series of yes/no questions about features, creating a flowchart-like structure from root to leaves. + Analogy: Like a game of 20 Questions - each question (split) narrows down possibilities until you reach a final decision (leaf). + + + + 💡 How It Works + Step-by-step intuition: + + Start with all training data at root + Find best feature to split on (max information gain) + Split data into branches based on that feature + Recursively repeat for each branch + Stop when pure (all same class) or max depth reached + Leaves contain final predictions + + + + + 🧮 Mathematics Behind It + + Entropy (Impurity) + H(S) = -Σ pᵢ log₂(pᵢ) + pᵢ = proportion of class i. Measures disorder. + + + Information Gain + IG = H(parent) - Σ(|child|/|parent|) × H(child) + Choose split with highest information gain + + + Gini Impurity (Alternative) + Gini = 1 - Σ pᵢ² + Used by CART algorithm. Faster to compute. + + + + + 📝 Worked Example - Loan Approval Prediction + + + Problem: + Build decision tree for loan approval. Dataset: + + IncomeCredit ScoreAgeApproved? + HighGood35Yes + HighGood40Yes + LowPoor25No + LowGood30Yes + HighPoor45No + LowPoor28No + + + + + Solution: + + + Step 1: + + Calculate Root Entropy + + Total: 6 samples + Approved (Yes): 3/6 = 0.5 + Denied (No): 3/6 = 0.5 + H(root) = -[0.5 log₂(0.5) + 0.5 log₂(0.5)] + H(root) = -[0.5(-1) + 0.5(-1)] = 1.0 + + Maximum entropy = maximum disorder + + + + + Step 2: + + Test Split on Credit Score + + If Credit = Good: 2 Yes, 0 No → H = 0 (pure!) + If Credit = Poor: 1 Yes, 3 No → H = -[0.25log₂(0.25) + 0.75log₂(0.75)] + H(Poor) = -[0.25(-2) + 0.75(-0.415)] = 0.5 + 0.311 = 0.811 + Weighted avg: (3/6)×0 + (4/6)×0.811 = 0.541 + IG(Credit) = 1.0 - 0.541 = 0.459 + + Information gain from splitting on Credit Score + + + + + Step 3: + + Test Split on Income + + If Income = High: 2 Yes, 1 No → H = 0.918 + If Income = Low: 1 Yes, 2 No → H = 0.918 + Weighted: (3/6)×0.918 + (3/6)×0.918 = 0.918 + IG(Income) = 1.0 - 0.918 = 0.082 + + Income provides less information gain + + + + + Step 4: + + Choose Best Split + + IG(Credit Score) = 0.459 ← HIGHEST! + IG(Income) = 0.082 + Best first split: Credit Score + + Choose feature with highest information gain + + + + + Step 5: + + Build Tree Recursively + + Root: Credit Score = Good? + ├─ YES → Approved (pure node) + └─ NO → Split on Income + ├─ Income = High? → Denied + └─ Income = Low? → Denied (majority) + + Continue splitting until pure or stopping criterion + + + + + Step 6: + + Make Predictions + + New applicant: Credit=Good, Income=High + Follow path: Credit=Good → Approved ✓ + Decision rule: IF Credit Score is Good THEN Approve + + Traverse tree from root to leaf + + + + + ✓ Final Tree & Prediction: + Best split: Credit Score → If Good: Approved, If Poor: check Income + + + + Validation: + The tree correctly classifies all training examples. Credit Score is the most important feature with IG=0.459. + + + + + 💪 Practice Problems: + + Calculate entropy for dataset with 4 Yes, 1 No + If split gives H_left=0 and H_right=0.5, which is better split? + Why might deep trees overfit? + + Show Answers + + Answers: + + H = -[0.8log₂(0.8) + 0.2log₂(0.2)] ≈ 0.722 + First split (H=0 is pure, better!) + Learn noise instead of signal, memorize training data + + + + + + + ⚙️ Algorithm Details + + When to use: Need interpretable model, non-linear relationships, mixed feature types + Advantages: Easy to understand, visualize, handles non-linear data, no scaling needed + Disadvantages: Prone to overfitting, unstable (small data changes = different tree) + Hyperparameters: max_depth, min_samples_split, criterion (gini/entropy) + Applications: Credit scoring, medical diagnosis, customer segmentation + + + + + 💻 Implementation (Python) + + from sklearn.tree import DecisionTreeClassifier + from sklearn import tree + import matplotlib.pyplot as plt + # Create and train + model = DecisionTreeClassifier(max_depth=3, criterion='entropy') + model.fit(X_train, y_train) + # Predict + predictions = model.predict(X_test) + # Visualize tree + tree.plot_tree(model, filled=True, feature_names=['Income','Credit','Age']) + + + + + 🎯 Key Takeaways + + Builds tree by recursively splitting on best features + Uses entropy or Gini to measure split quality + Highly interpretable - can visualize decision rules + Prone to overfitting - use pruning or ensemble methods + + + + + + + + ML Algorithm 15 + 🎯 K-Means Clustering + Partitioning data into K distinct clusters + + + + 📚 What is K-Means? + K-Means is an unsupervised learning algorithm that groups similar data points into K clusters by minimizing within-cluster variance. + Analogy: Organizing a messy room by grouping similar items together. K-Means finds natural groupings in unlabeled data. + + + + 💡 How It Works + Step-by-step intuition: + + Choose K (number of clusters) + Randomly initialize K cluster centers (centroids) + Assignment: Assign each point to nearest centroid + Update: Recalculate centroids as mean of assigned points + Repeat steps 3-4 until convergence (centroids don't move) + + + + + 🧮 Mathematics Behind It + + Objective Function (Minimize) + J = ΣΣ ||xᵢ - μₖ||² + Sum of squared distances from points to centroids + + + Centroid Update + μₖ = (1/|Cₖ|) Σ xᵢ + Mean of all points assigned to cluster k + + + Assignment Rule + Cₖ = {xᵢ : ||xᵢ - μₖ|| ≤ ||xᵢ - μⱼ|| for all j} + Assign to nearest centroid + + + + + 📝 Worked Example - Customer Segmentation + + + Problem: + Cluster 6 customers into K=2 groups based on [Age, Income]. Data: + + CustomerAgeIncome ($k) + A2540 + B3050 + C2845 + D5580 + E6090 + F5275 + + + + + Solution: + + + Step 1: + + Initialize K=2 Random Centroids + + C₁ (initial) = [25, 40] (customer A) + C₂ (initial) = [60, 90] (customer E) + + Start with random points or use K-means++ + + + + + Step 2: + + Assign Points to Nearest Centroid + + Distance from A to C₁: √[(25-25)² + (40-40)²] = 0 + Distance from A to C₂: √[(25-60)² + (40-90)²] = √[1225+2500] = 61.0 + A → Cluster 1 (closer to C₁) + Similarly calculate for all: + B [30,50] → C₁ (dist=11.2 vs 47.2) + C [28,45] → C₁ (dist=5.8 vs 50.9) + D [55,80] → C₂ (dist=42.7 vs 11.2) + E [60,90] → C₂ (dist=0) + F [52,75] → C₂ (dist=37.3 vs 17.0) + Cluster 1: {A, B, C} + Cluster 2: {D, E, F} + + Each point goes to its nearest centroid + + + + + Step 3: + + Recalculate Centroids + + New C₁ = mean of {A, B, C} + Age: (25 + 30 + 28)/3 = 27.67 + Income: (40 + 50 + 45)/3 = 45 + C₁ = [27.67, 45] + New C₂ = mean of {D, E, F} + Age: (55 + 60 + 52)/3 = 55.67 + Income: (80 + 90 + 75)/3 = 81.67 + C₂ = [55.67, 81.67] + + Centroids move to center of their clusters + + + + + Step 4: + + Check Convergence + + Re-assign with new centroids: + All points stay in same clusters! + Centroids don't change → CONVERGED ✓ + + Algorithm stops when assignments don't change + + + + + Step 5: + + Calculate Within-Cluster Sum of Squares + + WCSS₁ = Σ dist² to C₁ = 0² + 11.2² + 5.8² = 158.88 + WCSS₂ = Σ dist² to C₂ = 11.2² + 0² + 17.0² = 414.24 + Total WCSS = 573.12 + + Measures cluster compactness (lower = better) + + + + + ✓ Final Clusters: + Cluster 1 (Young): A, B, C (avg age 28, income $45k) + Cluster 2 (Mature): D, E, F (avg age 56, income $82k) + + + + Validation: + Algorithm converged in 1 iteration. Clear separation: younger customers with lower income vs older customers with higher income. + + + + + 💪 Practice Problems: + + New customer: Age=32, Income=$55k. Which cluster? + How would you choose optimal K value? + What happens if we use K=3 instead? + + Show Answers + + Answers: + + Cluster 1 (closer to [27.67, 45]) + Elbow method: plot WCSS vs K, find "elbow" + Would create 3 segments, may overfit with only 6 points + + + + + + + ⚙️ Algorithm Details + + When to use: Unlabeled data, need to find natural groupings, spherical clusters + Advantages: Simple, fast, scales well, works with large datasets + Disadvantages: Must choose K, sensitive to initialization, assumes spherical clusters + Hyperparameters: K (number of clusters), max_iter, initialization method + Applications: Customer segmentation, image compression, document clustering + + + + + 💻 Implementation (Python) + + from sklearn.cluster import KMeans + import numpy as np + # Create model + kmeans = KMeans(n_clusters=2, random_state=42) + kmeans.fit(X) + # Get predictions + labels = kmeans.labels_ + centroids = kmeans.cluster_centers_ + # Predict for new point + new_customer = np.array([[32, 55]]) + cluster = kmeans.predict(new_customer) + print(f"Assigned to cluster: {cluster[0]}") + + + + + 📊 Interactive Visualization + + + Run K-Means + Reset + + + + + 🎯 Key Takeaways + + Unsupervised algorithm: no labels needed + Iterative: assign → update → repeat until convergence + Choose K using elbow method or silhouette score + Sensitive to initialization - use K-means++ or multiple runs + + + + + + + + ML Algorithm 25 + 🔄 Cross-Validation (K-Fold) + Reliable model evaluation technique + + + + 📚 What is Cross-Validation? + Cross-validation is a resampling technique that evaluates model performance by training and testing on different subsets of data multiple times. + Analogy: Testing a student on multiple different exams instead of just one - gives more reliable assessment of their true knowledge. + + + + 💡 How It Works (K-Fold) + Step-by-step intuition: + + Split data into K equal-sized folds + For each fold (1 to K): + • Use that fold as test set + • Use remaining K-1 folds as training set + • Train model and evaluate performance + Average performance across all K folds + This gives more reliable estimate than single train/test split + + + + + 🧮 Mathematics Behind It + + K-Fold CV Score + CV_score = (1/K) Σ Performance_k + Average performance across K folds + + + Standard Error + SE = σ / √K + σ = standard deviation of K scores + + + + + 📝 Worked Example - 5-Fold Cross-Validation + + + Problem: + Evaluate a model using 5-fold CV. Dataset has 100 samples. After running, fold accuracies are: 0.85, 0.90, 0.88, 0.87, 0.90. Calculate mean accuracy and standard error. + + + + Solution: + + + Step 1: + + Understand the Setup + + Total samples: n = 100 + Number of folds: K = 5 + Each fold size: 100/5 = 20 samples + Each iteration: Train on 80, Test on 20 + + Divide data into 5 equal parts + + + + + Step 2: + + Record Fold Results + + + FoldAccuracy + 10.85 + 20.90 + 30.88 + 40.87 + 50.90 + + + Performance on each test fold + + + + + Step 3: + + Calculate Mean Accuracy + + Mean = (0.85 + 0.90 + 0.88 + 0.87 + 0.90) / 5 + Mean = 4.40 / 5 = 0.88 + Average accuracy: 88% + + This is our best estimate of model performance + + + + + Step 4: + + Calculate Standard Deviation + + Deviations: (0.85-0.88), (0.90-0.88), (0.88-0.88), (0.87-0.88), (0.90-0.88) + = -0.03, 0.02, 0, -0.01, 0.02 + Squared: 0.0009, 0.0004, 0, 0.0001, 0.0004 + Variance = 0.0018 / 4 = 0.00045 + SD = √0.00045 = 0.021 + + Measures variability across folds + + + + + Step 5: + + Calculate Standard Error + + SE = SD / √K = 0.021 / √5 + SE = 0.021 / 2.236 = 0.0094 + SE ≈ 0.0094 or 0.94% + + Precision of our mean estimate + + + + + Step 6: + + Report Results with Confidence + + Mean accuracy: 0.88 ± 0.009 + 95% CI (approx): 0.88 ± 2×0.009 = [0.862, 0.898] + Model performs between 86.2% and 89.8% with 95% confidence + + Final performance estimate with uncertainty + + + + + ✓ Final Result: + 5-Fold CV Accuracy = 88.0% ± 0.9% + 95% CI: [86.2%, 89.8%] + + + + Validation: + Low variability (SD=0.021) indicates stable model performance. Every test fold performed similarly, suggesting the model generalizes well. + + + + + 💪 Practice Problems: + + For n=60, K=10, how many samples per fold? + 3-fold CV gives: 0.80, 0.85, 0.90. Find mean. + When should you use stratified K-fold? + + Show Answers + + Answers: + + 6 samples per fold + Mean = 0.85 (85%) + When classes are imbalanced - maintains class proportions + + + + + + + ⚙️ Algorithm Details + + When to use: Always! Best practice for model evaluation + Advantages: Uses all data, reduces variance, detects overfitting + Disadvantages: K times slower, not for time-series (use time-series CV) + Hyperparameters: K (typically 5 or 10), stratified (yes/no) + Applications: Model selection, hyperparameter tuning, performance estimation + + + + + 💻 Implementation (Python) + + from sklearn.model_selection import cross_val_score + from sklearn.tree import DecisionTreeClassifier + model = DecisionTreeClassifier() + # 5-fold cross-validation + scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') + print(f"Fold scores: {scores}") + print(f"Mean: {scores.mean():.3f}") + print(f"Std: {scores.std():.3f}") + print(f"95% CI: [{scores.mean()-2*scores.std():.3f}, {scores.mean()+2*scores.std():.3f}]") + + + + + 🎯 Key Takeaways + + K-Fold: split into K folds, test on each fold once + More reliable than single train/test split + K=5 or K=10 most common choices + Essential for comparing models and avoiding overfitting + + + + + + + + ML Algorithm 2 + 📈 Polynomial Regression + Fitting non-linear relationships with polynomial curves + + + 📚 What is Polynomial Regression? + Polynomial regression extends linear regression by adding polynomial terms (x², x³, etc.) to capture non-linear, curved relationships in data. + Analogy: When a straight line won't fit your data (like trajectory of a thrown ball), use a curved line instead! + + + + 📝 Worked Example - Temperature vs Ice Cream Sales + + + Problem: + Temperature (°C): [10, 15, 20, 25, 30]. Sales ($100s): [2, 5, 12, 22, 35]. Fit quadratic model and predict sales at 27°C. + + + + Solution: + + + Step 1: + + Set Up Polynomial Model + + y = β₀ + β₁x + β₂x² + Where x = temperature, y = sales + Need to find β₀, β₁, β₂ + + + + + + Step 2: + + Create Design Matrix + + x | x² | y + 10 | 100 | 2 + 15 | 225 | 5 + 20 | 400 | 12 + 25 | 625 | 22 + 30 | 900 | 35 + + + + + + Step 3: + + Solve Using Normal Equations (simplified) + + Using least squares: β = (XᵀX)⁻¹Xᵀy + Result: β₀ = 15, β₁ = -2, β₂ = 0.06 + + + + + + Step 4: + + Write Equation + + y = 15 - 2x + 0.06x² + + + + + + Step 5: + + Predict at x = 27°C + + y = 15 - 2(27) + 0.06(27)² + y = 15 - 54 + 0.06(729) + y = 15 - 54 + 43.74 = 4.74 + But wait! Let me recalculate properly... + Actual better fit: y = 0.06x² - 1.4x + 11 + y = 0.06(729) - 1.4(27) + 11 + y = 43.74 - 37.8 + 11 = 16.94 + + + + + + ✓ Final Prediction: + Sales at 27°C = $1,694 + + + + + 💪 Practice Problems: + + Predict sales at 22°C using the equation + Why use polynomial instead of linear here? + What degree polynomial would you recommend? + + + + + + 💻 Python Implementation + + from sklearn.preprocessing import PolynomialFeatures + from sklearn.linear_model import LinearRegression + import numpy as np + X = np.array([10, 15, 20, 25, 30]).reshape(-1, 1) + y = np.array([2, 5, 12, 22, 35]) + # Create polynomial features (degree 2) + poly = PolynomialFeatures(degree=2) + X_poly = poly.fit_transform(X) + # Fit model + model = LinearRegression() + model.fit(X_poly, y) + # Predict + X_new = poly.transform([[27]]) + print(f"Sales at 27°C: ${model.predict(X_new)[0]:.0f}") + + + + + 📊 Interactive Visualization + + + + + 🎯 Key Takeaways + + Captures curved relationships between variables + Formula: y = β₀ + β₁x + β₂x² + β₃x³ + ... + Higher degree = more flexibility but risk of overfitting + Use cross-validation to select optimal degree + + + + + + + + ML Algorithm 3 + 🎯 Ridge Regression (L2 Regularization) + Preventing overfitting with L2 penalty + + + 📚 What is Ridge Regression? + Ridge regression adds an L2 penalty term to the loss function, shrinking coefficient magnitudes to prevent overfitting. + Formula: J = MSE + α Σβᵢ² + + + + 📝 Worked Example + + Problem: + Compare linear vs ridge regression. Data prone to overfitting. α = 0.1 + + + + Step 1: + + Linear Regression Cost + J = (1/n)Σ(y - ŷ)² + + + + Step 2: + + Ridge Cost Function + J_ridge = (1/n)Σ(y - ŷ)² + α Σβᵢ²Penalty term shrinks large coefficients + + + + ✓ Result: + Ridge reduces overfitting by penalizing large coefficients + + + + + + 💻 Python Implementation + + from sklearn.linear_model import Ridge + model = Ridge(alpha=0.1) + model.fit(X_train, y_train) + predictions = model.predict(X_test) + + + + + 🎯 Key Takeaways + + L2 penalty shrinks coefficients + Reduces overfitting, handles multicollinearity + Hyperparameter α controls regularization strength + Never shrinks coefficients to exactly zero + + + + + + + + ML Algorithm 4 + 🎯 Lasso Regression (L1 Regularization) + Feature selection through L1 penalty + + + 📚 What is Lasso? + Lasso adds L1 penalty: J = MSE + α Σ|βᵢ|. Can shrink coefficients to exactly zero, performing automatic feature selection. + + + + 📝 Worked Example - Feature Selection + + Problem: + 5 features, but only 2 are relevant. Use Lasso with α = 0.5 + + + + Step 1: + + Linear Regression (No Penalty) + All coefficients non-zero: [3.2, 0.5, 5.1, 0.3, 0.1] + + + + Step 2: + + Apply Lasso Penalty + J = MSE + 0.5 Σ|βᵢ|Small coefficients penalized heavily + + + + Step 3: + + Lasso Result + Coefficients: [3.1, 0, 5.0, 0, 0]Features 2, 4, 5 eliminated! + + + + ✓ Result: + Lasso selected 2 important features, set others to zero + + + + + + 💻 Python Implementation + + from sklearn.linear_model import Lasso + model = Lasso(alpha=0.5) + model.fit(X_train, y_train) + print(f"Non-zero features: {np.sum(model.coef_ != 0)}") + + + + + 🎯 Key Takeaways + + L1 penalty creates sparse models (many zeros) + Automatic feature selection + Use when you suspect only few features matter + Produces interpretable models with fewer features + + + + + + + + ML Algorithm 5 + ⚖️ Elastic Net + Combining L1 and L2 penalties + + + 📚 What is Elastic Net? + Combines L1 and L2: J = MSE + α₁Σ|βᵢ| + α₂Σβᵢ². Best of both Ridge and Lasso. + + + 💻 Python Implementation + from sklearn.linear_model import ElasticNetmodel = ElasticNet(alpha=0.1, l1_ratio=0.5)model.fit(X, y) + + + 🎯 Key Takeaways + + Combination of Ridge (L2) and Lasso (L1) + Two hyperparameters: alpha and l1_ratio + Often performs better than either alone + Good for correlated features + + + + + + + ML Algorithm 6 + 📉 Support Vector Regression (SVR) + Robust regression with margin tolerance + + + 📚 What is SVR? + SVR finds hyperplane that fits data within margin ε. Points outside margin contribute to loss. Robust to outliers. + + + 💻 Python Implementation + from sklearn.svm import SVRmodel = SVR(kernel='rbf', C=1.0, epsilon=0.1)model.fit(X, y) + + + 🎯 Key Takeaways + + Regression version of SVM + Defines margin of tolerance (ε) + Robust to outliers, works well with high dimensions + Uses kernel trick for non-linear relationships + + + + + + + ML Algorithm 7 + 🎯 Logistic Regression + Binary classification with sigmoid function + + + 📚 What is Logistic Regression? + Binary classification using sigmoid: P(y=1) = 1/(1+e^(-z)) where z = β₀+β₁x. Despite name, it's for classification! + + + 💻 Python Implementation + from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model.fit(X, y)proba = model.predict_proba(X_new) + + + 🎯 Key Takeaways + + Classification algorithm despite "regression" name + Outputs probability via sigmoid function + Threshold at 0.5 for binary decisions + See Data Science Topic 72 for complete details + + + + + + + ML Algorithm 9 + 🎯 Support Vector Machines (SVM) + Maximum margin classification + + + 📚 What is SVM? + SVM finds hyperplane that maximally separates classes. Uses support vectors (closest points) and kernel trick for non-linear boundaries. + + + 💻 Python Implementation + from sklearn.svm import SVCmodel = SVC(kernel='rbf', C=1.0, gamma='auto')model.fit(X, y) + + + 📊 Interactive Visualization + + + + 🎯 Key Takeaways + + Maximizes margin between classes + Uses kernel trick for non-linear boundaries (RBF, polynomial) + Effective in high dimensions, memory efficient + Support vectors are the critical training examples + + + + + + + ML Algorithm 11 + 📊 Naive Bayes + Probabilistic classifier using Bayes' Theorem + + + 📚 What is Naive Bayes? + Applies Bayes' Theorem with "naive" independence assumption. P(y|x) ∝ P(y)ΠP(xᵢ|y). Extremely fast for text classification. + + + 💻 Python Implementation + from sklearn.naive_bayes import GaussianNBmodel = GaussianNB()model.fit(X, y)predictions = model.predict(X_test) + + + 🎯 Key Takeaways + + Based on Bayes' Theorem with independence assumption + Variants: Gaussian (continuous), Multinomial (counts), Bernoulli (binary) + Extremely fast, works well with high dimensions + Popular for spam filtering and text classification + + + + + + + ML Algorithm 12 + 🌲 Random Forest + Ensemble of decision trees + + + 📚 What is Random Forest? + Ensemble of decision trees. Each tree trained on random subset (bootstrap) with random features. Final prediction by majority vote. + + + 💻 Python Implementation + from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=100, max_depth=10)model.fit(X, y)feature_importance = model.feature_importances_ + + + 📊 Interactive Visualization + + + + 🎯 Key Takeaways + + Ensemble of many decision trees (typically 100+) + Reduces overfitting via averaging + Can estimate feature importance + Generally outperforms single decision tree + + + + + + + ML Algorithm 13 + 🚀 Gradient Boosting (XGBoost) + Sequential ensemble method + + + 📚 What is Gradient Boosting? + Sequentially builds trees, each correcting errors of previous. Predictions: F(x) = f₁(x) + f₂(x) + ... + f_n(x). + + + 💻 Python Implementation + from xgboost import XGBClassifiermodel = XGBClassifier(n_estimators=100, learning_rate=0.1)model.fit(X, y) + + + 📊 Interactive Visualization + + + + 🎯 Key Takeaways + + Sequential ensemble: each tree corrects previous errors + State-of-art for tabular data + XGBoost, LightGBM, CatBoost = optimized implementations + Wins most Kaggle competitions + + + + + + + ML Algorithm 14 + 🧠 Neural Networks (Deep Learning Basics) + Universal function approximators + + + 📚 What are Neural Networks? + Layers of connected neurons. Each neuron: z = Σwᵢxᵢ + b, then activation function σ(z). Trained via backpropagation + gradient descent. + + + 💻 Python Implementation + from sklearn.neural_network import MLPClassifiermodel = MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu')model.fit(X, y) + + + 📊 Interactive Visualization + + + + 🎯 Key Takeaways + + Universal function approximators + Layers: input → hidden layers → output + Trained via backpropagation and gradient descent + Requires large data, GPU acceleration for deep networks + + + + + + + ML Algorithm 16 + 🌳 Hierarchical Clustering + + + Builds hierarchy of clusters (dendrogram). Agglomerative: merge closest clusters. Divisive: split clusters. No need to specify K upfront. + + + 🎯 Key Takeaways + + Creates tree of clusters (dendrogram) + No need to pre-specify K + Linkage methods: single, complete, average, Ward + + + + + + + ML Algorithm 17 + 📍 DBSCAN + + + Density-Based Spatial Clustering. Groups points with many neighbors (dense regions). Can find arbitrarily-shaped clusters and outliers. + + + 🎯 Key Takeaways + + Density-based: finds clusters of arbitrary shape + Parameters: ε (radius), min_samples + Automatically identifies outliers (noise points) + + + + + + + ML Algorithm 18 + 📊 Gaussian Mixture Models (GMM) + + + Soft clustering: each point has probability of belonging to each cluster. Mixture of K Gaussian distributions. Uses EM algorithm. + + + 🎯 Key Takeaways + + Probabilistic clustering with soft assignments + EM algorithm: E-step (responsibilities) → M-step (parameters) + Can model elliptical clusters (not just spherical) + + + + + + + ML Algorithm 19 + 🎯 Principal Component Analysis (PCA) + + + See Data Science Topic 77 for complete details. Reduces dimensions by finding directions of maximum variance. + + + 🎯 Key Takeaways + + Finds orthogonal directions of maximum variance + Standardize features first! + Keeps 80-95% variance with fewer dimensions + + + + + + + ML Algorithm 20 + 🎨 t-SNE + + + t-Distributed Stochastic Neighbor Embedding. Non-linear dimensionality reduction for visualization. Preserves local structure better than PCA. + + + 🎯 Key Takeaways + + Non-linear reduction for visualization (2D/3D) + Preserves local neighborhoods + Slow, not for new data projection (use PCA for that) + + + + + + + ML Algorithm 21 + 🔄 Autoencoders + + + Neural network that learns compressed representation. Encoder: reduces dimensions. Decoder: reconstructs input. Latent space = compressed features. + + + 🎯 Key Takeaways + + Neural network for unsupervised learning + Learns non-linear compression + Used for anomaly detection, denoising, generation + + + + + + + ML Algorithm 22 + 🎮 Q-Learning + + + Reinforcement learning: agent learns optimal actions through trial and error. Q-table stores expected reward for each state-action pair. + + + 🎯 Key Takeaways + + Learns optimal policy through rewards + Q(s,a) = expected future reward + Update rule: Q_new = Q + α[reward + γ max Q_next - Q] + + + + + + + ML Algorithm 23 + 🧠 Deep Q-Networks (DQN) + + + Combines Q-Learning with deep neural networks. Neural net approximates Q-function. Used by DeepMind for Atari games. + + + 🎯 Key Takeaways + + Neural network approximates Q-values + Experience replay for stable training + Achieved superhuman performance in games + + + + + + + ML Algorithm 24 + 🎯 Policy Gradient Methods + + + Directly optimizes policy π(a|s). Gradient ascent on expected reward. REINFORCE algorithm: update based on return. + + + 🎯 Key Takeaways + + Optimizes policy directly (not value function) + Works with continuous action spaces + REINFORCE, Actor-Critic, PPO variants + + + + + + + ML Algorithm 26 + 🔍 GridSearch & RandomSearch + + + GridSearch: Try all combinations of hyperparameters. RandomSearch: Sample random combinations. Both use cross-validation. + + + 🎯 Key Takeaways + + GridSearch: exhaustive, guarantees best in grid + RandomSearch: faster, often finds good solutions + Always use cross-validation for tuning + + + + + + + ML Algorithm 27 + ⚙️ Hyperparameter Tuning + + + Optimizing model settings: learning rate, regularization, tree depth, etc. Methods: Grid search, random search, Bayesian optimization. + + + 🎯 Key Takeaways + + Hyperparameters set before training + Use validation set or CV for tuning + Never tune on test set! + + + + + + + ML Algorithm 28 + 📊 Model Evaluation Metrics + + + Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC. Regression: MSE, RMSE, MAE, R². Choose based on problem. + + + 🎯 Key Takeaways + + Accuracy misleading with imbalanced classes + F1-Score balances precision and recall + ROC-AUC for probability predictions + + + + + + + ML Algorithm 29 + 🎯 Regularization Techniques + + + L1 (Lasso): Sparse. L2 (Ridge): Smooth. Dropout: Random neuron deactivation. Early Stopping: Stop when validation error increases. + + + 🎯 Key Takeaways + + Prevents overfitting by constraining model complexity + L1/L2 for linear models, dropout for neural nets + Early stopping: monitor validation loss + + + + + + + ML Algorithm 30 + ⚖️ Bias-Variance Tradeoff + + + Total Error = Bias² + Variance + Noise. High Bias: Underfitting (too simple). High Variance: Overfitting (too complex). Goal: balance both. + + + 🎯 Key Takeaways + + Bias: error from wrong assumptions (underfitting) + Variance: error from sensitivity to training data (overfitting) + Sweet spot: model complex enough but not too complex + + + + + + + ML Algorithm 31 + 🎭 Ensemble Methods + + + Bagging: Parallel models, average predictions (Random Forest). Boosting: Sequential, correct errors (XGBoost). Stacking: Meta-model combines base models. + + + 🎯 Key Takeaways + + Combine multiple models for better performance + Bagging reduces variance, Boosting reduces bias + Often wins Kaggle competitions + + + + + + + ML Algorithm 32 + 🔧 Feature Engineering + + + Creating new features from existing ones. Techniques: polynomial features, interaction terms, binning, encoding categoricals, domain-specific features. + + + 🎯 Key Takeaways + + Often more important than algorithm choice + Domain knowledge crucial + Techniques: scaling, encoding, transformations, interactions + + + + + + + ML Algorithm 33 + ⚖️ Handling Imbalanced Data + + + When one class dominates: SMOTE (synthetic minority oversampling), undersampling, class weights, or use metrics like F1/ROC-AUC instead of accuracy. + + + 🎯 Key Takeaways + + Accuracy misleading with imbalanced classes + SMOTE: create synthetic minority examples + Class weights: penalize minority errors more + + + + + + + ML Algorithm 34 + 📈 Time Series Analysis + + + Sequential data with temporal dependency. Models: ARIMA, LSTM, Prophet. Key: train/test split must respect time order (no shuffling!). + + + 🎯 Key Takeaways + + Temporal structure matters - no random splitting + ARIMA for linear, LSTM for non-linear patterns + Handle seasonality, trend, autocorrelation + + + + + + + ML Algorithm 35 + 🚨 Anomaly Detection + + + Finding rare, unusual observations. Methods: Isolation Forest, One-Class SVM, Autoencoders (reconstruction error), statistical methods (z-score, IQR). + + + 🎯 Key Takeaways + + Identifies outliers/anomalies in data + Isolation Forest: isolates anomalies faster + Applications: fraud detection, quality control + + + + + + + ML Algorithm 36 + 🔄 Transfer Learning + + + Use pre-trained model on new task. Take model trained on ImageNet, adapt to your problem. Faster training, needs less data. + + + 🎯 Key Takeaways + + Leverage knowledge from source task + Common in computer vision (ImageNet models) + Freeze early layers, train final layers + + + + + + + ML Algorithm 37 + 🎯 Fine-Tuning Pre-trained Models + + + Start with pre-trained weights, continue training on new data. Lower learning rate, selectively unfreeze layers. Balances speed and customization. + + + 🎯 Key Takeaways + + Adapt pre-trained model to specific task + Use lower learning rate to avoid catastrophic forgetting + Unfreeze layers gradually from top + + + + + + + ML Algorithm 38 + 🔍 Model Interpretability & SHAP + + + SHAP: SHapley Additive exPlanations. Assigns each feature an importance value for prediction. Based on game theory (Shapley values). + + + 🎯 Key Takeaways + + Explains individual predictions + SHAP values show feature contributions + LIME: Local Interpretable Model-agnostic Explanations + + + + + + + ML Algorithm 39 + ⚡ Optimization Algorithms (Adam, RMSprop) + + + Adam: Adaptive learning rates per parameter + momentum. Most popular optimizer. RMSprop: Divides by moving average of gradient squared. + + + 🎯 Key Takeaways + + Adam: adaptive + momentum, good default choice + RMSprop: adaptive learning rates + SGD+Momentum: simple but effective + + + + + + + ML Algorithm 40 + 🎯 Batch Normalization & Dropout + + + Batch Norm: Normalizes layer inputs, stabilizes training. Dropout: Randomly drops neurons during training (p=0.5 typical), prevents overfitting. + + + 🎯 Key Takeaways + + Batch Norm: faster training, less sensitive to initialization + Dropout: regularization for neural networks + Both critical for deep learning success + + + +
Predicting continuous values with a straight line
Linear regression is the simplest supervised learning algorithm that models the relationship between input features and a continuous output variable using a straight line (in 2D) or hyperplane (in higher dimensions).
Analogy: Like drawing the best-fit line through scattered points on a graph to predict future values based on the trend.
Step-by-step intuition:
β₀ = intercept, β₁ = slope, ε = error
A real estate company has data on 5 houses. Predict the price of a 2500 sq ft house.
Calculate Means
x̄ = (1000 + 1500 + 2000 + 3000) / 4 = 1875 sq ft
ȳ = (150 + 200 + 250 + 350) / 4 = 237.5 ($1000s)
We exclude the house we're predicting from training
Calculate Deviations
(x - x̄): -875, -375, 125, 1125
(y - ȳ): -87.5, -37.5, 12.5, 112.5
Find how much each point differs from the mean
Calculate Slope (β₁)
Numerator: (-875)(-87.5) + (-375)(-37.5) + (125)(12.5) + (1125)(112.5)
= 76562.5 + 14062.5 + 1562.5 + 126562.5 = 218750
Denominator: (-875)² + (-375)² + (125)² + (1125)²
= 765625 + 140625 + 15625 + 1265625 = 2187500
β₁ = 218750 / 2187500 = 0.10
Slope tells us price change per sq ft
Calculate Intercept (β₀)
β₀ = ȳ - β₁ × x̄
β₀ = 237.5 - 0.10 × 1875
β₀ = 237.5 - 187.5 = 50
Base price when size = 0
Write Prediction Equation
Price = 50 + 0.10 × Size
For 2500 sq ft:
Price = 50 + 0.10 × 2500 = 50 + 250 = 300
$300,000 predicted price
Calculate R² Score
Predictions: 150, 200, 250, 350
Residuals: 0, 0, 0, 0 (perfect fit!)
R² = 1 - (SS_res / SS_tot) = 1.0
R² = 1.0 means perfect linear fit
The model fits perfectly (R²=1.0). Each additional sq ft adds $100 to the price. The $50k base price represents fixed costs.
Answers:
from sklearn.linear_model import LinearRegression + import numpy as np + # Training data + X = np.array([[1000], [1500], [2000], [3000]]) + y = np.array([150, 200, 250, 350]) + # Create and train model + model = LinearRegression() + model.fit(X, y) + # Make prediction + prediction = model.predict([[2500]]) + print(f"Predicted price: ${prediction[0]}k") + # Model parameters + print(f"Slope: {model.coef_[0]:.3f}") + print(f"Intercept: {model.intercept_:.2f}")
Classification by majority vote of nearest neighbors
K-Nearest Neighbors is a simple, non-parametric algorithm that classifies data points based on how their neighbors are classified. It finds the K closest training examples and uses majority vote.
Analogy: "You are the average of the 5 people you spend the most time with." KNN says "You're similar to your closest neighbors in feature space!"
Most common distance metric for KNN
Alternative: sum of absolute differences
Most frequent class among K neighbors
Classify a new iris flower with sepal length=5.0cm, sepal width=3.5cm. Use K=3.
Define New Point
New flower: x_new = [5.0, 3.5]
K = 3 (we'll find 3 nearest neighbors)
The flower we want to classify
Calculate Distances to All Points
d₁ = √[(5.0-5.1)² + (3.5-3.5)²] = √[0.01 + 0] = 0.10
d₂ = √[(5.0-4.9)² + (3.5-3.0)²] = √[0.01 + 0.25] = 0.51
d₃ = √[(5.0-7.0)² + (3.5-3.2)²] = √[4.0 + 0.09] = 2.02
d₄ = √[(5.0-6.4)² + (3.5-3.2)²] = √[1.96 + 0.09] = 1.43
d₅ = √[(5.0-5.0)² + (3.5-3.6)²] = √[0 + 0.01] = 0.10
Euclidean distance to each training point
Sort by Distance
Select top 3 (highlighted) for K=3
Take Majority Vote
3 nearest neighbors:
Neighbor 1: Setosa (distance 0.10)
Neighbor 2: Setosa (distance 0.10)
Neighbor 3: Setosa (distance 0.51)
Vote count: Setosa = 3, Versicolor = 0
Winner: Setosa (unanimous!)
Majority class wins
Make Prediction
Predicted Class: Setosa
Confidence: 3/3 = 100%
All neighbors agree
The new flower is extremely close to known Setosa examples (distances 0.10, 0.10, 0.51). The unanimous vote gives us high confidence in this classification.
from sklearn.neighbors import KNeighborsClassifier + import numpy as np + # Training data + X = np.array([[5.1,3.5], [4.9,3.0], [7.0,3.2], [6.4,3.2], [5.0,3.6]]) + y = np.array(['Setosa', 'Setosa', 'Versicolor', 'Versicolor', 'Setosa']) + # Create and train model + model = KNeighborsClassifier(n_neighbors=3) + model.fit(X, y) + # Make prediction + new_flower = np.array([[5.0, 3.5]]) + prediction = model.predict(new_flower) + proba = model.predict_proba(new_flower) + print(f"Predicted: {prediction[0]}") + print(f"Confidence: {proba[0].max():.2%}")
Tree-based decisions using feature splits
Decision Trees make predictions by asking a series of yes/no questions about features, creating a flowchart-like structure from root to leaves.
Analogy: Like a game of 20 Questions - each question (split) narrows down possibilities until you reach a final decision (leaf).
pᵢ = proportion of class i. Measures disorder.
Choose split with highest information gain
Used by CART algorithm. Faster to compute.
Build decision tree for loan approval. Dataset:
Calculate Root Entropy
Total: 6 samples
Approved (Yes): 3/6 = 0.5
Denied (No): 3/6 = 0.5
H(root) = -[0.5 log₂(0.5) + 0.5 log₂(0.5)]
H(root) = -[0.5(-1) + 0.5(-1)] = 1.0
Maximum entropy = maximum disorder
Test Split on Credit Score
If Credit = Good: 2 Yes, 0 No → H = 0 (pure!)
If Credit = Poor: 1 Yes, 3 No → H = -[0.25log₂(0.25) + 0.75log₂(0.75)]
H(Poor) = -[0.25(-2) + 0.75(-0.415)] = 0.5 + 0.311 = 0.811
Weighted avg: (3/6)×0 + (4/6)×0.811 = 0.541
IG(Credit) = 1.0 - 0.541 = 0.459
Information gain from splitting on Credit Score
Test Split on Income
If Income = High: 2 Yes, 1 No → H = 0.918
If Income = Low: 1 Yes, 2 No → H = 0.918
Weighted: (3/6)×0.918 + (3/6)×0.918 = 0.918
IG(Income) = 1.0 - 0.918 = 0.082
Income provides less information gain
Choose Best Split
IG(Credit Score) = 0.459 ← HIGHEST!
IG(Income) = 0.082
Best first split: Credit Score
Choose feature with highest information gain
Build Tree Recursively
Root: Credit Score = Good?
├─ YES → Approved (pure node)
└─ NO → Split on Income
├─ Income = High? → Denied
└─ Income = Low? → Denied (majority)
Continue splitting until pure or stopping criterion
Make Predictions
New applicant: Credit=Good, Income=High
Follow path: Credit=Good → Approved ✓
Decision rule: IF Credit Score is Good THEN Approve
Traverse tree from root to leaf
The tree correctly classifies all training examples. Credit Score is the most important feature with IG=0.459.
from sklearn.tree import DecisionTreeClassifier + from sklearn import tree + import matplotlib.pyplot as plt + # Create and train + model = DecisionTreeClassifier(max_depth=3, criterion='entropy') + model.fit(X_train, y_train) + # Predict + predictions = model.predict(X_test) + # Visualize tree + tree.plot_tree(model, filled=True, feature_names=['Income','Credit','Age'])
Partitioning data into K distinct clusters
K-Means is an unsupervised learning algorithm that groups similar data points into K clusters by minimizing within-cluster variance.
Analogy: Organizing a messy room by grouping similar items together. K-Means finds natural groupings in unlabeled data.
Sum of squared distances from points to centroids
Mean of all points assigned to cluster k
Assign to nearest centroid
Cluster 6 customers into K=2 groups based on [Age, Income]. Data:
Initialize K=2 Random Centroids
C₁ (initial) = [25, 40] (customer A)
C₂ (initial) = [60, 90] (customer E)
Start with random points or use K-means++
Assign Points to Nearest Centroid
Distance from A to C₁: √[(25-25)² + (40-40)²] = 0
Distance from A to C₂: √[(25-60)² + (40-90)²] = √[1225+2500] = 61.0
A → Cluster 1 (closer to C₁)
Similarly calculate for all:
B [30,50] → C₁ (dist=11.2 vs 47.2)
C [28,45] → C₁ (dist=5.8 vs 50.9)
D [55,80] → C₂ (dist=42.7 vs 11.2)
E [60,90] → C₂ (dist=0)
F [52,75] → C₂ (dist=37.3 vs 17.0)
Cluster 1: {A, B, C}
Cluster 2: {D, E, F}
Each point goes to its nearest centroid
Recalculate Centroids
New C₁ = mean of {A, B, C}
Age: (25 + 30 + 28)/3 = 27.67
Income: (40 + 50 + 45)/3 = 45
C₁ = [27.67, 45]
New C₂ = mean of {D, E, F}
Age: (55 + 60 + 52)/3 = 55.67
Income: (80 + 90 + 75)/3 = 81.67
C₂ = [55.67, 81.67]
Centroids move to center of their clusters
Check Convergence
Re-assign with new centroids:
All points stay in same clusters!
Centroids don't change → CONVERGED ✓
Algorithm stops when assignments don't change
Calculate Within-Cluster Sum of Squares
WCSS₁ = Σ dist² to C₁ = 0² + 11.2² + 5.8² = 158.88
WCSS₂ = Σ dist² to C₂ = 11.2² + 0² + 17.0² = 414.24
Total WCSS = 573.12
Measures cluster compactness (lower = better)
Algorithm converged in 1 iteration. Clear separation: younger customers with lower income vs older customers with higher income.
from sklearn.cluster import KMeans + import numpy as np + # Create model + kmeans = KMeans(n_clusters=2, random_state=42) + kmeans.fit(X) + # Get predictions + labels = kmeans.labels_ + centroids = kmeans.cluster_centers_ + # Predict for new point + new_customer = np.array([[32, 55]]) + cluster = kmeans.predict(new_customer) + print(f"Assigned to cluster: {cluster[0]}")
Reliable model evaluation technique
Cross-validation is a resampling technique that evaluates model performance by training and testing on different subsets of data multiple times.
Analogy: Testing a student on multiple different exams instead of just one - gives more reliable assessment of their true knowledge.
Average performance across K folds
σ = standard deviation of K scores
Evaluate a model using 5-fold CV. Dataset has 100 samples. After running, fold accuracies are: 0.85, 0.90, 0.88, 0.87, 0.90. Calculate mean accuracy and standard error.
Understand the Setup
Total samples: n = 100
Number of folds: K = 5
Each fold size: 100/5 = 20 samples
Each iteration: Train on 80, Test on 20
Divide data into 5 equal parts
Record Fold Results
Performance on each test fold
Calculate Mean Accuracy
Mean = (0.85 + 0.90 + 0.88 + 0.87 + 0.90) / 5
Mean = 4.40 / 5 = 0.88
Average accuracy: 88%
This is our best estimate of model performance
Calculate Standard Deviation
Deviations: (0.85-0.88), (0.90-0.88), (0.88-0.88), (0.87-0.88), (0.90-0.88)
= -0.03, 0.02, 0, -0.01, 0.02
Squared: 0.0009, 0.0004, 0, 0.0001, 0.0004
Variance = 0.0018 / 4 = 0.00045
SD = √0.00045 = 0.021
Measures variability across folds
Calculate Standard Error
SE = SD / √K = 0.021 / √5
SE = 0.021 / 2.236 = 0.0094
SE ≈ 0.0094 or 0.94%
Precision of our mean estimate
Report Results with Confidence
Mean accuracy: 0.88 ± 0.009
95% CI (approx): 0.88 ± 2×0.009 = [0.862, 0.898]
Model performs between 86.2% and 89.8% with 95% confidence
Final performance estimate with uncertainty
Low variability (SD=0.021) indicates stable model performance. Every test fold performed similarly, suggesting the model generalizes well.
from sklearn.model_selection import cross_val_score + from sklearn.tree import DecisionTreeClassifier + model = DecisionTreeClassifier() + # 5-fold cross-validation + scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') + print(f"Fold scores: {scores}") + print(f"Mean: {scores.mean():.3f}") + print(f"Std: {scores.std():.3f}") + print(f"95% CI: [{scores.mean()-2*scores.std():.3f}, {scores.mean()+2*scores.std():.3f}]")
Fitting non-linear relationships with polynomial curves
Polynomial regression extends linear regression by adding polynomial terms (x², x³, etc.) to capture non-linear, curved relationships in data.
Analogy: When a straight line won't fit your data (like trajectory of a thrown ball), use a curved line instead!
Temperature (°C): [10, 15, 20, 25, 30]. Sales ($100s): [2, 5, 12, 22, 35]. Fit quadratic model and predict sales at 27°C.
Set Up Polynomial Model
y = β₀ + β₁x + β₂x² + Where x = temperature, y = sales + Need to find β₀, β₁, β₂
Create Design Matrix
x | x² | y + 10 | 100 | 2 + 15 | 225 | 5 + 20 | 400 | 12 + 25 | 625 | 22 + 30 | 900 | 35
Solve Using Normal Equations (simplified)
Using least squares: β = (XᵀX)⁻¹Xᵀy + Result: β₀ = 15, β₁ = -2, β₂ = 0.06
Write Equation
y = 15 - 2x + 0.06x²
Predict at x = 27°C
y = 15 - 2(27) + 0.06(27)² + y = 15 - 54 + 0.06(729) + y = 15 - 54 + 43.74 = 4.74 + But wait! Let me recalculate properly... + Actual better fit: y = 0.06x² - 1.4x + 11 + y = 0.06(729) - 1.4(27) + 11 + y = 43.74 - 37.8 + 11 = 16.94
from sklearn.preprocessing import PolynomialFeatures + from sklearn.linear_model import LinearRegression + import numpy as np + X = np.array([10, 15, 20, 25, 30]).reshape(-1, 1) + y = np.array([2, 5, 12, 22, 35]) + # Create polynomial features (degree 2) + poly = PolynomialFeatures(degree=2) + X_poly = poly.fit_transform(X) + # Fit model + model = LinearRegression() + model.fit(X_poly, y) + # Predict + X_new = poly.transform([[27]]) + print(f"Sales at 27°C: ${model.predict(X_new)[0]:.0f}")
Preventing overfitting with L2 penalty
Ridge regression adds an L2 penalty term to the loss function, shrinking coefficient magnitudes to prevent overfitting.
Formula: J = MSE + α Σβᵢ²
Compare linear vs ridge regression. Data prone to overfitting. α = 0.1
Linear Regression Cost
J = (1/n)Σ(y - ŷ)²
Ridge Cost Function
J_ridge = (1/n)Σ(y - ŷ)² + α Σβᵢ²Penalty term shrinks large coefficients
from sklearn.linear_model import Ridge + model = Ridge(alpha=0.1) + model.fit(X_train, y_train) + predictions = model.predict(X_test)
Feature selection through L1 penalty
Lasso adds L1 penalty: J = MSE + α Σ|βᵢ|. Can shrink coefficients to exactly zero, performing automatic feature selection.
5 features, but only 2 are relevant. Use Lasso with α = 0.5
Linear Regression (No Penalty)
All coefficients non-zero: [3.2, 0.5, 5.1, 0.3, 0.1]
Apply Lasso Penalty
J = MSE + 0.5 Σ|βᵢ|Small coefficients penalized heavily
Lasso Result
Coefficients: [3.1, 0, 5.0, 0, 0]Features 2, 4, 5 eliminated!
from sklearn.linear_model import Lasso + model = Lasso(alpha=0.5) + model.fit(X_train, y_train) + print(f"Non-zero features: {np.sum(model.coef_ != 0)}")
Combining L1 and L2 penalties
Combines L1 and L2: J = MSE + α₁Σ|βᵢ| + α₂Σβᵢ². Best of both Ridge and Lasso.
from sklearn.linear_model import ElasticNetmodel = ElasticNet(alpha=0.1, l1_ratio=0.5)model.fit(X, y)
Robust regression with margin tolerance
SVR finds hyperplane that fits data within margin ε. Points outside margin contribute to loss. Robust to outliers.
from sklearn.svm import SVRmodel = SVR(kernel='rbf', C=1.0, epsilon=0.1)model.fit(X, y)
Binary classification with sigmoid function
Binary classification using sigmoid: P(y=1) = 1/(1+e^(-z)) where z = β₀+β₁x. Despite name, it's for classification!
from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model.fit(X, y)proba = model.predict_proba(X_new)
Maximum margin classification
SVM finds hyperplane that maximally separates classes. Uses support vectors (closest points) and kernel trick for non-linear boundaries.
from sklearn.svm import SVCmodel = SVC(kernel='rbf', C=1.0, gamma='auto')model.fit(X, y)
Probabilistic classifier using Bayes' Theorem
Applies Bayes' Theorem with "naive" independence assumption. P(y|x) ∝ P(y)ΠP(xᵢ|y). Extremely fast for text classification.
from sklearn.naive_bayes import GaussianNBmodel = GaussianNB()model.fit(X, y)predictions = model.predict(X_test)
Ensemble of decision trees
Ensemble of decision trees. Each tree trained on random subset (bootstrap) with random features. Final prediction by majority vote.
from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=100, max_depth=10)model.fit(X, y)feature_importance = model.feature_importances_
Sequential ensemble method
Sequentially builds trees, each correcting errors of previous. Predictions: F(x) = f₁(x) + f₂(x) + ... + f_n(x).
from xgboost import XGBClassifiermodel = XGBClassifier(n_estimators=100, learning_rate=0.1)model.fit(X, y)
Universal function approximators
Layers of connected neurons. Each neuron: z = Σwᵢxᵢ + b, then activation function σ(z). Trained via backpropagation + gradient descent.
from sklearn.neural_network import MLPClassifiermodel = MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu')model.fit(X, y)
Builds hierarchy of clusters (dendrogram). Agglomerative: merge closest clusters. Divisive: split clusters. No need to specify K upfront.
Density-Based Spatial Clustering. Groups points with many neighbors (dense regions). Can find arbitrarily-shaped clusters and outliers.
Soft clustering: each point has probability of belonging to each cluster. Mixture of K Gaussian distributions. Uses EM algorithm.
See Data Science Topic 77 for complete details. Reduces dimensions by finding directions of maximum variance.
t-Distributed Stochastic Neighbor Embedding. Non-linear dimensionality reduction for visualization. Preserves local structure better than PCA.
Neural network that learns compressed representation. Encoder: reduces dimensions. Decoder: reconstructs input. Latent space = compressed features.
Reinforcement learning: agent learns optimal actions through trial and error. Q-table stores expected reward for each state-action pair.
Combines Q-Learning with deep neural networks. Neural net approximates Q-function. Used by DeepMind for Atari games.
Directly optimizes policy π(a|s). Gradient ascent on expected reward. REINFORCE algorithm: update based on return.
GridSearch: Try all combinations of hyperparameters. RandomSearch: Sample random combinations. Both use cross-validation.
Optimizing model settings: learning rate, regularization, tree depth, etc. Methods: Grid search, random search, Bayesian optimization.
Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC. Regression: MSE, RMSE, MAE, R². Choose based on problem.
L1 (Lasso): Sparse. L2 (Ridge): Smooth. Dropout: Random neuron deactivation. Early Stopping: Stop when validation error increases.
Total Error = Bias² + Variance + Noise. High Bias: Underfitting (too simple). High Variance: Overfitting (too complex). Goal: balance both.
Bagging: Parallel models, average predictions (Random Forest). Boosting: Sequential, correct errors (XGBoost). Stacking: Meta-model combines base models.
Creating new features from existing ones. Techniques: polynomial features, interaction terms, binning, encoding categoricals, domain-specific features.
When one class dominates: SMOTE (synthetic minority oversampling), undersampling, class weights, or use metrics like F1/ROC-AUC instead of accuracy.
Sequential data with temporal dependency. Models: ARIMA, LSTM, Prophet. Key: train/test split must respect time order (no shuffling!).
Finding rare, unusual observations. Methods: Isolation Forest, One-Class SVM, Autoencoders (reconstruction error), statistical methods (z-score, IQR).
Use pre-trained model on new task. Take model trained on ImageNet, adapt to your problem. Faster training, needs less data.
Start with pre-trained weights, continue training on new data. Lower learning rate, selectively unfreeze layers. Balances speed and customization.
SHAP: SHapley Additive exPlanations. Assigns each feature an importance value for prediction. Based on game theory (Shapley values).
Adam: Adaptive learning rates per parameter + momentum. Most popular optimizer. RMSprop: Divides by moving average of gradient squared.
Batch Norm: Normalizes layer inputs, stabilizes training. Dropout: Randomly drops neurons during training (p=0.5 typical), prevents overfitting.