Linear Regression — Statistical Machine Learning

Theory

What it does

Estimates a continuous output by fitting the best-fit hyperplane through data, minimizing the sum of squared residuals (MSE). Each feature receives a coefficient β that quantifies its exact linear contribution to the prediction.

When to use

Target is continuous (yield, price, log-survival). Relationship between features and target is roughly linear. Check residuals are normally distributed (Q-Q plot). Add regularization (Ridge/Lasso) when many features are correlated.

Key strength

Fully interpretable — β_i means "y changes by β_i for every unit increase in x_i". Fastest model to train. Best first baseline. Limitation: assumes linearity; fails on non-linear data and is sensitive to outliers.

Regularization

Ridge (L2) shrinks all coefficients toward zero — handles multicollinearity. Lasso (L1) zeroes out irrelevant features entirely — automatic feature selection. Both add a penalty λ·‖β‖ to the loss.

Hypothesis & Loss

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Loss (MSE) = (1/n) Σ (yᵢ − ŷᵢ)² β = (XᵀX)⁻¹ Xᵀy [Normal Eq.]

Ridge: Loss + λ·Σβⱼ² | Lasso: Loss + λ·Σ|βⱼ|

Dataset Selection

Synthetic Datasets — 2D

Real Datasets — Multi-feature

Train / Test split 80 / 20 %

Model

Performance Metrics

Model Fit

Scatter + Regression Line

Actual vs Predicted

Perfect predictions lie on the diagonal. Systematic deviation → model bias.

Coefficients & Confidence Intervals

Coefficient Values

Feature	Coef	95% CI	p-value	Magnitude

* p<0.05 significant. CI only available for OLS (not Ridge/Lasso).

Coefficient Confidence Intervals (OLS)

Error bars = 95% CI. Bars crossing zero indicate non-significant features.

Model Understanding

Learning Curve

Train vs validation R² as training set grows. Large gap → high variance (overfit). Both low → high bias (underfit).

Regularization Path

Coefficients vs log₁₀(λ). Lasso zeroes features (feature selection); Ridge shrinks them gradually.

Permutation Feature Importance

Drop in R² when a feature is shuffled (avg over 20 repeats). Larger drop → more important feature.

Assumption Diagnostics

Residuals vs Fitted

—

Should be random around zero. A funnel or curve → heteroscedasticity or non-linearity.

Q-Q Plot (Normality)

—

Points near diagonal → residuals approximately normal.

Scale-Location (Homoscedasticity)

—

Flat red trend → equal variance. Rising trend → heteroscedastic errors.

Residuals vs Leverage (Cook's Distance)

—

Points beyond the dashed threshold are influential — they disproportionately affect the fit.

Interactive — Gradient Descent

Gradient Descent Animation (1-D Regression)

Iteration: 0 / —

Speed:

      β₀=—   β₁=—   MSE=—
    

Data + Regression Line

MSE over Iterations

Loss Surface (β₀, β₁)

Gradient descent updates β₀, β₁ iteratively: β ← β − η·∇MSE. The path on the loss surface shows how quickly it converges to the minimum.

Diagnostic Summary