Logistic Regression — Statistical Machine Learning

Theory

What it does

Models the probability that an observation belongs to a class. The log-odds of the positive class is modelled as a linear combination of features; the sigmoid function squashes this to (0, 1), giving a probability. A decision boundary at threshold 0.5 separates classes.

When to use

Binary or multi-class classification where you need well-calibrated class probabilities (e.g. medical diagnosis, credit scoring). Assumes each class is (roughly) linearly separable in feature space. Add regularization (L1 / L2) when features are many or correlated.

Key strength

Outputs calibrated probabilities — not just a label. Coefficients are interpretable as log-odds ratios: eˢᵉ means "the odds multiply by eˢᵉ for each unit increase in xⱼ". Limitation: linear decision boundary; fails on XOR-type data (use kernel or neural net).

Regularization — note the sign flip

C = 1/λ — the inverse of regularization strength. Smaller C → stronger regularization → simpler boundary. L2 shrinks all coefficients. L1 zeroes irrelevant features (feature selection). ElasticNet blends both.

Hypothesis, Loss & Multi-class

σ(z) = 1 / (1 + e⁻ᶻ) z = β₀ + β₁x₁ + … + βₙxₙ

P(y=1|x) = σ(z) Decision boundary: z = 0 ⟺ P = 0.5

Loss (Cross-Entropy) = −(1/n) Σ [ yᵢ log ŷᵢ + (1−yᵢ) log(1−ŷᵢ) ]

L2: Loss + (1/2C)·Σβⱼ² | L1: Loss + (1/C)·Σ|βⱼ| | Multi-class: One-vs-Rest (OvR) by default

Interactive — Sigmoid Function

Move z along the log-odds axis to see how it maps to a probability:

z (log-odds) 0.00

σ(z) = 0.500

Boundary: exactly 50% probability

Dataset Selection

Synthetic Datasets — 2D (decision boundary visible)

Real Datasets — Multi-feature

Train / Test split 80 / 20 %

Regularization

C = 1/λ  ← stronger reg | weaker reg →

1.00

Performance Metrics

Model Fit

Decision Boundary

Confusion Matrix

—

Rows = True class, Columns = Predicted. Diagonal = correct. Off-diagonal = misclassifications.

Probabilistic Performance

ROC Curve

True Positive Rate vs False Positive Rate. AUC = 1 is perfect; AUC = 0.5 is random. Multi-class: One-vs-Rest per class.

Precision-Recall Curve

Better diagnostic than ROC on imbalanced datasets. AP = area under this curve. High precision + high recall = ideal classifier.

Probability Output

Predicted Probability Distribution

Histograms of predicted probabilities split by true class. A good model produces well-separated distributions.

Calibration Curve (Reliability Diagram)

Mean predicted probability vs fraction of true positives per bin. The diagonal = perfectly calibrated model. Logistic regression is typically well-calibrated.

Feature Importance

Permutation Feature Importance

Drop in accuracy when a feature is shuffled (avg over 20 repeats). Larger drop → feature is more important to the model.

Coefficient Magnitudes

Positive coefficient → log-odds increase (pushes toward class). Negative → pushes away. Magnitude reflects feature influence after scaling.

Model Understanding

Learning Curve

Train vs validation accuracy as training set grows. Large gap → overfitting (increase regularization / get more data). Both low → underfitting (decrease regularization / add features).

Regularization Path (C sweep)

Effect of C (= 1/λ) on model. Left = strong regularization, right = weak. Coef view: L1 zeroes features. Accuracy view: find the sweet spot between underfitting and overfitting.

Diagnostic Summary