Models the probability that an observation belongs to a class. The log-odds of the positive class is modelled as a linear combination of features; the sigmoid function squashes this to (0, 1), giving a probability. A decision boundary at threshold 0.5 separates classes.
When to use
Binary or multi-class classification where you need well-calibrated class probabilities (e.g. medical diagnosis, credit scoring). Assumes each class is (roughly) linearly separable in feature space. Add regularization (L1 / L2) when features are many or correlated.
Key strength
Outputs calibrated probabilities — not just a label. Coefficients are interpretable as log-odds ratios: eˢᵉ means "the odds multiply by eˢᵉ for each unit increase in xⱼ". Limitation: linear decision boundary; fails on XOR-type data (use kernel or neural net).
Regularization — note the sign flip
C = 1/λ — the inverse of regularization strength. Smaller C → stronger regularization → simpler boundary. L2 shrinks all coefficients. L1 zeroes irrelevant features (feature selection). ElasticNet blends both.
Hypothesis, Loss & Multi-class
σ(z) = 1 / (1 + e⁻ᶻ) z = β₀ + β₁x₁ + … + βₙxₙ
P(y=1|x) = σ(z) Decision boundary: z = 0 ⟺ P = 0.5
True Positive Rate vs False Positive Rate. AUC = 1 is perfect; AUC = 0.5 is random. Multi-class: One-vs-Rest per class.
Precision-Recall Curve
Better diagnostic than ROC on imbalanced datasets. AP = area under this curve. High precision + high recall = ideal classifier.
Probability Output
Predicted Probability Distribution
Histograms of predicted probabilities split by true class. A good model produces well-separated distributions.
Calibration Curve (Reliability Diagram)
Mean predicted probability vs fraction of true positives per bin. The diagonal = perfectly calibrated model. Logistic regression is typically well-calibrated.
Feature Importance
Permutation Feature Importance
Drop in accuracy when a feature is shuffled (avg over 20 repeats). Larger drop → feature is more important to the model.
Train vs validation accuracy as training set grows. Large gap → overfitting (increase regularization / get more data). Both low → underfitting (decrease regularization / add features).
Regularization Path (C sweep)
Effect of C (= 1/λ) on model. Left = strong regularization, right = weak. Coef view: L1 zeroes features. Accuracy view: find the sweet spot between underfitting and overfitting.
Interactive — Decision Threshold
Threshold Analysis — Binary Classification
The default threshold is 0.5: predict positive if P(class) ≥ 0.5. Lowering it increases Recall (catches more positives) at the cost of Precision (more false alarms). This tradeoff is crucial in medical diagnosis, fraud detection, and other asymmetric cost settings.