| # Logistic Regression | |
| References: | |
| https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html | |
| https://developers.google.com/machine-learning/crash-course/classification/thresholding | |
| https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ | |
| https://developers.google.com/machine-learning/crash-course/logistic-regression/model-training | |
| ### Thresholding | |
| A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune. | |
| ### Accuracy | |
| $ | |
| Accuracy = Total correct / total predictions | |
| $ | |
| Using the Confusion Matrix values | |
| $ | |
| Accuracy = TP + TN / TP + FP + TN + FN | |
| $ | |
| Accuracy alone doesn't tell the full story when you're working with a **class-imbalanced data** set, like this one, where there is a significant disparity between the number of positive and negative labels. | |
| ### Precision | |
| Precision — Also called Positive predictive value | |
| The ratio of correct positive predictions to the *total predicted positives.* | |
| $ | |
| Precision = \frac{TP}{TP + FP} | |
| $ | |
| ### Recall | |
| Recall — Also called Sensitivity, Probability of Detection, True Positive Rate | |
| The ratio of correct positive predictions to the *total positives examples.* | |
| $ | |
| Recall = \frac{TP}{TP + FN} | |
| $ | |
| ### ROC & AUC | |
| * ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. | |
| * Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. | |
| * ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets. | |
| ### sklearn functions | |
| https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html | |
| https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html | |
| ### What about using XGBoost for classification? | |
| We can still use XGBoost but logistic regression is linear and XGBoost is *not* linear. | |
| For example we can see here that we are drawing linear boundaries between classifications in the iris dataset. | |
| https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py | |
| ### What about difference between SVM and logistic regression? | |
| ### Logistic Regression Loss function | |
| **This always trips me up because some people call it *log loss* or cross entropy or logits or something else!** | |
| The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows: | |
| $ | |
| put formula in here later | |
| $ | |
| is the data set containing many labeled examples, which are | |
| pairs. | |
| is the label in a labeled example. Since this is logistic regression, every value of | |
| must either be 0 or 1. | |
| is the predicted value (somewhere between 0 and 1), given the set of features in . | |