# Logistic Regression References: https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html https://developers.google.com/machine-learning/crash-course/classification/thresholding https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ https://developers.google.com/machine-learning/crash-course/logistic-regression/model-training ### Thresholding A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune. ### Accuracy $ Accuracy = Total correct / total predictions $ Using the Confusion Matrix values $ Accuracy = TP + TN / TP + FP + TN + FN $ Accuracy alone doesn't tell the full story when you're working with a **class-imbalanced data** set, like this one, where there is a significant disparity between the number of positive and negative labels. ### Precision Precision — Also called Positive predictive value The ratio of correct positive predictions to the *total predicted positives.* $ Precision = \frac{TP}{TP + FP} $ ### Recall Recall — Also called Sensitivity, Probability of Detection, True Positive Rate The ratio of correct positive predictions to the *total positives examples.* $ Recall = \frac{TP}{TP + FN} $ ### ROC & AUC * ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. * Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. * ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets. ### sklearn functions https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html ### What about using XGBoost for classification? We can still use XGBoost but logistic regression is linear and XGBoost is *not* linear. For example we can see here that we are drawing linear boundaries between classifications in the iris dataset. https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py ### What about difference between SVM and logistic regression? ### Logistic Regression Loss function **This always trips me up because some people call it *log loss* or cross entropy or logits or something else!** The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows: $ put formula in here later $ is the data set containing many labeled examples, which are pairs. is the label in a labeled example. Since this is logistic regression, every value of must either be 0 or 1. is the predicted value (somewhere between 0 and 1), given the set of features in .