ml_courses / code /classification.md
johnnydevriese's picture
Upload 80 files
8938d1b verified
|
raw
history blame
3.67 kB
# Logistic Regression
References:
https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html
https://developers.google.com/machine-learning/crash-course/classification/thresholding
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
https://developers.google.com/machine-learning/crash-course/logistic-regression/model-training
### Thresholding
A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.
### Accuracy
$
Accuracy = Total correct / total predictions
$
Using the Confusion Matrix values
$
Accuracy = TP + TN / TP + FP + TN + FN
$
Accuracy alone doesn't tell the full story when you're working with a **class-imbalanced data** set, like this one, where there is a significant disparity between the number of positive and negative labels.
### Precision
Precision — Also called Positive predictive value
The ratio of correct positive predictions to the *total predicted positives.*
$
Precision = \frac{TP}{TP + FP}
$
### Recall
Recall — Also called Sensitivity, Probability of Detection, True Positive Rate
The ratio of correct positive predictions to the *total positives examples.*
$
Recall = \frac{TP}{TP + FN}
$
### ROC & AUC
* ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
* Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
* ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.
### sklearn functions
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html
### What about using XGBoost for classification?
We can still use XGBoost but logistic regression is linear and XGBoost is *not* linear.
For example we can see here that we are drawing linear boundaries between classifications in the iris dataset.
https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py
### What about difference between SVM and logistic regression?
### Logistic Regression Loss function
**This always trips me up because some people call it *log loss* or cross entropy or logits or something else!**
The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:
$
put formula in here later
$
is the data set containing many labeled examples, which are
pairs.
is the label in a labeled example. Since this is logistic regression, every value of
must either be 0 or 1.
is the predicted value (somewhere between 0 and 1), given the set of features in .