ml_courses / code /classification.md

Upload 80 files

8938d1b verified 12 months ago

3.67 kB

	# Logistic Regression

	References:
	https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html
	https://developers.google.com/machine-learning/crash-course/classification/thresholding
	https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

	https://developers.google.com/machine-learning/crash-course/logistic-regression/model-training

	### Thresholding

	A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

	### Accuracy

	$
	Accuracy = Total correct / total predictions
	$

	Using the Confusion Matrix values

	$
	Accuracy = TP + TN / TP + FP + TN + FN
	$

	Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.

	### Precision

	Precision — Also called Positive predictive value
	The ratio of correct positive predictions to the total predicted positives.

	$
	Precision = \frac{TP}{TP + FP}
	$


	### Recall

	Recall — Also called Sensitivity, Probability of Detection, True Positive Rate

	The ratio of correct positive predictions to the total positives examples.

	$
	Recall = \frac{TP}{TP + FN}
	$

	### ROC & AUC

	* ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.

	* Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

	* ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.


	### sklearn functions

	https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
	https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html


	### What about using XGBoost for classification?

	We can still use XGBoost but logistic regression is linear and XGBoost is not linear.

	For example we can see here that we are drawing linear boundaries between classifications in the iris dataset.

	https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py


	### What about difference between SVM and logistic regression?


	### Logistic Regression Loss function

	*This always trips me up because some people call it log loss* or cross entropy or logits or something else!**

	The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:
	$
	put formula in here later
	$

	is the data set containing many labeled examples, which are
	pairs.
	is the label in a labeled example. Since this is logistic regression, every value of
	must either be 0 or 1.
	is the predicted value (somewhere between 0 and 1), given the set of features in .