https://stats.stackexchange.com/questions/11859/what-is-the-difference-between-multiclass-and-multilabel-problem

https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/

https://web.stanford.edu/~nanbhas/blog/sigmoid-softmax/


Binary Classification: One node, sigmoid activation.
Multiclass Classification: One node per class, softmax activation.
Multilabel Classification: One node per class, sigmoid activation.

Multi-class vs Binary-class is the question of the number of classes your classifier is modeling. In theory, a binary classifier is much simpler than multi-class, so it's important to make this distinction. For example, the Support vector machine (SVM) can trivially learn a hyperplane to separate two classes, but 3 or more classes makes it complex. In the neural networks, we commonly use Sigmoid for binary, but Softmax for multi-class as the last layer of the model.

Multi-label vs Single-Label is the question of how many classes any object or example can belong to. In the neural networks, if single label is needed we use a single Softmax layer as the last layer, thus learning a single probability distribution that spans across all classes. If the multi-label classification is needed, we use multiple Sigmoids on the last layer, thus learning separate distribution for each class.

# Cross-Entropy or Log Likelihood in Output Layer (StackExchange)

*negative log likelihood* is also known as multi class cross-entropy 


### all of the normal loss functions and their applications in PyTorch 

https://neptune.ai/blog/pytorch-loss-functions


https://medium.com/deeplearningmadeeasy/negative-log-likelihood-6bd79b55d8b6 

It’s a cost function that is used as loss for machine learning models, telling us how bad it’s performing, the lower the better.

I’m going to explain it word by word, hopefully that will make it. easier to understand.

Negative: obviously means multiplying by -1. What? The loss of our model. Most machine learning frameworks only have minimization optimizations, but we want to maximize the probability of choosing the correct category.

We can **maximize by minimizing the negative log likelihood,** there you have it, we want somehow to maximize by minimizing.

Also it’s much easier to reason about the loss this way, to be consistent with the rule of loss functions approaching 0 as the model gets better.


cross entropy loss is same as negative log likelihood 


NLL uses a negative connotation since the probabilities (or likelihoods) vary between zero and one, and the logarithms of values in this range are negative. In the end, the loss value becomes positive.