Suppose we want to train a machine learning model on a binary classification problem. A standard way of measuring model performance here is called log-loss or binary cross entropy (I will refer to this as cross entropy throughout this post). This means that given the task to predict some binary label , rather than outputting a hard 0 / 1 to the predicted classes, one outputs a probability, say. Then the cross entropy score of the model is
We will explain roughly where this loss comes from in the next section. Now suppose that the test set actually has a different proportion of positives to negatives to the training set. This is not a hypothetical scenario. this is exactly what competitors of the recently added ‘Quora question pair challenge‘ are faced with. This post is to explain why the nature of cross entropy makes this is a problematic setup (something I, and other posters pointed out), and a theoretical solution. This problem could also come up where the proportion of positives changes over time (and this is known), but the training cross-entropy score is to be used. Some posters on the Kaggle discussion boards mentioned attempts to convert training set predictions to test set predictions, but to my knowledge there is no serious published analysis on it so far, so here goes…
Cross-entropy and class imbalance problems
Cross entropy is a loss function that derives from information theory. One way to think about it is how much extra information is required to derive the label set from the predicted set. This is how it is explained on the wikipedia page for example. In my opinion, a more intuitive way to view it is as a loss function that rewards the model for being ‘honest’ about how probable it believes labels to be. Say our predictive model believes that there is a probability of a given label being positive. What value should we output to minimise cross entropy loss? Well if we really believe that there is a chance that a label is positive, then our best estimate for our loss is
So we differentiate this function with respect to , and set to zero (to find the minimum). We get
and thus can verify that the value of which minimises the loss is .
Now let’s think about why class proportions being different between the training and test set is problematic for cross entropy loss. Suppose we just wanted to take the most naive possible model where we output the same value for every label. By the above discussion, the single value that will optimise loss on the training set is , i.e. the probability that a randomly chosen label is positive, but it will only also maximise it on the test set if this probability is the same – i.e. if positives are equally likely in the training/test set. This is our ‘prior’ in Bayesian-speak.
Moreover, more complicated models will tend to gravitate around this prior when they are very ‘unsure’, i.e. when they don’t glean any extra information about the label from the training features, and they will be punished for doing this if training/test class balances are not equal.
Note that there are other loss functions available that are less sensitive to this class imbalance problem, for example area under the curve (AUC).
Converting predictions using Bayes’ theorem
Let’s suppose our training set is drawn from a distribution , and our test set is drawn from . Our assumption at this point will be that the only difference between these two distributions is that they happen to have different proportions of positives or negatives i.e. and .
Suppose we have some sample . Our model is trying to estimate , where (abusing notation for now) is the event that the label is positive (and is the event that the label is negative). Suppose our model’s best estimate of this is . By Bayes’ theorem, we have
say. Now suppose that the same was instead sampled from , and we are now trying to estimate . We suppose that is the same as except the positives have been oversampled by a ratio , and the negatives by a ratio – i.e. and . As noted, conditional on , and are identical, so that . So then :
Now from the above equation, we have that , and so
Thus a link function mapping from probabilities in the training set to probabilities in the test set is
Further work could be done to estimate how uncertainty on affects the uncertainty on , but I’m not going to pursue that here.
Note than one can also derive this formula from trying to optimise the loss function
with respect to . This suggests that we can optimise the loss function
on , to optimise the cross entropy on , which is another way to derive (though slightly less insightful, in my opinion).
As an example let’s go back to the original Quora dateset. It is (currently) believed that the training set has 37% positives, whereas the test set only has 16.5% negatives. From the above discussion, we take and , and then looks like so :
here we have marked lines to confirm that . Our function also has some desirable properties that a simple linear scaling could not have, e.g. if our is very close to 0 or 1, then will also be.
Also note that we heavily relied on the assumption that the positives/negatives in were equidistributed to the positives/negatives in respectively. If that it is not true, then this analysis could be of limited use!