Cross entropy and training-test class imbalance


Suppose we want to train a machine learning model on a binary classification problem.  A standard way of measuring model performance here is called log-loss or binary cross entropy (I will refer to this as cross entropy throughout this post).  This means that given the task to predict some binary label y, rather than outputting a hard 0 / 1 to the predicted classes, one outputs a probability, \hat{y} say.  Then the cross entropy score of the model is

\sum_i -y_i\log\hat{y_i} - (1-y_i)\log(1-\hat{y_i}).

We will explain roughly where this loss comes from in the next section.  Now suppose that the test set actually has a different proportion of positives to negatives to the training set.  This is not a hypothetical scenario. this is exactly what competitors of the recently added ‘Quora question pair challenge‘ are faced with.  This post is to explain why the nature of cross entropy makes this is a problematic setup (something I, and other posters pointed out), and a theoretical solution.  This problem could also come up where the proportion of positives changes over time (and this is known), but the training cross-entropy score is to be used. Some posters on the Kaggle discussion boards mentioned attempts to convert training set predictions to test set predictions, but to my knowledge there is no serious published analysis on it so far, so here goes…

Continue reading