Cross entropy and training-test class imbalance


Suppose we want to train a machine learning model on a binary classification problem.  A standard way of measuring model performance here is called log-loss or binary cross entropy (I will refer to this as cross entropy throughout this post).  This means that given the task to predict some binary label y, rather than outputting a hard 0 / 1 to the predicted classes, one outputs a probability, \hat{y} say.  Then the cross entropy score of the model is

\sum_i -y_i\log\hat{y_i} - (1-y_i)\log(1-\hat{y_i}).

We will explain roughly where this loss comes from in the next section.  Now suppose that the test set actually has a different proportion of positives to negatives to the training set.  This is not a hypothetical scenario. this is exactly what competitors of the recently added ‘Quora question pair challenge‘ are faced with.  This post is to explain why the nature of cross entropy makes this is a problematic setup (something I, and other posters pointed out), and a theoretical solution.  This problem could also come up where the proportion of positives changes over time (and this is known), but the training cross-entropy score is to be used. Some posters on the Kaggle discussion boards mentioned attempts to convert training set predictions to test set predictions, but to my knowledge there is no serious published analysis on it so far, so here goes…

Continue reading

DeepRhyme (D-Prime) – generating dope rhymes with deep learning


Disclaimer : I am a white boy, straight outta Cambridge, UK.  I do not condone the nastier, sometimes misogynistic language in this post, but in the interest of keeping it real, I have not made any effort to omit it either :b

Mobile users : you probably want to turn your screen sideways


The following was completely generated, unedited, by a rapbot (minus the title).

House Full of Bricks - D-Prime

pump it up , just throw yo hands in the air
this goes out to all my niggaz that don't dare
i got a whole lot of diamonds in yo veins
about havin thangs gold rings , and my platinum chain
that's why all these bitch - ass niggaz get smoked
they don't give a fuck , that shit ain't no joke
nigga this goes out to all them crooked ass cops
since back in the days when i was getting popped
always had a good time for me to be fine
i've lost my mind , and i'm still gon shine
but i don't wanna see a nigga on the grind
we ain't trying to be ready for war or crime
check this out , a lot of people are scarred
and it's my time when it comes to the stars
cause i was born with a house full of bricks
yeah , we can see it all in the mix
but now it's hard for me to beg and feed
i gotta wake up , so take away my seed

Ever since I listened to Illmatic as a youngster, I’ve loved hip hop, and the amazing creativity of (some) rap lyrics.  I also love machine learning.  Over a year ago, I decided I would write a machine learning algorithm that could ingest rap lyric text, then generate lyrics of its own automatically.  This was hard, and I gave up.  Twice.  After many, many iterations, I eventually came up with a model that could produce the above.  The following is a brief description of how this was achieved, the full gory technical details as well as code will be written up in a later post.

but right now i'm just trying to make it nice
this is my life , you can pay the price
i ain't gotta wait til it's time to take flight
have a party all night , everything's gonna be alright
so now do you really wanna get with me tonight
it ain't no need to talk about what i write


Continue reading

Enhancing images using Deep Convolutional Generative Adversarial Networks (DCGANs)

(edit1 : this got to the top of r/machinelearning, check out the comments for some discussion)

(edit2 : code for this project can now be found at this repo, discussion has been written up here)

Recently a very impressive paper came out which produced some extremely life-like images generated from a neural network.  Since I wrote a post about convolutional autoencoders about 9 months ago, I have been thinking about the problem of how one could upscale or ‘enhance’ an image, CSI-style, by using a neural network to fill in the missing pixels.  I was therefore very interested while reading this paper, as one of its predecessors was attempting to do just that (albeit in a much more extreme way, generating images from essentially 4 pixels).

Using this as inspiration, I built a neural network with the DCGAN structure in Theano, and trained it on a large set of images of celebrities.  Here is an example of a random outputs, the original images are on the left, the grainy images fed into the neural network in the middle, and the outputs on the right.


N.B. this image is large, you should open and zoom in to really see the detail / lack of detail produced by the DCGAN, I certainly do not claim the DCGAN did phenomenally well on pixel level detail (although occasionally I’d say it did a pretty impressive job, particularly with things like hair)

Continue reading

Convolutional autoencoders in python/theano/lasagne

If you are just looking for code for a convolutional autoencoder in python, look at this git.  It needs quite a few python dependencies, the only non-standard ones are theano, nolearn, and lasagne (make sure they are up to date).  Also there is a section at the end of this post that explains it.


I have recently been working a project to teach a neural network to count the number of blobs in an image, such as the one below :


I will save the why, and details of this task for a later post.  One of the methods I have been looking at is using autoencoders.  This means we build a network which takes an image as an input, and tries to reconstruct the same image as an output.  Of course, the identity function would do this exactly, but we would not have learned anything interesting about the image that way.  There are several methods to prevent this, such as adding some sort of randomness to the image, see here or here for example.  We will use an easier method, which is to make one of the layers of the network quite narrow.  This means that the network must compress all the data from the image to a small vector from which it must reconstruct the image.  We hope that this forces the autoencoder to learn useful features about the image.

The most powerful tool we have to help networks learn about images is convolutional layers.  This post has a good explanation of how they work.  It seems natural to try to use convolutional layers for our autoencoder, and indeed there is some work in the literature about this, see for example this paper of Masci et al.  Unfortunately there is not really all that much online to help someone get started with building one of these structures, so having built one myself, I have provided some code.  Our network has the following structure :


There is a lot of things we can change about this skeleton model.  After a lot of trial and error, I arrived at a net with one convolutional / pooling layer and one deconvolution / unpooling layer, both with filter size 7.  The narrow encoding layer layer has 64 weights.  The model trains on an Nvidia 900 series GPU in roughly 20 minutes.


Here are some of the inputs / outputs of the autoencoder, we see it does a fairly decent job of getting the position and size of the circles, though they are not fantastically circular.


We can run pretty much the same architecture on lots of dataset : here is MNIST with 64 units on the narrow layer :


Code discussion

Code can be found at this git, which works on the popular MNIST handwritten digit dataset.

The convolution stage of the network is straightforward to build with neural network libraries, such as caffe, torch7, pylearn etc. etc.  I have done all of my work on neural networks in Theano, a python library that can work out the gradient steps involved in training, and compile to CUDA which can be run GPU for large speed gains over CPUs.  Recently I have been using the lasagne library built on Theano, to help write layers for neural nets, and nolearn, which has some nice classes to help with the training code, which is generally quite long messy in Theano.  My code is written with these libraries, however it should be reasonably straightforward to convert it into code that relies only on Theano.


There are some fancy things one can do here in undoing the pooling operation, however in our net we just do a simple upsampling.  That is to say our unpooling operation looks like this :


Unpooling certainly seems to help the deconvolutional step.

Deconvolution layer = Convolutional layer

Let’s consider 1-dimensional convolutional layers.  The following picture is supposed to represent one filter of length 3 between two layers :

convolutionThe connections of a single colour all share a single weight.  Note this picture is completely symmetric.  That is to say we can consider this as a convolution upwards or downwards.  There are two important points to note however.  Firstly for deconvolution, we need to use a symmetric activation function (definitely NOT rectified linear units).  Secondly, the picture is not symmetric at the edges of the diagram.  There are two main border methods used in convolution layers – ‘valid’, which means we only take the inputs from places where the filter can completely fit at the border (which causes the output dimensions to go up, and ‘full’ where we allow the filter to cross the edges of the picture (which causes the output dimensions to go up).  I chose to use valid borders for the convolutional and deconvolutional step, but note that this means we have to be careful about the sizes of the layers at each step.

We could also try putting the deconvolutional weights to be equal but transposed to the original, as in the classic autoencoder.  This is useful because it means that the encoding will be roughly normalised, and has fewer degrees of freedom.   I did not try this however, since the simpler approach seemed to work fine.