Cross entropy and training-test class imbalance


Suppose we want to train a machine learning model on a binary classification problem.  A standard way of measuring model performance here is called log-loss or binary cross entropy (I will refer to this as cross entropy throughout this post).  This means that given the task to predict some binary label y, rather than outputting a hard 0 / 1 to the predicted classes, one outputs a probability, \hat{y} say.  Then the cross entropy score of the model is

\sum_i -y_i\log\hat{y_i} - (1-y_i)\log(1-\hat{y_i}).

We will explain roughly where this loss comes from in the next section.  Now suppose that the test set actually has a different proportion of positives to negatives to the training set.  This is not a hypothetical scenario. this is exactly what competitors of the recently added ‘Quora question pair challenge‘ are faced with.  This post is to explain why the nature of cross entropy makes this is a problematic setup (something I, and other posters pointed out), and a theoretical solution.  This problem could also come up where the proportion of positives changes over time (and this is known), but the training cross-entropy score is to be used. Some posters on the Kaggle discussion boards mentioned attempts to convert training set predictions to test set predictions, but to my knowledge there is no serious published analysis on it so far, so here goes…

Continue reading

Note on an item-item recommender system using matrix factorisation

Collaborative filtering for product recommendations tends to focus on user-item recommendations – given a user, try to find the products they might like based on the products they have already interacted with.  While personalisation like this is often preferable, sometimes we would just like to know what products to recommend if a user lands on a given product page.  There is far less written about this in the literature, so I would like to share some thoughts (and maths).

Continue reading

DeepRhyme (D-Prime) – generating dope rhymes with deep learning


Disclaimer : I am a white boy, straight outta Cambridge, UK.  I do not condone the nastier, sometimes misogynistic language in this post, but in the interest of keeping it real, I have not made any effort to omit it either :b

Mobile users : you probably want to turn your screen sideways


The following was completely generated, unedited, by a rapbot (minus the title).

House Full of Bricks - D-Prime

pump it up , just throw yo hands in the air
this goes out to all my niggaz that don't dare
i got a whole lot of diamonds in yo veins
about havin thangs gold rings , and my platinum chain
that's why all these bitch - ass niggaz get smoked
they don't give a fuck , that shit ain't no joke
nigga this goes out to all them crooked ass cops
since back in the days when i was getting popped
always had a good time for me to be fine
i've lost my mind , and i'm still gon shine
but i don't wanna see a nigga on the grind
we ain't trying to be ready for war or crime
check this out , a lot of people are scarred
and it's my time when it comes to the stars
cause i was born with a house full of bricks
yeah , we can see it all in the mix
but now it's hard for me to beg and feed
i gotta wake up , so take away my seed

Ever since I listened to Illmatic as a youngster, I’ve loved hip hop, and the amazing creativity of (some) rap lyrics.  I also love machine learning.  Over a year ago, I decided I would write a machine learning algorithm that could ingest rap lyric text, then generate lyrics of its own automatically.  This was hard, and I gave up.  Twice.  After many, many iterations, I eventually came up with a model that could produce the above.  The following is a brief description of how this was achieved, the full gory technical details as well as code will be written up in a later post.

but right now i'm just trying to make it nice
this is my life , you can pay the price
i ain't gotta wait til it's time to take flight
have a party all night , everything's gonna be alright
so now do you really wanna get with me tonight
it ain't no need to talk about what i write


Continue reading

The birthday problem II : three people or more

This is a follow on from the previous post about the birthday problem.  Now we will look at the more general case where we want more than two people to share a birthday.  Incidentally, the reason for my interest in this problem is that we used to use it as a coding/probability exercise for data science interviews at Qubit (not anymore!), and because there seems to be surprisingly little written about it online.

Continue reading

RNN-spelling bee : Seq2seq learning in TensorFlow

You might want to just go straight to the ipython notebook, I put a lot more effort into it than this post.

I have been doing a bunch of side projects recently and not writing them up.  This one I think may be of some interest to other people since TensorFlow is so in vogue right now.  I have been interest in trying sequence to sequence learning for some time, so came up with a toy problem to solve.  I actually took some effort to make the notebook legible, and would probably be easiest to just read that to see the problem and code description : see the notebook here.  It includes this picture (just to make this post look  more interesting) :


Please note this is a work in progress, I will probably write up the problem/solution itself at a later date (but at my current rate of write-ups, perhaps not…)

Generative adversarial autoencoders in Theano

Just here for code?  Git repo.

This is a followup on my last post about enhancing images using a generative adversarial autoencoder structure.  This post is about how it was done, and we provide code to hopefully let the reader replicate the results.

This project was done in Theano, and closely follows the code given for the DCGAN paper of Alec Radford et al.  I refactored some things to make it easier to change things around, and I had to change the architecture a bit.  I originally tried porting the code over to Lasagne, a library built on top of Theano, but decided that it was only slowing me down doing this.  After this project I have started to think that sadly, for small experimental projects using novel techniques, working with small simple modules over Theano is quicker than trying to twist your code to fit some given library.

Continue reading

Enhancing images using Deep Convolutional Generative Adversarial Networks (DCGANs)

(edit1 : this got to the top of r/machinelearning, check out the comments for some discussion)

(edit2 : code for this project can now be found at this repo, discussion has been written up here)

Recently a very impressive paper came out which produced some extremely life-like images generated from a neural network.  Since I wrote a post about convolutional autoencoders about 9 months ago, I have been thinking about the problem of how one could upscale or ‘enhance’ an image, CSI-style, by using a neural network to fill in the missing pixels.  I was therefore very interested while reading this paper, as one of its predecessors was attempting to do just that (albeit in a much more extreme way, generating images from essentially 4 pixels).

Using this as inspiration, I built a neural network with the DCGAN structure in Theano, and trained it on a large set of images of celebrities.  Here is an example of a random outputs, the original images are on the left, the grainy images fed into the neural network in the middle, and the outputs on the right.


N.B. this image is large, you should open and zoom in to really see the detail / lack of detail produced by the DCGAN, I certainly do not claim the DCGAN did phenomenally well on pixel level detail (although occasionally I’d say it did a pretty impressive job, particularly with things like hair)

Continue reading