Reddit buzzwords of 2015 visualised by month

Back in March 2015, I wrote a post where I visualised a large amount of reddit comments, by looking for buzzwords over time and plotting them as a wordcloud.  Now that 2015 is over, I decided to plot the remainder of that year.


I made a few small tweaks to the original algorithm.  For example, when looking to see how ‘surprising’ a given word is for a given month, the algorithm now looks at the last 12 months to see how prevalent it was (rather than the 12 months of the year, which is less chronological in a way).  For aesthetic reasons, I also made sure that each word was only emphasised in one of the months (otherwise ‘Trump’ got ever more buzzword-y over time, and showed up in a lot of the different months).


All 24-bit RGB colours in one animation

Edit : this page was featured on engadget!  Thanks for the love!

(THIS PAGE WON’T LOOK GOOD ON MOBILE, make sure Vimeo links are playing in 720p)

The way a computer monitor displays colours is to mix various intensities of red, green and blue light.  A typical monitor can display 256 levels of intensity of each of the three colours, giving 256x256x256 ~ 16 million different colours, which can be expressed in 24 bits of information.  The following looping animation contains every 24-bit colour exactly once in one of its pixels in one of it’s frames (at least the uncompressed mp4 I uploaded to vimeo did).  It has 256 frames, 256×256 pixels apiece.

It’s a bit small, so most of the animations that follow are actually 256 frames of 512×512 (every colour appears 4 times).  The following uses the same algorithm as the first one, just bigger.

There are animations that look quite different to this below.

Code that generated these is here.

Continue reading

The Birthday Problem with Generalisations I

A well known ‘paradox’ in probability is the following : suppose we have a set of 23 people in a room.  Then the probability that at least two people share a birthday is more than 50%.  This is not a real paradox, but people generally find this somewhat surprising at first, since 23 is so much smaller than 365.  In this post, we will look at how the distribution of birthdays affects the probability.

Continue reading

Convolutional autoencoders in python/theano/lasagne

If you are just looking for code for a convolutional autoencoder in python, look at this git.  It needs quite a few python dependencies, the only non-standard ones are theano, nolearn, and lasagne (make sure they are up to date).  Also there is a section at the end of this post that explains it.


I have recently been working a project to teach a neural network to count the number of blobs in an image, such as the one below :


I will save the why, and details of this task for a later post.  One of the methods I have been looking at is using autoencoders.  This means we build a network which takes an image as an input, and tries to reconstruct the same image as an output.  Of course, the identity function would do this exactly, but we would not have learned anything interesting about the image that way.  There are several methods to prevent this, such as adding some sort of randomness to the image, see here or here for example.  We will use an easier method, which is to make one of the layers of the network quite narrow.  This means that the network must compress all the data from the image to a small vector from which it must reconstruct the image.  We hope that this forces the autoencoder to learn useful features about the image.

The most powerful tool we have to help networks learn about images is convolutional layers.  This post has a good explanation of how they work.  It seems natural to try to use convolutional layers for our autoencoder, and indeed there is some work in the literature about this, see for example this paper of Masci et al.  Unfortunately there is not really all that much online to help someone get started with building one of these structures, so having built one myself, I have provided some code.  Our network has the following structure :


There is a lot of things we can change about this skeleton model.  After a lot of trial and error, I arrived at a net with one convolutional / pooling layer and one deconvolution / unpooling layer, both with filter size 7.  The narrow encoding layer layer has 64 weights.  The model trains on an Nvidia 900 series GPU in roughly 20 minutes.


Here are some of the inputs / outputs of the autoencoder, we see it does a fairly decent job of getting the position and size of the circles, though they are not fantastically circular.


We can run pretty much the same architecture on lots of dataset : here is MNIST with 64 units on the narrow layer :


Code discussion

Code can be found at this git, which works on the popular MNIST handwritten digit dataset.

The convolution stage of the network is straightforward to build with neural network libraries, such as caffe, torch7, pylearn etc. etc.  I have done all of my work on neural networks in Theano, a python library that can work out the gradient steps involved in training, and compile to CUDA which can be run GPU for large speed gains over CPUs.  Recently I have been using the lasagne library built on Theano, to help write layers for neural nets, and nolearn, which has some nice classes to help with the training code, which is generally quite long messy in Theano.  My code is written with these libraries, however it should be reasonably straightforward to convert it into code that relies only on Theano.


There are some fancy things one can do here in undoing the pooling operation, however in our net we just do a simple upsampling.  That is to say our unpooling operation looks like this :


Unpooling certainly seems to help the deconvolutional step.

Deconvolution layer = Convolutional layer

Let’s consider 1-dimensional convolutional layers.  The following picture is supposed to represent one filter of length 3 between two layers :

convolutionThe connections of a single colour all share a single weight.  Note this picture is completely symmetric.  That is to say we can consider this as a convolution upwards or downwards.  There are two important points to note however.  Firstly for deconvolution, we need to use a symmetric activation function (definitely NOT rectified linear units).  Secondly, the picture is not symmetric at the edges of the diagram.  There are two main border methods used in convolution layers – ‘valid’, which means we only take the inputs from places where the filter can completely fit at the border (which causes the output dimensions to go up, and ‘full’ where we allow the filter to cross the edges of the picture (which causes the output dimensions to go up).  I chose to use valid borders for the convolutional and deconvolutional step, but note that this means we have to be careful about the sizes of the layers at each step.

We could also try putting the deconvolutional weights to be equal but transposed to the original, as in the classic autoencoder.  This is useful because it means that the encoding will be roughly normalised, and has fewer degrees of freedom.   I did not try this however, since the simpler approach seemed to work fine.

Visualising reddit buzzwords over time

I was recently inspired by this post to write a python script that scraped a large amount of reddit comments (see last section for code).  This constitutes roughly 10 million comments, and 200 million words.  I’ve since been looking for interesting ways of visualising this data. My favourite one so far is to split the comments into months, create word clouds of the ‘buzzwords’ in each month, and plot the images on the same axis to give a ‘word cloud timeline’, something I had not seen done before. I know that word clouds have rather fallen out of fashion a bit in the last few years, but I’ve always been fond of them. It is important to note that these word clouds are not based on the raw counts of each word, rather I used a metric for how unusual the prevalence a word was in a given month to dictate the font sizes. More details will follow, but now here is what this analysis produced for the last 14 months of comments (click to enlarge).

all More word clouds –  Here are the timelines for 2013, and 2012 respectively.


I also looked at dividing by subreddit. Most were not so interesting, some of them failed because their status as a default subreddit changed in the last year.  Unsurprisingly, the ones that worked best tended to be based on current events :

gaming_2014-01-01_2015-03-01_1.0 movies_2014-01-01_2015-03-01_1.0 Music_2014-01-01_2015-03-01_1.0 science_2014-01-01_2015-03-01_1.0 worldnews_2014-01-01_2015-03-01_1.0IAmA_2014-01-01_2015-03-01_1.0

How I made this – For each day, I got a list of top frontpage posts from the reddit archive, using Beautiful Soup. I then processed the comments using the reddit comment scraper PRAW. To limit the number of queries to reddit, I only took the first 200 comments from each page (i.e. the ones you can see without clicking on ‘continue thread’ or ‘load more comments’). Next I cleaned the data (always a headache). Generally it involved removing things with regexes, attempting to remove bots etc. One thing that I found removed a lot of weirdness was removing duplicate words from every comment, this thread is an example of why.  This step is imperfect, always, but I did as much as I could stomach. As mentioned before the word cloud font sizes are not based on raw counts, as the most common words are boring – ‘the’, ‘and’ etc. A standard strategy is to use a metric that reflects how unusual the prevalence of a word is in a given month. A well known example of such as metric is tf-idf, but I instead opted for a Bayesian approach.  I can’t talk too much about this as it is a propriety algorithm of the research team at Qubit (where I work), but it involves comparing the prior and posterior distributions of the prevalence of words based on the data without that month, and then after we add it in.  Finally I created the word clouds. I normalised the font sizes so they would not be too sparse, or too dense, and plotted them using this python library, and glued them together.

Data Sharing –  I have put (almost) all the code used to generate the images available at this repo.  Raw, uncleaned tsvs for the last few months can be found here (not the best file hosting service, I know).  User names have been hashed for anonymity.

Problems – There are many problems with this methodology that could be addressed, e.g. what happens if a word peaks in popularity on the boundary of two months?  Relying on the redditarchive was a little dodgy in retrospect, as I believe its frontpage only takes into account the default subreddits, which change.  For a subreddit’s word cloud, one should really scrape off its own frontpage.  Data cleaning has some problems, e.g. ‘fil’ is one of the top words in 2012, because Chick-fil-a was a popular talking point, and the cleaning split this word.