The birthday problem II : three people or more

This is a follow on from the previous post about the birthday problem.  Now we will look at the more general case where we want more than two people to share a birthday.  Incidentally, the reason for my interest in this problem is that we used to use it as a coding/probability exercise for data science interviews at Qubit (not anymore!), and because there seems to be surprisingly little written about it online.

Continue reading

Seq2seq learning in TensorFlow

You might want to just go straight to the ipython notebook, I put a lot more effor into it than this post.

I have been doing a bunch of side projects recently and not writing them up.  This one I think may be of some interest to other people since TensorFlow is so in vogue right now.  I have been interest in trying sequence to sequence learning for some time, so came up with a toy problem to solve.  I actually took some effort to make the notebook readable, and would probably be easiest to just read that to see the problem and code description : see the notebook here.  It includes this picture (just to make this post look  more interesting) :

seq2seq.png

Please note this is a work in progress, I will probably write up the problem/solution itself at a later date (but at my current rate of write-ups, maybe not!)

Generative adversarial autoencoders in Theano

Just here for code?  Git repo.

This is a followup on my last post about enhancing images using a generative adversarial autoencoder structure.  This post is about how it was done, and we provide code to hopefully let the reader replicate the results.

This project was done in Theano, and closely follows the code given for the DCGAN paper of Alec Radford et al.  I refactored some things to make it easier to change things around, and I had to change the architecture a bit.  I originally tried porting the code over to Lasagne, a library built on top of Theano, but decided that it was only slowing me down doing this.  After this project I have started to think that sadly, for small experimental projects using novel techniques, working with small simple modules over Theano is quicker than trying to twist your code to fit some given library.

Continue reading

Reddit buzzwords of 2015 visualised by month

Back in March 2015, I wrote a post where I visualised a large amount of reddit comments, by looking for buzzwords over time and plotting them as a wordcloud.  Now that 2015 is over, I decided to plot the remainder of that year.

reddit_2015

I made a few small tweaks to the original algorithm.  For example, when looking to see how ‘surprising’ a given word is for a given month, the algorithm now looks at the last 12 months to see how prevalent it was (rather than the 12 months of the year, which is less chronological in a way).  For aesthetic reasons, I also made sure that each word was only emphasised in one of the months (otherwise ‘Trump’ got ever more buzzword-y over time, and showed up in a lot of the different months).

All 24-bit RGB colours in one animation

Edit : this page was featured on engadget!  Thanks for the love!

(THIS PAGE WON’T LOOK GOOD ON MOBILE, make sure Vimeo links are playing in 720p)

The way a computer monitor displays colours is to mix various intensities of red, green and blue light.  A typical monitor can display 256 levels of intensity of each of the three colours, giving 256x256x256 ~ 16 million different colours, which can be expressed in 24 bits of information.  The following looping animation contains every 24-bit colour exactly once in one of its pixels in one of it’s frames (at least the uncompressed mp4 I uploaded to vimeo did).  It has 256 frames, 256×256 pixels apiece.

It’s a bit small, so most of the animations that follow are actually 256 frames of 512×512 (every colour appears 4 times).  The following uses the same algorithm as the first one, just bigger.

There are animations that look quite different to this below.

Code that generated these is here.

Continue reading

The Birthday Problem with Generalisations I

A well known ‘paradox’ in probability is the following : suppose we have a set of 23 people in a room.  Then the probability that at least two people share a birthday is more than 50%.  This is not a real paradox, but people generally find this somewhat surprising at first, since 23 is so much smaller than 365.  In this post, we will look at how the distribution of birthdays affects the probability.

Continue reading

Visualising reddit buzzwords over time

I was recently inspired by this post to write a python script that scraped a large amount of reddit comments (see last section for code).  This constitutes roughly 10 million comments, and 200 million words.  I’ve since been looking for interesting ways of visualising this data. My favourite one so far is to split the comments into months, create word clouds of the ‘buzzwords’ in each month, and plot the images on the same axis to give a ‘word cloud timeline’, something I had not seen done before. I know that word clouds have rather fallen out of fashion a bit in the last few years, but I’ve always been fond of them. It is important to note that these word clouds are not based on the raw counts of each word, rather I used a metric for how unusual the prevalence a word was in a given month to dictate the font sizes. More details will follow, but now here is what this analysis produced for the last 14 months of comments (click to enlarge).

all More word clouds –  Here are the timelines for 2013, and 2012 respectively.

all_2013-01-01_2014-01-01_1all_2012-01-02_2013-01-01_0.75

I also looked at dividing by subreddit. Most were not so interesting, some of them failed because their status as a default subreddit changed in the last year.  Unsurprisingly, the ones that worked best tended to be based on current events :

gaming_2014-01-01_2015-03-01_1.0 movies_2014-01-01_2015-03-01_1.0 Music_2014-01-01_2015-03-01_1.0 science_2014-01-01_2015-03-01_1.0 worldnews_2014-01-01_2015-03-01_1.0IAmA_2014-01-01_2015-03-01_1.0

How I made this – For each day, I got a list of top frontpage posts from the reddit archive, using Beautiful Soup. I then processed the comments using the reddit comment scraper PRAW. To limit the number of queries to reddit, I only took the first 200 comments from each page (i.e. the ones you can see without clicking on ‘continue thread’ or ‘load more comments’). Next I cleaned the data (always a headache). Generally it involved removing things with regexes, attempting to remove bots etc. One thing that I found removed a lot of weirdness was removing duplicate words from every comment, this thread is an example of why.  This step is imperfect, always, but I did as much as I could stomach. As mentioned before the word cloud font sizes are not based on raw counts, as the most common words are boring – ‘the’, ‘and’ etc. A standard strategy is to use a metric that reflects how unusual the prevalence of a word is in a given month. A well known example of such as metric is tf-idf, but I instead opted for a Bayesian approach.  I can’t talk too much about this as it is a propriety algorithm of the research team at Qubit (where I work), but it involves comparing the prior and posterior distributions of the prevalence of words based on the data without that month, and then after we add it in.  Finally I created the word clouds. I normalised the font sizes so they would not be too sparse, or too dense, and plotted them using this python library, and glued them together.

Data Sharing –  I have put (almost) all the code used to generate the images available at this repo.  Raw, uncleaned tsvs for the last few months can be found here (not the best file hosting service, I know).  User names have been hashed for anonymity.

Problems – There are many problems with this methodology that could be addressed, e.g. what happens if a word peaks in popularity on the boundary of two months?  Relying on the redditarchive was a little dodgy in retrospect, as I believe its frontpage only takes into account the default subreddits, which change.  For a subreddit’s word cloud, one should really scrape off its own frontpage.  Data cleaning has some problems, e.g. ‘fil’ is one of the top words in 2012, because Chick-fil-a was a popular talking point, and the cleaning split this word.