I was recently inspired by this post to write a python script that scraped a large amount of reddit comments (see last section for code). This constitutes roughly 10 million comments, and 200 million words. I’ve since been looking for interesting ways of visualising this data. My favourite one so far is to split the comments into months, create word clouds of the ‘buzzwords’ in each month, and plot the images on the same axis to give a ‘word cloud timeline’, something I had not seen done before. I know that word clouds have rather fallen out of fashion a bit in the last few years, but I’ve always been fond of them. It is important to note that these word clouds are not based on the raw counts of each word, rather I used a metric for how unusual the prevalence a word was in a given month to dictate the font sizes. More details will follow, but now here is what this analysis produced for the last 14 months of comments (click to enlarge).
I also looked at dividing by subreddit. Most were not so interesting, some of them failed because their status as a default subreddit changed in the last year. Unsurprisingly, the ones that worked best tended to be based on current events :
How I made this – For each day, I got a list of top frontpage posts from the reddit archive, using Beautiful Soup. I then processed the comments using the reddit comment scraper PRAW. To limit the number of queries to reddit, I only took the first 200 comments from each page (i.e. the ones you can see without clicking on ‘continue thread’ or ‘load more comments’). Next I cleaned the data (always a headache). Generally it involved removing things with regexes, attempting to remove bots etc. One thing that I found removed a lot of weirdness was removing duplicate words from every comment, this thread is an example of why. This step is imperfect, always, but I did as much as I could stomach. As mentioned before the word cloud font sizes are not based on raw counts, as the most common words are boring – ‘the’, ‘and’ etc. A standard strategy is to use a metric that reflects how unusual the prevalence of a word is in a given month. A well known example of such as metric is tf-idf, but I instead opted for a Bayesian approach. I can’t talk too much about this as it is a propriety algorithm of the research team at Qubit (where I work), but it involves comparing the prior and posterior distributions of the prevalence of words based on the data without that month, and then after we add it in. Finally I created the word clouds. I normalised the font sizes so they would not be too sparse, or too dense, and plotted them using this python library, and glued them together.
Data Sharing – I have put (almost) all the code used to generate the images available at this repo. Raw, uncleaned tsvs for the last few months can be found here (not the best file hosting service, I know). User names have been hashed for anonymity.
Problems – There are many problems with this methodology that could be addressed, e.g. what happens if a word peaks in popularity on the boundary of two months? Relying on the redditarchive was a little dodgy in retrospect, as I believe its frontpage only takes into account the default subreddits, which change. For a subreddit’s word cloud, one should really scrape off its own frontpage. Data cleaning has some problems, e.g. ‘fil’ is one of the top words in 2012, because Chick-fil-a was a popular talking point, and the cleaning split this word.