The birthday problem II : three people or more

This is a follow on from the previous post about the birthday problem.  Now we will look at the more general case where we want more than two people to share a birthday.  Incidentally, the reason for my interest in this problem is that we used to use it as a coding/probability exercise for data science interviews at Qubit (not anymore!), and because there seems to be surprisingly little written about it online.

We will be using the falling factorial  notation (x)_n to ease exposition slightly (because writing latex in wordpress is not so fun).

To simplify, we will assume that birthdays are uniformly distributed throughout the year.  Consider the case where we want to know if three or more people from a group of N share a birthday.  There are other discussions of this on the internet, e.g. this post, but what follows is an easier derivation than what I have seen.  Again, we look at the complement probability – let’s try and work out the probability that all days have at most two people with that as their birthday.  We will break into cases based on how many pairs of people share a birthday, which we will label m, so that 0 \leq m \leq \lfloor \frac{N}{2} \rfloor.  The idea is to count how many ways we can split our N people into m pairs, and N - 2m singletons, and count the distinct number of ways to put these into 365 bins.  This is :

(N)_{2m}\cdot \frac{1}{2 ^m} \cdot (365)_m \cdot (365 - m)_{N-2m} .

where the four terms are respectively : the number of ways of selecting m pairs from the N people, a correction term, since we can swap two people in a pair around, the number of ways of selecting the m ‘pair-days’ from 365, and finally the number of way of selecting the remaining n-2m people from the remaining 365 - m days.  Simplifying this slightly and taking the sum over m, we have the probability of N people having at least one triple is

1- \frac{1}{365^N} \sum \limits_{m = 0}^{ \lfloor \frac{N}{2} \rfloor }\frac{ (N)_{2m} \cdot (365)_{N-m}} {2 ^ m}.

It seems unlikely that there is a clean closed formula to calculate the general case here (one was claimed in this stackexchange post, but it’s wrong, there’s some erroneous double counting here).  A natural way to approximate it would be to run a simulation (and in interviews we would push the candidate towards this), but for curiosity’s sake, I’m going to show how you can work it out precisely.

If we think about the case where we want there to be no quadruples, it is already getting quite daunting to try and use our previous strategy to break this up in terms of smaller subgroups (we’d have to look at all the ways that different numbers of triples, pairs and singletons which add up to N can be arranged into 365 bins) – you can do it with an induction onto lower terms (left as an exercise!), but it’s messy, slow, and memory intensive to implement.  Instead, we are going to try and reduce the number of days we need to think about.  To this end, let F_M (N, k) be the number of ways of pigeonholing N objects into k bins so that there are no more than M objects in any one bin. Then the probability that amongst N randomly selected people, there is one set of at least M who all share a birthday is

1 - \frac{F_{M - 1} (N, 365)}{365^N}.

Consider a binning that is counted by F_M (N, k).  Let’s assume there are m objects in the first bin, so that 0 \leq m \leq M.  There are {N \choose m} sets of objects this could be.  So considering what remains, and summing over all valid m‘s, we have the recurrence

F_M (N, k) = \sum_{m=0}^M  {N \choose m} F_M (N - m, k - 1).

 This together with the base cases F_M (N, 1) = 1 for 0 \leq N \leq M and zero otherwise, allows us to calculate the probability exactly.  Here is some python code to do this (this is essentially a tail recursion on k up to 365.  Note the use of Decimal, to keep precision.

import math
from decimal import Decimal
from decimal import getcontext

#bit of a trade off between speed/memory and precision here
getcontext().prec = 50

def binom(n, m):
    return Decimal(math.factorial(n)) / (Decimal(math.factorial(m)) * Decimal(math.factorial(n-m)))

def get_birthday_prob(N, M):
    current_row = [Decimal(0)] * (N + 1)
    
    #base case
    for k in range(M):
        current_row[k] = Decimal(1)

    #recursion
    for d in range(1, 365):
        new_row = [Decimal(0)] * (N + 1)
        for n in range(N + 1):
            s = Decimal(0)
            for m in range(min(n + 1, M)):
                s += binom(n, m) * current_row[n-m]
            new_row[n] = s
        current_row = new_row

    complement_prob = current_row[N] / Decimal(math.pow(365, N))
    return float(1 - complement_prob)

print get_birthday_prob(23, 2)

 

Advertisements

2 thoughts on “The birthday problem II : three people or more

  1. Hi Mike, to cut a long story short, I wanted to know the probability of 5 people in a group of 521 sharing the same birthday. It’s for something of no great consequence, but I am not enough of a mathematician (or Python programmer) to apply your formula, which I came across while searching for a solution to this question, so I wondered if you could simply apply it for me and send me the result. In case you’re interested, I have data on 521 people’s birthdays, and there are 7 different birthdays that are shared by 5 people (5 is the highest number of shared birthdays in the data). Is that within the expected probability?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s