## Bet You Can’t Solve The Birthday Paradox

September 4, 2020 Leave a comment

If you’ve heard of the Birthday Paradox and/or like math puzzles and/or want to know how it connects with Computational Genomics and the Seven Bridges of Konigsberg, this is for you.

The Birthday Paradox (which would be more accurately named The Birthday Somewhat Unintuitive Result) asks “How many people do you need in a group before there’s a 50% chance that at least two share a birthday?”

It’s easier to flip it around and ask “For a given number of people, what are the chances that NONE of them share a birthday?” This makes it simpler: each time we add another person to the group, we just need to calculate the number of “open” birthdays and multiply by the odds of our new member having one of those.

P(no shared birthdays for n people) =

If we just keep increasing n and calculating the probability, we find that with 23 people there’s a 50% chance of at least two people sharing a birthday.

(Ok, technically we answered “Given a number of people what probability…” and used brute force instead of “Given a probability what number of people…” but let’s ignore that; everyone else does.)

#### How about if we want to know the probability that no THREE people in a group share a birthday?

This variant is trickier, and has tripped up many smart people. Do you think you can solve it?

Give it a shot! I’ll talk about the solution after some context for why it matters and some graphics/tools made with Wolfram Mathematica. I started as a machine learning research scientist with Wolfram earlier this year and I’ve been really enjoying playing with the tools!

### Birthdays, Bridges, de Bruijn Graphs, and… Bgenomics

This is more than an idle math puzzle; it’s related to a fascinating challenge in Computational Genomics. (It was actually a question on my homework in grad school.)

When we want to read a DNA sequence, it’s usually far too long for our machines to process all at once. Instead, we copy the sequence a bunch, slice the copies up, and gather all the pieces of a chosen, more manageable, length — say, 4 base pair “letters”.

In the end, we know every length-four subsection in our DNA sequence, we just don’t know their order. For example, if we had a length-7 sequence and took all the random chunks of 4, we might end up with

TACA, ATTA, GATT, TTAC

Now, as though it were a one-dimensional jigsaw puzzle, we try to piece them together. Each chunk of 4 will overlap with its neighbors by 3 letters: the chunk ATTA ends with TTA, so we know the next chunk must **start** with TTA. There’s only one which does — TTAC — so those pieces fit together as ATTAC. Now we look for a piece that begins with TAC, and so on.

In our case, there’s only one unique way to arrange the five chunks:

GATTACA

As the sequences get longer, scientists turn to graph theory to find this arrangement. Treating each DNA chunk as an edge, we can form a directed graph indicating which chunks could follow (overlap with) each other.

This overlapping-ness graph is called a De Bruijn Graph, and if there’s a path which uses every edge exactly once, we’ve done it: we’ve found a way to order the DNA chunks and reconstruct the larger sequence!

If this sounds a bit like the Seven Bridges of Konigsberg problem, there’s a reason — it’s the same issue. It’s as though we were walking from “overlap island” to “overlap island”, each time saying to ourselves “Ok, if I have a piece ending in GCA, what bridge-chunk could come next?” and hoping to find a path that uses every chunk.

Since Euler solved the Bridges of Konigsberg problem, this type of path — using every edge exactly once — is known as an Eulerian Path. The great thing is that we have really efficient algorithms to find them!

However, we can run into trouble if some of our “overlap” sections aren’t distinct. If a section of DNA is repeated, that’s ok — we can probably still find the Eulerian Path through the graph.

…But if a section of the original DNA repeats three times, we’re screwed. When the repeat is at least as long as our “overlap” we can no longer find a unique path — multiple Eulerian Paths exist.

Let’s take the sequence AA**CAT**CC**CAT**GG**CAT**TT, in which the phrase “CAT” repeats three times. Once we reach the “ACAT” chunk, we don’t know what comes next. Overlapping with chunk “CATT” would lead us straight to the end — leaving out many chunks — so we know that comes last. But the loops starting either CATG or CATC could come next.

So, if we’re going to read a long DNA sequence, we might ask:

How many overlapping chunks of 4 can we take before there’s a 50% or a 5% or 1% chance that we see a triple-repeat which ruins our attempt to reconstruct the original?

This is where we return to our Birthday Paradox variant!

### Back to the Birthday Problem

With our Birthday Problem, there are 365 different birthdays a person can have. With DNA chunks of 4, there are 64 different three-letter ways each chunk could end. If any three chunks have the same ending, we won’t know how to reconstruct our sequence.

As the chunks get longer, we have a much better chance of producing a unique Eulerian Path from our graph.

While we can’t move the Earth farther from the Sun (nor should we (probably)) to increase the number of possible birthdays in a year, we CAN use chunks larger than 4 and increase the number of ways each chunk can end. So if we know we’re sequencing a genome 100,000 letters long, how long do our chunks need to be in order for us to have a >99% chance of reconstructing it?

Since starting my job at Wolfram Research, I’ve been playing with their graph capabilities and put together this little interactive tool. It generates a random gene and shows the De Bruijn Graph when you take chunks of different lengths. It’s amazing how quickly a totally chaotic graph becomes more orderly!

(The cloud-deployment can be a bit sluggish, but give it a second after clicking a button. If you get the desktop version of Wolfram Mathematica you can play with things like this incredibly quickly. They’re so cool.)

### The Answer

Heck if I know. Sorry.

I can get the right answer of about 88 by running simulations, but I didn’t manage to derive the general formula for my class.

Every time I’ve shown this question to a friend — and the first time I saw it on my Computational Genomics homework — the response has been “Oh, this is simple. You just… wait, no, that won’t work. We can just… well… hm.”

Stack Exchange confirmed my fears: it’s ugly and we typically try to find approximations. I was momentarily excited to find Dr. Math claim to solve it, but they’d made a mistake and other mathematicians corrected them.

This 2005 paper in The Journal of Statistical Planning and Inference by Anirban DasGupta provides an answer, but it’s way more involved than I expected:

### Why is this so ugly?

In the original version, there’s a unique situation — each person has different birthdays. But in our version, for 23 people:

- each person could have distinct birthdays
- one pair could share a birthday and the other 21 are distinct
- two pairs could share birthdays and the other 19 are distinct
- three pairs could share birthdays and the other 17 are distinct
- …
- eleven pairs could share birthdays and the last one is distinct

For each scenario, we need to calculate the number of ways it can occur, the probability of each, and how it impacts our chance of getting a triple. It’s a mess.

But enough of my complaining about an old homework problem I never solved and which I’m clearly over and never think about.

How did you approach the problem? Did you solve it? Let me know!

Personal note: In my continuing efforts against Perfectionism, I’m going to declare this done. It’s taken up real estate in my head long enough.