## Bet You Can’t Solve The Birthday Paradox

If you’ve heard of the Birthday Paradox and/or like math puzzles and/or want to know how it connects with Computational Genomics and the Seven Bridges of Konigsberg, this is for you.

Math cake! By Sarah Lynn

The Birthday Paradox (which would be more accurately named The Birthday Somewhat Unintuitive Result) asks “How many people do you need in a group before there’s a 50% chance that at least two share a birthday?”

It’s easier to flip it around and ask “For a given number of people, what are the chances that NONE of them share a birthday?” This makes it simpler: each time we add another person to the group, we just need to calculate the number of “open” birthdays and multiply by the odds of our new member having one of those.

P(no shared birthdays for n people) =

If we just keep increasing n and calculating the probability, we find that with 23 people there’s a 50% chance of at least two people sharing a birthday.

(Ok, technically we answered “Given a number of people what probability…” and used brute force instead of “Given a probability what number of people…” but let’s ignore that; everyone else does.)

#### How about if we want to know the probability that no THREE people in a group share a birthday?

This variant is trickier, and has tripped up many smart people. Do you think you can solve it?

Give it a shot! I’ll talk about the solution after some context for why it matters and some graphics/tools made with Wolfram Mathematica. I started as a machine learning research scientist with Wolfram earlier this year and I’ve been really enjoying playing with the tools!

### Birthdays, Bridges, de Bruijn Graphs, and… Bgenomics

This is more than an idle math puzzle; it’s related to a fascinating challenge in Computational Genomics. (It was actually a question on my homework in grad school.)

When we want to read a DNA sequence, it’s usually far too long for our machines to process all at once. Instead, we copy the sequence a bunch, slice the copies up, and gather all the pieces of a chosen, more manageable, length — say, 4 base pair “letters”.

In the end, we know every length-four subsection in our DNA sequence, we just don’t know their order. For example, if we had a length-7 sequence and took all the random chunks of 4, we might end up with

TACA, ATTA, GATT, TTAC

Now, as though it were a one-dimensional jigsaw puzzle, we try to piece them together. Each chunk of 4 will overlap with its neighbors by 3 letters: the chunk ATTA ends with TTA, so we know the next chunk must start with TTA. There’s only one which does — TTAC — so those pieces fit together as ATTAC. Now we look for a piece that begins with TAC, and so on.

In our case, there’s only one unique way to arrange the five chunks:

GATTACA

As the sequences get longer, scientists turn to graph theory to find this arrangement. Treating each DNA chunk as an edge, we can form a directed graph indicating which chunks could follow (overlap with) each other.

This overlapping-ness graph is called a De Bruijn Graph, and if there’s a path which uses every edge exactly once, we’ve done it: we’ve found a way to order the DNA chunks and reconstruct the larger sequence!

If this sounds a bit like the Seven Bridges of Konigsberg problem, there’s a reason — it’s the same issue. It’s as though we were walking from “overlap island” to “overlap island”, each time saying to ourselves “Ok, if I have a piece ending in GCA, what bridge-chunk could come next?” and hoping to find a path that uses every chunk.

Since Euler solved the Bridges of Konigsberg problem, this type of path — using every edge exactly once — is known as an Eulerian Path. The great thing is that we have really efficient algorithms to find them!

However, we can run into trouble if some of our “overlap” sections aren’t distinct. If a section of DNA is repeated, that’s ok — we can probably still find the Eulerian Path through the graph.

…But if a section of the original DNA repeats three times, we’re screwed. When the repeat is at least as long as our “overlap” we can no longer find a unique path — multiple Eulerian Paths exist.

Let’s take the sequence AACATCCCATGGCATTT, in which the phrase “CAT” repeats three times. Once we reach the “ACAT” chunk, we don’t know what comes next. Overlapping with chunk “CATT” would lead us straight to the end — leaving out many chunks — so we know that comes last. But the loops starting either CATG or CATC could come next.

So, if we’re going to read a long DNA sequence, we might ask:
How many overlapping chunks of 4 can we take before there’s a 50% or a 5% or 1% chance that we see a triple-repeat which ruins our attempt to reconstruct the original?

### Back to the Birthday Problem

With our Birthday Problem, there are 365 different birthdays a person can have. With DNA chunks of 4, there are 64 different three-letter ways each chunk could end. If any three chunks have the same ending, we won’t know how to reconstruct our sequence.

As the chunks get longer, we have a much better chance of producing a unique Eulerian Path from our graph.

While we can’t move the Earth farther from the Sun (nor should we (probably)) to increase the number of possible birthdays in a year, we CAN use chunks larger than 4 and increase the number of ways each chunk can end. So if we know we’re sequencing a genome 100,000 letters long, how long do our chunks need to be in order for us to have a >99% chance of reconstructing it?

Since starting my job at Wolfram Research, I’ve been playing with their graph capabilities and put together this little interactive tool. It generates a random gene and shows the De Bruijn Graph when you take chunks of different lengths. It’s amazing how quickly a totally chaotic graph becomes more orderly!

(The cloud-deployment can be a bit sluggish, but give it a second after clicking a button. If you get the desktop version of Wolfram Mathematica you can play with things like this incredibly quickly. They’re so cool.)

Heck if I know. Sorry.

I can get the right answer of about 88 by running simulations, but I didn’t manage to derive the general formula for my class.

Every time I’ve shown this question to a friend — and the first time I saw it on my Computational Genomics homework — the response has been “Oh, this is simple. You just… wait, no, that won’t work. We can just… well… hm.”

Stack Exchange confirmed my fears: it’s ugly and we typically try to find approximations. I was momentarily excited to find Dr. Math claim to solve it, but they’d made a mistake and other mathematicians corrected them.

This 2005 paper in The Journal of Statistical Planning and Inference by Anirban DasGupta provides an answer, but it’s way more involved than I expected:

### Why is this so ugly?

In the original version, there’s a unique situation — each person has different birthdays. But in our version, for 23 people:

• each person could have distinct birthdays
• one pair could share a birthday and the other 21 are distinct
• two pairs could share birthdays and the other 19 are distinct
• three pairs could share birthdays and the other 17 are distinct
• eleven pairs could share birthdays and the last one is distinct

For each scenario, we need to calculate the number of ways it can occur, the probability of each, and how it impacts our chance of getting a triple. It’s a mess.

But enough of my complaining about an old homework problem I never solved and which I’m clearly over and never think about.

How did you approach the problem? Did you solve it? Let me know!

Personal note: In my continuing efforts against Perfectionism, I’m going to declare this done. It’s taken up real estate in my head long enough.

## Why Blocking Roads Can Speed Up Traffic

It’s so counter-intuitive that it’s called Braess’ Paradox: How can closing a road actually make everyone’s commute shorter? You would think that blocking a route would be an inconvenience, but under some circumstances it’s actually for the best.

Doesn’t sound right, does it?  Here’s the situation: Assume drivers are rational and intelligent.  I know, that’s a stretch – I grew up around DC.  But bear with me.  If there are multiple paths that people can take, they should in theory find an equilibrium between them.  If one path has less traffic and takes less time, more people will switch to it until it loses its advantage.  If one path starts longer than the others, nobody will use it until the other paths get congested enough to make it worth it.

So how can an extra path actually make the average commute time longer?  Shouldn’t an extra path just give people more options to choose from, and ultimately find the best equilibrium?

### The Situation:

It turns out that when some roads are more prone to traffic than others, it can create Braess’ Paradox.  Imagine that some roads aren’t as affected by traffic – I picture these as the local roads with traffic lights. They add a fixed amount of time to your commute, say 45 minutes. The other roads are heavily dependent on traffic – these highways can either be wonderfully fast or a mess of stop-and-go congestion, depending on how many other people are on them. The average time it takes to drive on them is the number of cars over 100.

Let’s say there are 4000 cars driving from the start to finish. Without the connector (dotted in the diagram), an equilibrium forms where half the drivers (2000 cars) take the top route through A, and half take the bottom route through B.  The highway takes 2000/100 = 20 minutes, and the local road takes 45 minutes. So half the population spends 45 minutes on a local street, followed by 20 minutes on a highway, and the other half of the drivers spend 20 minutes on a highway, followed by 45 minutes on a local street. Everyone gets to their destination in 65 minutes. Nobody has any incentive to switch.

But what if a new connector is opened between A and B, allowing people to go straight from one highway to the other? Now everyone thinks to themselves, “Hey, why spend 45 minutes on a local street when I could spend 20 minutes on the highway? I’m going to take the route Start –> A –> B –> Finish, and shave 25 minutes off of my commute time!”

Of course, if everyone thinks that way, there are now double the cars on each highway than there were before, and it’s half as fast: now each highway takes 40 minutes, not 20 minutes. That’s still 5 minutes less than the 45 minutes it takes to drive on the local street, though, so everyone still has an incentive to take the highway.

So in the end, how has the connector affected people’s commutes? Everyone’s commute used to be 65 minutes; now, everyone’s commute is 80 minutes. And to make it stranger, there’s no better path to take – anyone considering switching to their original route would be looking at an 85 minute drive.

### How does this happen?

How can opening a new, super-fast connector make commutes worse? It comes down to the price of anarchy and people’s selfish motivations.  With the connector open, each set of cars has the option to clog up the other half’s highways – saving themselves 5 minutes but adding 20 minutes to the other guys’ commute.

It’s like the prisoner’s dilemma: Each driver has the motivation to take the highways, even though it damages the overall system. Without the connector, nobody is allowed to “defect” for personal gain. In the traditional prisoner’s dilemma, it would be like a mafia boss keeping all his criminals anonymous. Without the option to rat each other out, criminals would avoid the selfish temptation and the entire system is better off.

Braess’ Paradox isn’t purely hypothetical – it has real-world implications in city planning. According to this New York Times article titled What if They Closed 42d Street and Nobody Noticed?, “When a network is not congested, adding a new street will indeed make things better. But in the case of congested networks, adding a new street probably makes things worse at least half the time, mathematicians say.”  That’s shocking. My intuitions about how traffic works were way off.

Lastly, via Presh Talkwalkar’s fantastic game theory blog, Mind Your Decisions, (which brought Braess’ paradox to my attention) there’s a great video of the paradox physically in action with springs. Check it out:

Imagine that one Sunday afternoon, Sleeping Beauty is taking part in a mysterious science experiment. The experimenter tells her:

“I’m going to put you to sleep tonight, and wake you up on Monday. Then, out of your sight, I’m going to flip a fair coin. If it lands Heads, I will send you home. If it lands Tails, I’ll put you back to sleep and wake you up again on Tuesday, and then send you home. But I will also, if the coin lands Tails, administer a drug to you while you’re sleeping that will erase your memory of waking up on Monday.”

So when she wakes up, she doesn’t know what day it is, but she does know that the possibilities are:

• It’s Monday, and the coin will land either Heads or Tails.
• It’s Tuesday, and the coin landed Tails.

We can rewrite the possibilities as:

• Tails, Monday
• Tails, Tuesday

I’d argue that since it’s a fair coin, you should place 1/2 probability on the coin being Heads and 1/2 on the coin being  Tails. So the probability on (Heads, Monday) should be 1/2. I’d also argue that since Tails means she wakes up once on Monday and once on Tuesday, and since those two wakings are indistinguishable from each other, you should split the remaining 1/2 probability evenly between (Tails, Monday) and (Tails, Tuesday). So you end up with:

• Heads, Monday  (P = 1/2)
• Tails, Monday (P = 1/4)
• Tails, Tuesday  (P = 1/4)

So, is that the answer? It seems indisputable, right? Not so fast. There’s something troubling about this result. To see what it is, imagine that Beauty is told, upon waking, that it’s Monday. Given that information, what probability should she assign to the coin landing Heads? Well, if you look at the probabilities we’ve assigned to the three scenarios, you’ll see that conditional on it being Monday, Heads is twice as likely as Tails. And why is that so troubling? Because the coin hasn’t been flipped yet. How can Beauty claim that a fair coin is twice as likely to come up Heads as Tails?

Can you figure out what’s wrong with the reasoning in this post?

## Easy Math Puzzle – Or is it?

How good are you at basic math? Can you solve this simple logic puzzle? Here, give it a go and let me know how long it took you to answer:

Got it yet?

It looks easier than it is. The options are presented beautifully to cause maximum mental confusion.

As my dad put it, the answer depends on the answer. If the answer is 60%, it’s 25%. If the answer is 25% it’s 50%. If the answer is 50% it’s 25%. There’s an endless loop with no correct answer.

Don’t lose sleep, I “found” an answer, it was hidden: [edited for clarity]

Yes, I photoshopped this. I’m either cheating or engaging in outside-the-box thinking. Sometimes it’s tough to tell the difference.

My preferred set of answers would be:

• A) 25%
• B) 50%
• C) 75%
• D) 50%

Though I’m tempted to throw a “0%” in for good measure…

(Puzzle via PostSecret by way of Spencer of Ask a Mathematician/Ask a Physicist)

[Edited for clarity]