Statistics | Measure of Doubt

Bet You Can’t Solve The Birthday Paradox

September 4, 2020 by Jesse Galef Leave a comment

If you’ve heard of the Birthday Paradox and/or like math puzzles and/or want to know how it connects with Computational Genomics and the Seven Bridges of Konigsberg, this is for you.

Math cake! By Sarah Lynn

The Birthday Paradox (which would be more accurately named The Birthday Somewhat Unintuitive Result) asks “How many people do you need in a group before there’s a 50% chance that at least two share a birthday?”

It’s easier to flip it around and ask “For a given number of people, what are the chances that NONE of them share a birthday?” This makes it simpler: each time we add another person to the group, we just need to calculate the number of “open” birthdays and multiply by the odds of our new member having one of those.

P(no shared birthdays for n people) =

If we just keep increasing n and calculating the probability, we find that with 23 people there’s a 50% chance of at least two people sharing a birthday.

(Ok, technically we answered “Given a number of people what probability…” and used brute force instead of “Given a probability what number of people…” but let’s ignore that; everyone else does.)

How about if we want to know the probability that no THREE people in a group share a birthday?

This variant is trickier, and has tripped up many smart people. Do you think you can solve it?

Give it a shot! I’ll talk about the solution after some context for why it matters and some graphics/tools made with Wolfram Mathematica. I started as a machine learning research scientist with Wolfram earlier this year and I’ve been really enjoying playing with the tools!

Birthdays, Bridges, de Bruijn Graphs, and… Bgenomics

This is more than an idle math puzzle; it’s related to a fascinating challenge in Computational Genomics. (It was actually a question on my homework in grad school.)

When we want to read a DNA sequence, it’s usually far too long for our machines to process all at once. Instead, we copy the sequence a bunch, slice the copies up, and gather all the pieces of a chosen, more manageable, length — say, 4 base pair “letters”.

In the end, we know every length-four subsection in our DNA sequence, we just don’t know their order. For example, if we had a length-7 sequence and took all the random chunks of 4, we might end up with

TACA, ATTA, GATT, TTAC

Now, as though it were a one-dimensional jigsaw puzzle, we try to piece them together. Each chunk of 4 will overlap with its neighbors by 3 letters: the chunk ATTA ends with TTA, so we know the next chunk must start with TTA. There’s only one which does — TTAC — so those pieces fit together as ATTAC. Now we look for a piece that begins with TAC, and so on.

In our case, there’s only one unique way to arrange the five chunks:

GATTACA

As the sequences get longer, scientists turn to graph theory to find this arrangement. Treating each DNA chunk as an edge, we can form a directed graph indicating which chunks could follow (overlap with) each other.

This overlapping-ness graph is called a De Bruijn Graph, and if there’s a path which uses every edge exactly once, we’ve done it: we’ve found a way to order the DNA chunks and reconstruct the larger sequence!

If this sounds a bit like the Seven Bridges of Konigsberg problem, there’s a reason — it’s the same issue. It’s as though we were walking from “overlap island” to “overlap island”, each time saying to ourselves “Ok, if I have a piece ending in GCA, what bridge-chunk could come next?” and hoping to find a path that uses every chunk.

Since Euler solved the Bridges of Konigsberg problem, this type of path — using every edge exactly once — is known as an Eulerian Path. The great thing is that we have really efficient algorithms to find them!

However, we can run into trouble if some of our “overlap” sections aren’t distinct. If a section of DNA is repeated, that’s ok — we can probably still find the Eulerian Path through the graph.

…But if a section of the original DNA repeats three times, we’re screwed. When the repeat is at least as long as our “overlap” we can no longer find a unique path — multiple Eulerian Paths exist.

Let’s take the sequence AACATCCCATGGCATTT, in which the phrase “CAT” repeats three times. Once we reach the “ACAT” chunk, we don’t know what comes next. Overlapping with chunk “CATT” would lead us straight to the end — leaving out many chunks — so we know that comes last. But the loops starting either CATG or CATC could come next.

So, if we’re going to read a long DNA sequence, we might ask:
How many overlapping chunks of 4 can we take before there’s a 50% or a 5% or 1% chance that we see a triple-repeat which ruins our attempt to reconstruct the original?

This is where we return to our Birthday Paradox variant!

Back to the Birthday Problem

With our Birthday Problem, there are 365 different birthdays a person can have. With DNA chunks of 4, there are 64 different three-letter ways each chunk could end. If any three chunks have the same ending, we won’t know how to reconstruct our sequence.

As the chunks get longer, we have a much better chance of producing a unique Eulerian Path from our graph.

While we can’t move the Earth farther from the Sun (nor should we (probably)) to increase the number of possible birthdays in a year, we CAN use chunks larger than 4 and increase the number of ways each chunk can end. So if we know we’re sequencing a genome 100,000 letters long, how long do our chunks need to be in order for us to have a >99% chance of reconstructing it?

Since starting my job at Wolfram Research, I’ve been playing with their graph capabilities and put together this little interactive tool. It generates a random gene and shows the De Bruijn Graph when you take chunks of different lengths. It’s amazing how quickly a totally chaotic graph becomes more orderly!

(The cloud-deployment can be a bit sluggish, but give it a second after clicking a button. If you get the desktop version of Wolfram Mathematica you can play with things like this incredibly quickly. They’re so cool.)

The Answer

Heck if I know. Sorry.

I can get the right answer of about 88 by running simulations, but I didn’t manage to derive the general formula for my class.

Every time I’ve shown this question to a friend — and the first time I saw it on my Computational Genomics homework — the response has been “Oh, this is simple. You just… wait, no, that won’t work. We can just… well… hm.”

Stack Exchange confirmed my fears: it’s ugly and we typically try to find approximations. I was momentarily excited to find Dr. Math claim to solve it, but they’d made a mistake and other mathematicians corrected them.

This 2005 paper in The Journal of Statistical Planning and Inference by Anirban DasGupta provides an answer, but it’s way more involved than I expected:

Why is this so ugly?

In the original version, there’s a unique situation — each person has different birthdays. But in our version, for 23 people:

each person could have distinct birthdays
one pair could share a birthday and the other 21 are distinct
two pairs could share birthdays and the other 19 are distinct
three pairs could share birthdays and the other 17 are distinct
…
eleven pairs could share birthdays and the last one is distinct

For each scenario, we need to calculate the number of ways it can occur, the probability of each, and how it impacts our chance of getting a triple. It’s a mess.

But enough of my complaining about an old homework problem I never solved and which I’m clearly over and never think about.

How did you approach the problem? Did you solve it? Let me know!

Personal note: In my continuing efforts against Perfectionism, I’m going to declare this done. It’s taken up real estate in my head long enough.

Filed under Math, Paradoxes, Statistics

How has Bayes’ Rule changed the way I think?

June 4, 2013 by Julia Galef 3 Comments

People talk about how Bayes’ Rule is so central to rationality, and I agree. But given that I don’t go around plugging numbers into the equation in my daily life, how does Bayes actually affect my thinking?
A short answer, in my new video below:

(This is basically what the title of this blog was meant to convey — quantifying your uncertainty.)

Filed under Math, Rationality, Statistics

RS episode #53: Parapsychology

January 29, 2012 by Julia Galef 5 Comments

In Episode 53 of the Rationally Speaking Podcast, Massimo and I take on parapsychology, the study of phenomena such as extrasensory perception, precognition, and remote viewing. We discuss the type of studies parapsychologists conduct, what evidence they’ve found, and how we should interpret that evidence. The field is mostly not taken seriously by other scientists, which parapsychologists argue is unfair, given that their field shows some consistent and significant results. Do they have a point? Massimo and I discuss the evidence and talk about what the results from parapsychology tell us about the practice of science in general.

http://www.rationallyspeakingpodcast.org/show/rs53-parapsychology.html

Filed under Psychology, Statistics

Coach Smith’s Gutsy Call

November 16, 2011 by Jesse Galef 17 Comments

Coach Mike Smith was facing a tough decision. His Falcons were in overtime against the division-rival Saints. His team had been stopped on their own 29 yard-line and were facing fourth down and inches. Should he tell his players to punt, or go for it? A punt would be safe. Trying to get the first down would be the high-risk, high-reward play. Success would mean a good chance to win, failure would practically guarantee a loss. What play call would give his team the best chance to win?

He decided to be aggressive. He called for star running back Michael Turner to try pounding up the middle of the field.

It failed. The Saints were given the ball in easy range to score, and quickly did so. The media and fans criticized Smith for his stupid decision.

But is the criticism fair? If the play call had worked, I bet he would have been praised for his guts and brilliance. I think my favorite reaction came from ESPN writer Pat Yasinskas:

When Mike Smith first decided to go for it on fourth-and-inches in overtime, I liked the call. I thought it was gutsy and ambitious. After watching Michael Turner get stuffed, I changed my mind. Smith should have punted and taken his chances with his defense.

What a perfect, unabashed example of Outcome Bias! We have a tendency to judge a past decision solely based on the result, not on the quality of the choice given the information available at the time.

Did Coach Smith know that the play would fail? No, of course not. He took a risk, which could go well or poorly. The quality of his decision lies in the chances of success and the expected values for each call.

Fortunately, some other people at ESPN did the real analysis, using 10 years of historical data of teams’ chances to win based on factors like field position, score, time remaining, and so on:

Choice No. 1: Go for the first down

…Since 2001, the average conversion percentage for NFL teams that go for it on fourth-and-1 is 66 percent. Using this number, we can find the expected win probability for Atlanta if it chooses this option.

* Atlanta win probability if it converts (first-and-10 from own 30-yard line): 67.1 percent
* Atlanta win probability if it does not convert (Saints first-and-10 from Falcons’ 29-yard line): 18 percent.
* Expected win probability of going for the first down: 0.660*(.671) + (1-.660)*(.180) = 50.4%

Choice No. 2: Punt

* For this choice, we will assume the Falcons’ net punt average of 36 yards for this season. This means the expected field position of the Saints after the punt is their own 35-yard line. This situation (Saints with first-and-10 from their 35, in OT, etc.) would give the Falcons a win probability of 41.4%.

So by choosing to go for it on fourth down, the Falcons increased their win probability by 9 percentage points.

That’s a much better way to evaluate a coach’s decision! Based on a simple model and league averages (there are problems with both of those, but they’re better than simply trusting outcome!) the punt was not the best option. Smith made the right decision.

Well, sort of. There are different ways to go for the fourth-down conversion, and according to Brian Burke at AdvancedNFLStats, Smith chose the wrong one:

Conversion success rates on 1-yd to go runs (%)

Position	3rd Down	4th Down
FB	77	70
QB	87	82
RB	68	66
Total	72	72

In these situations, quarterback sneaks have proven much more effective than having your running back take the ball. In a perfect game-theory world, defenses would realize their weakness and focus more effort on stopping it. But for now, it remains something more offenses teams can exploit. According to the numbers, the Falcons probably could have made a better decision.

And, of, course, it was OBVIOUS to me at the time that they should have called a quarterback sneak. </hindsight bias>

Filed under Game Theory, Statistics

A Sleeping Beauty paradox

November 10, 2011 by Julia Galef 29 Comments

Imagine that one Sunday afternoon, Sleeping Beauty is taking part in a mysterious science experiment. The experimenter tells her:

“I’m going to put you to sleep tonight, and wake you up on Monday. Then, out of your sight, I’m going to flip a fair coin. If it lands Heads, I will send you home. If it lands Tails, I’ll put you back to sleep and wake you up again on Tuesday, and then send you home. But I will also, if the coin lands Tails, administer a drug to you while you’re sleeping that will erase your memory of waking up on Monday.”

So when she wakes up, she doesn’t know what day it is, but she does know that the possibilities are:

It’s Monday, and the coin will land either Heads or Tails.
It’s Tuesday, and the coin landed Tails.

We can rewrite the possibilities as:

Heads, Monday
Tails, Monday
Tails, Tuesday

I’d argue that since it’s a fair coin, you should place 1/2 probability on the coin being Heads and 1/2 on the coin being Tails. So the probability on (Heads, Monday) should be 1/2. I’d also argue that since Tails means she wakes up once on Monday and once on Tuesday, and since those two wakings are indistinguishable from each other, you should split the remaining 1/2 probability evenly between (Tails, Monday) and (Tails, Tuesday). So you end up with:

Heads, Monday (P = 1/2)
Tails, Monday (P = 1/4)
Tails, Tuesday (P = 1/4)

So, is that the answer? It seems indisputable, right? Not so fast. There’s something troubling about this result. To see what it is, imagine that Beauty is told, upon waking, that it’s Monday. Given that information, what probability should she assign to the coin landing Heads? Well, if you look at the probabilities we’ve assigned to the three scenarios, you’ll see that conditional on it being Monday, Heads is twice as likely as Tails. And why is that so troubling? Because the coin hasn’t been flipped yet. How can Beauty claim that a fair coin is twice as likely to come up Heads as Tails?

Can you figure out what’s wrong with the reasoning in this post?

Filed under Paradoxes, Philosophy, Statistics

RS #47: The Search for Extra-Terrestrial Intelligence

November 7, 2011 by Julia Galef 3 Comments

In the latest episode of Rationally Speaking, Massimo and I spar about SETI, the Search for Extra-Terrestrial Intelligence: Is it a “scientific” endeavor? Is it worth maintaining? How would we find intelligent alien life, if it’s out there?

My favorite parts of this episode are the ones in which we’re debating how likely it is that intelligent alien life exists. Massimo’s opinion is essentially that we have no way to answer the question; I’m less pessimistic. There are a number of scientific facts which I think should raise or lower our estimates of the prevalence of intelligent alien life. And what about the fact of our own existence? Does that provide any evidence we can use to reason about the likelihood of our ever encountering other intelligent life? It’s a very tricky question, fraught as it is with unresolved philosophical problems in probability theory, but a fascinating one.

RS #47: The Search for Extra-Terrestrial Intelligence

Filed under Physics, Statistics

Game theory and basketball

June 13, 2011 by Julia Galef 5 Comments

Ben Morris is a friend-of-a-friend of mine who recently competed in a contest sponsored by ESPN called “Stat Geek Smackdown,” in which the goal was to correctly predict as many of the NBA playoff games as possible. For each correct guess, a contestant received 5 points.

Heading into the final game between Miami and Dallas, Ben was in second place, trailing just 4 points behind a veteran stat geek named Ilardi. By most estimates, Miami had about a 63% chance of beating Dallas. But Ben realized that if he and Ilardi both chose Miami, then even if Miami won the game, Ilardi would still win the competition, because he and Ben would each get 5 points and the gap between their scores would remain unchanged. In order for Ben to win the competition, he would have to pick the winning team and Ilardi would have to pick the losing team.

So that created an interesting game theory problem: If Ben predicted that Ilardi would pick Miami, since they were more likely to win, then Ben should pick Dallas. But if Ilardi predicted that Ben would be reasoning that way, then Ilardi might pick Dallas, knowing that all he needs to do to win the competition is to pick the same team as Ben. But of course if Ben predicts that Ilardi will be thinking that way, maybe Ben should pick Miami…

What would you do if you were Ben? You can read about Ben’s reasoning on his excellent blog, Skeptical Sports, but here’s my summary. Ben essentially had two options:

(1) His first option was to play his Nash equilibrium strategy, which is a concept you might recall if you ever took game theory (or if you saw the movie “A Beautiful Mind,” although the movie botched the explanation). That’s the set of strategies (Ben’s and Ilardi’s) which gives each of them no incentive to switch to a new strategy as long as the other guy doesn’t. The Nash equilibrium strategy is especially appealing if you’re risk averse because it’s “unexploitable,” meaning that it gives you predictable, fixed odds of winning the game, no matter what strategy your opponent uses.

In this case — and you can read Ben’s blog for the proof — the Nash equilibrium is for Ben to pick Miami with exactly the same probability as Miami has of losing (0.37) and for Ilardi to pick Miami with exactly the same probability as Miami has of winning (0.63). (You might wonder how you should pick a team “with X probability,” but it’s pretty easy: just roll a 100-sided die, and pick the team if the die comes up X or lower.)

If you do the calculation, you’ll find that playing this strategy — i.e., rolling a hundred-sided die and picking Miami only if the die came up 37 or lower — would give Ben a 23.3% chance of beating Ilardi, no matter how Ilardi decided to play. Not terrible odds, especially given that this approach doesn’t require Ben to make any predictions about Ilardi’s strategy. But perhaps Ben could do better if he were able to make a reasonable guess about what Ilardi would do.

(2) That leads us to option two: Ben could abandon his Nash equilibrium strategy, if he felt that he could predict Ilardi’s action with sufficient confidence. To be precise, if Ben thinks that Ilardi is more than 63% likely to pick Miami, then Ben should pick Dallas.

Here’s a rough proof. Call “p” the likelihood that Ilardi picks Miami, and “q” the likelihood that Ben picks Miami. Then we can assign probabilities to each of the outcomes in which Ben wins:

Since the two outcomes are mutually exclusive, we can add up their probabilities to get the total probability that Ben wins, as a function of p and q:

Probability Ben wins = .37p + .63q – pq

Just to illustrate how Ben’s chance of winning changes depending on p, I plugged in three different values of p to create three different lines: For the black line, p=0.63. For the red line, p < 0.63 (to be precise, I plugged in p=0.62, but any value of p<0.63 will create an upward sloping line). For the blue line, p > 0.63 (to be precise, I plugged in p=0.64, but any value of p>0.63 will create a downward sloping line).

If p = .63, that renders Ben’s chance of winning constant ( .233) for all values of q. In other words, if Ilardi seems to be about 63% likely to pick Miami, then it doesn’t matter how Ben picks, he’ll have the same chance of winning (23.3%) as he would if he played his Nash equilibrium strategy.

If p > .63, Ben’s chance of winning decreases as q (his probability of choosing Miami) increases. In other words, if Ben thinks there’s a greater than 63% chance that Ilardi will pick Miami, then Ben should pick Miami with as low a probability as possible (i.e., he should pick Dallas).

If p < .63, Ben’s chance of winning increases as q (his probability of choosing Miami) increases. In other words, if Ben thinks there’s a lower than 63% chance that Ilardi will pick Miami, then Ben should pick Miami with as high a probability as possible (i.e., he should pick Miami).

So what happened? Ben estimated that Ilardi would pick Miami with greater than 63% probability. That’s mainly because most people aren’t comfortable playing probabilistic strategies that require them to roll a die — people will simply “round up” in their mind and pick the team that would give them a win more often than not. And Ben knew that if he was right about Ilardi picking Miami, then Ben would end up with a 37% chance of winning, rather than the 23.3% chance he would have had if he stuck to his equilibrium strategy.

So Ben picked Dallas. As he’d predicted, Ilardi picked Miami, and lucky for Ben, Dallas won. This one case study doesn’t prove that Ilardi reasoned as Ben expected, of course. Ben summed up the takeaway on his blog:

Of course, we shouldn’t read too much into this: it’s only a single result, and doesn’t prove that either one of us had an advantage. On the other hand, I did make that pick in part because I felt that Ilardi was unlikely to “outlevel” me. To be clear, this was not based on any specific assessment about Ilardi personally, but based my general beliefs about people’s tendencies in that kind of situation.

Was I right? The outcome and reasoning given in the final “picking game” has given me no reason to believe otherwise, though I think that the reciprocal lack of information this time around was a major part of that advantage. If Ilardi and I find ourselves in a similar spot in the future (perhaps in next year’s Smackdown), I’d guess the considerations on both sides would be quite different.

Filed under Game Theory, Statistics

Thinking in greyscale

May 23, 2011 by Julia Galef 30 Comments

Have you ever converted an image from greyscale into black and white? Basically, your graphics program rounds all of the lighter shades of grey down to “white,” and all of the darker shades of grey up to “black.” The result is a visual mess – same rough shape as the original, but unrecognizable.

Something similar happens to our mental picture of the world whenever we talk about how we “believe” or “don’t believe” an idea. Belief isn’t binary. Or at least, it shouldn’t be. In reality, while we can be more confident in the truth of some claims than others, we can’t be absolutely certain of anything. So it’s more accurate to talk about how much we believe a claim, rather than whether or not we believe it. For example, I’m at least 99% sure that the moon landing was real. My confidence that mice have the capacity to suffer is high, but not quite as high. Maybe 85%. Ask me about a less-developed animal, like a shrimp, and my confidence would fall to near-uncertainty, around 60%.

Obviously there’s no rigorous, precise way to assign a number to how confident you are about something. But it’s still valuable to get in the habit, at least, of qualifying your statements of belief with words like “probably,” or “somewhat,” or “very.” It just helps keep you thinking in greyscale, and reminds you that different amounts of evidence should yield different degrees of belief. Why lose all that resolution unnecessarily by switching to black and white?

More importantly, the reason you shouldn’t ever have 0% or 100% confidence in any empirical claim is because that implies that there is no conceivable evidence that could ever make you change your mind. You can prove this formally with Bayes’ theorem, which is a simple rule of probability that also serves as a way of describing how an ideal reasoner would update his belief in some hypothesis “H” after encountering some evidence “E.” Bayes’ theorem can be written like this:

… in other words, it’s a rule for how to take your prior probability of a hypothesis, P[H], and update it based on new evidence [E] to get the probability of H given that evidence: P[H | E].

So what happens if you think there’s zero chance of some hypothesis H being true? Well, just plug in zero for “P[H],” all the way on the right, and you’ll realize that the entire equation becomes zero (because zero times anything is zero). So you don’t have to know any of the other terms to conclude that P[H | E] = 0. That means that if you start out with zero belief in a hypothesis, you’ll always have zero belief in that hypothesis no matter what evidence comes your way.

And what if you start out convinced, beyond a shadow of a doubt, that some hypothesis is true? That’s akin to saying that P[H] = 1. That also implies you must put zero probability on all the other possible hypotheses. So plug in 1 for P[H] and 0 for P[not H] in the equation above. With just a bit of arithmetic you’ll find that P[H | E] = 1. Which means that no matter what evidence you come across, if your belief in a hypothesis is 100% before seeing some evidence (that is, P[H] = 1) then your belief in that hypothesis will still be 100% after seeing that evidence (that is, P[H | E] = 1).

As much as I’m in favor of thinking in greyscale, however, I will admit that it can be really difficult to figure out how to feel when you haven’t committed yourself wholeheartedly to one way of viewing the world. For example, if you hear that someone has been accused of rape, your estimation of the likelihood of his guilt should be somewhere between 0 and 100%, depending on the circumstances. But we want, instinctively, to know how we should feel about the suspect. And the two possible states of the world (he’s guilty/he’s innocent) have such radically different emotional attitudes associated with them (“That monster!”/”That poor man!”). So how do you translate your estimated probability of his guilt into an emotional reaction? How should you feel about him if you’re, say, 80% confident he’s guilty and 20% confident he’s innocent? Somehow, finding a weighted average of outrage and empathy doesn’t seem like the right response — and even if it were, I have no idea what that would feel like.

Filed under Rationality, Statistics

The D.I.Y. way of getting a probability estimate from your doctor

May 6, 2011 by Julia Galef 20 Comments

One frustrating thing about dealing with doctors is that they tend to be unwilling or unable to talk about probabilities. I run into this problem in particular when they’ve told me there is “a chance” of something, like a chance of a complication of a procedure, or a chance of transmitting an infection, or a chance of an illness lasting past some time threshold, and so on. Whenever I’ve pressed them to try to tell me approximately how much of a chance there is, they’ve told me something to the effect of, “It varies” or “I can’t say.” I sometimes tell them, look, I know you’re not going to have exact numbers for me, but I just want to know if we’re talking more like 50% or, you know, 1%? Still, they balk.

My interpretation is that this happens due to a combination of (1) people not having a good intuitive sense of how to estimate probabilities and (2) doctors not wanting to be held liable for making me a “promise” – perhaps they’re concerned that if they give me a low estimate and it happens anyway, then I’ll get angry or sue them or something.

So I wanted to share a useful tip from my friend, the mathematician who blogs at www.askamathematician.com, who was about to have his wisdom teeth removed and was trying unsuccessfully to get his surgeon to tell him the approximate risks of various possible complications from surgery. He discovered that you can actually get a percentage out of your doctor if you’re willing to just construct it yourself:

Friend: “I’ve heard that it’s possible to end up with permanent numbness in your mouth or lip after this surgery… what’s the chance of that happening?”

Surgeon: “It’s pretty low.”

Friend: “About how low? Are we talking, like five percent? Or only a fraction of one percent?”

Surgeon: “I really can’t say.”

Friend: “Okay, well… how many of these surgeries have you done?”

Surgeon: “About four thousand.”

Friend: “How many of your patients have had permanent numbness?”

Surgeon: “Two.”

Friend: “Ah, okay. So, about one twentieth of one percent.”

Surgeon: “I really can’t give you a percentage.”

Filed under Health, Statistics

Visualizing data with lines, blocks, and roller coasters

April 25, 2011 by Julia Galef 8 Comments

Randall Munroe's infographic on radiation dose levels (Click to enlarge)

I’m a huge fan of clever ways of visualizing data, especially when there’s something challenging about the data in question. For example, if it contains more than three important dimensions and therefore can’t be easily graphed with the typical representations (e.g., position on x-axis, position on y-axis, color of dot). Or if it contains a few huge outliers which distort the scale of the data.

This recent infographic in Scientific American by my friend (and co-blogger, at Rationally Speaking) Lena Groeger is a great example of the latter. The challenge in displaying relative levels of radioactivity is that there are a few outliers (e.g., Chernobyl) which are so many times higher than the rest of the data that when you try to graph them on the same scale, you end up with the outlier at one end and then all the rest of the data clumped together in an indeterminate mass at the other end.

Randall Munroe over at the webcomic XKCD came up with a pretty good, inventive solution that relies on our intuitive sense of area, rather than length. Each successive grid represents only one small block of the next grid, which is how he manages to cram the entire skewed scale into one page. It’s cool, but I don’t think it works that intuitively. We have to consciously keep in mind the reminder of how big each grid is relative to the next, and it’s easy to lose your grip on the relative scales involved.

However, one of the benefits of online infographics as opposed to print is that you don’t have to fit the whole image in view at once. Lena and her colleagues created a long, leisurely scale that has the space at one end to show the differences between various low levels of radiation dose, below 100,000 micro-Sieverts… and then it hits you with a sense of relative magnitude as you have to scroll down, down, down, until you get to Chernobyl at 6 million micro-Sieverts.

It reminded me of one of my all-time favorite data visualizations: over one hundred years of housing prices, transformed into a first-person perspective roller coaster ride. There are a number of wonderful things about this design choice. For one thing, it works on a visceral level: reaching unprecedented heights actually makes you feel giddy, and sudden steep declines are a little scary.

I also love the way it captures the most recent housing bubble — as you keep climbing higher, and higher, and higher, and higher, and higher, the repetitive climb starts to feel relaxing, and you even forget that you’re on a roller coaster. You forget, in other words, that you’re not going to keep going up forever. And that moment at the end, when the coaster pauses and you turn around to look down at how far away the ground is (this video stops right before the 2008 crash) — shiver. Just perfect.

Filed under Statistics, Video

← Older posts

Measure of Doubt

Bet You Can’t Solve The Birthday Paradox

How about if we want to know the probability that no THREE people in a group share a birthday?