Calibrating our Confidence


It’s one thing to know how confident we are in our beliefs, it’s another to know how confident we should be. Sure, the de Finetti’s Game thought experiment gives us a way to put a number on our confidence – quantifying how likely we feel we are to be right. But we still need to learn to calibrate that sense of confidence with the results.  Are we appropriately confident?

Taken at face value, if we express 90% confidence 100 times, we expect to be proven wrong an average of 10 times. But very few people take the time to see whether that’s the case. We can’t trust our memories on this, as we’re probably more likely to remember our accurate predictions and forget all the offhand predictions that fell flat. If want to get an accurate sense of how well we’ve calibrated our confidence, we need a better way to track it.

Well, here’s a way: PredictionBook.com. While working on my last post, I stumbled on this nifty project. Its homepage features the words “How Sure Are You?” and “Find out just how sure you should be, and get better at being only as sure as the facts justify.” Sounds perfect, right?

It allows you to enter your prediction, how confident you are, and when the answer will be known.  When the time comes, you record whether or not you were right and it tracks your aggregate stats.  Your predictions can be private or public – if they’re public, other people can weigh in with their own confidence levels and see how accurate you’ve been.

(This site isn’t new to rationalists: Eliezer and the LessWrong community noticed it a couple years ago, and LessWrong’er Gwern has been using it to – among other things – track inTrade predictions.)

Since I don’t know who’s using the site and how, I don’t know how seriously to take the following numbers. So take this chart with a heaping dose of salt. But I’m not surprised that the confidences entered are higher than the likelihood of being right:

Predicted Certainty 50% 60% 70% 80% 90% 100% Total
Actual Certainty 37% 52% 58% 70% 79% 81%
Sample Size 350 544 561 558 709 219 2941

Sometimes the miscalibration matters more than others. In Mistakes Were Made (but not by me), Tavris and Aronson describe the overconfidence police interrogators feel about their ability to discern honest denials from false ones. In one study, researchers selected videos of police officers interviewing suspects who were denying a crime – some innocent and some guilty.

Kassin and Fong asked forty-four professional detectives in Florida and Ontario, Canada, to watch the tapes. These professionals averaged nearly fourteen years of experience each, and two-thirds had ha special training, many in the Reid Technique. Like the students [in a similar study], they did no better than chance, yet they were convinced that their accuracy rate was close to 100 percent. Their experience and training did not improve their performance. Their experience and training simply increased their belief that it did.

As a result, more people are falsely imprisoned as prosecutors steadfastly pursue convictions for people they’re sure are guilty. This is a case in which poor calibration does real harm.

Of course, it’s often a more benign issue. Since finding PredictionBook, I see everything as a prediction to be measured. A coworker and I were just discussing plans to have a group dinner, and had the following conversation (almost word for word):

Her: How to you feel about squash?”
Me: “I’m uncertain about squash…”
Her: “What about sauteed in butter and garlic?”
Me: “That has potential. My estimation of liking it just went up slightly.”
*Runs off to enter prediction*

I’ve already started making predictions in hopes that tracking my calibration errors will help me correct them. I wish Prediction Book had tags – it would be fascinating (and helpful!) to know that I’m particularly prone to misjudge whether I’ll like foods or that I’m especially well-calibrated at predicting the winner of sports games.

And yes, I will be using PredictionBook on football this season. Every week I’ll try to predict the winners and losers, and see whether my confidence is well-placed. Honestly, I expect to see some homer-bias and have too much confidence in the Ravens.  Isn’t exposing irrationality fun?

De Finetti’s Game: How to Quantify Belief

What do people really mean when they say they’re “sure” of something? Everyday language is terrible at describing actual levels of confidence – it lumps together different degrees of belief into vague groups which don’t always match from person to person. When one friend tells you she’s “pretty sure” we should turn left and another says he’s “fairly certain” we should turn right, it would be useful to know how confident they each are.

Sometimes it’s enough to hear your landlord say she’s pretty sure you’ll get towed from that parking space – you’d move your car. But when you’re basing an important decision on another person’s advice, it would be better describe confidence on an objective, numeric scale. It’s not necessarily easy to quantify a feeling, but there’s a method that can help.

Bruno de Finetti, a 20th-century Italian mathematician, came up with a creative idea called de Finetti’s Game to help connect the feeling of confidence to a percent (hat tip Keith Devlin in The Unfinished Game). It works like this:


Suppose you’re half a mile into a road trip when your friend tells you that he’s “pretty sure” he locked the door. Do you go back? When you ask him for a specific number, he replies breezily that he’s 95% sure. Use that number as a starting point and begin the thought experiment.

In the experiment, you show your friend a bag with 95 red and 5 blue marbles. You then offer him a choice: he can either pick a marble at random and, if it’s red, win $1 million. Or he can go back and verify that the door is locked and, if it is, get $1 million.

If your friend would choose to draw a marble from the bag, he preferred the 95% chance to win. His real confidence of locking the door must be somewhere below that. So you play another round – this time with 80 red and 20 blue marbles. If he would rather check the door this time, his confidence is higher than 80% and perhaps you try a 87/13 split next round.

And so on. You keep offering different deals in order to hone in on the level where he feels equally comfortable selecting a random marble and checking the door. That’s his real level of confidence.


The thought experiment should guide people through the tricky process of connecting their feeling of confidence to a corresponding percent. The answer will still be somewhat fuzzy – after all, we’re still relying on a feeling that one option is better than another.

It’s important to remember that the game doesn’t tell us how likely we are to BE right. It only tells us about our confidence – which can be misplaced. From cognitive dissonance to confirmation bias there are countless psychological influences messing up the calibration between our confidence level and our chance of being right. But the more we pay attention to the impact of those biases, the more we can do to compensate. It’s a good practice (though pretty rare) to stop and think, “Have I really been as accurate as I would expect, given how confident I feel?”

I love the idea of measuring people’s confidence (and not just because I can rephrase it as measuring their doubt). I just love being able to quantify things! We can quantify exactly how much a new piece of evidence is likely to affect jurors, how much a person’s suit affects their persuasive impact, or how much confidence affects our openness to new ideas.

We could even use de Finetti’s Game to watch the inner workings of our minds doing Bayesian updating. Maybe I’ll try it out on myself to see how confident I feel that the Ravens will win the Superbowl this year before and after the Week 1 game against the rival Pittsburgh Steelers. I expect that my feeling of confidence won’t shift quite in accordance with what the Bayesian analysis tells me a fully rational person would believe. It’ll be fun to see just how irrational I am!

Video Short: Sentient Meat

One of my earlier posts quoted the Terry Bisson “Sentient Meat” short story – well, it turns out that there’s a film short based on the story!

Great exchange:

And the ones who have been aboard our vessels, the ones you have probed? You’re sure they won’t remember?

They’ll be considered crackpots if they do. We went into their heads and smoothed out their meat so that we’re just a dream to them.

A dream to meat! How strangely appropriate, that we should be meat’s dream.

I liked the feeling of surrealism surrounding something we take for granted every day: we are thinking, loving, dreaming meat.

And yes, that IS Ben Baily of Cash Cab.

Happy Tau Day!

I almost missed the chance to promote Tau Day! Many of you probably know about Pi Day, held on March 14th. At my high school we used to bring in pies to the math room and eat them at 1:59PM in a glorious (and delicious) celebration of mathematics. But the inimitable Vi Hart lobs an objection: using Pi often doesn’t make as much sense as using Tau, the ratio of the circumference of a circle over its RADIUS.

Thus, we need a new day in celebration of the more-useful Tau:

Seeing as Tau is approximately 6.28 and today is June 28th, have yourself a great Tau Day and enjoy two pi(e)s! While you’re eating, you can go check out more of Vi Hart’s work – she does a fantastic job showing how much fun math can be. We need more voices like hers, and I’ll be sure to post more of her videos!

Game Theory and Football: How Irrationality Affects Play Calling

Coaches and coordinators in professional football get paid a lot of money to call the right plays – not just the best plays for particular situations, but also unpredictable plays that will catch the other team off guard. It’s a perfect setup for game theory analysis!

As in other game theory situations, the best play depends in part on what your opponent does. Your running play is much more likely to succeed against a pass-prevent defense, but would be in trouble against a run-stuffing formation. If the defense can guess what you’re going to call, they can adjust accordingly and have an advantage. Even on 3rd down and long – a common passing situation – there’s value in calling a percent of running plays, because the defense is less likely to be geared toward stopping that. But as you do it more, the chance of catching the defense off guard gets smaller. There’s some optimal balance where the expected success of a surprising run is equal to the expected success of a more sensible (but anticipated) pass.

The goal is to stay unpredictable and exploit patterns where your opponent is using a sub-optimal combination. If a team notices that passing plays are working better, they’ll be more likely to call them. As the defense notices, they’ll shift away from their run-defense and focus more on defending passes. In theory, the two teams reach an equilibrium.

In practice, it doesn’t quite work that perfectly – human beings are making the decisions, and humans are both vulnerable to cognitive biases and notoriously bad at mimicking true unpredictability. Brian Burke, a fellow fan of combining sports with statistics, was poring over the play-calling data for second downs and noticed something odd:

There’s a strange spike in percent of running plays called at 2nd and 10! Tactically, 2nd and 10 isn’t all that different from 2nd and 9 or 11, so it’s strange to see such a difference. Why would they call so many more running plays in that particular situation?

The key is to realize that there are two ways a team tends to find itself facing a 2nd and 10 situation – runs that happen to go nowhere or any incomplete pass. Of those, incomplete passes are far more common. So in cases of 2nd and 10, it’s most often because the team just failed a passing play. That suggests two reasons coaches might be irrationally switching to running plays, even at the cost of sacrificing unpredictability:

(1) The hasty generalization bias (also called the small sample bias) and the recency effect are cognitive biases in which people overgeneralize from a small amount of data, especially recent data. Failed passes are very common (about 40% fail), so there’s no good reason for a coach to treat any single failed pass as evidence that they’d be better off switching to a running play. But the urge to overreact to the failed pass that just happened is strong, thanks to these two biases.

(2) People are terrible at generating unpredictability — when asked to make up a “seemingly-random” sequence of coin flips, we tend to use far more alternation between Heads and Tails than would actually occur in a real sequence of coin flips. So even if coaches weren’t overreacting to a failed pass, and they were simply trying to be unpredictable, they would still tend to switch to a running play after a passing play more often than random chance would dictate.

Indeed, when Brian separated the data by previous play, the alternation trend is clear — passes are more likely after runs, and runs are more likely after passes:

(My favorite team, the Baltimore Ravens, was pretty bad about this under the previous regime, Coach Billick)

Brian concludes:

Coaches and coordinators are apparently not immune to the small sample fallacy. In addition to the inability to simulate true randomness, I think this helps explain the tendency to alternate. I also think this why the tendency is so easy to spot on the 2nd and 10 situation. It’s the situation that nearly always follows a failure. The impulse to try the alternative, even knowing that a single recent bad outcome is not necessarily representative of overall performance, is very strong.

So recency bias may be playing a role. More recent outcomes loom disproportionately large in our minds than past outcomes. When coaches are weighing how successful various play types have been, they might be subconsciously over-weighting the most recent information—the last play. But regardless of the reasons, coaches are predictable, at least to some degree.

Coaches are letting irrational biases influence their play calling, pulling them away from the optimal mix. The result, according to Pro Football Reference stats, is less success on those plays. I wonder how well a computer could call plays using a Statistical Prediction Rule

Why Imagined Indulgence Helps Us Diet

What makes decadent waffles so damn satisfying in the morning? Is it the optimal balance of crispy and soft textures? The fat in the whipped cream? The sugar content? It turns out that there’s a factor beyond the actual food: your frame of mind. A team of researchers at Yale just performed a clever study and found that you feel fuller and more sated if you believe you just ate something indulgent.

As with most psychology experiments, the study involved lying to people. Subjects were given a milkshake on two separate occasions but were told that one contained a whopping 620 calories and the other had a more sensible 140 calories. In reality, both shakes were the same – right in the middle at 380 calories.

Before and after each test, the researchers monitored the subjects’ ghrelin levels as a measure of how satisfied they were. Ghrelin – the hormone which triggers hunger – increases and spikes before meals, then drops off after people eat. If the calorie content were all that mattered, there would be no difference in reactions to the two shakes. But there was:

Results: The mindset of indulgence produced a dramatically steeper decline in ghrelin after consuming the shake, whereas the mindset of sensibility produced a relatively flat ghrelin response. Participants’ satiety was consistent with what they believed they were consuming rather than the actual nutritional value of what they consumed.

What should we make of the finding (besides a continued fascination with the placebo effect)? For one thing, it reinforces the notion that our stomachs are very crude sense organs which aren’t precise or accurate at judging how much food they need.

For anyone trying to achieve (or maintain) a healthy weight, the dynamic makes it tougher to diet. The conscious decision to eat ‘sensible’ food motivates our bodies to demand more calories. What a frustrating situation!

Patrick at Discoblog toys with a creative solution:

It definitely suggests some new approaches to dieting, like berating yourself for eating celery sticks in an effort to make them seem more luxurious and satisfying. But it’s not clear if lying to yourself is as effective as having other people lie to you. And believing that you are constantly eating poorly might have other psychological side effects, one supposes.

I agree, it probably doesn’t work as well to lie to yourself (and nobody will be able to convince me that celery sticks are fatty treats). But we can draw a useful tactic that doesn’t require deception. Instead of applying the study’s findings when we eat light food, keep it in mind when eating dessert. Next time you want a rich slice of cheesecake, look up how many calories it has! According to the study, focusing on the fact that the slice has 50% of your recommended calories will make you feel more satisfied eating less of it.

What I’d like to see is a study that’s honest about the number of calories but emphasizes different ingredients to foster that ‘indulgent’ mindset. Would our bodies react differently to drinking a “300-calorie fruit milkshake” compared to the same one described as a “300-calorie shake with bananas, heavy cream, vanilla extract, and pure cane juice”?

If that works, we can help our friends and families by focusing attention on the fattiest, sweetest, and tastiest part of a dish. Next time Julia is willing to make those delicious-looking blintzes again, sign me up. I can eat one as she tells me about the heavy cream that went into the homemade ricotta.

[UPDATE] I’m looking a little deeper into what exactly was being measured – the abstract and researchers said “Participants’ satiety was consistent with what they believed they were consuming rather than the actual nutritional value of what they consumed.” but news sources report this as well:

The study also didn’t find that the larger drop in ghrelin in those who drank the indulgent shakes was accompanied by a larger drop in hunger levels, a finding that the researchers couldn’t fully explain. “We may not have used a reliable measure of hunger,” says Crum. “My sense is that hunger levels should have changed.”

I had assumed that the participants’ satiety was the same as their remaining hunger – but those two quotes seem at odds at first glance.

The Game Theory of Story Endings

Do happy endings really make you as happy if you see them coming a mile away? When we watch a trashy action flick or a fluffy romantic comedy, aren’t the conflicts less interesting because we know it’ll all end happily ever after? Someone has to bite the bullet and write a sad ending to give plausibility to the threat of unhappiness. It’s disincentivized because sad endings are more challenging and risk upsetting the audience, but someone has to do it.

Steven E. Landsburg muses about this in The Armchair Economist:

I am intrigued by the market for movie endings. Movie-goers want two things in an ending: They want it to be happy and they want it to be unpredictable. There is some optimal frequency of sad endings that maintains the right level of suspense. Yet the market might fail to provide enough sad endings.

An individual director who films a sad ending risks short-term losses, as word gets around that the movie is “unsatisfying.” It is true that there are long-term gains, as viewers are kept off their guard for future movies. Unfortunately, most of those gains may be captured by other directors, because movie-goers remember only that the murderer does sometimes catch up with the heroine in the basement, and do not remember that it happens only in movies with particular directors. Under these circumstances, no individual director may be willing to incur costs for his rivals’ benefit.

A solution is for directors to display their names prominently, so that viewers know when a movie was made by someone unpredictable. Viewers, however, may find it in their interests to retaliate by covering their eyes when the director’s name is shown.

If you can be associated more strongly with unpredictability, you reap more benefits. You’re also more strongly associated with the unhappy ending, which might turn audiences away.

One way to ease the blow of an unexpected sad ending is to make deaths triumphant, defiant, or heroic. Think of how Spock died in The Wrath of Khan (No, I’m not going to give a spoiler alert for a 30 year old movie). Sure, people die in Star Trek all the time – when Kirk, Spock, and fresh-faced, red-shirted Ensign Jimmy beam down to explore a planet for life, we all know one of them isn’t going to make it back. But to kill a main character is more significant. And it was done in a touching way. They got the unpredictability without upsetting their audience.

I genuinely respect Joss Whedon for his willingness to throw curve balls like this in his story lines. He’s developed a reputation for having sympathetic characters die, leave, or change sides – often without warning. Rather than watching Buffy, Firefly and Serenity thinking “So, how is it all going to work out this time?” we’re forced to think “Is it going to work out this time?”

TV Tropes has a name for all this – Anyone Can Die:

This is where no one is exempt from being killed, including the main characters (maybe even the hero). The Sacrificial Lamb is often used to establish the writer’s Anyone Can Die cred early on. However, if the Lamb’s death is a one-off with no follow-up, it’s just Killed Off for Real. To really be Anyone Can Die, the work must include multiple deaths, happening at different points in the story. Bonus points if the death is unnecessary and devoid of Heroic Sacrifice.

In game theory situations, reputation plays a large role. TV Tropes mentions building a ‘Anyone Can Die’ cred, which can be achieved through repeated interactions. In a TV series or multiple films by the same director, you get a feel for whether the good guys always prevail. But even within a single story, early and repeated signaling can make the remainder of the plot more intense. When a major character is killed off without it being a Heroic Sacrifice, that’s a powerful signal that anything can happen. The musical Into the Woods will always have a special place in my heart for mastering this dynamic.

But there’s another route. Historical dramas can increase society’s perception of “sadness plausibility” without anyone taking a hit for being a downer. Nobody’s going to feel unsatisfied that Titanic, The Great Escape, or Butch Cassidy and the Sundance Kid have sad endings. (Or if they do, they can take it up with reality for writing a depressing script. It’s not easy to keep those separate in our brains; we just get the overall sense that sometimes stories have sad endings. And that perception helps us enjoy all the other movies we watch.

All Wikipedia Roads Were Forced to Philosophy

Does everything boil down to philosophy? A case could be made that it’s really all about math and science. Or perhaps breasts. In the alt-text of Wednesday’s XKCD comic, a specific challenge was made: “Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at ‘Philosophy’.”

Game on. I already had a tab open to the Wikipedia page for “Where Mathematics Comes From” and decided to see how long it took:

Where Mathematics Comes From

  1. George Lakoff
  2. Cognitive linguistics
  3. Linguistics
  4. Human
  5. Taxonomy
  6. Science
  7. Knowledge
  8. Fact
  9. Information
  10. Sequence
  11. Mathematics
  12. Quantity
  13. Property (Philosophy)
  14. Modern Philosophy
  15. Philosophy

Ok, maybe that one was too easy. Let’s use my go-to example: Waffles:

Waffle

  1. Batter (cooking)
  2. Flour
  3. Powder
  4. Solid
  5. State of matter
  6. Phase (matter)
  7. Outline of physical science
  8. Natural science
  9. Science
  10. Knowledge
  11. Fact
  12. Information
  13. Sequence
  14. Mathematics
  15. Quantity
  16. Property (Philosophy)
  17. Modern Philosophy
  18. Philosophy

Well son of a gun. I’ve tried it with ‘Mongoose’ (11 clicks), Baltimore Ravens football player ‘Ed Reed’ (16 clicks), and ‘Lord of the Rings’ (22 clicks). All led back to Philosophy.

Well, we don’t END at philosophy – we could keep going. It turns out that (as of writing this) there’s a 19-step loop including philosophy, science, mathematics, and mammory gland.

We could just as easily say that all paths lead to science! Or math! Or breasts!

However, before you get too excited, it turns out there’s some mischief afoot.

First, it’s been two days since the XKCD comic went up, and considering how malleable Wikipedia is, some things have been changed. I was suspicious that Quantity’s first link went to Property (Philosophy) so I checked the history page:

# (cur | prev) 09:54, 25 May 2011 99.186.253.32 (talk) (14,042 bytes) (Edited for xkcd)

# (cur | prev) 09:28, 25 May 2011 146.162.240.242 (talk) (14,004 bytes) (Undid revision 430815864 by Antony-22 (talk) see today’s xkcd, without the “property” link, it breaks the “all pages eventually end up at philosophy ” game. The link should be there)

I actually found a small loop that Male leads to Gender, which leads back to Male. I expect the Male one will be “fixed” at some point (a phrase you don’t hear outside the veterinarian very often).

The philosophy topic has been in the XKCD forums last Sunday, and the idea was around for longer than that. Tricky editing has been going on toward this goal for a while.

I’d heard that philosophy leads to reason, which leads to rationality, which leads back to philosophy. That’s been changed since Wednesday, and I wonder if a deliberate effort moved the path from rationality to breasts.

Yes, it’s true that the first link on an article is likely to be broad and trend toward science/philosophy, but this isn’t unguided evolution. This is intelligently designed.

When Literal Honesty Goes Awry

When is it NOT appropriate to bluntly speak the truth? We’ve all heard someone be insulting and resort to the defense of “Well, it’s true!” Even boring, inoffensive facts can become offensive if brought up inartfully. I think this is a perfect example, illustrated by the hilarious comedy team of David Mitchell and Robert Webb:

I mean, technically it’s true. The literal fact that “anyone we know is unlikely to be the most attractive person on earth” shouldn’t hurt feelings. Nobody should think that much of themselves!

…And yet, it’s rude to say. Why?

I think that’s because nobody took Robert’s original statement “this is the most beautiful woman in the world” at its face value. It violated the maxim of quality – the literal meaning was clearly false so people look for alternative interpretations (“She’s beautiful and I love her” or “She’s very attractive in a combination of ways”).

Since nobody took it seriously at face value, challenges to the claim are perceived as challenging the alternate interpretations rather than the literal meaning. The very decision to call attention to it makes a statement. Why would David be so motivated to discuss her beauty unless he strenuously disagreed with her beauty? So, in essence, he’s saying “No, she’s not very beautiful.”

Yes, David’s literal content is true: she’s not the most beautiful person in the world. But so much of our reaction to a statement is is really a reaction to its implied meaning, and it’s tough to get around that. Initial gut reactions can be powerful.

But it’s possible to do it right. I love having the opportunity to share the awesome and incredible Tim Minchin song If I Didn’t Have You:

Somehow, when Tim does it, the honest approach works better. People often claim that they DO have a soul mate, so it isn’t automatically interpreted as a figure of speech for something more casual.

But it’s particularly important the way he addresses the literal meanings. Compare “I don’t think you’re special. I mean, I think you’re special but not off the charts” with “I don’t think you’re special. I mean, I think you’re special but you fall within a bell-curve.” It’s a strange enough statement to make people think about it harder and realize he’s not being snide.

I found myself thinking of something Steven Pinker wrote in The Stuff of Thought:

The incongruity in a fresh literary metaphor is another ingredient that gives it its pungency. The listener resolves the incongruity soon enough by spotting the underlying similarity, but the initial double take and subsequent brainwork conveys something in addition. It implies that the similarity is not apparent in the humdrum course of everyday life, and that the author is presenting real news in forcing it upon the listener’s attention.

Pinker was writing about using new metaphors to emphasize non-literal meaning, but it works the other way as well. Fresh phrasings – in this case gloriously nerdy ones – make listeners pay more attention to parsing the intended meaning, metaphorical or literal.

If you’re worried about being misinterpreted, try a creative way of expressing the same thought. Protesting “But I was telling the truth!” won’t always be enough.

Quality Webcomic Nerdiness

I love it when comics are both intelligent and fun. One of the first things I do when I wake up is read webcomics – it makes for a good transition to consciousness. And Today’s Saturday Morning Breakfast Cereal is particularly relevant to last month’s discussion:

First thought: that could totally have been my high school.
Actual first thought: Blurg… (read: Where’s my coffee)
Second thought: Wait a minute, 0^0 is a complicated question! It’s not necessarily 1!

But before I got too far up on my high horse about knowing that 0^0 is poorly defined, I looked at the scroll-over bonus panel:

And this is why I love SMBC.