← Back to 2019 talks### Transcript

For as long as we've had oral or written histories, we've had stories. And humans are remarkably good at telling each other stories and our brain has adapted to work well with this mode of communicating ideas. We remember things better when they're told to us as stories. And we're more likely to recall facts if they're presented as a story rather than as disconnected units.

In fact, were so good at this that when we hear a good story, our brains are adept at placing us inside the story. We imagine ourselves experiencing the events as if they really happened to us. And neurological scans will bear that out too.

But there are other kinds of information we want to convey that don't fit into the mold of a story and that our brains haven't really evolved to be specialized at-- things like accurately assessing risk, drawing statistical conclusions, understanding who to trust, and correcting or counteracting our own biases.

By way of example of how stories might not always work, consider this short fragment. Alex loved his grandmother. In fact, he was on his way to visit her right now. Alex came bearing gifts, three of her favorite longtime foods from his kitchen-- tart blueberry muffins, freshly roasted coffee beans, and decadent cheesecake. He couldn't wait to see her and hear how her day had been.

Chances are good that if I ask you in 30 seconds what the grandmother's favorite foods were, you'll be able to remember that they were blueberry, muffins, coffee, and cheesecake. Now, who here thinks that they would be able to remember those three things for 30 seconds when this slide goes away? Raise your hand if you feel pretty sure that you could remember those three foods. OK, pretty strong confidence. Don't worry. There won't be a quiz.

Now let's change the story a little bit. What if Alex's grandmother's favorite things were numbers instead of food? And we'll randomly generate some numbers that have the same number of characters as blueberry muffins, coffee beans, and cheesecake respectively. That's a little bit different, right?

Now raise your hand if you think you can remember those three numbers for 30 seconds after they disappear. OK. I just wanted to see if anyone was confident enough to say yes. So it turns out that stories aren't a universally good tool for representing information that we'd like to convey. This has sweeping implications for how we build systems, especially complex computer systems because those systems are built by minds-- at least for now-- that are good at some things like stories and very bad in others.

In fact, it turns out that we're so bad at some kinds of things that we're very likely to frequently make the same kinds of mistakes over, and over, and over again. Unfortunately, it also turns out that many of the things that we're bad at are also things that we need to be good at in order to build complex systems and draw accurate conclusions from them.

So stories are good at telling us about predictable phenomena. And they can help us understand and provide conclusions that would otherwise take a long time for us to learn. And we're really good at using them.

Statistics, by contrast-- that's only really been around for about 100 to 200 years, a tiny blip on the timescale of human evolution. So what happens when the lessons from our stories are wrong or incomplete, or don't account for some edge case that's important? What happens when the need to communicate some idea doesn't quite fit the mold of a story? What kinds of statistical mistakes and errors are we making? And what kinds of storytelling traps are we falling into, all without even realizing it? What happens when we're not only not right but not even wrong?

Hi, I'm John. And that's what I'd like to talk to you about today. I want to help you recognize some of the most common statistical mistakes we make, especially if you participate in building or designing computer systems. And I'll give you some strategies for overcoming them or counteracting them.

I'll also give you some further reading that you can explore. And I'll send you the slides afterwards, and you can click on a bunch of links. And that way you can have a chance to implement some of these ideas directly in your own systems and on your own teams.

So I hope that by the end of this talk, you feel more self-aware of your collective limitations as human beings. Hopefully, everyone in this room is a human, human beings who use statistics and are better equipped to counteract any biases or limitations we might have.

So I work with a lot of customers who are big enterprise companies. And let me hasten to reassure you that the size of a company, a large company doesn't necessarily reduce the chance that they'll make a very expensive, statistical, or arithmetic mistake.

For example, take this story from 1995, where Fidelity's fund manager accidentally omitted a minus sign when reporting how the accounting of the fund was going, turning a $1.3 billion loss into a $1.3 billion gain, which is an error of $2.6 billion. So one character difference made a much larger difference than might be expected by having a typo.

And beyond just costing money, these kinds of mistakes can also have serious ethical and personal ramifications, like this example from last year where the UK's new camera system mistakenly flagged hundreds of people for arrests that it shouldn't have. And they claimed that a 92% false positive rate is nothing to worry about.

So mistakes are high, I think, for getting this right. So let's talk about how we should balance storytelling and statistics. We can't overcome a millennia of evolution with 125-ish minute talk. But what we can do is leave you more aware of the kinds of storytelling and statistical mistakes people make and then give you some strategies that might equip you to help you recognize and compensate for those problems.

The crux of the matter is that we have both statistical and storytelling techniques and we use those to understand the world. And that's good. It's good to have tools in our toolbox. What's a problem is when we use stories to solve statistical problems and when we use statistics to solve story problems. What we'd like to do is use the right tool for the job.

We also want to avoid relying exclusively on either stories or statistics. We have to balance our use of those appropriately. Someone who relies exclusively on storytelling is going to miss important nuances buried under a blanket of generalities that might not apply well or to every situation, [INAUDIBLE] poor or erroneous decision-making when we want to talk about facts rather than stories.

Someone who relies exclusively on statistics is going to have trouble delivering compelling insights for themselves or stakeholders. That can lead to making it hard to change or create some outcome that you'd like to drive because we're generally more moved to action by stories that are told with statistics, especially if they make appeals to our emotions rather than just statistics alone.

So in order to achieve that balance, we have to be aware of what kinds of problems might be tipping our internal cognitive skill in one direction or the other without us realizing it. So let's talk about a few of the problems we might encounter.

So one problem facing us is that we use natural language, words with fuzzy meanings, to describe things that are, really, precise probability notions-- so words like improbable, or pretty sure, or unlikely. So to maybe illustrate this and to give you an example of why this might be a problem, let's take this instance.

So suppose I take 100 different laptop batteries of the same make and model and I run them down, and I see how long they last, and I record the results as shown here. So, zero batteries lasted less than one hour. 25 batteries lasted between one and two. 45 lasted between two and three. 32 lasted between three and four. Eight batteries lasted between four and five hours. And finally, no batteries lasted more than five hours.

Now, I'm going to ask you a question. And I'd like you to raise your hand at the first answer that you agree with. So, let's say I get another battery of the exact same make and model as the ones I was testing here. How many of you would say that it was unlikely that this battery would last at least one hour?

That makes sense. Almost no one's raising their hands right now because all of the batteries lasted at least one hour. How about now? How many of the batteries-- how likely would you say that it was unlikely that those batteries would last at least two hours? How about three hours? How about four hours? How about five hours?

OK. So it's a pretty wide dispersion of what people mean when they say unlikely. So some of you thought it meant a number very, very close to 0% probability. Some of you thought it meant a number around 10% probability. Some of you thought it meant about 30% probability. That's a huge swing in what we're describing.

So that's a problem. We told ourselves a story. We used natural language to describe something that really had a precise measure around it. So imagine if we had the same level of disagreement about how many kilograms of fuel we should put on the rocket or the right amount of medicine to put in someone's IV bag. That would be really bad if we got that wrong by 30%.

So is it possible to go too far? How do we balance storytelling with statistics here? Maybe another example will make that clear.

Here's a real example from a client I worked with. Suppose you're an engineer on your team and your manager has asked you to look at your AWS spending to see how much money the team is burning through each week. We need to start putting together weekly projections for how much they're going to consume the next week and they want to know what their budget should be.

So you look at the current spending and you see values like this where longer bars are bigger values. And your manager looks at you and says, what if I just ask for the same amount of money that we used last week? Would that work? I don't want to ask for more money than I need to because I have my own OKRs and metrics that I get measured on and I'd prefer to keep my spending as low as possible.

How good should you feel about that manager's estimate? How confident are you that this will be enough money? Well, one way of thinking about it is you might feel pretty good about this because if you look at the past, what already happened, you would say, well, four out of those five weeks, we didn't exceed that budget. So that's an 80% chance that we won't exceed the budget in the future. So I feel pretty good about this, about picking that value.

So this is a little bit of a trap because if you use that same logic and the numbers look like this instead, would your confidence change? So here, 80% of the week-by-week numbers are still below the proposed budget just like the earlier example. But the numbers swing much wider. There's a much higher variance in what they are relative to each other. So would you still feel good about that estimate? How good should be feel about any given estimate?

We need to make sure that when we talk about estimates, we're using the right technique to do so. If our manager asked us for a number, it's good to make sure that we understand, in relatively precise terms, how sure we are about that number. If you're estimating, say, story points, you might provide a specific number. But is there an uncertainty associated with that number that you can provide? Is there some value you can provide that includes an estimate of how sure you are? That helps if you're telling a story because it helps others understand how reliable that story is.

One popular statistical measure to measure how confident we are about some number is to talk about our statistical confidence. How sure are we that some estimate we provide is a good estimate? So there's lots of ways to do complicated statistical analysis when you have lots of data and to feel good about those results.

We don't have lots of data in this example, though. We just have five data points. So what can we do? Well, statistics are really just measurements about data. What are some statistics that we might use to help us understand what the shape of this data looks like?

Two that people reach for a lot to help them understand sets like this are the sample mean and the sample standard deviation. To get the mean, which we write with a X-bar here, that little funny bar over the top-- we take the sum of all the values we have and then we divide by the number of values that we have. And that's how we get the mean.

You might sometimes see that written like this where there's a funny-looking E. That's the Greek letter sigma, which is conventionally taken to mean the sum of these values. So for the data set that we have here, that's what the average looks like. That's what the mean looks like. It's the length of that blue bar.

So if we picked this for our budget, this is the number that minimizes the error difference from any other value in the sample. So the mean is special in that way. It's the value that minimizes the total length of those red lines.

You can see this if I pick some other number besides the mean. Let's say I move this dotted line to the right. Well, the error for week three improved. But the other four weeks also changed by the same amount. And the total length of all the red lines increased so our error got worse. So [INAUDIBLE] that mean is the number that minimizes that error. So that gives us a little bit of a picture of things. But unfortunately, while the mean describes a central value for some data set, it doesn't really tell us how widely dispersed that data is.

So consider these two sets of numbers-- 0, 1, 99 100. That's one set. And 48, 49, 51, 52. That's another set. And I'll put these on a number line here so we have a visual of what they look.

Both of these data sets have a mean of 50. But the second set of numbers is much more tightly packed than the first set. So for our possible budget for our AWS spending is very widely distributed. We might need more padding to feel good about telling our manager what we've picked. So these two data sets looks very different but they have the same statistic, they have the same mean. So that's not a good, complete picture of a data.

One way to get a picture of how widely distributed a particular data set is is to take the sample standard deviation. So to compute the sample standard deviation and figure out how widely distributed those values are, there's this formula. So that little funny Greek letter, that's another lowercase sigma, not to be confused with the uppercase sigma.

So first, we're going to take the difference between each value in the mean, which you've already computed. We're going to square it. So let's say we had 56 as one of our values and our mean was 50. We would take 56 minus 50. Here we'd square it. So the 6 would become 36.

Then we add that up for all the values that we have. So we do about difference each time. And then we divide by n minus one. And we take the square root of the whole thing. So that gets us the standard deviation.

If we do that on the two data sets that we described before, you can see that their standard deviations are very, very different from each other. Now we can see that the standard deviation is much bigger than the mean is for the top data set and much smaller than the mean is for the bottom data set. That gives us a hint that the first data set is more widely dispersed than the second one.

Going back to our AWS data, we see here that our data is pretty widely dispersed. The standard deviation is about pretty close to the size of the mean. So that's a hint that we shouldn't have a lot of confidence if we picked a number close to be mean for our budget. We probably need to pick a bigger number-- but how much bigger?

So maybe you've seen a curve like this before called the bell curve or, more formally, a Gaussian function. A data scientist will tell you that this curve describes what's called the normal distribution. Many kinds of data are described well by a distribution that looks like this, like the height of people or the lifespan of the batteries that we saw before.

Now, one of the nice properties about the normal distribution is that it's described completely by knowing the mean and standard deviation. So if you know those two things and your data is normal or comes from a normal distribution, then you can draw a lot of useful conclusions. And we do know what those are because we just computed them. So if we go on to assume that our AWS spending is normally distributed, then we could use that to try to estimate how likely it is that we have enough budget.

So here, this data is centered around the mean of the data that we calculated earlier. And we can also label where one standard deviation away from the mean on either side would be. And we can do that for two and three standard deviations. And if our data is normal-- and remember that we're assuming that it is-- then we can use a very nice property of normal distributions.

You can say, I'd like to have a certain percent certainty that I won't exceed my budget. Let's say we want to pick 85%. So it turns out that for normal distributions, about 68% of the values fall within one standard deviation. About 95% fall within two. And about 99.5% fall within three.

That's a technique called Z-scoring that you can use to compute exactly which value you should pick. But since we picked 85%, we can use a small observation. Since about 68% of the values fall within one standard deviation, that means that 32% don't fall within one standard deviation, so that's about 16% on either side.

So if we take the 16% on the left and we combine it with the 68% in the middle, that gives us about 85%. So if we set our budget at about one standard deviation to the right of our mean, we should have about 85% confidence that the budget for next week won't exceed this number.

So if we set our budget there, we think we have an 85% chance of staying within the budget. That seems reasonable. So you pick that estimate and you report that value to your boss. Remember, this is a real story.

Unfortunately, it turns out that November 24 is the week of Thanksgiving. And more importantly for this example, it contains Black Friday. So the actual AWS spending as compared to the projected AWS spending is about 500% over. So your AWS spending skyrocketed in this example and it's much, much higher than your estimate. And you get called into the VPs office to explain how you could have forgotten that Black Friday occurred.

So what went wrong here? Often, people working on a team are asked to estimate things like how much budget they need, or how many story points something is, or how long a project will take. So we'll have some data about the problem at hand. But often, it will be limited or noisy in some way. What should we do when we want to provide an estimate?

Well, there are lots of statistical techniques you can use. But we can't ignore the real-world context like the fact that it was Black Friday in this example. We have to be sure that the assumptions we're making are reliable.

In this case, one of the other things that tripped us up is that spending isn't typically normally distributed. It's extremely improbable that your spending, for a growing company, is going to stay within a normal distribution or that it will be described well by one. So for any given business, that kind of data is going to be highly seasonal. Probably, for a growing business, it's going to increase over time. And that's going to really affect what the distribution of that data looks like and it probably won't be normal.

Moreover, you already have way better tools and technical tooling to measure what your [INAUDIBLE] consumption looks like than resorting to statistics. You can run your own experiments and projections based on users, and visits, and requests per second rather than statistics. Statistics is going to help you tell a certain story but you have other tools that can help you tell a better story with a much clearer answer than running a statistical test, in this case. Or it will at least give you a better starting point to work from.

So the danger of relying on these kinds of assumptions is that you'll still get answers out of the formulas. You'll still get an answer about what the mean or the standard deviation of this is and what percent confidence you should have for a particular z-score. But that will work regardless of what the actual underlying data is. So if you don't check your work against that, you're going to have problems.

We also don't want to rely exclusively on statistics to tell a story. Consider this group of 100 randomly-placed data points on 0 to 100 on both axes. These have the following statistics for the X and Y means and the end the X and Y standard deviations.

Now, consider this graph, which is also 100 points but arranged in the shape of a T-Rex. It has these statistics. So there's the two graphs side by side. They look pretty different from each other but they have exactly the same X, y mean and standard deviation.

And in fact, you can perturb these data points pretty significantly and, within the most significant figures, retain consistency no matter how you move them around. So that gives us a big hint that we can't rely exclusively on even these-- this is five different statistics, not the two that we had before. Even five statistics isn't really enough to completely describe this data set in a way that helps us understand the general shape.

So it can be difficult to include all of the contacts we might need to make accurate estimates at the moment we're making them. So you have to be extra careful that when we're building statistical models, we check those models against the real world knowledge that we've gained through stories. We exclude relevant knowledge and context at our peril.

So how do we know what kinds of contextual relationships we should think about? Well, there are lots of ways that data can be related to each other. For example, perhaps the more visitors your website has, the higher your monthly AWS spending will be.

When two different variables seem to share some kind of relationship, we say that they're correlated. For example, when the temperature goes up, the number of people who buy ice cream tends to increase. The more gasoline you have in your car's fuel tank, the farther your car will go. When workers close to the minimum wage get a raise, their productivity per dollar goes up, et cetera, et cetera. An interesting question to ask, though, is if one of these things causes the other. So we often want to know if adjusting some value will adjust or will cause a change in some other value.

One way to explore the relationship between these any two variables is to plot them against each other and see what you get. Here on the horizontal axis we've got foot size in inches. And on the vertical axis, we've got spelling test scores for 60 people who were asked to spell 36 words each ranging from relatively easy words, like cat, to relatively hard words, like sesquipedalian.

When we plot these results, we get something interesting. It seems like the bigger your foot size is, the better your score on this test. These variables seem to share some kind of relationship.

Well, that seems strange. Why would your feet have anything to do with how well you spell? You may have heard the phrase correlation doesn't imply causation. What we mean by that is just because two variables seem to have a relationship with each other doesn't mean that one causes the other. The relationship could just be a coincidence like in this example where the number of people who died by becoming tangled in their bed sheets is highly correlated with the per-capita cheese consumption in America.

So what we're doing when we say that, when we compare those two values, is saying, well, cheese consumption is coincidental with bedsheet tangling-- that is, they might be related. One might cause the other. But the most we can say is that they're both happening at the same time.

So, using our real world knowledge, though, we can probably say, as far as we know, there isn't any relationship between cheese consumption and bedsheet tangling. There's no reason to believe, that I'm aware of, that that would be true. So it's just a coincidence that they seem to be related.

But why did we draw that conclusion? There's nothing in the data that says that these aren't related, right? We had to apply some external knowledge that these things are probably not related. So unless you're a cheese consumption slash bed sheet tangling expert, we're all just kind of relying on our own common sense idea that these things shouldn't be connected.

Or is it a coincidence? What if it turns out that, say, the show 30 Rock is to blame for the gradual rise in bedsheet tanglings? In 30 Rock, Tina Fey's character, Liz Lemon, frequently enjoys eating cheese in bed. And perhaps some viewers emulated this behavior and got kind in their own bedsheets and died.

Now, what we're seeing here is that there might be a causal relationship here from a hidden variable, a variable that's not one of the two we considered. So in this case, people who watch 30 Rock love both Slankets and cheese. And the show motivated them to combine these activities, leading to a net rise in bedsheet tanglings. That's probably what's going on in the foot length and spelling test data here.

These two variables are related but they don't have a causal relationship. Instead, a third hidden variable-- namely, your age-- is responsible for this. So the older you are, the more words you know how to spell-- if you're a toddler, you probably don't know that many words and your shoe size is probably small. If you're an adult, you probably know a lot more words and your shoe size is probably larger.

So it's not really driven by any one of those two factors. It's driven by a third factor that is related to both with the other two. So in case, we're saying that age was a hidden variable that describes a causal relationship.

One place where causation and correlation comes up a lot for engineers is in AB testing. A popular thing to do is to try lots of experiments-- adjusting color, fonts, positioning, and so on-- and see what improves whatever metrics you care about the most-- say, the number of purchases somebody makes on your website, or the number of hits you got, and so on.

There's a problem though, which is that when you run a lot of experiments, then by chance sometimes you'll get a coincidental relationship. Sometimes you'll get eating cheese causes bedsheet tangling instead of smoking causes lung cancer. And that's partly how headlines like this come about, right? When you try lots and lots of foods to see if one of them might be a superfood, eventually you're going to find something that works and that maybe isn't good for you. You're going to stumble on things that happen just by coincidence.

And conversely, sometimes people take this too far. It's correlation doesn't imply causation, not correlated things can't be causally related. So smoking does cause lung cancer and smoking is highly correlated to lung cancer. There is strong evidence that many correlated things have causal relationships, like smoking causes lung cancer and skydiving causes your heart rate to increase.

It can be difficult to include all the contacts we might need to make accurate estimates at the moment we're making them. So we have to be extra careful that when we're building statistical models, that we check those models against the real world knowledge that we've gained. How do we know what kind of contextual relationships we should think about them?

Well, one of the trickiest parts for people is measuring risks that depend on decisions. We're all going to die at some point, for example. But how much does that risk change if you decide to eat ice cream every day, or if you don't get enough exercise, or if you get on airplanes frequently?

Understanding this requires understanding two ideas-- base rates and conditional probabilities. A base rate is a unconditional probability, how likely it is that something is true. For example, if 30% of the people in this room are software engineers, that's the base rate of being a software engineer. Now, if I told you that 20% of the people in the room have green eyes, what's the probability that someone in the room who's a software engineer has green eyes? That's a conditional probability.

To understand these two ideas, let's take a look at the historical weather forecast for Seattle and Phoenix. Seattle has very stereotypically rainy weather and Phoenix has various stereotypically desert-like dry weather. Suppose you look back over the last 28 days for each city and you see this pattern of rainy days in Seattle and this pattern of rainy days in Phoenix.

Now I tell you that based on current weather conditions and historical weather patterns, the weather forecast for tomorrow calls for a 30% chance of rain in both Seattle and Phoenix. Is it more likely to rain in Seattle, more likely to rain in Phoenix, or are both equally probable? Raise your hand if you think it's more likely to rain in Seattle. Raise your hand if you think it's more likely to rain in Phoenix. Raise your hand if you think the chance is exactly the same.

So it turn out the chance is exactly the same because these probabilities are equal. It doesn't matter what happened before or what that context is because that context has already been taken into account when we came up with those probabilities. That's what the base rate is.

So you don't want a double-count information about your base rates. If we adjust our expectations about what probabilities are based on stereotypes of Seattle or Phoenix, we're in essence double-counting information that has already been taken into account.

Most of the time, engineers are building systems that are essentially black boxes to users or the people that the systems interact with. [INAUDIBLE] sweeping ethical ramifications. If it's not done right, then lives or livelihoods might be at stake.

For instance, maybe we're checking to see if someone or something belongs to an interesting group-- measuring whether they have diabetes, scanning their face to see if they're on the no-fly list, evaluating their credit score, et cetera. These are essentially measurements involving conditional probability.

For example, given that someone has some set of biometric data, what is the chance that they're a terrorist? That's a question that involves a conditional probability. That's one that's being considered right now by lots of government systems. And unfortunately for us and for those systems, it turns out that our intuition about conditional probabilities is strikingly bad.

So here's an example. Let's say that 1% of the population has a genetic heart condition, we would like to screen for and we have a treatment for this. And we have a test that is 95% accurate at identifying the people with the condition. 95% accurate means that 95% of the time, it correctly tells you whether you have the heart condition or not. And 5% of the time. It gives you the opposite result.

So here's the conditional probability question. Given a positive result, what is the probability that you have the heart condition? So I'm not going to ask you to compute the exact number. I just want your gut feeling about, what's the probability that this is true? There'll be five choices. Raise your hand at the choice you think is the closest to the correct answer.

Raise you hand if you think the probability that-- given that you test positive-- you actually have the condition is closest to 0%. OK. 25%? 50%? 75%? And 95%?

OK. So it turns out the closest value is 25%. The exact answer is 16%. So given a 95%-accurate test, the chance that you actually have the heart condition is 16%.

Here's how you could see how that works. Because this is tricky, I think, from an intuition perspective to understand those conditional probabilities. But here's a way I think is more intuitive to understand.

Let's say you have 100,000 people and let's say that we look at who's got the condition and who doesn't. So 1,000 of those people do actually have the condition and 99% of them don't actually have the condition. And then we run the test on them.

Well, 950 of the 1,000 people-- because the test is 95% accurate-- will report-- the test will report that they do, in fact, have the condition. And 5% of them will be incorrectly reported. Same for the 99,000 people-- 95% of them will get the correct result. 5% of them will get the wrong result.

Now we combine all the people that have positive results, some of which are true positive and some are too negative. And we divide those two numbers-- so, 16% chance that you actually have a heart condition given the positive test.

Let's say we bump up the accuracy of the test to 99%. It turns out that this actually only makes it about as good as a coin flip. So we got 990 out of 990 plus 990. So that's 50%.

If we bump it up again to 99.9%, we get a 91% accuracy. So intuitively, these results make sense from the perspective of-- if we have a very, very, very rare condition, let's say one in a billion people have some condition, then we need a really accurate test in order to detect that, something that's at least as sensitive as that. Because otherwise, we could just invent a fake test that says, no, you don't have that condition and it returns it all the time. And you'd be right most of the time, right? Because most people don't have that condition.

So when we're dealing with even simple conditional probabilities, our intuition is likely not going to be correct. So we've talked a lot about the flavors of mistakes that people can make. And I hope I've given you some things to think about or at least made you aware of your own biases. It's very, very challenging to come to grips with this and recognize when you're making one of these mistakes. And it's certainly been a long-term struggle for me.

I want to leave you with some reassurance that you're not alone and that we've been thinking about these things for a while as human beings. So when Charles Babbage and Ada Lovelace were thinking about the first ways to use early computers to solve problems, they were thinking about statistics and numbers too. And people wanted them to estimate things with their magic computer machine. That was actually an important goal of the difference engine, the machine you see pictured here.

In fact, people asked him for all kinds of different answers to things. And he wrote in his diary on more than one occasion, I had been asked, pray, Mr. Babbage, if you put the wrong figures into the machine, will the right answers come out? I am not rightly able to apprehend the kind of confusion of such ideas that could provoke such a question.

So while this is a sick and classy burn about computers, it's also good advice. We need to make sure that the numbers we're putting into our machines are the right ones. And we have to make sure that the numbers we're putting in overcame our own cognitive biases to get there.

And if we can do that, maybe we'll have even better stories to tell. Thanks for listening and hope you enjoyed Deconstruct.

[APPLAUSE]

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hi, everyone. Thanks for coming to Deconstruct and thanks for sticking around until 3:45 in the last day. So I want you to try to imagine the very first book you remember reading to yourself. It may have been a book that a friend, or a parent, or a mentor had read to you before. And they probably did it by passing on a story.For as long as we've had oral or written histories, we've had stories. And humans are remarkably good at telling each other stories and our brain has adapted to work well with this mode of communicating ideas. We remember things better when they're told to us as stories. And we're more likely to recall facts if they're presented as a story rather than as disconnected units.

In fact, were so good at this that when we hear a good story, our brains are adept at placing us inside the story. We imagine ourselves experiencing the events as if they really happened to us. And neurological scans will bear that out too.

But there are other kinds of information we want to convey that don't fit into the mold of a story and that our brains haven't really evolved to be specialized at-- things like accurately assessing risk, drawing statistical conclusions, understanding who to trust, and correcting or counteracting our own biases.

By way of example of how stories might not always work, consider this short fragment. Alex loved his grandmother. In fact, he was on his way to visit her right now. Alex came bearing gifts, three of her favorite longtime foods from his kitchen-- tart blueberry muffins, freshly roasted coffee beans, and decadent cheesecake. He couldn't wait to see her and hear how her day had been.

Chances are good that if I ask you in 30 seconds what the grandmother's favorite foods were, you'll be able to remember that they were blueberry, muffins, coffee, and cheesecake. Now, who here thinks that they would be able to remember those three things for 30 seconds when this slide goes away? Raise your hand if you feel pretty sure that you could remember those three foods. OK, pretty strong confidence. Don't worry. There won't be a quiz.

Now let's change the story a little bit. What if Alex's grandmother's favorite things were numbers instead of food? And we'll randomly generate some numbers that have the same number of characters as blueberry muffins, coffee beans, and cheesecake respectively. That's a little bit different, right?

Now raise your hand if you think you can remember those three numbers for 30 seconds after they disappear. OK. I just wanted to see if anyone was confident enough to say yes. So it turns out that stories aren't a universally good tool for representing information that we'd like to convey. This has sweeping implications for how we build systems, especially complex computer systems because those systems are built by minds-- at least for now-- that are good at some things like stories and very bad in others.

In fact, it turns out that we're so bad at some kinds of things that we're very likely to frequently make the same kinds of mistakes over, and over, and over again. Unfortunately, it also turns out that many of the things that we're bad at are also things that we need to be good at in order to build complex systems and draw accurate conclusions from them.

So stories are good at telling us about predictable phenomena. And they can help us understand and provide conclusions that would otherwise take a long time for us to learn. And we're really good at using them.

Statistics, by contrast-- that's only really been around for about 100 to 200 years, a tiny blip on the timescale of human evolution. So what happens when the lessons from our stories are wrong or incomplete, or don't account for some edge case that's important? What happens when the need to communicate some idea doesn't quite fit the mold of a story? What kinds of statistical mistakes and errors are we making? And what kinds of storytelling traps are we falling into, all without even realizing it? What happens when we're not only not right but not even wrong?

Hi, I'm John. And that's what I'd like to talk to you about today. I want to help you recognize some of the most common statistical mistakes we make, especially if you participate in building or designing computer systems. And I'll give you some strategies for overcoming them or counteracting them.

I'll also give you some further reading that you can explore. And I'll send you the slides afterwards, and you can click on a bunch of links. And that way you can have a chance to implement some of these ideas directly in your own systems and on your own teams.

So I hope that by the end of this talk, you feel more self-aware of your collective limitations as human beings. Hopefully, everyone in this room is a human, human beings who use statistics and are better equipped to counteract any biases or limitations we might have.

So I work with a lot of customers who are big enterprise companies. And let me hasten to reassure you that the size of a company, a large company doesn't necessarily reduce the chance that they'll make a very expensive, statistical, or arithmetic mistake.

For example, take this story from 1995, where Fidelity's fund manager accidentally omitted a minus sign when reporting how the accounting of the fund was going, turning a $1.3 billion loss into a $1.3 billion gain, which is an error of $2.6 billion. So one character difference made a much larger difference than might be expected by having a typo.

And beyond just costing money, these kinds of mistakes can also have serious ethical and personal ramifications, like this example from last year where the UK's new camera system mistakenly flagged hundreds of people for arrests that it shouldn't have. And they claimed that a 92% false positive rate is nothing to worry about.

So mistakes are high, I think, for getting this right. So let's talk about how we should balance storytelling and statistics. We can't overcome a millennia of evolution with 125-ish minute talk. But what we can do is leave you more aware of the kinds of storytelling and statistical mistakes people make and then give you some strategies that might equip you to help you recognize and compensate for those problems.

The crux of the matter is that we have both statistical and storytelling techniques and we use those to understand the world. And that's good. It's good to have tools in our toolbox. What's a problem is when we use stories to solve statistical problems and when we use statistics to solve story problems. What we'd like to do is use the right tool for the job.

We also want to avoid relying exclusively on either stories or statistics. We have to balance our use of those appropriately. Someone who relies exclusively on storytelling is going to miss important nuances buried under a blanket of generalities that might not apply well or to every situation, [INAUDIBLE] poor or erroneous decision-making when we want to talk about facts rather than stories.

Someone who relies exclusively on statistics is going to have trouble delivering compelling insights for themselves or stakeholders. That can lead to making it hard to change or create some outcome that you'd like to drive because we're generally more moved to action by stories that are told with statistics, especially if they make appeals to our emotions rather than just statistics alone.

So in order to achieve that balance, we have to be aware of what kinds of problems might be tipping our internal cognitive skill in one direction or the other without us realizing it. So let's talk about a few of the problems we might encounter.

So one problem facing us is that we use natural language, words with fuzzy meanings, to describe things that are, really, precise probability notions-- so words like improbable, or pretty sure, or unlikely. So to maybe illustrate this and to give you an example of why this might be a problem, let's take this instance.

So suppose I take 100 different laptop batteries of the same make and model and I run them down, and I see how long they last, and I record the results as shown here. So, zero batteries lasted less than one hour. 25 batteries lasted between one and two. 45 lasted between two and three. 32 lasted between three and four. Eight batteries lasted between four and five hours. And finally, no batteries lasted more than five hours.

Now, I'm going to ask you a question. And I'd like you to raise your hand at the first answer that you agree with. So, let's say I get another battery of the exact same make and model as the ones I was testing here. How many of you would say that it was unlikely that this battery would last at least one hour?

That makes sense. Almost no one's raising their hands right now because all of the batteries lasted at least one hour. How about now? How many of the batteries-- how likely would you say that it was unlikely that those batteries would last at least two hours? How about three hours? How about four hours? How about five hours?

OK. So it's a pretty wide dispersion of what people mean when they say unlikely. So some of you thought it meant a number very, very close to 0% probability. Some of you thought it meant a number around 10% probability. Some of you thought it meant about 30% probability. That's a huge swing in what we're describing.

So that's a problem. We told ourselves a story. We used natural language to describe something that really had a precise measure around it. So imagine if we had the same level of disagreement about how many kilograms of fuel we should put on the rocket or the right amount of medicine to put in someone's IV bag. That would be really bad if we got that wrong by 30%.

So is it possible to go too far? How do we balance storytelling with statistics here? Maybe another example will make that clear.

Here's a real example from a client I worked with. Suppose you're an engineer on your team and your manager has asked you to look at your AWS spending to see how much money the team is burning through each week. We need to start putting together weekly projections for how much they're going to consume the next week and they want to know what their budget should be.

So you look at the current spending and you see values like this where longer bars are bigger values. And your manager looks at you and says, what if I just ask for the same amount of money that we used last week? Would that work? I don't want to ask for more money than I need to because I have my own OKRs and metrics that I get measured on and I'd prefer to keep my spending as low as possible.

How good should you feel about that manager's estimate? How confident are you that this will be enough money? Well, one way of thinking about it is you might feel pretty good about this because if you look at the past, what already happened, you would say, well, four out of those five weeks, we didn't exceed that budget. So that's an 80% chance that we won't exceed the budget in the future. So I feel pretty good about this, about picking that value.

So this is a little bit of a trap because if you use that same logic and the numbers look like this instead, would your confidence change? So here, 80% of the week-by-week numbers are still below the proposed budget just like the earlier example. But the numbers swing much wider. There's a much higher variance in what they are relative to each other. So would you still feel good about that estimate? How good should be feel about any given estimate?

We need to make sure that when we talk about estimates, we're using the right technique to do so. If our manager asked us for a number, it's good to make sure that we understand, in relatively precise terms, how sure we are about that number. If you're estimating, say, story points, you might provide a specific number. But is there an uncertainty associated with that number that you can provide? Is there some value you can provide that includes an estimate of how sure you are? That helps if you're telling a story because it helps others understand how reliable that story is.

One popular statistical measure to measure how confident we are about some number is to talk about our statistical confidence. How sure are we that some estimate we provide is a good estimate? So there's lots of ways to do complicated statistical analysis when you have lots of data and to feel good about those results.

We don't have lots of data in this example, though. We just have five data points. So what can we do? Well, statistics are really just measurements about data. What are some statistics that we might use to help us understand what the shape of this data looks like?

Two that people reach for a lot to help them understand sets like this are the sample mean and the sample standard deviation. To get the mean, which we write with a X-bar here, that little funny bar over the top-- we take the sum of all the values we have and then we divide by the number of values that we have. And that's how we get the mean.

You might sometimes see that written like this where there's a funny-looking E. That's the Greek letter sigma, which is conventionally taken to mean the sum of these values. So for the data set that we have here, that's what the average looks like. That's what the mean looks like. It's the length of that blue bar.

So if we picked this for our budget, this is the number that minimizes the error difference from any other value in the sample. So the mean is special in that way. It's the value that minimizes the total length of those red lines.

You can see this if I pick some other number besides the mean. Let's say I move this dotted line to the right. Well, the error for week three improved. But the other four weeks also changed by the same amount. And the total length of all the red lines increased so our error got worse. So [INAUDIBLE] that mean is the number that minimizes that error. So that gives us a little bit of a picture of things. But unfortunately, while the mean describes a central value for some data set, it doesn't really tell us how widely dispersed that data is.

So consider these two sets of numbers-- 0, 1, 99 100. That's one set. And 48, 49, 51, 52. That's another set. And I'll put these on a number line here so we have a visual of what they look.

Both of these data sets have a mean of 50. But the second set of numbers is much more tightly packed than the first set. So for our possible budget for our AWS spending is very widely distributed. We might need more padding to feel good about telling our manager what we've picked. So these two data sets looks very different but they have the same statistic, they have the same mean. So that's not a good, complete picture of a data.

One way to get a picture of how widely distributed a particular data set is is to take the sample standard deviation. So to compute the sample standard deviation and figure out how widely distributed those values are, there's this formula. So that little funny Greek letter, that's another lowercase sigma, not to be confused with the uppercase sigma.

So first, we're going to take the difference between each value in the mean, which you've already computed. We're going to square it. So let's say we had 56 as one of our values and our mean was 50. We would take 56 minus 50. Here we'd square it. So the 6 would become 36.

Then we add that up for all the values that we have. So we do about difference each time. And then we divide by n minus one. And we take the square root of the whole thing. So that gets us the standard deviation.

If we do that on the two data sets that we described before, you can see that their standard deviations are very, very different from each other. Now we can see that the standard deviation is much bigger than the mean is for the top data set and much smaller than the mean is for the bottom data set. That gives us a hint that the first data set is more widely dispersed than the second one.

Going back to our AWS data, we see here that our data is pretty widely dispersed. The standard deviation is about pretty close to the size of the mean. So that's a hint that we shouldn't have a lot of confidence if we picked a number close to be mean for our budget. We probably need to pick a bigger number-- but how much bigger?

So maybe you've seen a curve like this before called the bell curve or, more formally, a Gaussian function. A data scientist will tell you that this curve describes what's called the normal distribution. Many kinds of data are described well by a distribution that looks like this, like the height of people or the lifespan of the batteries that we saw before.

Now, one of the nice properties about the normal distribution is that it's described completely by knowing the mean and standard deviation. So if you know those two things and your data is normal or comes from a normal distribution, then you can draw a lot of useful conclusions. And we do know what those are because we just computed them. So if we go on to assume that our AWS spending is normally distributed, then we could use that to try to estimate how likely it is that we have enough budget.

So here, this data is centered around the mean of the data that we calculated earlier. And we can also label where one standard deviation away from the mean on either side would be. And we can do that for two and three standard deviations. And if our data is normal-- and remember that we're assuming that it is-- then we can use a very nice property of normal distributions.

You can say, I'd like to have a certain percent certainty that I won't exceed my budget. Let's say we want to pick 85%. So it turns out that for normal distributions, about 68% of the values fall within one standard deviation. About 95% fall within two. And about 99.5% fall within three.

That's a technique called Z-scoring that you can use to compute exactly which value you should pick. But since we picked 85%, we can use a small observation. Since about 68% of the values fall within one standard deviation, that means that 32% don't fall within one standard deviation, so that's about 16% on either side.

So if we take the 16% on the left and we combine it with the 68% in the middle, that gives us about 85%. So if we set our budget at about one standard deviation to the right of our mean, we should have about 85% confidence that the budget for next week won't exceed this number.

So if we set our budget there, we think we have an 85% chance of staying within the budget. That seems reasonable. So you pick that estimate and you report that value to your boss. Remember, this is a real story.

Unfortunately, it turns out that November 24 is the week of Thanksgiving. And more importantly for this example, it contains Black Friday. So the actual AWS spending as compared to the projected AWS spending is about 500% over. So your AWS spending skyrocketed in this example and it's much, much higher than your estimate. And you get called into the VPs office to explain how you could have forgotten that Black Friday occurred.

So what went wrong here? Often, people working on a team are asked to estimate things like how much budget they need, or how many story points something is, or how long a project will take. So we'll have some data about the problem at hand. But often, it will be limited or noisy in some way. What should we do when we want to provide an estimate?

Well, there are lots of statistical techniques you can use. But we can't ignore the real-world context like the fact that it was Black Friday in this example. We have to be sure that the assumptions we're making are reliable.

In this case, one of the other things that tripped us up is that spending isn't typically normally distributed. It's extremely improbable that your spending, for a growing company, is going to stay within a normal distribution or that it will be described well by one. So for any given business, that kind of data is going to be highly seasonal. Probably, for a growing business, it's going to increase over time. And that's going to really affect what the distribution of that data looks like and it probably won't be normal.

Moreover, you already have way better tools and technical tooling to measure what your [INAUDIBLE] consumption looks like than resorting to statistics. You can run your own experiments and projections based on users, and visits, and requests per second rather than statistics. Statistics is going to help you tell a certain story but you have other tools that can help you tell a better story with a much clearer answer than running a statistical test, in this case. Or it will at least give you a better starting point to work from.

So the danger of relying on these kinds of assumptions is that you'll still get answers out of the formulas. You'll still get an answer about what the mean or the standard deviation of this is and what percent confidence you should have for a particular z-score. But that will work regardless of what the actual underlying data is. So if you don't check your work against that, you're going to have problems.

We also don't want to rely exclusively on statistics to tell a story. Consider this group of 100 randomly-placed data points on 0 to 100 on both axes. These have the following statistics for the X and Y means and the end the X and Y standard deviations.

Now, consider this graph, which is also 100 points but arranged in the shape of a T-Rex. It has these statistics. So there's the two graphs side by side. They look pretty different from each other but they have exactly the same X, y mean and standard deviation.

And in fact, you can perturb these data points pretty significantly and, within the most significant figures, retain consistency no matter how you move them around. So that gives us a big hint that we can't rely exclusively on even these-- this is five different statistics, not the two that we had before. Even five statistics isn't really enough to completely describe this data set in a way that helps us understand the general shape.

So it can be difficult to include all of the contacts we might need to make accurate estimates at the moment we're making them. So you have to be extra careful that when we're building statistical models, we check those models against the real world knowledge that we've gained through stories. We exclude relevant knowledge and context at our peril.

So how do we know what kinds of contextual relationships we should think about? Well, there are lots of ways that data can be related to each other. For example, perhaps the more visitors your website has, the higher your monthly AWS spending will be.

When two different variables seem to share some kind of relationship, we say that they're correlated. For example, when the temperature goes up, the number of people who buy ice cream tends to increase. The more gasoline you have in your car's fuel tank, the farther your car will go. When workers close to the minimum wage get a raise, their productivity per dollar goes up, et cetera, et cetera. An interesting question to ask, though, is if one of these things causes the other. So we often want to know if adjusting some value will adjust or will cause a change in some other value.

One way to explore the relationship between these any two variables is to plot them against each other and see what you get. Here on the horizontal axis we've got foot size in inches. And on the vertical axis, we've got spelling test scores for 60 people who were asked to spell 36 words each ranging from relatively easy words, like cat, to relatively hard words, like sesquipedalian.

When we plot these results, we get something interesting. It seems like the bigger your foot size is, the better your score on this test. These variables seem to share some kind of relationship.

Well, that seems strange. Why would your feet have anything to do with how well you spell? You may have heard the phrase correlation doesn't imply causation. What we mean by that is just because two variables seem to have a relationship with each other doesn't mean that one causes the other. The relationship could just be a coincidence like in this example where the number of people who died by becoming tangled in their bed sheets is highly correlated with the per-capita cheese consumption in America.

So what we're doing when we say that, when we compare those two values, is saying, well, cheese consumption is coincidental with bedsheet tangling-- that is, they might be related. One might cause the other. But the most we can say is that they're both happening at the same time.

So, using our real world knowledge, though, we can probably say, as far as we know, there isn't any relationship between cheese consumption and bedsheet tangling. There's no reason to believe, that I'm aware of, that that would be true. So it's just a coincidence that they seem to be related.

But why did we draw that conclusion? There's nothing in the data that says that these aren't related, right? We had to apply some external knowledge that these things are probably not related. So unless you're a cheese consumption slash bed sheet tangling expert, we're all just kind of relying on our own common sense idea that these things shouldn't be connected.

Or is it a coincidence? What if it turns out that, say, the show 30 Rock is to blame for the gradual rise in bedsheet tanglings? In 30 Rock, Tina Fey's character, Liz Lemon, frequently enjoys eating cheese in bed. And perhaps some viewers emulated this behavior and got kind in their own bedsheets and died.

Now, what we're seeing here is that there might be a causal relationship here from a hidden variable, a variable that's not one of the two we considered. So in this case, people who watch 30 Rock love both Slankets and cheese. And the show motivated them to combine these activities, leading to a net rise in bedsheet tanglings. That's probably what's going on in the foot length and spelling test data here.

These two variables are related but they don't have a causal relationship. Instead, a third hidden variable-- namely, your age-- is responsible for this. So the older you are, the more words you know how to spell-- if you're a toddler, you probably don't know that many words and your shoe size is probably small. If you're an adult, you probably know a lot more words and your shoe size is probably larger.

So it's not really driven by any one of those two factors. It's driven by a third factor that is related to both with the other two. So in case, we're saying that age was a hidden variable that describes a causal relationship.

One place where causation and correlation comes up a lot for engineers is in AB testing. A popular thing to do is to try lots of experiments-- adjusting color, fonts, positioning, and so on-- and see what improves whatever metrics you care about the most-- say, the number of purchases somebody makes on your website, or the number of hits you got, and so on.

There's a problem though, which is that when you run a lot of experiments, then by chance sometimes you'll get a coincidental relationship. Sometimes you'll get eating cheese causes bedsheet tangling instead of smoking causes lung cancer. And that's partly how headlines like this come about, right? When you try lots and lots of foods to see if one of them might be a superfood, eventually you're going to find something that works and that maybe isn't good for you. You're going to stumble on things that happen just by coincidence.

And conversely, sometimes people take this too far. It's correlation doesn't imply causation, not correlated things can't be causally related. So smoking does cause lung cancer and smoking is highly correlated to lung cancer. There is strong evidence that many correlated things have causal relationships, like smoking causes lung cancer and skydiving causes your heart rate to increase.

It can be difficult to include all the contacts we might need to make accurate estimates at the moment we're making them. So we have to be extra careful that when we're building statistical models, that we check those models against the real world knowledge that we've gained. How do we know what kind of contextual relationships we should think about them?

Well, one of the trickiest parts for people is measuring risks that depend on decisions. We're all going to die at some point, for example. But how much does that risk change if you decide to eat ice cream every day, or if you don't get enough exercise, or if you get on airplanes frequently?

Understanding this requires understanding two ideas-- base rates and conditional probabilities. A base rate is a unconditional probability, how likely it is that something is true. For example, if 30% of the people in this room are software engineers, that's the base rate of being a software engineer. Now, if I told you that 20% of the people in the room have green eyes, what's the probability that someone in the room who's a software engineer has green eyes? That's a conditional probability.

To understand these two ideas, let's take a look at the historical weather forecast for Seattle and Phoenix. Seattle has very stereotypically rainy weather and Phoenix has various stereotypically desert-like dry weather. Suppose you look back over the last 28 days for each city and you see this pattern of rainy days in Seattle and this pattern of rainy days in Phoenix.

Now I tell you that based on current weather conditions and historical weather patterns, the weather forecast for tomorrow calls for a 30% chance of rain in both Seattle and Phoenix. Is it more likely to rain in Seattle, more likely to rain in Phoenix, or are both equally probable? Raise your hand if you think it's more likely to rain in Seattle. Raise your hand if you think it's more likely to rain in Phoenix. Raise your hand if you think the chance is exactly the same.

So it turn out the chance is exactly the same because these probabilities are equal. It doesn't matter what happened before or what that context is because that context has already been taken into account when we came up with those probabilities. That's what the base rate is.

So you don't want a double-count information about your base rates. If we adjust our expectations about what probabilities are based on stereotypes of Seattle or Phoenix, we're in essence double-counting information that has already been taken into account.

Most of the time, engineers are building systems that are essentially black boxes to users or the people that the systems interact with. [INAUDIBLE] sweeping ethical ramifications. If it's not done right, then lives or livelihoods might be at stake.

For instance, maybe we're checking to see if someone or something belongs to an interesting group-- measuring whether they have diabetes, scanning their face to see if they're on the no-fly list, evaluating their credit score, et cetera. These are essentially measurements involving conditional probability.

For example, given that someone has some set of biometric data, what is the chance that they're a terrorist? That's a question that involves a conditional probability. That's one that's being considered right now by lots of government systems. And unfortunately for us and for those systems, it turns out that our intuition about conditional probabilities is strikingly bad.

So here's an example. Let's say that 1% of the population has a genetic heart condition, we would like to screen for and we have a treatment for this. And we have a test that is 95% accurate at identifying the people with the condition. 95% accurate means that 95% of the time, it correctly tells you whether you have the heart condition or not. And 5% of the time. It gives you the opposite result.

So here's the conditional probability question. Given a positive result, what is the probability that you have the heart condition? So I'm not going to ask you to compute the exact number. I just want your gut feeling about, what's the probability that this is true? There'll be five choices. Raise your hand at the choice you think is the closest to the correct answer.

Raise you hand if you think the probability that-- given that you test positive-- you actually have the condition is closest to 0%. OK. 25%? 50%? 75%? And 95%?

OK. So it turns out the closest value is 25%. The exact answer is 16%. So given a 95%-accurate test, the chance that you actually have the heart condition is 16%.

Here's how you could see how that works. Because this is tricky, I think, from an intuition perspective to understand those conditional probabilities. But here's a way I think is more intuitive to understand.

Let's say you have 100,000 people and let's say that we look at who's got the condition and who doesn't. So 1,000 of those people do actually have the condition and 99% of them don't actually have the condition. And then we run the test on them.

Well, 950 of the 1,000 people-- because the test is 95% accurate-- will report-- the test will report that they do, in fact, have the condition. And 5% of them will be incorrectly reported. Same for the 99,000 people-- 95% of them will get the correct result. 5% of them will get the wrong result.

Now we combine all the people that have positive results, some of which are true positive and some are too negative. And we divide those two numbers-- so, 16% chance that you actually have a heart condition given the positive test.

Let's say we bump up the accuracy of the test to 99%. It turns out that this actually only makes it about as good as a coin flip. So we got 990 out of 990 plus 990. So that's 50%.

If we bump it up again to 99.9%, we get a 91% accuracy. So intuitively, these results make sense from the perspective of-- if we have a very, very, very rare condition, let's say one in a billion people have some condition, then we need a really accurate test in order to detect that, something that's at least as sensitive as that. Because otherwise, we could just invent a fake test that says, no, you don't have that condition and it returns it all the time. And you'd be right most of the time, right? Because most people don't have that condition.

So when we're dealing with even simple conditional probabilities, our intuition is likely not going to be correct. So we've talked a lot about the flavors of mistakes that people can make. And I hope I've given you some things to think about or at least made you aware of your own biases. It's very, very challenging to come to grips with this and recognize when you're making one of these mistakes. And it's certainly been a long-term struggle for me.

I want to leave you with some reassurance that you're not alone and that we've been thinking about these things for a while as human beings. So when Charles Babbage and Ada Lovelace were thinking about the first ways to use early computers to solve problems, they were thinking about statistics and numbers too. And people wanted them to estimate things with their magic computer machine. That was actually an important goal of the difference engine, the machine you see pictured here.

In fact, people asked him for all kinds of different answers to things. And he wrote in his diary on more than one occasion, I had been asked, pray, Mr. Babbage, if you put the wrong figures into the machine, will the right answers come out? I am not rightly able to apprehend the kind of confusion of such ideas that could provoke such a question.

So while this is a sick and classy burn about computers, it's also good advice. We need to make sure that the numbers we're putting into our machines are the right ones. And we have to make sure that the numbers we're putting in overcame our own cognitive biases to get there.

And if we can do that, maybe we'll have even better stories to tell. Thanks for listening and hope you enjoyed Deconstruct.

[APPLAUSE]