What an executive needs to know about Statistics (with no Maths involved)?

“Does any of this stuff work” is a question that’s easier to ask than it is to answer. Find out why you should care and what you need to know.”

Imagine you’re a senior tech executive at a big company. You’ve got the corner office, the view, and the comfortable salary. Congrats! That all sounds pretty sweet to me. Unfortunately here on Earth, those trappings come with a weight that you need to carry. And a big chunk of that weight is the paranoia that comes with constantly scanning the horizon for technologies that might disrupt your company’s business.

Now you’ve heard the buzz about A.I., so you put together a plan for a pilot project to see how it could help your company. Then, being the charming person you are, you get sign-off from a skeptical C.F.O., and find a consultancy to build some whizzbang thing that detects defects, interprets images, generates text, or some other A.I. thing.

The project completes, and the consultancy delivers a working prototype. This, of course, comes with a bunch of PowerPoint slides. The shininess of these slides will vary depending on the consultancy, but there will always be a line that says something like “this will add X million to your bottom line”. And, mirabile dictu, X is always much, much larger than the cost of the project.

Now as an executive, what do you do? The dark pattern is to loudly take pre-emptive credit for the £X million figure and pass the prototype to someone else a.s.a.p.. If it all works out, great! It was your idea! On the other hand if things don’t go well — that’s on the implementation team.

Maybe I shouldn’t be surprised. Success has a thousand fathers, but failure is an orphan after all. 

But let’s stay positive! Let’s suppose you’re an ethical executive and you’re not quite ready to join the dark side just yet. What do you do? You know you can’t just accept the headline figure. Your C.F.O. isn’t going to believe it for a start. Worse, your new A.I. is going to be doing something that is currently being done by a human. Someone somewhere in the company will be pushing for it to fail.

Doing a test seems sensible. You can run some small portion of the company’s operations for a limited time on the prototype. Then you can compare how the new version is doing vs the old version. This will allow you to extrapolate that difference out, and work out how much difference the new version would make if it were rolled out over the company as a whole.

Now there’s an immediate problem here. As the saying goes, if you’re running many tests you’re a scientist, if you’re running one test you’re a politician. And of course your C.F.O. only signed off on one prototype initially, so there might be only one test to run. 

Want to find out more? Talk to Us!

 
 

Anyway, you’re a fair-minded person, and you figure that the right thing to do is to run the test and see what happens. You can worry about the politics afterwards. So you embark on a trial, looking for an answer to a fairly simple question: “does this stuff actually work”.

Unfortunately at the end of the process you get a statistician (or data scientist) in front of you giving what seems like unnecessary detail about some fairly technical stuff. All you really want is a straightforward answer: yes or no!

Meanwhile the statistician (or data scientist) is there giving what they think is a perfectly standard answer to a perfectly well understood problem. As far as they’re concerned they are giving you a straight answer, and it’s your problem that you don’t understand it. Come on, SURELY everyone knows what a Bayesian credible interval is! The clue’s in the name!

Now I’ve seen this play out enough times to realise there are problems all around here. On the one hand our statistician probably isn’t being hugely helpful — especially if, like me, they’ve been a “maybe-this-maybe-that-more-research-is-needed” academic. In the end they’re not the ones carrying the weight of decision. They can prevaricate until the end of time.

On the other hand, there’s a sense in which a yes or no answer is impossible. Probabilities are involved and there’s a chance that anything might happen. To get a handle on this, statisticians have developed a language to describe how to weigh up the evidence from experiments. To a large extent, it’s the use of this language that makes things abstruse.

So it’s easy then? Learn the language and you get the answer? Well not so fast. There’s no universally agreed way to weigh up evidence and draw conclusions. No matter how much Maths people throw at you, in the end someone has to look at the evidence and make a call. But if you’re a senior executive, exercising good judgment is your thing! Right … ?

So what do you need to know? And is it that complicated? Well judge for yourself … 

Everything an executive needs to know about frequentist methods.

There’s a lot of stuff out there about how unnatural and hard to interpret these methods are and how you need a specialist to tell you what they mean. Don’t buy it! If I came to you saying: “It’s a miracle! I tossed a coin 1000 times and got a head each time”, your natural reaction should be to call me an idiot and tell me it’s a double sided coin. 

So what’s gone on here? You’ve started off with a preconceived idea about a “normal” version of the world, i.e. that the coin was fair. Then you’ve thought to yourself, “if the coin is fair then what are the chances of seeing a result like 1000 heads in a row”. You’ve thought about those chances and decided “that’s ridiculous, the coin can’t be fair, it must be double sided”.

In an intuitive way, you’ve just performed a frequentist statistical analysis. To translate into statistics speak: the preconceived idea is a null hypothesis, the number of heads in a row is your test statistic and the chances of seeing a result that weird is your p-value. Then when you decide “that’s too weird”, you’ve compared the p-value against some threshold for weirdness. Statisticians call this the critical value. When you decided the coin can’t be fair, you did something called “rejecting the null hypothesis”. You then accepted the alternative, in this case that the coin is double-sided.

All pretty intuitive.

In the example we started with, the null hypothesis would be that the A.I has no effect on the business process. You then go out and gather the data for your trial and calculate how likely are the differences you see between the A.I. version and the normal version, all while assuming that there truly is no difference between the two.

Now there’s a whole load of Maths about which test statistic to use, and how to calculate all the probabilities involved, but that’s what statisticians are for. And at the risk of being thrown out of the Super Secret Guild of Statisticians for revealing our sacred mysteries — 99.9% of the time we just google it.

However, there are some things a statistician needs from an executive. If they’ve performed an analysis similar to the one we’ve described, someone will have to decide the critical value. In other words, how weird do things have to look before we can say: “this is rubbish, the null hypothesis can’t be true.” 

Now that’s not Maths, that’s judgement. You have to earn that corner office somehow.

OK. So far it looks like there’s a role for everybody, which is good. But the world is more complicated than that. There’s something odd about what we’ve done. If we reject our null hypothesis — in this case that the A.I. has no effect — we accept the alternative, that the A.I. does have an effect. 

Great! It works! 

But now you ask your statistician: “so how much difference does this A.I. make?”

From the analysis we did before, technically called hypothesis testing, the statistician can’t tell you. They can just tell you how weird the data looks if we assume the A.I. hasn’t improved things.

However a frequentist statistician can give you a confidence interval for the thing the A.I. is supposed to be improving, whatever that might be: speed of manufacture, click through rate, average purchase size. 

So the statistician tells you that the 95% confidence interval for the thing you care about is between two values. Reasonably you might believe that there’s a 95% chance that the value of the thing you care about is between those two values. 

Nope. Sorry. This is where things get a bit … odd.

The statistician is saying that there’s a real, unknown value out there in the ether that we don’t know, but are trying to find out. If we were to do our experiment over and over again, calculating our confidence interval each time, then the real unknown value would lie within the interval 95% of the time. So out of 100 different experiments, we’d calculate 100 different confidence intervals and 95 of them would contain the true value.

Not surprisingly, real-life human beings don’t love this. “So what’s the number then?” is a perfectly reasonable question to ask. And you might reasonably feel that a load of stuff about infinitely repeated experiments doesn’t answer it. You only did the experiment once! However we’ve now run into the limits of the method. We can only talk about the probability of the events we’ve observed. We can’t talk about the probability of the values that describe what’s going on under the hood. 

But, before we get too far from the executive version, we’ll deal with this problem by turning to … 

Everything an executive needs to know about Bayes.

So far, I’ve been trying to hide the concept of probability behind concepts that might be less well defined, but are more intuitive, e.g. “oddness”, “weirdness” and “chance”. However we now need to briefly peep behind the curtain. In the world we just described, the probability of an event happening is something you get to in the long run. If you keep tossing a fair coin, the proportion of heads that you see will in the long run, get ever closer to 50%. The probability of getting a head is therefore 50%. 

However in this world, you can only calculate the probability of events that you could, in principle, measure. You can’t, for example, calculate the probability that a coin is biased because you can’t see bias directly. You just see the pattern it leaves in the data.

However, “given the data I see, what’s the probability that the A.I. performs better” is a natural question to ask, even though you can’t observe “better”. So we need another definition of probability. 

Enter Bayes.

For a Bayesian a probability is a “degree of belief”: 1 is complete certainty that something is true, 0 is complete certainty that something is false. The numbers in-between reflect the strength of your belief in the truth (or falsity) of “the thing”. You can have a degree of belief in anything, not just things you can observe directly.

Now if you think that all sounds a bit subjective, you’re not alone. I’m far too young to have witnessed it first-hand, but by all accounts the Great Bayesian Wars split statistics departments in the late 90s. It all died down when machine learning came along of course. There’s nothing like a shared enemy to unite warring tribes.

Anyway, let’s go back to our example of tossing 1000 heads in a row. Suppose you wanted to calculate the probability that the coin was double sided from the data you have. Well you might think: “OK, let’s say about one in a million coins are double headed. I’ll weight the probability of seeing 1000 heads in a row from a normal coin by my 999999/1000000 chance that it actually is a normal coin.” 

You could then weight the probability of seeing 1000 heads in a row from a double sided coin by your 1/1000000 chance that it is a double sided coin and compare the two weighted numbers. 

Congratulations. You’ve just done a Bayesian analysis! But as an executive, you need to realise that when you decided that one in a million coins were double headed you brought along some previously held beliefs. These were then used in the analysis. 

Now this “incorporation of prior beliefs” is the trick that Bayesians use to calculate the probabilities of things that can’t be observed. They take the observable evidence from the data, and weight it by their prior beliefs to get the belief in the thing that they care about. Again if this all sounds a bit subjective, you’re correct — hence the Bayesian Wars.

But remember that there was judgement in the frequentist case too. We had to decide how weird was too weird! The difference is that now, we make those judgements up front, and they then filter through the analysis.

Of course in practice it often doesn’t matter too much what the priors are. The story from the data should overwhelm the beliefs that you started with. Sometimes statisticians will start with ‘uninformative’ priors. E.g. you could say: I know nothing about how common double-headed coins are, so I’ll just assume the probability of starting with one is 0.5. 

Now just thinking about this example, you can probably see that starting with an uninformative prior isn’t necessarily the right thing to do. 

Anyway, the thing you need to know is that judgements are being made upfront. You should probably ask your statistician what their prior beliefs are and how much another set of plausible prior beliefs would change the analysis. The other thing to recognise is that a Bayesian probability is telling you about degrees of belief. So if a Bayesian statistician tells you that there’s a 95% probability that a value lies between two values, they’re telling you that their belief that it lies between those two values is 19 times stronger than the belief that it lies outside them.

So what should we take away?

Hopefully, if you’re with me this far, I’ll have convinced you of a few things. Firstly I think that the idea that you can move mechanically from data to an optimal decision is a mathematician’s fantasy. There is always a place for judgement in that fuzzy place where an experimental model meets the rest of the world.

Secondly, I think that we should be explicit about the judgement calls that are being made. Depending on which type of statistician you are talking to, there are two obvious places where judgment calls are made.

  1. If you’re talking to a frequentist statistician, the judgement call that is being made is about how implausible the data is under the standard way of doing things, i.e. the “null” hypothesis. This is usually something along the lines of “there is no difference between the old way of doing things and the new way”.

  2. If you’re talking to a Bayesian then a judgement call is being made with the prior beliefs that you starting with. With enough data, that prior belief won’t matter, but you need to ask your statistician whether there actually is enough data to overwhelm reasonable priors.

But there is another place where judgement calls are necessary. In any statistical analysis, a complex real world is being modelled by a simplified mathematical model. Someone, somewhere, needs to think about what happens when then that messy reality deviates from the model. However that is a subject large enough to form a future blogpost of its own.

Dave Dale