Data Overload: How to Deal With Too Much Information

Why the devil isn’t always in the detail

Organisations from every sector of the economy are facing a deluge of data, and there is an expectation that we should be using every last drop: decisions must always be data-driven, data-led, evidence-based. In many cases, this is too much — a recent survey found that 95% of employees were overwhelmed by data, and that people were increasingly relying on gut-feel due to the information overload.

Faced with complex problems and/or too much data to handle, what should we do? Ignore the data and rely on instinct; or stop procrastinating, accept that it’s going to be hard, and just dive into the detail?

Start small and simple

Keeping it simple doesn’t mean ignoring the detail, but it does mean stepping back to try to understand causal relationships: what is the big picture; which factors are most likely to affect the outcome significantly, and crucially, which are not.

From the outset, we need to be able to see the wood for the trees. And then keep the wood in mind as we undertake detailed analysis. That is, if it’s really necessary.

Many analysts, researchers and data scientists (myself included) have a tendency to get sucked into the detail. We want to know how-stuff-works and how-people-behave at the micro level. What are the mechanics of the situation? And this extends to to data — down to asking why we see specific values in certain rows in the database. This is valuable in lots of situations, especially in scientific research to deepen our understanding of the world.

However, it’s also a recipe for getting lost in the weeds, and losing sight of the reasons for doing the analysis. This is especially important in the context of decision making. Before jumping down the rabbit hole of a line-by-line analysis we ought at least to believe it possible that the analysis will change the direction of the decision.

The high level process set out below is designed to help you navigate complex problems and guide your data analysis to help reach decisive conclusions — and avoid analysis paralysis. The final section then considers a worked example of understanding article engagement (a field where I have already been down many rabbit holes): what actions should I take to improve engagement with this blog post?

Now, before reading on, if you want to find out more about how we handle these problems at the Efficient Data Group, get in touch below! (We’ll never share your information with any other party).

 
 

The problem is defining the problem

  1. Define the questions you want to answer.

  2. Think causally about your problem. Without worrying about the data, what are the most likely causal explanations for these questions.

  3. Match these causal mechanisms to your data. What can you actually measure, and what data is already being collected? It can be useful to think in terms of the mechanisms which are likely to have generated the data: are these the same as the causal processes from the previous step?

  4. Review the original questions. Given your data, what questions can you answer without making too many assumptions?

Uncomfortable honesty is required at these last two stages. Everyone wants a success, but we have to avoid simply willing it into existence. All too often, the questions you can answer are not the ones you would like to answer. Similarly, the things you can measure are not the things you are asking questions about.

Knowing when to stop

If you can answer the original questions, then great, dive into your analysis. But … only go into detail if you are convinced it could affect the answer to the original question. Once an answer becomes obvious, then stop.

This is easier said than done. Intellectual curiosity will always be tempting you into more detail, to clean some more input data, to try a new modelling technique, but you must resist! First convince yourself that it is genuinely important to the decision making exercise.

If, on the other hand, you conclude that you can’t answer the original question, then congratulations! This is a difficult conclusion to reach, but a vital one in conducting analyses which will lead to the right decision being made. At this stage you have two options:

  • iterate, by redefining the goal with the questions you can answer, as long as it will still be useful to the final decision; or

  • stop.

Again, honesty is required. Sometimes it is important to admit that the data can’t answer your questions: no matter how hard you try. All too often, individuals and organisations will avoid this awkward conclusion, and keep plugging away at some analysis which is never going to answer the questions being asked.

Don’t stop monitoring

One final point. Although you might not be able to use data to answer questions about future decisions, you may still be able to gain valuable insights from historic performance. So, don’t stop collecting the data.

 

Are you still reading this?

In this final section, I look at the question of reader engagement as a worked example of the process outlined above. Step one is to define the goal: I want to understand what I can do to improve engagement with my blog posts.

The second step requires that I try to understand the drivers of engagement. There are lots of possible reasons why users might or might not engage. For example, the quality of the article; the relevance of the topic to the user; the type of device they are using; their emotional state whilst reading; how the user found the article; and so on. The list of possibilities is endless. However, we can make a reasonable assumption that, on average, the first two listed are going to be very important, and the others less so.

Now comes the tricky part: matching what we have done so far to the data. Can we actually measure engagement, quality of the article, or relevance of the topic to the user? Simply put, no.

We can suggest a reasonable proxy for engagement such as the time spent on the article page, but we certainly don’t have anything on article quality, and it’s unlikely we would ever have sufficient information about a user to judge whether the topic was relevant (unless, of course, you are Google).

So, where does this leave us? First, we should redefine the goal to make it transparent that we are addressing a different question: what I can do to increase the time users spend on blog post pages? Laid bare, this doesn’t sound quite right. Longer articles are probably going to increase time on page, so perhaps we need to iterate and suggest a new proxy of something like time spent per word.

More fundamentally, I don’t have any way of measuring the two principal causes I identified: article quality, and topic relevance to the user. So, perhaps I just need to stop.

But wait! I can measure device type, and that might affect time on page. This is true, but I can’t realistically influence the devices that readers use, and so it isn’t directly relevant to helping me understand what I can do to increase time spent per word (as a proxy for engagement).


The conclusion is that the answer to the original question is not in the data. Article quality and topic relevance are the most significant factors affecting user engagement; and so I need to strive to write better articles, and try to get these articles published in appropriate places. The current data cannot give me these answers.

This might feel like giving up. It really isn’t: it’s just being honest about where the data can and can’t help you, and avoiding doing unnecessary analysis.

However, the final caveat is that I could run experiments. As long as I continue collecting data, I can experiment with different changes to see how this affects historic performance. But if you are still reading this, that will have to wait for another blog post.

Ed Rushton