The Problem with Data

Allen Faulton
10 min readSep 19, 2024

--

An Article of the Modern Survival Guide

Photo by Lukas: https://www.pexels.com/photo/charts-on-black-wooden-table-669622/

This is the Modern Survival Guide, a blog I’ve been writing for a while now about the challenges of the modern world. Now, if there’s something that really typifies the modern world, I would argue that the concept of “data” is pretty high on that list. Data is everywhere, in every business and government organization, and the sad fact of the matter is that it is very often poorly used, poorly understood, and overly persistent.

This is a big deal. I mean, think about something like tax assessment data. That kind of data has huge impacts on your life every year. Or, to take another example, standardized test data — this impacts everyone from the ages of about 5 through 18 in the US, and the government is doubling down on its use every year, for good or ill.

So, what happens when people get the data wrong? Bad things. Very bad things. This article is going to look at why this happens, how it happens, and what you can do about it (spoiler: not much).

When people talk about data, generally speaking the idea is that you are quantifying an event or phenomenon, which is to say that you take something that happens and turn it into numbers. Something that either happens or doesn’t gets turned into a 1 or a 0. Things that can be counted get assigned a count number. Things that are more subjective get assigned a range number (e.g., rate this article on a scale of 1 to 10). And since, from a certain point of view, everything can therefore be quantified, this type of data can be very useful.

The whole idea behind data science and the data industry is to quantify as much of every part of life as possible, and then turn the resulting data into a product for consumption. This is done both internally at companies and government agencies, and externally for the purpose of sale to interested parties. This is a HUGE industry; hundreds of billions of dollars, with the exact number depending on who you ask (hint: this might have something to do with the core point of the article).

Why is the industry so big? Because everyone in a position of power wants to know what’s going on. It’s that simple. Data, properly refined, becomes business intelligence. It provides a solid point of reference to allow people to make decisions about where to invest, what products to build, what products to buy, what people want, what people think, what people think they want vs. what they really want, how people will vote, where lightning will strike, where to build a house, where to park a car — almost literally, anything can be represented as data, and any data can drive decisions.

Now that I’ve made the point, let’s look at why and how it goes wrong.

Why and How Data Goes Wrong

There are three main ways why data fails to provide value.

  • When it is incorrect used
  • When it is incorrectly interpreted
  • When it is incorrectly collected

Generally speaking, incorrect usage occurs at the management level, incorrect interpretation occurs at the analysis level, and incorrect collection is a problem of either the collection mechanism or the collection strategy.

We’ll go through each of these in turn, because they each deserve a day in the sun. Note that I will be presenting these issues from the perspective of a large organization, but this is not always the case. Any individual person can also mess up in any of these ways. It’s just generally a bigger deal when it happens in a large organization, because the whole point of large organizations is to do big things.

Incorrect Data Use

Let’s get something out of the way first: most senior managers are not going to spend a lot of time looking at raw data. They usually don’t have time and it’s usually not their job; that’s what analysts are for. Instead, most of these folks are going to be looking at data summaries of some sort, and attempting to make decisions based on these summaries. These decisions can suffer from the following problems, and I don’t pretend this is an exclusive list:

  • Misunderstanding the summary
  • Not reading the summary
  • Disregarding the summary
  • Wholeheartedly believing the summary
  • Applying the data to the wrong things
  • Internalizing the data summary
  • Using old data

Starting from the top, it’s very easy for senior managers to misunderstand the data summaries they’ve been given. There are all sorts of causes for this problem, including things like poor analysis, bad charts, lack of education, too much raw data, or simply misreading the summary. This is why it is always a bad idea to simply hand your boss a PowerPoint presentation and walk away; they might not be able to figure out what it means.

It’s important to recognize that data presentation is kind of like looking into the mind of another person — different minds work in different ways. If you’ve ever looked at a coworker’s Excel sheets, you know what I mean. What makes sense to me might not make sense to you in one form of presentation or another. It is worth your while to work with anyone you are handing a data summary, or who is handing you a data summary, to ensure that you both understand how the other person thinks about the data.

Add to that, most senior managers have some degree of personal pride or personal beliefs that can and will color their interpretation of data. They can and will cherry pick information they like, and ignore or willfully misunderstand the information they don’t.

All of this, and many other issues besides, creates situations where a senior manager might completely misunderstand the point of the data they’ve been given, and act accordingly.

Similarly, many senior managers suffer from a problem of not reading some or all of the data summaries they are handed, or simply disregarding them. I’ve interacted with many senior managers in my life, and at this point in my career I am one; the overriding commonality among these people is that they don’t have a lot of time, and they generally value personal relationships more highly than reading charts.

What that means is that your manager is very likely to prefer a very short summary, and may simply rely on asking other people for their opinions rather than read a white paper or PowerPoint presentation. Death by PowerPoint is a real phenomenon, and shouldn’t be taken lightly.

On the flip side, some senior managers run into the problem of wholeheartedly believing the data put in front of them, when they should have held some healthy skepticism for reasons we’ll get into later in the article. This produces the classic “garbage in, garbage out” situation where a manager makes bad decisions because they didn’t question bad data.

Similarly, some senior managers have an issue where they will grab any shred of data they can get their hands on, and then apply that data to the wrong things. Correlation is not causation, but it can be easier than you might think for an executive to base their decisions solely off of the only data points they have.

For example, if I’m an executive at a meatpacking plant, and my subordinate hands me a report that says that hotdog consumption is way up, it’s entirely possible that I take that to mean that meat consumption on the whole is up. All I know about for sure is that people are eating more hot dogs, but surely that must mean that other meat is back on the menu, right boys?

There’s another problem with senior management psychology, and that is the tendency to latch onto data points. This can result in internalized data or old data creating bad decisions.

Internalized data in this sense is data that an executive has built into their worldview to the point that they have difficulty disregarding it. For example, the stonks always go up is an example of internalized data for many of the current generation of investment managers. It’s held true, more or less, so far. That doesn’t mean it always will.

Old data, on the other hand, is a touchpoint problem. Data is almost never static; what is true on one moment may not be true on the next. Cost data, consumption data, public opinion polls; these and thousands of other things are highly variable. Unfortunately, if all you have is a hammer, everything looks like a nail. If all a senior manager has is one data point, they are likely to continue to make decisions based on that data point long after it has ceased to be relevant.

Again, I don’t pretend this is an exhaustive list, but this should serve to highlight the fact that there are lots of ways to use data poorly. For that matter, there’s never just one way that a person can use data poorly; plenty of people mix and match different bad things! This is always something to guard against, because poor data usage creates bad decisions, and bad decisions create failure.

Incorrect Data Interpretation

This is a problem that generally occurs at the analysis level, and applies to whoever is doing the analysis. Generally speaking, this isn’t going to be a senior manager, but rather a lower-level manager, researcher, or an analyst of some sort.

I’m not going to go too deep into all the ways that data analysis can go wrong, especially not when other Medium writers have already done it. The point is that poor analysis leads to bad interpretations, and then those interpretations get funneled up to the level where they can do damage. Sufficed to say, there are lots of ways this can go wrong.

The largest fundamental impact of poor data analysis is that the resulting interpretations have a tendency to stick around. A great example of this happening in reality is the vaccine controversy sparked by a now-retracted article published in The Lancet, which incorrectly linked the MMR vaccine to autism. This was just bad analysis (and bad data collection, but who’s counting), but the core point is that once a bad idea is out in public, it doesn’t just go away. You have to slowly bludgeon it back into a hole, and then there are still people who will crawl in after it.

Incorrect Data Collection

Data collection errors are the lowest level of data problems, and also arguably the most common. These errors typically also have terrifically outsized impacts, since they generate poor interpretations and poor data usage according to the old expression, “garbage in, garbage out.” It’s worth remembering that people often value data simply for the sake of being data, and having any numbers is considered better than having no numbers. Sometimes, as a result, people aren’t too choosy about the quality of the data they’re using… but they really should be.

There are two broad categories of data collection errors. There are statistical errors, which you can read about in some depth here, and there are survey design errors, some of which are listed here. Broadly speaking, statistical errors creep in when you already have quantified data and you’re trying to play with it, while survey errors creep in when you are creating quantifiable data from subjective opinions. These are not mutually exclusive, in other words.

It’s worth noting that both of these error categories have entire books written about them, and I highly encourage you to read some of them.

To summarize: it is worth your while to never, ever completely trust any data. It’s not just that incorrect data collection is endemic due to bad choices, it’s also the case that quite often people simply don’t know what choices to make when collecting data. To put that another way, it’s not just that some data collection is wrong, it’s that some of it doesn’t know how to be right. Nonetheless, once the numbers are out there, someone will glom onto them.

Dealing with Bad Data

Ok, so what can you do about this? The short answer is, not much. You’re going to have to live with a lot of poor data decisions, because they’re happening all around you, all the time.

The slightly longer answer is that doubt is your best friend when it comes to data use, interpretation, and collection that you control. It is unwise to take data at face value, because someone collected it and someone (often someone else) interpreted it, and everyone who touched it is human and therefore fallible.

This is why the scientific method calls for repeated testing. Scientific thought is focused on created replicable data in order to triangulate conclusions. You can and should use the same methods and patterns of thought in your usage of data. Always look for other sources. Always verify your sources, if you can. Always ask questions. Never simply believe that just because something has a number attached to it, that means it’s actually understood or correct.

At the end of the day, there is data out there that is reasonably trustworthy, and there are people out there who are generally capable of making good data decisions. We wouldn’t have electric lights or computers if that was not the case. You should do your best to understand data collection and interpretation best practices in order to do your best to use data correctly. If you don’t do these things, you are at risk of falling prey to bad data and bad interpretations.

And that, friends, can have serious consequences on your decision-making, like thinking it’s a good idea to apply for a 30-year mortgage on a beachfront house in Florida.

You can’t do much to deal with bad data, because it’s all around you, all the time. The Information Age was grossly misnamed; we’re actually living the Age of Bullshit, and it’s getting worse all the time. The best you can do is to understand the problem, and try to avoid adding to it.

If you liked this article, check out the Modern Survival Guide Volume I, and my current work on Volume II! It’s an utterly random assortment of things I think people ought to know; there’s something in there for everyone.

--

--