What Is Big Data? (and what do we do with it?)
Big data is like Moneyball for geeks. The term itself is so innately exciting that it made the leap from neologism to buzzword in a heartbeat. Mostly, that’s a good thing, because it has undoubtedly increased enthusiasm and investment in a crucial area.
Unfortunately, all the hype also has a downside, because we are now in the midst of a big data bubble where everything and anything seems to be touted as having some kind of big data tie-in, which makes it hard to know what we’re talking really about.
In their book, Big Data, authors Viktor Mayer-Schönberger and Ken Cukier define it as “things that one can do at a large scale that can’t be done at a small one” and that, I think, gets to the heart of the matter. Big data is not just a difference in scale, it’s a difference in kind and it demands that we make serious changes to how we think, manage and operate.
The Problem With Facts
Imagine you wanted to get an idea of whether someone was a good free throw shooter or not and set 90% as a target level. 10 tries would be too small a sample, because one lucky or unlucky shot would make too big of a difference, so we might try 100 (which would have a 95% chance of being accurate to within a ± 5% margin of error).
In order to make a clear comparison with other players, however, you would also need to be very careful to create controlled conditions, including standards for the ball, the height of the net, lighting conditions in the gym and so on. The more varied the conditions, the less accurate your assessment will be.
Once all of this has been achieved, you can state with a precise degree of confidence (i.e. 95%) that you have made a reasonably accurate assessment and can state it as a reliable fact. That’s how science is done:
Controlled Conditions + Statistical Significance = Facts.
However, you can probably already see the problem. You’ll be wrong one out of twenty times (5%) and that’s assuming conditions were perfectly controlled. Unfortunately, facts aren’t always what they seem (a recent study in Nature found that a majority of cancer research could not be replicated).
How To Become Less Wrong Over Time
What if you took a different strategy for assessing free throws? What if you simply assumed that everyone was an average performer and then changed your assessment as data came in. If someone made nine out of ten shots you’d figure that they were pretty good, but wouldn’t be certain that they didn’t just get lucky.
As they shot more free throws, in different conditions and at different times, you would continue to adjust your assessment. You’d be a whole lot more confident at 100 trials and even more so at 1000. Whatever your initial assessment, you’d become less wrong over time.
Now imagine that you had billions of data points, all being collected and analyzed in real time and real conditions.
That’s the secret to the transformative power of big data. By vastly increasing the data we use, we can incorporate lower quality sources and still be amazingly accurate. What’s more, because we continue to reevaluate, we can correct errors in initial assessments and make adjustments as facts on the ground change.
In an increasingly connected age, with low cost sensors and a central Internet, this is unleashing a world of new possibilities. We are monitoring everything from signs of stress on bridges to vibrations in vehicle engines to how people search for information on the Internet and are able to glean valuable insights.
More is Better Than Smarter
A big data turning point came in 2009 with the onset of the H1N1 virus. The Center for Disease Control (CDC) requested that highly trained specialists (i.e. doctors) report signs of flu in their area in order to track its spread. The data was accurate, but had a lag of two weeks, which greatly hampered its effectiveness.
At the same time, Google began its own flu tracking, using big data methodology to correlate search terms with the flu. The data was just as accurate, but almost instantaneous and was able to outperform the work of thousands of highly trained specialists by identifying specific patterns in the information.
What’s important to understand about Google flu trends is that, from a conventional perspective, it doesn’t make a whole lot of sense. No symptoms are checked. No one with medical training is evaluating the data. The service identifies correlation, not causation. In other words, this isn’t medicine, it’s data.
And that’s the beauty of big data, it can be dumb and still be incredibly useful. We don’t need to find a rhyme or reason in order to make judgments about the world around us. If an unusual pattern precedes an infection in a hospital patient or an engine problem or a bridge collapse, we can take preventative steps, even if we don’t fully understand the causal link.
However, what is perhaps even more exciting is when vast computational power is applied to simpler patterns. That’s when dumb data starts to become smart.
The first thing an infant begins to recognize are simple patterns like lines, shading and phonemes (elemental units of language). They then learn to combine those elemental patterns into higher order ones, like shapes, objects and words. Over the course of a lifetime, we continually learn to combine simpler patterns into higher order ones.
This, of course, takes an enormous amount of time, because humans must gain significant experience in lower order patterns before they can advance. Doctors spend years at medical school and then years more in residency before they master enough patterns to practice in a specialized area. Computers, however, can learn much, much faster.
For example, researchers at IBM taught their algorithm to translate between French and English by exposing it to proceedings of the Canadian Parliament, which by law must be produced in both languages. This allowed them to connect not just words, but entire phrases and even slang. It would take a year for a human to sit through it all, but a computer can do it without breaking a sweat.
Others, such as Mattersight, a company that uses artificial intelligence to analyze and improve call center operations, uses a more human centered approach. They have trained analysts check the computer’s work and teach it to improve over time. Researchers at Cornell have developed algorithms that can learn by merely observing human behavior.
From Hypothesis to Simulation
In the past, we’ve mostly operated by testing hypotheses. We come up with an idea of how the world might work, like a new way to treat an illness or a potentially lucrative marketing approach and then do some research in a lab, conduct consumer surveys or use some other method. If those go well, we might invest further in a trial or pilot program.
The problem is that’s all incredibly expensive and time consuming. With big data, we can simply look for patterns that correlate with real world phenomena. Once we identify a potentially fertile model, we can continue to observe and test it as more data comes in. With billions of data points and cheap computing power, this is extremely efficient.
What’s even more exciting is our ability to use big data to build large scale simulations called agent based models, which allows us to test ideas in a virtual environment. Marketing simulations have been shown to be 90% accurate, which allow us to eliminate a lot of bad ideas before we go to the trouble and expense of testing them in the real world.
In essence, big data is enabling us to create a simulation economy where organizations can learn much more efficiently. However, before we can manage effectively in a big data world, we must first change our business culture from one that values clever ideas to one that embraces simulation, testing and action.