If Big Data Is To Live Up To Its Promise, We Need To Fix Our Data Systems
A data scientist, it’s been said, is a statistician who works in Silicon Valley, which is another way of saying that the term has attained true buzzword status. The potential to be unlocked is undeniably, but so far, there has been no shortage of disappointment and frustration. Truth be told, the steak hasn’t always lived up to the sizzle.
The problem hasn’t been with big data itself, but with the practicalities of technology. Simply put, we design systems to perform particular tasks and only later realize that we want them to do more than we originally realized. That’s when it becomes clear that our systems are hopelessly incompatible.
In a nutshell, that’s the problem IBM is now trying to fix. By creating a universal platform, which it calls the Data Science Experience, it hopes to integrate data trapped in separate protocols and incompatible systems. This will not only enable more advanced analytics, it will help us to reimagine how we manage our organizations and compete in the marketplace.
The Power Of A Universal Platform
The European Organization for Nuclear Research, more commonly known as CERN, has long been one of the world’s premier scientific institutions. Its discoveries include the isolation of antimatter and, more recently, the identification of particle that is consistent with the Higgs boson. Six Nobel prizes have been awarded in connection with work done there.
Yet when Tim Berners-Lee was working there in the 1980’s, he noticed that CERN had a very thorny problem. Brilliant researchers would arrive from around the globe and perform important scientific experiments, write up their results and then leave. The documents would be stored on different systems and in different formats, which made them hard to share.
So in November 1989, Berners-Lee created three protocols, HTTP, URI and HTML that, taken together, would create a universal platform for documents. That platform, now known as the World Wide Web, allowed us to share information as never before and has changed the world in more ways than we can count.
Still, as he described in his memoir, Weaving The Web, Berners-Lee soon recognized the web’s shortcomings. While it allowed humans to communicate with humans as never before, it did little to help machines communicate with machines. In other words, while our ideas could flow freely, our data remained trapped within the systems in which we created them.
The Problem With Data
Most organizations today have a very similar problem to what CERN had in the 1980’s. They collect data on a variety of systems, commissioned by different departments, that don’t talk to each other very well. Some of these systems are decades old and were originally designed for a completely different computing environment.
Consider a typical retail enterprise, which has separate operations for purchasing, point-of-sale, inventory, marketing and other functions. All of these are continually generating and storing data as they interact with the real world in real time. Ideally, these systems would be tightly integrated, so that data generated in one area could influence decisions in another.
The reality, unfortunately, is that things rarely work together so seamlessly. Each of these systems stores information differently, which makes it difficult to get full value from data. To understand how, for example, a marketing campaign is affecting traffic on the web site and in the stores, you often need to pull it out of separate systems and load it into excel sheets.
That, essentially, has been what’s been holding data science back. We have the tools to analyze mountains of data and derive amazing insights in real time. New advanced cognitive systems, like Watson, can then take that data, learn from it and help guide our actions. But for all that to work, the information has to be accessible.
Building An Integrated Data Environment
None of this is to say that there haven’t been real advances in the way we handle data over the last decade or so. Hadoop, first published in 2003, allows us to store data in thousands of clusters strewn across the world and analyze it as a single data set. Spark, released in 2014, acts as an operating system for Hadoop and helps us to analyze data in real time. But working with incompatible systems within organizations still presents a problem.
Let’s return to our retail example and imagine that we want to build a predictive model to make purchasing decisions. We’d want to incorporate response to marketing campaigns, to gauge demand, but also inventory systems, so that we can avoid stockouts or having extra items that we’ll have to discount later.
Seems simple, right? Actually it’s not, because all that data resides in separate systems.
That’s precisely the problem that the IBM’s Data Science Experience will fix. Rob Thomas, a VP at IBM and author of Big Data Revolution told me, “Today, data science is an individual sport. What we’re doing now is transforming it into a team sport, where groups in separate departments can work together to build, refine and share data models and insights.”
So essentially, IBM’s Data Science Experience seeks to do for data what Tim Berners-Lee’s World Wide Web did for documents, take a balkanized world made up of disparate islands and integrate it into a single, unified environment in which we can work effectively.
The Management Challenge Ahead
When William Faulkner wrote, “The past is never over. It is isn’t even past,” he meant that we are, to a large extent, bounded by legacy and that is certainly true of technology. We never build anew, but nest technologies on top of each other, as if they were some incredibly elaborate set of matryoshka dolls.
This becomes painfully obvious when you try to integrate new systems with old, but an even more pervasive problem is that management practices are even further behind. We designed our computer systems to mirror our organizational mindsets and, while technology is now breaking free, our management ethos is still largely trapped in an earlier age.
Today, we are increasingly living in a semantic economy, where information flows freely across once impermeable boundaries of firm and industry. A dizzying array of devices and sensors allow us to interact with the world in real-time, but all too often we act according to predetermined plans and expect the world to conform.
As Steve Blank has often said, no business plan survives first contact with a customer, yet we are still stuck in the planning mindset, using historical data to predict things months in advance. Then we make investment decisions based on those judgments—often arrived at by consensus in a boardroom over several more months—and wonder why things don’t work out as planned.
Clearly, that’s becoming an untenable mindset. We need to start taking a more Bayesian approach to strategy, where we don’t expect to predict things and be right, but rather allow data streams to help us become less wrong over time. Data is no panacea, but it can help us to see the world more clearly.