A New Era For Data
In The Innovator’s Solution, Harvard Professor Clayton Christensen argued that, during the early stages of an industry, firms with wholly proprietary products have the advantage. New technology is always glitchy, so engineering the entire architecture is the best way to ensure quality.
However, as an industry matures and the technology becomes better understood, it inevitably becomes more modular, with different firms specializing in different parts of the value chain. That’s when the true potential of a technology is unlocked, empowering an entirely new era of value creation.
The computer industry is a good example of Christensen’s model at work. Before the PC, computers were proprietary systems. Yet when the basic architecture became universal, an amazing amount of innovation was unleashed. IBM’s recent news of its commitment to the open source Apache Spark community, marks just such a point in the evolution of data.
Unlocking The Power Of The PC
In the 1970’s, enterprise computing was dominated by a handful of companies, such as IBM, Hewlett Packard and DEC. Each offered its own line of hardware and software, which could only be combined with the products of other vendors through a long and expensive systems integration process. Customers, for the most part, were locked in.
Initially, the personal computing industry developed the same way, with tightly integrated hardware and software, but in 1981, IBM broke that mold with its launch of the PC. Rather than do everything in house, IBM created an open architecture and its internal divisions had to compete with outside vendors to build components for the new computer.
The move was a fantastic success and Big Blue vaulted past Apple to become the leading personal computing company. Yet, even more importantly, it helped preserve IBM’s position in the enterprise computing market, which became crucial for Lou Gerstner’s turnaround a decade later, when many of the firm’s former competitors ceased to exist.
Alas, it wasn’t a completely happy story for Big Blue. One of the companies that IBM outsourced to, Microsoft, was able to dominate the industry through its control of the operating system. While IBM created the PC revolution, it was Microsoft that ended up taking the bulk of the profits.
That was a hard lesson, but IBM seems to have learned it.
The Evolution Of Data
Data, it’s been said, is the plural of anecdote. People naturally catalogue specific events until they become noticeable trends. Unfortunately, the scope of human experience is relatively small. We can only be one place at a time and there are only 24 hours a day. So it’s hard for a manager of an office in Boston to know if his colleague in San Diego sees what he sees.
That’s essentially the problem that computerized databases are designed to solve. At first, they could only store fairly simple information. Yet soon, more sophisticated relational databases were developed that could retrieve and analyze data much more efficiently, enabling data mining and more sophisticated analysis. Still, challenges remained.
One was that data needed to be housed in a single location. Another was that while databases worked well with information that was formally structured, like point of sale records, they couldn’t handle unstructured data, like Word documents or scientific papers. In effect, anecdotes could only become data if they were properly processed.
Hadoop, an open source technology created in 2005, solved both of these problems. It allows us to store and pull data, whether structured or unstructured, from many locations at once. That’s what enabled the era of big data, creating an integrated environment from which events—or anecdotes if you will—can be widely aggregated and analyzed.
An Operating System For Data
As transformational as Hadoop has been, issues remain. Although it is an incredibly effective filing system for storing data, it is less adept at retrieving it. To analyze information in Hadoop, you must go through the entire data set, which takes time. Further, you can’t analyze data continuously, only in batches.
That’s a real problem for machine learning. Imagine if you had to go through an entire day and only be able to process experiences when you got home. Without the ability to continuously process —in effect to turn anecdotes into data to be analyzed—you couldn’t react to changes in your immediate environment. Every decision would be a day late.
Spark, an open source technology created at UC Berkeley in 2009, effectively solves that problem. It can pull relevant data in from Hadoop, or any other source, and analyze it continuously, updating insights in real time. That makes it a boon for machine learning. Just like a human, Spark allows systems to see the world as it happens and update analysis.
That’s why Rob Thomas, a Vice President at IBM and author of Big Data Revolution, calls Spark an operating system for data. Much like Microsoft’s operating system for the PC, it allows machine learning systems and analytical engines to pull resources from anywhere in the system and deliver those resources to applications.
Put another way, the data revolution is like the PC revolution all over again. Except this time, instead of Microsoft, we basically have the equivalent of Linux—an open source, rather than a proprietary, operating system.
History Doesn’t Repeat Itself, But It Does Rhyme
Twenty years from now, we will most likely see the PC era as an anomaly—a time when one lucky firm was able to transform one section of the stack and use it as a choke point to gain dominance of the industry. Although Microsoft remains the third most valuable company in the world, it will likely never regain its former power and influence.
Yet besides that element, the same basic narrative continues to play out. Innovations in core technologies continue to unlock massive amounts of new value, which eventually takes the form of applications that transform our daily lives. Who can imagine modern existence without personal computing devices, the Internet and, increasingly, the cloud?
The data revolution should be seen in that context and IBM’s investment in Spark will help power the transformation. The company will put over 3500 researchers and engineers to work on improving the core technology, while at the same time embedding it into their analytics and artificial intelligence platforms. It will also offer Spark as a cloud service on its Bluemix platform.
It’s sad, in a sense, that most of this will be invisible to the average consumer. We rarely take notice of underlying technologies. However, the capabilities that are about to be unleashed promise to be no less transformative than the PC or the Internet.