This is especially important to the perennial also-ran, Yahoo, which was one of the early believers in Hadoop and supported much of the original work. Open source believers point to Hadoop as a huge victory for the open source strategy: Yahoo shared with everyone else and everyone else shared back, yielding a whole new software ecosystem -- one that's driving the hottest industry trend today.
Embedded in Hadoop is the ongoing tension between open and proprietary. A swarm of startups have sprung from the open source Hadoop code, adding just enough proprietary contributions to attract and retain customers. This debate is being played out as one company alone, Hortonworks, tries to keep its entire platform open. Will Hortonworks succeed? One competitor told me archly, "It's nice to see that Hortonworks finally got a platform out."
Yet pragmatists see this tension as a creative force that fosters exciting new businesses. The core of Hadoop is still pretty close to a standard, which makes life easier for everyone. The extras keep everything running and pay for the upkeep of the core. Programmers need to eat, and the secret sauce is the best way to justify salaries, while the core remains open and improving.
From rows and relations to keys and columns
Hadoop and its satellites are not the only projects working to solve large and complex data problems -- or simpler data problems for that matter. After decades of throwing every sort of data into relational database management systems, we're now seeing a slew of open source alternatives to the traditional data store.
Call them NoSQL, not only SQL, or un-SQL, these alternatives range from refreshingly simple to startlingly sophisticated. Many of them offer high performance or horizontal scalability by trading away some of the power of the relational database. The differences often lie in the tradeoffs they've made to accommodate certain kinds of use cases.
For instance, key-value stores such as Couchbase and Cassandra offer high performance and high scalability where preserving relationships among the data isn't a priority. Both will integrate with Hadoop, and serve as good analytical stores for semi-structured data. Cassandra offers a column-oriented solution, while Couchbase is evolving into a document database -- yet another way to organize the keys and values.
A document database is nothing but a key-value store in which the values are JSON. Programmers like them because they fall squarely into their object-oriented paradigm. Plus, the elements inside the JSON document can be indexed to speed up searches. They're a natural way to store clumps of related elements -- like the makings of a blog post or a patient record -- that don't fit neatly into a relational model. All of this explains the popularity of MongoDB.
There's a different project for each kind of problem. Some of the answers, such as graph database Neo4j, come into play when the relational powers of the traditional RDBMS aren't relational enough. In the past, depending on the requirements, the architect would tweak the configuration of MySQL or Oracle in a different way. Now, the architect can choose a completely different project.
Puffing up the private cloud
With a private cloud, you can borrow technologies and architectures pioneered by public cloud providers and apply them to your own data center. A bunch of open source projects have emerged to offer software to accomplish just that.
The OpenStack open source project in particular has gained surprising momentum in the private cloud space. This stack of Apache 2-licensed bits provides a framework for managing virtualized compute, storage, and networking resources -- with identity, monitoring, and self-service thrown in for the ride.
Billing itself as a "cloud operating system," OpenStack was initially developed by Rackspace and NASA. Now governed by a separate Foundation, OpenStack claims more than 192 participating companies, including Canonical, Cisco, Dell, HP, IBM, Red Hat, VMware, and a gaggle of cloud startups. Many of these companies plan to offer "packaged" versions because -- as with the Linux kernel -- the raw OpenStack bits are not something you'd normally download and put into production.
But OpenStack is not the only open source private cloud game in town. The best-known open source competitor to OpenStack isĀ Eucalyptus, developed at the University of California and intended to mimic Amazon Web Services, with full API compatibility. CloudStack, an open source project launched by Citrix in April, is well-positioned for use by cloud service providers, with a great Web UI for administering cloud resources.
To the Internets and beyond!?
Open source is outgrowing the world of computers. The success of open source software means that developers are giving it a whirl for everything from automobiles to clothing.
Take routers -- AutoAP, for instance, is a fascinating bit of code that can turn your wireless router into a node in a self-organizing network. It turns the once passive box for setting up Wi-Fi connections into an active participant that's constantly looking to bond with any neighbor.