Enterprise Hadoop: Big data processing made easier
Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs
It's been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss's boss. That puts it in the sweet spot when there's plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won't be long before "Hadoop cluster" takes over the role that the words "crazy supercomputer" used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.
The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There's HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.
[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Read about InfoWorld's 2012 Technology of the Year Award winners. | Read about InfoWorld's top 10 emerging enterprise technologies. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
The open source projects are just the beginning -- a surprisingly large number of companies are emerging with the plan of helping people actually use Hadoop. Some are just selling support, and others are building their own tools that sit alongside Hadoop and make it easier to use.
This kind of competition is usually seen as open source at its best. There is a core collection of packages that serve like a standard to keep everyone in synchrony. Each of the groups is competing to add the right sauce that will attract customers, both paying and nonpaying. There continues to be controversy over just how much is rolled into the central collection, as there can be in any major open source project, but the amount of experimentation is so large that it's hard to be too focused on the amount of sharing.
To get a feel for the excitement, I took four major collections out for a test-drive. I powered up a cluster of nodes on Rackspace, installed the tools, pushed the buttons, and ran some sample jobs. It's getting to be surprisingly easy to spend a few pennies for an hour or two of machine time -- so much so that I found myself debating whether it was worth leaving my cluster idling over lunchtime. Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.
The not-so-good news is that these collections are far from perfect. None of the tools I tried worked exactly as promised. There were always small glitches. I often found myself reading the log files and paging through endless lists of Java stack dumps. (Someone is going to have to apply Hadoop to analyzing the endless stack dumps. They're getting so involved that I doubt a single machine will be able to parse them for much longer.) After a few seconds, I could usually get things on track again. These tools may not require someone with much experience to use once they're running, but they can't be installed unless you're fairly adept with the ways that the Java stack is organized.
Despite these impediments, I spent most of my time churning through data. The good news is that all of these tools make it pretty easy to get a cluster of computers working together to solve problems. Using these tools is much easier than downloading and configuring the source code yourself. They're designed to be one-button applications, and they come close to achieving that goal.