Bossie Awards 2013: The best open source big data tools
InfoWorld's top picks in the expanding Hadoop ecosystem, the NoSQL universe, and beyond
The best open source big data tools
MapReduce was a response to the limitations of traditional databases. Tools like Giraph, Hama, and Impala are responses to the limitations of MapReduce. These all run on Hadoop, but graph, document, column, and other NoSQL databases might also be part of the mix. Which big data tools will meet your needs? The number of options seems to be expanding faster than ever.
Apache Hadoop
When people say "big data" or "data science," they're usually talking about a Hadoop project. Hadoop generally refers to the MapReduce framework, but the project also consists of important tools for data storage and processing. This new YARN framework, aka MapReduce 2.0, is an important step forward for Hadoop, and you can expect a big hype cycle to start shortly (if not then I'll start one!).
There aren't many Apache projects that support even one heavily capitalized startup. Hadoop supports several. Analysts estimate that Hadoop will be a ballooning market worth tens of billions per year. If you slipped into a coma during the financial crisis and just woke up, this is the biggest thing you missed.
-- Andrew C. Oliver
Apache Sqoop
When you think of big data processing, you think of Hadoop, but that doesn't mean traditional databases don't play a role. In fact, in most cases you'll still be drawing from data locked in legacy databases. That's where Apache Sqoop comes in.
Sqoop facilitates fast data transfers from relational database systems to Hadoop by leveraging concurrent connections, customizable mapping of data types, and metadata propagation. You can tailor imports (such as new data only) to HDFS, Hive, and HBase; you can export results back to relational databases as well. Sqoop manages all of the complexities inherent in the use of data connectors and mismatched data formats.
-- James R. Borck
Talend Open Studio for Big Data
Talend Open Studio for Big Data lets you load files into Hadoop (via HDFS, Hive, Sqoop, and so on) without manual coding. Its graphical IDE generates native Hadoop code (supporting YARN/MapReduce 2) that leverages Hadoop's distributed environment for large-scale data transformations.
Talend's visual mapping tools allow you to build flows and test your transforms without ever getting your hands dirty with Pig. Project scheduling and job optimization tools further enhance the toolkit.
Gleaning intelligence from big piles of data starts with getting that data from one place to Hadoop, and often from Hadoop to another place. Talend Open Studio helps you swim through these migrations without getting bogged down in operational complexities.
-- James R. Borck
Apache Giraph
Apache Giraph is a graph processing system built for high scalability and high availability. The open source equivalent of Google's Pregel, Giraph is used by Facebook to analyze social graphs of users and their connections. This system circumvents the problem of using MapReduce to process graphs by implementing Pregel's more efficient Bulk Synchronous Parallel processing model. The best part: Giraph computations run as Hadoop jobs on your existing Hadoop infrastructure. You get distributed graph processing while using the same familiar tools.
-- Indika Kotakadeniya
Apache Hama
Like Giraph, Apache Hama brings Bulk Synchronous Parallel processing to the Hadoop ecosystem and runs on top of the Hadoop Distributed File System. However, whereas Giraph focuses exclusively on graph processing, Hama is a more generalized framework for performing massive matrix and graph computations. It combines the advantages of Hadoop compatibility with a more flexible programming model for tackling data-intensive scientific applications.
-- Indika Kotakadeniya
Cloudera Impala
What MapReduce does for batch processing, Cloudera Impala does for real-time SQL queries. The Impala engine sits on all the data nodes in your Hadoop cluster, listening for queries. After parsing each query and optimizing an execution plan, it coordinates parallel processing among the worker nodes in the cluster. The result is low-latency SQL queries across Hadoop with near-real-time insight into big data.
Because Impala uses your native Hadoop infrastructure (HDFS, HBase, Hive metadata), you get a unified platform where you can analyze all of your data without connector complexities, ETL, or expensive data warehousing. And because Impala can be tapped from any ODBC/JDBC source, it makes a great companion for BI packages like Pentaho.
-- James R. Borck
Serengeti
VMware's project aimed at bringing virtualization to big data processing, Serengeti lets you spin up Hadoop clusters dynamically on shared server infrastructure. The project leverages the Apache Hadoop Virtualization Extensions -- created and contributed by VMware -- that make Hadoop virtualization-ready.
With Serengeti, you can deploy your Hadoop cluster environments in minutes without sacrificing configuration options like node placement, HA status, or job scheduling. Further, by deploying Hadoop in multiple VMs on each host, Serengeti allows data and compute functions to be separated, improving computational scaling while maintaining local data storage.
-- James R. Borck
Apache Drill
Inspired by Google's Dremel system, Apache Drill is designed for low-latency interactive analysis of very large data sets. Drill supports multiple sources of data, including HBase, Cassandra, and MongoDB as well as traditional relational databases. With Hadoop, you get massive data throughput, but exploring an idea might take hours or minutes. With Drill, you get results fast enough to work interactively, so ideas can be rapidly explored and fruitful theories developed further.
-- Steven Nuñez
Gephi
Graph theory has applications across the board. A suspected case of insider trading can be investigated by a link analysis of the traders and employees involved. A complex IT environment can be visualized to uncover the most important connection points in the system. Developed by a consortium of academics, corporations, and individuals, Gephi is a visualization and exploration tool that supports multiple graph types and networks as large as 1 million nodes. The wiki, forums, and tutorials are extensive, and the active Gephi community has produced a large set of plug-ins, so it's likely you won't have to reinvent the wheel for common applications.
-- Steven Nuñez
Neo4j
An agile and blazing-fast graph database, Neo4j can be used in a variety of different ways, including social applications, recommendation engines, fraud detection, resource authorization, and data center network management. Neo4j has continued its steady progress with both performance improvements (streaming of query results) and improved clustering/HA support.
-- Michael Scarlett
MongoDB
Perhaps the most popular NoSQL database of them all, MongoDB uses a binary form of JSON document to store data. This allows schemas to vary across documents, giving developers unbridled freedom compared to traditional relational databases, which impose flat, rigid schemas across numerous tables. And yet MongoDB still provides the functionality developers expect in a relational database.
This was a big year for MongoDB with two new releases and scores of new features, including text search and geospatial capabilities, as well as such performance improvements as concurrent index builds and a faster JavaScript engine (V8).
-- Michael Scarlett
Couchbase Server
Like other NoSQL databases, and unlike most relational databases, Couchbase Server does not require you to create a schema before data is inserted. One unique attribute of Couchbase Server is its memcached library. This feature allows developers to seamlessly transition from a memcached environment and gain data replication, durability, and zero application downtime. The 2.0 release added document database capability. The 2.1 release built on this with cross-data center replication and improved storage performance.
-- Michael Scarlett
Paradigm4 SciDB
SciDB is a distributed database system that leverages parallel processing to perform real-time analytics on streaming data. Built from the ground up to support massive scientific data sets, it eschews the rows and columns of relational databases for native array constructs that are better suited to ordered data sets such as time series and location data. Neither relational nor MapReduce, SciDB offers a unified solution that scales across large clusters without requiring Hadoop's multilayered infrastructure and data massaging obligations.
-- James R. Borck
Read about more open source winners
InfoWorld's seventh annual Best of Open Source Software Awards feature more than 120 open source projects in seven categories ranging from data center and cloud to desktop and mobile.
Bossie Awards 2013: The best open source applications
Bossie Awards 2013: The best open source application development tools
Bossie Awards 2013: The best open source data center and cloud software
Bossie Awards 2013: The best open source big data tools
Bossie Awards 2013: The best open source desktop and mobile software
Bossie Awards 2013: The best open source networking and security software
Copyright © 2013 IDG Communications, Inc.
Best of Open Source Awards 2013
This year, the annual Bossie Awards recognize 120 of the best open source software for data centers...