What is a data lake? Massively scalable storage for big data analytics

Dive into data lakes—what they are, how they're used, and how data lakes are both different and complementary to data warehouses.

aydinmutlu / Getty Images

In 2011, James Dixon, then CTO of the business intelligence company Pentaho, coined the term data lake. He described the data lake in contrast to the information silos typical of data marts, which were popular at the time: 

If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Data lakes have evolved since then, and now compete with data warehouses for a share of big data storage and analytics. Various tools and products support faster SQL querying in data lakes, and all three major cloud providers offer data lake storage and analytics. There's even the new data lakehouse concept, which combines governance, security, and analytics with affordable storage. This article is a high dive into data lakes, including what they are, how they're used, and how to ensure your data lake does not become a data swamp.

The data lake explained

A data lake is essentially a single data repository that holds all your data until it is ready for analysis, or possibly only the data that doesn't fit into your data warehouse. Typically, a data lake stores data in its native file format, but the data may be transformed to another format to make analysis more efficient. The goal of having a data lake is to extract business or other analytic value from the data.

Data lakes can host binary data, such as images and video, unstructured data, such as PDF documents, and semi-structured data, such as CSV and JSON files, as well as structured data, typically from relational databases. Structured data is more useful for analysis, but semi-structured data can easily be imported into a structured form. Unstructured data can often be converted to structured data using intelligent automation.

Data lake vs data warehouse

The question isn't whether you need a data lake or a data warehouse; you most likely need both, but for different purposes. It is also possible to combine them, as we'll discuss soon. To start, let's look at the major differences between data lakes and data warehouses:

  • Data sources: Typical sources of data for data lakes include log files, data from click-streams, social media posts, and data from internet connected devices. Data warehouses typically store data extracted from transactional databases, line-of-business applications, and operational databases for analysis.
  • Schema strategy: The database schema for a data lakes is usually applied at analysis time, which is called schema-on-read. The database schema for enterprise data warehouses is usually designed prior to the creation of the data store and applied to the data as it is imported. This is called schema-on-write.
  • Storage infrastructure: Data warehouses often have significant amounts of expensive RAM and SSD disks in order to provide query results quickly. Data lakes often use cheap spinning disks on clusters of commodity computers. Both data warehouses and data lakes use massively parallel processing (MPP) to speed up SQL queries.
  • Raw vs curated data: The data in a data warehouse is supposed to be curated to the point where the data warehouse can be treated as the "single source of truth" for an organization. Data in a data lake may or may not be curated: data lakes typically start with raw data, which can later be filtered and transformed for analysis.
  • Who uses it: Data warehouse users are usually business analysts. Data lake users are more often data scientists or data engineers, at least initially. Business analysts get access to the data once it has been curated.
  • Type of analytics: Typical analysis for data warehouses includes business intelligence, batch reporting, and visualizations. For data lakes, typical analysis includes machine learning, predictive analytics, data discovery, and data profiling.

What about data marts?

Data marts are analysis databases that are limited to data from a single department or business unit, as opposed to data warehouses, which combine all of a company's relational data in a form suitable for analysis. Data marts offer efficient analysis by containing only data relevant to the department; as such, they are inherently siloed. Some claim the siloing doesn't matter because the business unit doesn't need the excluded data. In real life, it often does matter—there's always a higher-up who needs reports based on combined data from multiple business units. That's one reason why we currently see many data lakes and data warehouses, and few data marts.

Faster SQL queries on a data lake

When you store raw data in a data lake, the data might be useless for business analysts until it has been processed by a data engineer or data scientist. In addition to filtering and data transformations, data lakes need data catalogs, data security, and schema definitions. The unfortunate shorthand term for a data lake without these features is data swamp.

Fortunately, there are plenty of tools to help filter and organize the data in your data lake. As an example, you could address the need for a schema by creating a Hive metastore in ORC format. Once it's set up, the metastore supports fast SQL queries via a massively parallel SQL engine like Presto. (The Optimized Row Columnar format is a compressed columnar store that is optimized for Hive and works well with Presto.)

Apache Spark is another massively parallel SQL engine. While it can work with the ORC format, it works even better with Parquet, another compressed columnar store. Spark can perform both vertical and horizontal partitioning on Parquet files, producing a query plan that requires reading only the necessary data and can skip irrelevant data.

The data lakehouse

Databricks, the company behind Spark and MLflow, offers what they call a data lakehouse. According to Databricks, the lakehouse combines the best features of data warehouses and data lakes:

A data lakehouse unifies the best of data warehouses and data lakes in one simple platform to handle all your data, analytics, and AI uses cases. It’s built on an open and reliable data foundation that efficiently handles all data types and applies one common security and governance approach across all of your data and cloud platforms.

Delta Lake, which Databricks released to open source, forms the foundation of the lakehouse by providing reliability and high performance directly on data in the data lake. Databricks Lakehouse Platform also includes the Unity Catalog, which provides fine-grained governance for data and AI. Databricks claims that its data lakehouse offers 12 times the price/performance ratio of a data warehouse.

Data lakes on-prem and in public clouds

Historically, data lakes were implemented on-premises using Apache Hadoop clusters of commodity computers and HDFS (Hadoop Distributed File System). Hadoop clusters once were big business for Cloudera, Hortonworks, and so on. Cloudera and Hortonworks merged in 2018, which tells you something about the direction of the market.

What changed was the cloud, specifically the hyperscale public cloud vendors Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). All three cloud providers offer data lake storage products: Amazon Simple Storage Service (Amazon S3) and Amazon EMR (formerly Amazon Elastic MapReduce), Azure Data Lake Store (ADLS), and Google Cloud Storage (GCS). All three also offer services for data ingestion, data processing, analytics, and machine learning. It's much easier and faster to create, manage, and scale cloud data lakes than it is to manage Hadoop clusters in your data center; the tradeoff is that long-term operational expenditures in the cloud will eventually become significant.

Analytics and machine learning for data lakes

Earlier, I discussed using Presto and Apache Spark for faster SQL queries on a data lakes. SQL is only one of the ways to analyze data, although it's quite important and is often the first step. In addition, consider business intelligence tools such as Power BI, Tableau, or Qlik; Jupyter, Zeppelin, or Spark notebooks; machine learning, such as scikit-learn, SparkML, or KNIME; and deep learning, such as TensorFlow or PyTorch.

The hyperscale cloud vendors have analytics and machine learning tools of their own that connect to their data lakes.

Amazon Athena uses Presto and Hive to perform SQL queries on data in Amazon S3. Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications using open source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon SageMaker is a fully managed service to build, train, and deploy machine learning models.

Azure Data Lake Analytics (ADLA) is an older on-demand (serverless) analytics job service that simplifies big data, and uses U-SQL, which is SQL plus C#. ADLA is being replaced by Azure Synapse Analytics, which is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options—at scale. Synapse combines data lake, enterprise data warehouse, and in-place operational data query functionality, and can automatically migrate data and code from ADLA as well as data warehouses. Synapse has deep integration with Azure Machine Learning, Azure Cognitive Services, and Power BI.

Google Cloud Storage provides native integration with a number of powerful Google Cloud services, such as BigQuery (a data warehouse), Dataproc (Hadoop ecosystem), Dataflow (serverless streaming analytics), Video Intelligence API, Cloud Vision API, and AI Platform.

Altogether, you have considerable flexibility to choose the right tool to analyze your data.

Conclusion

Data lakes have become much more useful since the days of Hadoop clusters and MapReduce. Presto and Apache Spark offer much faster SQL processors than MapReduce, thanks to in-memory and massively parallel processing and Hive-based schemas. Cloud-based data lakes are much easier and faster to create, manage, and scale than on-prem clusters of commodity computers. And cloud data lakes integrate tightly with a wide range of analytics and artificial intelligence tools.