How to choose a cloud data warehouse

Modern data warehouses can query structured data and semi-structured data simultaneously, and even combine historical data and streaming live data for analysis.

How to choose a cloud data warehouse
Thinkstock

Enterprise data warehouses, or EDWs, are unified databases for all historical data across an enterprise, optimized for analytics. These days, organizations implementing data warehouses often consider creating the data warehouse in the cloud rather than on premises. Many also consider using data lakes that support queries instead of traditional data warehouses. A third question is whether you want to combine historical data with streaming live data.

A data warehouse is an analytic, usually relational, database created from two or more data sources, typically to store historical data, which may have a scale of petabytes. Data warehouses often have significant compute and memory resources for running complicated queries and generating reports, and are often the data sources for business intelligence (BI) systems and machine learning.

The write throughput requirements of transactional operational databases limit the number and kind of indexes you can create (more indexes mean more writes and updates per record added, and more possible contention). This in turn slows down analytic queries against the operation database. Once you have exported your data into a data warehouse, you can index everything you care about in the data warehouse for good analytic query performance, without affecting the write performance of the separate OLTP (online transaction processing) database.

Data marts contain data oriented toward a specific business line. Data marts may be dependent on the data warehouse, independent of the data warehouse (i.e., drawn from an operational database or external source), or a hybrid of the two.

Data lakes, which store files of data in its native format, are essentially “schema on read,” meaning that any application that reads data from the lake will need to impose its own types and relationships on the data. Traditional data warehouses, on the other hand, are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in the data warehouse.

Modern data warehouses can often handle structured data and semi-structured data and query them simultaneously. In addition, modern data warehouses can often query historical data and streamed recent data simultaneously.

Cloud data warehouses vs. on-prem data warehouses

A data warehouse can be implemented on-premises, in the cloud, or as a hybrid. Historically, data warehouses were always on-prem, but the capital cost and lack of scalability of on-prem servers in data centers were sometimes issues. On-prem EDW installations grew when vendors started offering data warehouse appliances. Now, however, the trend is to move all or part of your data warehouse to the cloud to take advantage of the inherent scalability of cloud data warehouses, and the ease of connecting to other cloud services.

The downside of putting petabytes of data in the cloud is the operational cost, both for cloud data storage and for cloud data warehouse compute and memory resources. You might think that the time to upload petabytes of data to the cloud would be a huge barrier, but the hyperscale cloud vendors now offer high-capacity, disk-based data transfer services.

Speed and scalability requirements

Data warehouses are designed so that analytical queries can run fast. For old on-prem data warehouses, reports with multiple queries based on historical data were typically run overnight. For modern cloud data warehouses, the performance requirements are stiffer, as analysts expect to run queries based on historical plus streaming data interactively, and then dig deeper with more queries.

Cloud data warehouses are usually designed to scale CPU capacity as needed, so that interactive queries against petabytes of data can return answers in minutes. Some cloud data warehouses can increase the CPU resources while a query is running without restarting the query, and reduce them again when the data warehouse is idle. Aggressive up-scaling and down-scaling can be a good strategy to get high performance when needed for low overall cost.

Columnar versus row storage

Row-oriented databases organize data by record, and typically attempt to store one database row in one block of storage, so that the whole row can be retrieved with a single read operation. Row-oriented databases are efficient for both reading and writing rows. Most transactional databases are row-oriented, and use b-tree indexes.

Column-oriented databases organize data by field, and attempt to store all the data associated with a field together. Columnar databases are efficient for reading and computing on columns. Most data warehouses store data in columns, compress their data heavily, and use LSM-tree indexes. The original paper describing C-Store, a read-optimized column-oriented database, was published in 2005. The C-Store paper laid the groundwork for most modern columnar store data warehouses, including Amazon Redshift, Google BigQuery, and Snowflake.

Some databases combine row and columnar storage. They use row storage for OLTP, and columnar storage for analytic queries. A few databases can query data in columnar storage and row storage together, which speeds up queries where not all fields can fit into columnar storage.

In-memory storage and layered storage

What’s faster than a compressed columnar store on disk? A compressed columnar store in memory. What can handle more data than a columnar store in memory? A layered storage system that backs memory with PMEM, such as Intel Optane, which is faster than flash and cheaper than DRAM. Additional layers would be flash and spinning disks. The hard part of a scheme like this is implementing the multi-level caching without slowing down retrievals or allowing unnecessary cache flushing in the faster layers.

ETL versus ELT

ETL (extract, transform and load) tools pull the data, perform any desired mappings and transformations, and load the data into the data storage layer. ELT tools store the data first and transform later. When you use ELT tools, it is common to also use a data lake.

Clustered and distributed cloud data warehouses

Since data warehouses are read-mostly databases, it is easier to cluster them than to cluster OLTP databases. It is also easier to distribute data warehouses geographically without incurring high write latency. Once your data warehouse has a clustered architecture, it is easy to add nodes to the cluster to increase processing capacity and return results faster.

Cloud UI for admin and queries

Just about every cloud data warehouse has its own user interface for administration and queries. Some are more usable than others. Administration is simpler than query building. Adding a node (or setting a maximum number of nodes for autoscaling) can be as easy as pressing one button. Some cloud data warehouses offer a graphical query builder, which is useful for SQL novices. Many cloud data warehouses offer a history pane for past queries and their answers.

Key cloud data warehouses

The 13 products listed below alphabetically either are cloud data warehouses, or provide the functionality of data warehouses while building on a different base architecture, such as data lakes. You could argue that Ahana, Delta Lake, and Qubole are built on data lakes rather than starting as data warehouses, but you could also argue that they provide much the same functionality as unquestioned data warehouses such as AWS Redshift, Azure Synapse, and Google BigQuery. As all these products add heterogenous federated query engines, the functional distinction between data lakes and data warehouses tends to blur.

Ahana Cloud for Presto

Ahana Cloud for Presto turns a data lake on Amazon S3 into what is effectively a data warehouse, without moving any data. SQL queries run quickly even when joining multiple heterogeneous data sources.

Presto is an open source, distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases, and proprietary data stores. A single Presto query can combine data from multiple sources. Facebook uses Presto for interactive queries against several internal data stores, including their 300 PB data warehouse.

Ahana Cloud for Presto runs on Amazon, has a fairly simple user interface, and has end-to-end cluster lifecycle management. It runs in Kubernetes and is highly scalable. It has a built-in catalog and easy integration with data sources, catalogs, and dashboarding tools. The default Ahana query interface is Apache Superset. You can also use Jupyter or Zeppelin notebooks, especially if you are doing machine learning.
Ahana claims to have 3X the performance of other Presto services, including Amazon Elastic MapReduce and Amazon Athena.

Amazon Redshift

Using Amazon Redshift you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Redshift lets you easily save the results of your queries back to your S3 data lake using open formats, such as Apache Parquet, so that you can do additional analytics from other analytics services such as Amazon EMR, Amazon Athena, and Amazon SageMaker.

Azure Synapse Analytics

Azure Synapse Analytics is an analytics service that brings together data integration, data warehousing, and big data analytics. It allows you to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs, and query data using either serverless or dedicated resources at scale. Azure Synapse can run queries using Spark or SQL engines. It has deep integration with Azure Machine Learning, Azure Cognitive Services, and Power BI.

Databricks Delta Lake

Developed by Databricks, Delta Lake is an open source project that enables building a “lakehouse” architecture on top of existing storage systems such as Amazon S3, Microsoft Azure Data Lake Storage, Google Cloud Storage, and HDFS. It adds ACID transactions, metadata handling, data versioning, schema enforcement, and schema evolution to data lakes. Databricks Lakehouse Platform uses Delta Lake, Spark, and MLflow in a cloud service that runs on AWS, Microsoft Azure, and Google Cloud to combine the data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.

Google BigQuery

Google BigQuery is a serverless, petabyte-scale, cloud data warehouse with an internal BI engine, internal machine learning accessible via SQL extensions, and integrations across all Google Cloud services including Vertex AI and TensorFlow. BigQuery Omni extends BigQuery to analyze data across clouds, using Anthos. Data QnA provides a natural language front end to BigQuery. Connected Sheets allow users to analyze billions of rows of live BigQuery data in Google Sheets. BigQuery can process federated queries including external data sources in object storage (Google Cloud Storage) for Parquet and ORC (Optimized Row Columnar) file formats, transactional databases (Google Cloud Bigtable, Google Cloud SQL), or spreadsheets in Google Drive.

Oracle Autonomous Data Warehouse

Oracle Autonomous Data Warehouse is a cloud data warehouse service that automates provisioning, configuring, securing, tuning, scaling, and backing up of the data warehouse. It includes tools for self-service data loading, data transformations, business models, automatic insights, and built-in converged database capabilities that enable simpler queries across multiple data types and machine learning analysis. It’s available in both the Oracle public cloud and customers’ data centers with Oracle Cloud@Customer.

Qubole

Qubole is a simple, open, and secure data lake platform for machine learning, streaming, and ad hoc analytics. It is available on the AWS, Azure, Google, and Oracle clouds. Qubole helps you to ingest datasets from a data lake, build schemas with Hive, query the data with Hive, Presto, Quantum, or Spark, and continue to your data engineering and data science. You can work with Qubole data in Zeppelin or Jupyter notebooks and Airflow workflows.

Rockset

Rockset is an operational analytics database. It occupies a niche between transactional databases and data warehouses. Rockset can analyze gigabytes to terabytes of recent, real-time, and streaming data, and has the indexes to make most queries run in milliseconds. Rockset builds a converged index on structured and semi-structured data from OLTP databases, streams, and data lakes in real time, and exposes a RESTful SQL interface.

Snowflake

Snowflake is a dynamically scalable enterprise data warehouse designed for the cloud. It runs on AWS, Azure, and Google Cloud. Snowflake features storage, compute, and global services layers that are physically separated but logically integrated. Data workloads scale independently from one another, making Snowflake an appropriate platform for data warehousing, data lakes, data engineering, data science, modern data sharing, and developing data applications.

Teradata Vantage

Teradata Vantage is a connected multi-cloud data platform for enterprise analytics that unifies data lakes, data warehouses, analytics, and new data sources and types. Vantage runs on public clouds (such as AWS, Azure, and Google Cloud), hybrid multi-cloud environments, on-premises with Teradata IntelliFlex, or on commodity hardware with VMware.

Vertica

1 2 Page 1
Page 1 of 2