Databases are typically classified as relational (SQL) or NoSQL, and transactional (OLTP), analytic (OLAP), or hybrid (HTAP). Departmental and special-purpose databases were initially considered huge improvements to business practices, but later derided as “islands.” Attempts to create unified databases for all data across an enterprise are classified as data lakes if the data is left in its native format, and data warehouses if the data is brought into a common format and schema. Subsets of a data warehouse are called data marts.
Data warehouse defined
Essentially, a data warehouse is an analytic database, usually relational, that is created from two or more data sources, typically to store historical data, which may have a scale of petabytes. Data warehouses often have significant compute and memory resources for running complicated queries and generating reports. They are often the data sources for business intelligence (BI) systems and machine learning.
Why use a data warehouse?
One major motivation for using an enterprise data warehouse, or EDW, is that your operational (OLTP) database limits the number and kind of indexes you can create, and therefore slows down your analytic queries. Once you have copied your data into the data warehouse, you can index everything you care about in the data warehouse for good analytic query performance, without affecting the write performance of the OLTP database.
Another reason to have an enterprise data warehouse is to enable joining data from multiple sources for analysis. For example, your sales OLTP application probably has no need to know about the weather at your sales locations, but your sales predictions could take advantage of that data. If you add historical weather data to your data warehouse, it would be easy to factor it into your models of historical sales data.
Data warehouse vs. data lake
Data lakes, which store files of data in its native format, are essentially “schema on read,” meaning that any application that reads data from the lake will need to impose its own types and relationships on the data. Data warehouses, on the other hand, are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in the EDW.
“Schema on read” is good for data that may be used in several contexts, and poses little risk of losing data, although the danger is that the data will never be used at all. (Qubole, a vendor of cloud data warehouse tools for data lakes, estimates that 90% of the data in most data lakes is inactive.) “Schema on write” is good for data that has a specific purpose, and good for data that must relate properly to data from other sources. The danger is that mis-formatted data may be discarded on import because it doesn’t convert properly to the desired data type.
Data warehouse vs. data mart
Data warehouses contain enterprise-wide data, while data marts contain data oriented towards a specific business line. Data marts may be dependent on the data warehouse, independent of the data warehouse (i.e. drawn from an operational database or external source), or a hybrid of the two.
Reasons to create a data mart include using less space, returning query results faster, and costing less to run than a full data warehouse. Often a data mart contains summarized and selected data, instead of or in addition to the detailed data found in the data warehouse.
Data warehouse architectures
In general, data warehouses have a layered architecture: source data, a staging database, ETL (extract, transform, and load) or ELT (extract, load, and transform) tools, the data storage proper, and data presentation tools. Each layer serves a different purpose.
The source data often includes operational databases from sales, marketing, and other parts of the business. It may also include social media and external data, such as surveys and demographics.
The staging layer stores the data retrieved from the data sources; if a source is unstructured, such as social media text, this is where a schema is imposed. This is also where quality checks are applied, to remove poor quality data and to correct common mistakes. ETL tools pull the data, perform any desired mappings and transformations, and load the data into the data storage layer.
ELT tools store the data first and transform later. When you use ELT tools, you may also use a data lake and skip the traditional staging layer.
The data storage layer of a data warehouse contains cleaned, transformed data ready for analysis. It will often be a row-oriented relational store, but may also be column-oriented or have inverted-list indexes for full-text search. Data warehouses often have many more indexes than operational data stores, to speed analytic queries.
Data presentation from a data warehouse is often done by running SQL queries, which may be constructed with the help of a GUI tool. The output of the SQL queries is used to create display tables, charts, dashboards, reports, and forecasts, often with the help of BI (business intelligence) tools.
Of late, data warehouses have started to support machine learning to improve the quality of models and forecasts. Google BigQuery, for example, has added SQL statements to support linear regression models for forecasting and binary logistic regression models for classification. Some data warehouses have even integrated with deep learning libraries and automated machine learning (AutoML) tools.
Cloud data warehouse vs. on-prem data warehouse
A data warehouse can be implemented on-premises, in the cloud, or as a hybrid. Historically, data warehouses were always on-prem, but the capital cost and lack of scalability of on-prem servers in data centers was sometimes an issue. EDW installations grew when vendors started offering data warehouse appliances. Now, however, the trend is to move all or part of your data warehouse to the cloud to take advantage of the inherent scalability of cloud EDW, and the ease of connecting to other cloud services.
The downside of putting petabytes of data in the cloud is the operational cost, both for cloud data storage and for cloud data warehouse compute and memory resources. You might think that the time to upload petabytes of data to the cloud would be a huge barrier, but the hyperscale cloud vendors now offer high-capacity, disk-based data transfer services.
Top-down vs. bottom-up data warehouse design
There are two major schools of thought about how to design a data warehouse. The difference between the two has to do with the direction of data flow between the data warehouse and the data marts.
Top-down design (known as the Inman approach) treats the data warehouse as the centralized data repository for the whole enterprise. Data marts are derived from the data warehouse.
Bottom-up design (known as the Kimball approach) treats the data marts as primary, and combines them into the data warehouse. In Kimball’s definition, the data warehouse is “a copy of transaction data specifically structured for query and analysis.”
Insurance and manufacturing applications of the EDW tend to favor the Inman top-down design methodology. Marketing tends to favor the Kimball approach.
Data lake, data mart, or data warehouse?
Ultimately, all of the decisions associated with enterprise data warehouses boil down to your company’s goals, resources, and budget. The first question is whether you need a data warehouse at all. The next task, assuming you do, is to identify your data sources, their size, their current growth rate, and what you’re currently doing to utilize and analyze them. After that, you can start to experiment with data lakes, data marts, and data warehouses to see what works for your organization.
I’d suggest doing your proof of concept with a small subset of data, hosted either on existing on-prem hardware or on a small cloud installation. Once you have validated your designs and demonstrated the benefits to the organization, you can scale up to a full-blown installation with full management support.