Modern data infrastructures don’t do ETL

Business happens in real time but many business systems don’t. It’s time to move past client-server databases, data warehouses, and batch processes.

Abstract network of digital streams.
Gonin / Getty Images

Businesses are 24/7. This includes everything from the website, back office, supply chain, and beyond. At another time, everything ran in batches. Even a few years ago, operational systems would be paused so that data could be loaded into a data warehouse and reports would be run. Now reports are about where things are right now. There is no time for ETL.

Much of IT architecture is still based on a hub-and-spoke system. Operational systems feed a data warehouse, which then feeds other systems. Specialized visualization software creates reports and dashboards based on “the warehouse.” However, this is changing, and these changes in business require both databases and system architecture to adapt.

Fewer copies, better databases

Part of the great cloud migration and the scalability efforts of the last decade resulted in the use of many purpose-built databases. In many companies, the website is backed by a NoSQL database, while critical systems involving money are on a mainframe or relational database. That is just the surface of the issue. For many problems, even more specialized databases are used. Often times, this architecture requires moving a lot of data around using traditional batch processes. The operational complexity leads not only to latency but faults. This architecture was not made to scale, but was patched together to stop the bleeding.

Databases are changing. Relational databases are now able to handle unstructured, document, and JSON data. NoSQL databases now have at least some transactional support. Meanwhile distributed SQL databases enable data integrity, relational data, and extreme scalability while maintaining compatibility with existing SQL databases and tools.

However, that in itself is not enough. The line between transactional or operational systems and analytical systems cannot be a border. A database needs to handle both lots of users and long-running queries, at least most of the time. To that end, transactional/operational databases are adding analytical capabilities in the form of columnar indexes or MPP (massively parallel processing) capabilities. It is now possible to run analytical queries on some distributed operational databases, such as MariaDB Xpand (distributed SQL) or Couchbase (distributed NoSQL).

Never extract

This is not to say that technology is at a place where no specialized databases are needed. No operational database is presently capable of doing petabyte-scale analytics. There are edge cases where nothing but a time series or other specialized database will work. The trick to keeping things simpler or achieving real-time analytics is to avoid extracts.

In many cases, the answer is how data is captured in the first place. Rather than sending data to one database and then pulling data from another, the transaction can be applied to both. Modern tools like Apache Kafka or Amazon Kinesis enable this kind of data streaming. While this approach ensures that data make it to both places without delay, it requires more complex development to ensure data integrity. By avoiding the push-pull of data, both transactional and analytical databases can be updated at the same time, enabling real-time analytics when a specialized database is required.

Some analytical databases just cannot take this. In that case more regular batched loads can be used as a stopgap. However, doing this efficiently requires the source operational database to take on more long-running queries, potentially during peak ours. This necessitates a built-in columnar index or MPP.

Databases old and new

Client-server databases were amazing in their era. They evolved to make good use of lots of CPUs and controllers to deliver performance to a wide variety of applications. However, client-server databases were designed for employees, workgroups, and internal systems, not the internet. They have become absolutely untenable in the modern age of web-scale systems and data omnipresence.

Lots of applications use lots of different stove-pipe databases. The advantage is a small blast radius if one goes down. The disadvantage is there is something broken all of the time. Combining fewer databases into a distributed data fabric allows IT departments to create a more reliable data infrastructure that handles varying amounts of data and traffic with less downtime. It also means less pushing data around when it is time to analyze it.

Supporting new business models and real-time operational analytics are just two advantages of a distributed database architecture. Another is that with fewer copies of data around, understanding data lineage and ensuring data integrity become simpler. Storing more copies of data in different systems creates a larger opportunity for something to not match up. Sometimes the mismatch is just different time indexes and other times it is genuine error. Combining data into fewer and more capable systems, you reduce the number of copies and have less to check. 

A new real-time architecture

By relying mostly on general-purpose distributed databases that can handle both transactions and analytics, and using streaming for those larger analytics cases, you can support the kind of real-time operational analytics that modern businesses require. These databases and tools are readily available in the cloud and on-premises and already widely deployed in production.  

Change is hard and it takes time. It is not just a technical problem but a personnel and logistical issue. Many applications have been deployed with stovepipe architectures, and live apart from the development cycle of the rest of the data infrastructure. However, economic pressure, growing competition, and new business models are pushing this change in even the most conservative and stalwart companies.

Meanwhile, many organizations are using migration to the cloud to refresh their IT architecture. Regardless of how or why, business is now real-time. Data architecture must match it.

Andrew C. Oliver is senior director of product marketing at MariaDB.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2023 IDG Communications, Inc.