In an effort to compete with its cloud-services rivals and help enterprises generate more business value out of their accumulated data, Oracle on Tuesday joined the data lakehouse bandwagon by debuting its MySQL HeatWave Lakehouse service.
MySQL HeatWave Lakehouse, announced at the Oracle CloudWorld conference, is currently available in beta and expected to be made generally available in the first half of 2023. It's designed to quickly load and query up to 400TB of data, while the HeatWave cluster can scale up to 512 nodes, Oracle said.
As the name suggests, a data lakehouse is an architecture that combines the benefits of a data warehouse—such as structured data management and processing functionality, including support for table formats, metadata management, and transactional updates and deletes—with the low cost and agility advantages of a data lake.
The lakehouse architecture concept has been gaining popularity, especially among enterprises that have invested in a data lake, said Matt Aslett, research vice president at Ventana Research.
“By 2024, more than three-quarters of current data lake adopters will be investing in data lakehouse technologies,” Aslett said.
Oracle rivals including Snowflake, Databricks, Teradata, Dremio, Google, AWS, and Microsoft Azure have all introduced some form of the data lakehouse concept.
Data lakes themselves have become an important part of the analytics data estate for many enterprises, according to a report from Ventana.
Data lakes have gained significance since the time vendors started offering a cloud object storage as the underlying data repository, which makes the lake concept a relatively inexpensive way of storing large volumes of data from multiple enterprise applications and workloads. This is all the more relevant for semistructured and unstructured data that is unsuitable for storing and processing in a data warehouse, Aslett explained.
More than half (53%) the participants in a Ventana Research's Analytics & Data Benchmark Research poll said they are using object storage in their analytics efforts, the market research firm said, adding that a further 29% are evaluating or planning to do so.
Lakehouse provides support for multiple file formats
MySQL HeatWave Lakehouse, the latest addition to Oracle’s MySQL HeatWave cloud service for analytics and mixed workloads, will allow enterprises to process and query data across file formats, such as CSV and Parquet, as well as Aurora and Redshift backups from AWS, the company said.
This means that enterprises can use MySQL HeatWave even when their data is not stored inside a MySQL database.
The new service allows enterprises to query their online transaction processing (OLTP) data stored inside MySQL database and combine it with data stored in the object store using standard MySQL syntax.
“Any change made to the OLTP data is updated in real time and reflected in the query result,” the company said in a statement.
The entire MySQL HeatWave portfolio has also been made available across multiple cloud service providers including Oracle Cloud Infrastructure (OCI), AWS and Microsoft Azure, Oracle said.
Machine learning-based automation with MySQL Autopilot
Oracle’s MySQL HeatWave Lakehouse comes with support for MySQL Autopilot, which was launched in August 2021 as a component of the HeatWave portfolio, and uses machine learning to accelerate query performance and scalability.
Some of the existing features of MySQL Autopilot, such as auto provisioning and auto query plan, have been improved to support better performance in the lakehouse service, the company said.
The new capabilities of MySQL Autopilot designed for the lakehouse include auto schema inference, adaptive data sampling, auto load, and adaptive data flow.
Auto schema inference as a feature allows Autopilot to automatically infer the mapping of the file data to datatypes in the database—and this means that enterprise users don’t need to manually specify the mapping for each new file to be queried by MySQL HeatWave Lakehouse, the company said.
To improve query performance, Autopilot uses adaptive data sampling, collecting statistics with minimal data access. MySQL HeatWave uses these statistics to generate and improve query plans, determine the optimal schema mapping, and other purposes.
Adaptive data flow is used by Autopilot to generate maximum available performance from the underlying cloud infrastructure, which improves overall performance, and availability, Oracle said.
Additional improvements to the MySQL HeatWave portfolio include support for forecasting models, a new query optimizer and updated support for the VS code plugin.
“Data scientists can now influence various stages of the automated HeatWave ML training pipeline, including the choice of algorithm, feature selection, scoring metric, and the explanation technique,” Oracle said, adding that HeatWave ML has been updated to allow import of machine learning models into HeatWave.
Will Oracle shed high-cost provider reputation?
The lakehouse announcement can be seen as Oracle’s broader strategy to reverse its reputation as a high-cost provider, said Tony Baer, principal analyst at market research firm dbInsight.
“Oracle’s strategy for reversing its reputation in this context is not with me-too technology, but with optimized database engines that outperform the competition,” Baer explained.
However, he warned that most vendors were also diving into the lakehouse space.
“The momentum is more on the vendor side than the customer side, but it’s a case of going where the hockey puck is going as opposed to where it is today,” Baer said. “The company can only bring its mainstream customer under the lakehouse fold if Oracle’s flagship databases hop the bandwagon,” he added.
Oracle claims that customers migrating from AWS, Google, and on-premises infrastructure have been using MySQL HeatWave for a broad set of applications including marketing analytics, real-time analysis of advertising campaign performance and customer data analytics.
Customers who migrated from AWS include firms in the automotive, telecommunications, retail, high-tech, and healthcare industries, it added.
Meanwhile, the phenomenon of an increasing number of vendors offering lakehouse architecture can benefit Oracle, according to Baer.
“Given that open source is creeping up the stack, and for Oracle, MySQL HeatWave is about reaching out to new audiences, hopping on the bandwagon could make HeatWave more accessible since, at the table level, there wouldn’t be any lock-in,” said Baer.
This will also depend on factors, such as whether open source formats, namely Delta Lake, Apache Iceberg, or possibly Apache Hudi, emerge as the de facto standard for modern lakehouses, Baer added.