AWS simplifies data management, analytics with new services

A major theme at re:Invent 2022 was Amazon's efforts to ease data management, as AWS announced new ETL capabilities and features for collaboration, searching and cataloging.

Senior Writer, InfoWorld |

Planning / strategy / management > Nurturing growth / scale / expansion

Simplifying data management and analytics for enterprises is a big theme at this year's AWS re:Invent conference, as Amazon announces new services and features targeted at easing extract, transform, load (ETL) processes and providing support for cataloging and searching data across organizations.

AWS has released two new capabilities—Amazon Aurora zero-ETL integration with Amazon Redshift and Amazon Redshift integration for Apache Spark—that it claims will make the ETL process obsolete.

Enterprises, typically, use ETL to integrate date from multiple sources into a single consistent data store to be loaded into a data warehouse for analysis.

However, most data engineers claim that transforming data from disparate sources could be a difficult and time-consuming task as the process involves steps such as cleaning, filtering, reshaping, and summarizing the raw data.

Another issue is the added cost of maintaining teams that prepare data pipelines for running analytics, AWS said.

New features aim to eliminate ETL

In contrast, the Amazon Aurora zero-ETL integration, according to the company, eliminates the need to perform ETL between Aurora and RedShift as transactional data that is written into Aurora is replicated into RedShift almost immediately and is ready for running analysis.

“Customers can replicate data from multiple Amazon Aurora database clusters into the same Amazon Redshift instance to derive insights across several applications,” the company said in a statement, adding that the integration was currently in preview.

In addition, the company said that Amazon Redshift Integration for Apache Spark will help enterprise developers use AWS analytics and machine learning services to build and run Apache Spark applications on data from Amazon Redshift.

Apache Spark, which is a common tool used by developers, is an open source, unified analytics engine for processing big data.

“Developers can begin running queries on Amazon Redshift data from Apache Spark-based applications within seconds using popular language frameworks (e.g., Java, Python, R, and Scala),” the company said, adding that the integration has been made generally available.

Amazon DataZone to help catalog and search data

The cloud services provider has also previewed a new data management service, dubbed Amazon DataZone. The new data management service, which is yet to be made available, is expected to help enterprises catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources, the company said.

Data producers in an enterprise can set up the data catalog by defining data sources, data taxonomy and governance policies via the service’s web portal, AWS said.

“Amazon DataZone removes the heavy lifting of maintaining a catalog by using machine learning to collect and suggest metadata (e.g., origin and data type) for each dataset and by training on a customer’s taxonomy and preferences to improve over time,” the company said in a press release.

After the catalog is set up, data consumers can use the Amazon DataZone web portal to search and discover data assets, examine metadata for context, and request access to data sets, it added.

In order to run analytics on the data, enterprise users have to create an Amazon DataZone Data Project—a shared space in the web portal that enables users to pull in different data sets, share access with colleagues, and collaborate on analysis, AWS said.

“Amazon DataZone is integrated with AWS analytics services, such as Amazon Redshift, Amazon Athena, and Amazon QuickSight, which enables data consumers to access these services in the context of their data project,” the company said.

The service also provides APIs to integrate with custom solutions or partners like DataBricks, Snowflake, and Tableau.

AWS Clean Rooms ease collaborating on data

In order to help enterprises collaborate on data with their partners, AWS has launched a new service, dubbed AWS Clean Rooms.

The service, which is restricted to only AWS customers currently, can be accessed via the AWS Management Console, where an enterprise can choose the partner with whom they want to collaborate, the company said, adding that the console provides options to choose data sets to be shared and configure permissions for participants.

The data sets that are being shared in the clean room are encrypted and don't have to move out of the AWS environment or be loaded into another platform, AWS said, adding that queries can also be run on these data sets.

Additionally, AWS Clean Rooms provides a broad set of configurable data access controls—including query controls, query output restrictions, and query logging—that allow enterprises to customize restrictions on the queries run by each clean room participant.

AWS Clean Rooms, which is available as a standalone offering and as part of AWS for Advertising and Marketing, will be available in early 2023 in US East (Ohio), US East (North Virginia), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), and Europe (Stockholm) regions.

AWS adds new features to Amazon QuickSight

In addition to updating other services, AWS has added new features to its unified business intelligence service, Amazon QuickSight.

The cloud service provider has added the capability to ask natural language queries inside QuickSight via a new feature dubbed QuickSight Q.

QuickSight Q uses machine learning to let enterprise users ask questions about business data in natural language and receive accurate answers with relevant visualizations in seconds, the company said, adding that the feature will allow users to ask "why" questions and seek forecast about data.

The support for forecast and “why” questions is available at no additional cost to all QuickSight Q customers, according to the company.

QuickSight Q also comes with another capability that automatically infers and adds semantic information to data sets, reducing the time business intelligence teams spend prepping data for natural language querying from days to minutes, AWS said.

This is made possible by pretrained machine learning models and learnings from business intelligence assets such as dashboards and reports.

The ability to automatically prepare data within QuickSight Q is also available to existing QuickSight Q customers at no extra cost.

Other added features include the ability to generate paginated reports and fast analysis for large data sets.

The paginated report service is being made available as an add-on service for QuickSight Enterprise edition customers, the company said.

Next read this:

Anirban Ghoshal is a senior writer, covering enterprise software for CIO and databases and cloud infrastructure for InfoWorld.