Measuring quality of any kind requires the science of creating a measure or key performance indicator for a subjective property and turning it into a quantifiable attribute. Measuring quality should be a means to drive operational and delivery improvements. But there’s a cost to measuring quality and a human capacity to how many metrics people can track, so there’s an art to picking the ones that drive the most significant business impacts.
We usually can spot bad quality, but defining good quality is subjective. Well-defined quality metrics help define poor quality and how much better something needs to be to move from good quality to better quality to top quality.
Managing data quality has these same challenges. When subject matter experts look at a data visualization or study the results from a machine learning model, they can often spot data quality issues that undermine the results. Data scientists also know how to use data prep and data quality tools to profile a data source and improve data fields’ quality or leave it out from their analysis. Common data quality problems include missing data, such as addresses that lack ZIP codes, or data normalization issues, such as a U.S. state field that sometimes has the state name (New York) and other times its abbreviation (NY).
Shift-left data quality improvements
One approach to improving data quality is to “shift left” the steps to measure and automate improvements as a dataops practice. Dataops focuses on all the steps in integrating, transforming, joining, and making data available and ready for consumption. It’s the optimal place to measure and remediate data quality issues so that all downstream analytics, data visualizations, and machine learning use cases operate on consistent, higher-quality data sources.
You’ll find many data quality metrics to consider if you survey the latest research and articles. For example, the six commonly used categories of data quality metrics are:
- Accuracy
- Completeness
- Consistency
- Timeliness
- Uniqueness
- Validity
When measuring the data quality in data warehouses and databases, intrinsic data quality dimensions such as consistency are independent of the use cases, whereas extrinsic ones such as reliability may depend on the analysis. Measuring data quality as a ratio, such as the ratio of data to errors or the data transformation error rates, provides a better mechanism to track quality improvements than absolute metrics.
The hard question is where to start and what dataops improvements and metrics to prioritize. I consulted several experts to weigh in.
Drive trust with data accuracy, completeness, and usability
Simon Swan, head of field solutions strategy at Talend, says, “60% of executives don’t consistently trust the data they work with”—a highly problematic concern for organizations promoting more data-driven decision-making.
Swan offers this suggestion to dataops teams. “First, dataops teams should prioritize improving data quality metrics for accuracy, completeness, and usability to ensure that users have verifiable insights to power the business,” he says.
Dataops teams can instrument these data health practices in several ways.
- Accuracy is improved when dataops integrates referenceable data sources, and data stewards resolve conflicts through automated rules and exception workflows.
- Completeness is an important quality metric for entity data such as people and products. Technologies for master data management and customer data platforms can help dataops teams centralize and complete golden records using multiple data sources.
- Usability is improved by simplifying data structures, centralizing access, and documenting data dictionaries in a data catalog.
Swan adds, “Data trust provides dataops teams with a measure of operational resilience and agility that readily equips business users with fact-based insights to improve business outcomes.”
Focus on data and system availability as data quality improves
The good news is that as business leaders trust their data, they’ll use it more for decision-making, analysis, and prediction. With that comes an expectation that the data, network, and systems for accessing key data sources are available and reliable.
Ian Funnell, manager of developer relations at Matillion, says, “The key data quality metric for dataops teams to prioritize is availability. Data quality starts at the source because it’s the source data that run today’s business operations.”
Funnell suggests that dataops must also show they can drive data and systems improvements. He says, “Dataops is concerned with the automation of the data processing life cycle that powers data integration and, when used properly, allows quick and reliable data processing changes.”
Barr Moses, CEO and cofounder of Monte Carlo Data, shares a similar perspective. “After speaking with hundreds of data teams over the years about how they measure the impact of data quality or lack thereof, I found that two key metrics—time to detection and time to resolution for data downtime—offer a good start.”
Moses shares how dataops teams can measure downtime. “Data downtime refers to any period of time marked by broken, erroneous, or otherwise inaccurate data and can be measured by adding the amount of time it takes to detect (TTD) and resolve (TTR), multiplied by the engineering time spent tackling the issue.”
Measuring downtime is one approach to creating a dataops key performance indicator tied to financial performance. Moses adds, “Inspired by tried and tested devops measurements, TTD, TTR, and data downtime eases quantifying the financial impact of poor data quality on a company’s bottom line.”
Differentiate with data timeliness and real-time dataops
Kunal Agarwal, cofounder and CEO of Unravel Data, says dataops must aspire to exceed basic data quality and availability metrics and look to more real-time capabilities. He says, “While most data quality metrics focus on accuracy, completeness, consistency, and integrity, another data quality metric that every dataops team should think about prioritizing is data timeliness.”
Timeliness captures the end-to-end data flow from capture, processing, and availability, including supplier and batch processing delays. Agarwal explains, “Reliable timeliness metrics make it much easier to assess and enforce internal and third-party vendor SLAs and ultimately provide a direct line to improved and accelerated data analysis.”
Swan agrees about the importance of improving data timeliness. He says, “Dataops should also focus on guaranteeing speed and timeliness so that users can access up-to-date data across any environment. The data is only as good as its ability to keep up with business needs in near real time.”
For many organizations, getting business leaders to trust the data, improve reliability, and enable closer to real-time data delivery may be aspirational. Many companies have a backlog of data debt issues, significant dark data that’s never been analyzed, and an overreliance on spreadsheets.
So, if you work in dataops, there’s plenty of work to do. Applying data quality metrics can help drum up support from the business, data scientists, and technology leaders.