Data lineage: What it is and why it’s important

As your data evolves, you need a way to track the who, what, when, why, and how of those changes. You need a data lineage system.

Data lineage: What it is and why it’s important
Thinkstock

Databases are good at inserting, updating, querying, and deleting data and representing the data’s current state. Developers rely on data consistency so APIs can perform the correct transactions and applications can retrieve accurate records. Other consumers of data include data scientists developing machine learning models and citizen data scientists creating data visualizations.

Query a SQL or NoSQL database for what the data looked like two days ago and you might have to rely on database snapshots or proprietary features to get this view. Snapshots and backups may be good enough for developers or data scientists to compare older data sets, but they are not adequate tools for tracking how the data changed.

There are many good reasons to know more about how people and systems modify data. It’s important to have the capabilities to answer questions such as:

  • Who or what business process changed the data?
  • What tool or technology made the change?
  • How was the data changed? Was it changed by an algorithm, a data flow, an API call, or someone entering data into a form?
  • What were the changes to records, documents, nodes, fields, or attributes?
  • When was the change made, and if done by a person, where were they geographically?
  • Why was the change made? What was the context?

Data lineage explained

Data lineage is comprised of methodologies and tools that expose data’s life cycle and help answer questions around who, when, where, why, and how data changes. It’s a discipline within metadata management and is often a featured capability of data catalogs that allow data consumers to understand the context of data they are utilizing for decision-making and other business purposes.

One way to explain data lineage is that it’s the GPS of data that provides “turn-by-turn directions and a visual overview of the completely mapped route.” Others view data lineage as a core datagovops practice, where data lineage, testing, and sandboxes are data governance’s technical practices and automation opportunities.

Capturing and understanding data lineage is important for several reasons:

Compliance requirements: Many organizations must implement data lineage to stay on the good side of government regulators. Data lineage in risk management and reporting is required for capital market trading firms to support BCBS 239 and MiFID II regulations. For large banks, automating extracting lineage from source systems can save significant IT time and reduce risks. In pharmaceutical clinical trials, the ADaM standard requires traceability between analysis and source data. Other regulations, including General Data Protection Regulation (GDPR), Personal Informational Protection and Electronic Documents Act (PIPEDA), and California Consumer Privacy Act (CCPA), also require more organizations to implement data governance and data lineage capabilities, especially to track private and sensitive data.

A data-driven culture: Organizations developing citizen data science programs, establishing key performance indicator dashboards, managing a hybrid BI (business intelligence) environment, and taking other steps to become data-driven organizations can easily trip up on data lineage challenges. When the financial data in a dashboard changes significantly, it’s a safe bet that executives want to know what caused the change. Citizen data science and other self-service BI programs are hard to get off the ground if subject matter experts don’t trust the data. Data lineage tools help them better understand data sources, flows, and rules around data they are querying, reporting on, or building into data visualizations.  

Transparency: Organizations developing products, services, and workflows seek to improve data quality, create master data hubs, or invest in master data management. These approaches typically include data lineage as a capability to provide transparency on business rules and changes. Example use cases include maturing customer 360 capabilities, scaling digital marketing programs, prioritizing customer experience initiatives, optimizing e-commerce storefronts, and creating transparency into supply chains.

Analytics and machine learning: Data lineage is also important to support modelops and the machine learning life cycle. Capturing and analyzing data lineage can help determine when sufficiently new or changed data requires retraining models and reducing model drift. But it’s equally important to track the full model’s life cycle because machine learning models are often inputs to services, applications, and downstream analytics.

As more organizations invest in data, analytics, and machine learning, data lineage becomes an increasingly important data governance practice. While regulatory requirements drive some organizations to mature data lineage capabilities, others seek data processing transparency, and some view data lineage as a core competency in democratizing data and analytics.   

Data lineage can improve business process

Here are some examples of how organizations use data lineage practices and tools in critical business processes.

The key to success may be setting priorities and defining reasonable goals, especially for organizations with many data sources, technologies, and usage patterns.

Examples of data lineage capabilities

One way to think about data lineage is through flow diagrams illustrating how new data and changes in primary data sources flow through different systems and impact derivative data elements. For example, a customer calls customer service to request an address change, and the data lineage shows the flow of data to other systems updated with the new address.

The more common way to use data lineage tools is to audit a backward flow of information. For example, if a sales projection changes, sales leaders can review all the data element changes contributing to the new projection.

Inside data catalogs, data lineage is a key documentation tool for all participants who create, steward, and analyze data. Data lineage helps establish a shared understanding of any dimension’s or measure’s computational context. One place to start with data catalogs is by capturing the data sources or data provenance and then using tools to trace data lineage.

The challenges for multicloud enterprises

The public clouds have some data lineage capabilities embedded in their platforms. For example, Azure Purview Data Catalog tracks source-to-target lineage, including column-level lineage. Google Cloud Data Fusion shows data-set and field-level changes for pipelines running on this data integration platform.

The challenge in implementing data lineage is that the organizations with the most to gain from data lineage’s transparency and diagnostics capabilities are also likely to have more heterogeneous data management, processing, and analytics tools.

When data warehouses, data lakes, data integration services, and analytics platforms operate on multiple clouds, then multicloud data catalogs and lineage capabilities are required. Competing platforms that promote data lineage capabilities include Alex Solutions, ASG, Ataccama, Alation, Boomi, Collibra, DataKitchen, Erwin, IBM, Infogix, Informatica, Manta, Microsoft, Octopai, Oracle, SAP, SAS, Talend, and others. There are also several open source data lineage solutions.

OpenLineage aims to create standards for supporting data lineage across platforms. Initiatives that create implementation standards, interoperability protocols, and cross-platform integration capabilities are needed to increase the adoption of data lineage and other data governance practices.

Considering how fast enterprise data is growing, the business value from machine learning capabilities, and the increasing data regulations, more companies will have to increase efforts to implement data governance and data lineage capabilities.

Copyright © 2021 IDG Communications, Inc.