How dataops improves data, analytics, and machine learning

A dataops team will help you get the most out of your data. Here’s how people, processes, technology, and culture bring it all together

Contributor, InfoWorld |

How dataops improves data, analytics, and machine learning — Getty Images

Have you noticed that most organizations are trying to do a lot more with their data?

Businesses are investing heavily in data science programs, self-service business intelligence tools, artificial intelligence programs, and organizational efforts to promote data-driven decision making. Some are developing customer facing applications by embedding data visualizations into web and mobile products or collecting new forms of data from sensors (Internet of Things), wearables, and third-party APIs. Still others are harnessing intelligence from unstructured data sources such as documents, images, videos, and spoken language.

Much of the work around data and analytics is on delivering value from it. This includes dashboards, reports, and other data visualizations used in decision making; models that data scientists create to predict outcomes; or applications that incorporate data, analytics, and models.

What has sometimes been undervalued is all the underlying data operations work, or dataops, that it takes before the data is ready for people to analyze and format into applications to present to end users.

Dataops includes all the work to source, process, cleanse, store, and manage data. We’ve used complicated jargon to represent different capabilities such as data integration, data wrangling, ETL (extract, transform and load), data prep, data quality, master data management, data masking, and test data management.

But a car is more than the sum of its parts.

Dataops is a relatively new umbrella term for the collection of data management practices with the goal of making users of the data—including executives, data scientists, as well as applications—successful in delivering business value from the data.

How dataops works with other technology practices

Dataops shares aspects of agile methodologies in that it drives iterative improvements in data processing metrics and quality. It also shares aspects of devops especially around automating data flows, enabling more frequent changes to data processing capabilities, and reducing recovery time when responding to data operational incidents.

There’s even a published DataOps Manifesto with 20 principles covering culture (continually satisfy your customer), team dynamics (self-organize, interact daily), technical practices (create disposable environments), and quality (monitor quality and performance).

You might wonder why the term is needed or useful. The answer is that it simplifies the conversation and defines a role for this critical business function. It helps drive investments, align teams, and define priorities around business outcomes.

One way to better understand new terminology is to define it around people, process, technology, and culture.

Understand the people aspect of dataops

When it comes to people, there are several roles tied to dataops:

Customers are the direct beneficiaries of the data, analytics, applications, and machine learning that’s produced. They can be actual product or service customers or internal customers, such as executives and leaders using analytics to make decisions, or other employees who consume data as part of business processes.
Data end users include data scientists, dashboard developers, report writers, application developers, citizen data scientists, and others who consume data to deliver results through applications, data visualizations, APIs, or other tools.
Those directly working on dataops, including database engineers, data engineers, and other developers who manage the data flows and database tools.
Data stewards who are responsible for data quality, definition, and linkages.
Business owners who are typically the buyers of data services and own decisions around sourcing, funding, creating policies, and processing (the data supply chain).

Define dataops flow, development, and operational processes

Dataops has many processes and disciplines, but what organizations invest and mature in largely depends on the nature of the business needs, data types, data complexities, service level requirements, and compliance factors.

One aspect of dataops represents the data flows from source to delivery. This is the manufacturing process managed through dataops development and operation processes. The data flows or data pipelines can be developed on different data integration technologies, data cleansing technologies, and data management platforms. These processes not only bring in data, they also provide tools for data stewards to manage exceptions to data quality and master data rules, enable data lineage and other metadata capabilities, and perform data archiving and removal procedures.

The second aspect of dataops is the development processes by which aspects of the data flows are maintained and enhanced. A good description of this process is in this article “Dataops is not just devops for data.” The development process includes several stages: sandbox management, develop, orchestrate, test, deploy, and monitor. The orchestrate, test, and deploy stages are similar to a devops CI/CD pipeline.

The final aspects of dataops processes involve operations and managing the infrastructure. Like devops, some of this work is tied to managing production data flows and ensuring their reliability, security, and performance. Since data science workflows—especially around machine learning—are highly variable, there’s also the more challenging responsibility of developing scalable, high performance, tear up and down development, and data science environments to support varying workloads.

The vast landscape of dataops technologies

Because dataops covers a large number of data orchestration, processing, and management functions, a lot of technologies fit under this term. In addition, since many businesses are investing in big data, data science, and machine learning capabilities, the number of vendors competing in this space is significant.

Here is a brief starting point:

Amazon Web Services has seven types of databases, from commonplace relational databases to document stores and key-value databases. Azure also offers several database types.
Lots of tools integrate data and create data flows, including data integration and data streaming. Within a data flow, there’s data quality and master data management.
There are many tools with ties to development, data science, and testing aspects of dataops. Many organizations use Jupyter, but there are other options for data science work. For testing, consider tools such as Delphix and QuerySurge.
Alteryx, Databricks, Dataiku, and ai provide end-to-end analytics and machine learning platforms that blend dataops, data science, and devops capabilities.
Other tools tackle data security, data masking, and other data operations.

Competitive intelligence drives dataops culture

Devops came about because of the tension between application development teams running agile development processes who were pressured to release code frequently while operations teams naturally slowed things down to ensure reliability, performance, and security. Devops teams aligned on the mission to do both well and invested in automation such as CI/CD, automated testing, infrastructure as code, and centralized monitoring to help bridge the technical gaps.

Dataops brings another group to the table. Data scientists, dashboard developers, data engineers, database developers, and other engineers work on data flows and data quality. In addition to managing speed of releases and the performance, reliability and security of the infrastructure, dataops teams also drive the competitive value of the data, analytics, machine learning models, and other data deliverables.

Competitive value is driven by the overall analytics deliverable, but also how dataops teams work through the complexities of processing data. How fast does data run through data flows? What volume of data and what level of quality is supported? How fast can the team integrate a new data source, and how versatile are the database platforms to support a growing variety of data modeling needs?

These are just some of the questions and performance indicators dataops teams must examine. As more organizations achieve business value from data and analytics investments, expect a corresponding demand around dataops practices and culture.

Next read this:

Isaac Sacolick is president of StarCIO and the author of the Amazon bestseller Driving Digital: The Leader’s Guide to Business Transformation through Technology and Digital Trailblazer: Essential Lessons to Jumpstart Transformation and Accelerate Your Technology Leadership. He covers agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO and digital transformation influencer. He has published more than 900 articles at InfoWorld.com, CIO.com, his blog Social, Agile, and Transformation, and other sites.