Data scientists have some practices and needs in common with software developers. Both data scientists and software engineers plan, architect, code, iterate, test, and deploy code to achieve their goals. For software developers this often means custom coding applications and microservices; data scientists implement data integrations with dataops, make predictions through analytical models, and create dashboards to help end users navigate results.
Devops engineers looking to automate and collaborate with operational engineers should expand their scope and also provide services to data scientists as part of their charter.
Larger organizations with multiple data science teams may invest in data science platforms such as Alteryx Analytics, Databricks, and Dataiku that provide a mix of tools for developing, testing, and deploying analytical models. These tools compete on dataops and analytics capabilities, integration options, governance, tools for business users, and deployment options.
Devops requirements for data scientists differ from application developers
Not every organization may be ready to invest in data science platforms, or it may have small data science teams who only need basic operational capabilities. In these cases, it may be better to apply devops best practices to data science teams rather than selecting and instrumenting a platform.
To do this, many of the agile and devops paradigms being used for software development teams can be applied to data science workflows with some significant adjustments. Although data scientists’ processes are similar to developers’ workflows there are some important differences.
- Data science work requires a lot more experimentation around data sets, models, and configuration. It’s not the simple plan, build, test, deploy cycle that most software development release management practices follow.
- Developing and testing models may not utilize a uniform compute stack. Some models can be implemented with simple Python scripts whereas others may utilize Apache Spark and other big data platforms.
- The computing needs can vary significantly even while models are in development. A data scientist who wants to test six variants of a model against a large data set is going to need a lot more compute and storage than another scientist testing one model at a time on a smaller data set.
- Models deployed into production also require ongoing maintenance, but there are more variables than just changing the underlying code. Models also require retraining with updated data sets, reconfiguring operating parameters, and adjusting infrastructure, all of which might trigger a new deployment.
- Monitoring data pipelines often requires more sophisticated validations. It’s not good enough to know that a dataops process is running and that a model is processing data. These tools, once they are in production, must be monitoried for throughput, error conditions, data source anomalies, and other conditions that might impact downstream results.
- To be successful, data scientists have to partner with developers, engineers, and business leaders, which can be a more daunting task than cementing a collaboration between developers and operations in application development. In addition, many data scientists and teams may not report into the IT organization, making it more difficult to dictate standards and governance to these groups.
Supporting data scientists requires understanding these and other differences before embarking on devops practices and solutions. Here are some places to start.
Start with the data scientist experience
Like application developers, data scientists are most interested in solving problems, are very involved in configuring their tools, and often have less interest in configuring infrastructure. But unlike software developers, data scientists may not have the same experience and background to fully configure their development workflows. This presents an opportunity for devops engineers to treat data scientists as customers, help define their requirements, and take ownership in delivering solutions.
This can start with the data scientists’ infrastructure. Are they coding in Python, R, or other languages? What tools (Jupyter, Tableau, Apache Kafka, and NLTK) are they using for analysis and modeling? What databases and clouds are they using as data sources, for storing trained data, and for deploying models?
From there, a devops engineer can help select and standardize a development environment. This can be done traditionally on a computing device or on a virtualized desktop. Either way, mirroring their applications and configurations to the development environments is an important first step when working with data scientists.
After that, devops engineers should review where data scientists store their code, how the code is versioned, and how code is packaged for deployments. Some data scientists are relatively new to using version control tools such as Git; others may be using a code repository but have not automated any integrations. Implementing continuous integration is a second place for devops engineers to help data scientists, as it creates standards and removes some of the manual work in testing new algorithms.
One thing to keep in mind is that some SaaS and enterprise data platforms may have built-in version control and don’t naturally interface with version control systems designed for code. Many of these platforms do have APIs to trigger integrations and deployments or other mechanisms that can mimic CI/CD pipelines.
Define deployment pipelines and configure infrastructure
With a development environment and continuous integration standardized, devops engineers should then look to other aspects of automating test and production environments. This can be done by introducing deployment pipelines with tools like Jenkins and configuring infrastructure as code with Chef, Puppet, Ansible, or other tools.
Data science environments are also strong candidates for containers like Docker and for container management and orchestration tools such as Kubernetes. Data science environments are often a mix of dataops, data management, and data modeling platforms that need to be deployed and managed as an integrated environment.
It is critical to understand the scale and frequency of running data integration, machine learning training, and other data analytics jobs. Devops engineers will likely find multiple patterns since data scientists do a mix of different workloads such as frequent testing of new models against partial data sets, scheduled runs to retrain production machine learning models, and special jobs to train new analytical models. These workload types should help devops engineers decide how best to configure and scale cloud infrastructure to meet different computing and storage needs.
Developers, devops engineers, and data scientists should collaborate on business outcomes
The most important aspect of implementing devops is fostering the collaboration between developers and engineers who have conflicting objectives. Developers are pressured to release application changes frequently, and engineers are accountable for the performance and reliability of production workflows. Having developers and engineers collaborate on automation and standardize configuration enables both objectives.
Data scientists are a third party to this collaboration. They are often under pressure to deliver analytics to executives and business managers. Other times they are developing models they hope developers will consume in their applications. They have a strong need for variable-capacity infrastructure and can be even more demanding than developers when experimenting with new platforms, libraries, and infrastructure configurations.
Data scientists require this partnership with developers and engineers to deliver successful analytics. Understanding their objectives, defining targeted goals, and partnering on devops implementations is how these groups can collaborate and deliver business outcomes.