Use synthetic data for continuous testing and machine learning

Where real data is unethical, unavailable, or doesn’t exist, synthetic data sets can provide the needed quantity and variety.

Thinkstock

Devops teams aim to increase deployment frequency, reduce the number of defects found in production, and improve the reliability of everything from microservices and customer-facing applications to employee workflows and business process automations. 

Implementing CI/CD (continuous integration and continuous delivery) pipelines ensures a seamless path to building and deploying all of these applications and services, and automating testing and instituting continuous testing practices help teams maintain quality, reliability, and performance. With continuous testing, agile development teams can shift-left their testing, grow the number of test cases, and increase testing velocity.

It’s one thing to build test cases and automate them, and it’s another issue to have a sufficient volume and variety of test data to validate an adequate number of use cases and boundary scenarios. For example, testing a website registration form should validate a permutation of input patterns, including missing data, long data entries, special characters, multilingual inputs, and other scenarios.

The challenge is generating test data. One approach is synthetic data generation, which uses different techniques to extrapolate data sets based on a model and set of input patterns. Synthetic data generation addresses the volume and variety of the data required. You can also use synthetic data generation to create data sets in cases where using real data might raise legal or other compliance issues.

“Synthetic data provides a great option when the needed data doesn’t exist or the original data set is rife with personally identifiable information,” says Roman Golod, CTO and cofounder of Accelario. “The best approach is to create synthetic data based on existing schemas for test data management or build rules that ensure your BI, AI, and other analyses provide actionable results. For both, you need to ensure the synthetic data generation automation can be fine-tuned according to changing business requirements.”

Use cases for synthetic data generation

While the most basic need for synthetic data generation stems from testing applications, automations, and integrations, demand is growing as data science testing requires test data for machine learning and artificial intelligence algorithms. Data scientists sometimes use synthetic data to train neural networks; at other times they use machine-generated data to validate a model’s results.

Other synthetic data use cases are more specific:

  • Testing cloud migrations by ensuring the same app running on two infrastructures generates identical results
  • Creating data for security testing, fraud detection, and other real-world scenarios where actual data may not exist
  • Generating data to test large-scale ERP (enterprise resource planning) and CRM (customer relationship management) upgrades where testers want to validate configurations before migrating live data
  • Generating data for decision-support systems to test boundary conditions, validate feature selections, provide a wider unbiased sample of test data, and ensure AI results are explainable
  • Stress testing AI and Internet of Things systems, such as autonomous vehicles, and validating their responses to different safety situations

If you are developing algorithms or applications with high-dimensionality data inputs and critical quality and safety factors, then synthetic data generation provides a mechanism for cost-effectively creating large data sets.

“Synthetic data is sometimes the only way to go since real data is either not available or not usable,” says Maarit Widman, data scientist at KNIME.

How platforms generate synthetic data

You might wonder how platforms generate synthetic test data and how to select optimal algorithms and configurations for creating the required data.

Widman explains, “There are two main strategies to generate synthetic data: based on statistical probabilities or based on machine learning algorithms. Recently, deep learning techniques like recurrent neural networks—such as long short-term memory networks and generative adversarial networks—have raised in popularity for their capability to generate new music, text, and images out of literally nothing.”

Data scientists use RNNs (recurrent neural networks) when there are dependencies between data points, such as time-series data and text analysis. LSTM (long short-term memory) creates a form of long-term memory through a series of repeating modules, each one with gates that provide a memory-like function. For example, LSTM in text analytics can learn the dependencies between characters and words to generate new character sequences. It is also used for music creation, fraud detection, and Google’s Pixel 6 grammar correction.

GANs (generative adversarial networks) have been used to generate many forms of images, crack passwords in cybersecurity, and even put together a pizza. GANs create data by using one algorithm to generate data patterns and a second algorithm to test them. Then they form an adversarial competition between the two to find optimal patterns. Code examples of GANs to generate synthetic data include PyTorch handwritten digits, a TensorFlow model for developing one-dimensional Gaussian distributions, and an R model for simulating satellite images.

There’s an art and science to picking machine learning and statistics-based models. Andrew Clark, cofounder and CTO of Monitaur, explains how to experiment with synthetic data generation. He says, “The rule of thumb here is always to pick the simplest model for the job that performs with an acceptable level of accuracy. If you are modeling customer checkout lines, then a univariate stochastic process based off of a Poisson distribution would be a good starting point. On the other hand, if you have a large loan underwriting data set and would like to create test data, a GAN model might be a better fit to capture the complex correlations and relationships between individual features.”

If you’re working on a data science use case, then you might want the flexibility to develop a synthetic data generation model. Commercial options include Chooch for computer vision, Datomize, and Deep Vision Data.

If your goal is application testing, consider platforms for test data management or synthetically generating test data, such as Accelario, Delphix, GenRocket, Informatica, K2View, Tonic, and several test data tools, such as open source test data generators. Microsoft’s Visual Studio Premium also has a built-in test data generator, and Java developers should review this example using Vaadin’s data generator.

Having a robust testing practice is incredibly important today because organizations depend on application reliability and the accuracy of machine learning models. Synthetic data generation is yet another approach to closing gaps. So not only do you have testing, training, or validating methodologies, but you also have a way of generating sufficient data to build models and validate applications.