MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) spin-off DataCebo is offering a new tool, dubbed Synthetic Data (SD) Metrics, to help enterprises compare the quality of machine-generated synthetic data by pitching it against real data sets.
The application, which is an open-source Python library for evaluating model-agnostic tabular synthetic data, defines metrics for statistics, efficiency and privacy of data, according to Kalyan Veeramachaneni, MIT’s principal research scientist and co-founder of DataCebo.
“For tabular synthetic data, it's necessary to create metrics that quantify how the synthetic data compares to the real data. Each metric measures a particular aspect of the data—such as coverage or correlation—allowing you to identify which specific elements have been preserved or forgotten during the synthetic data process,” said Neha Patki, co-founder of DataCebo.
Features such as CategoryCoverage and RangeCoverage can quantify whether an enterprise’s synthetic data covers the same range of possible values as real data, Patki added.
“To compare correlations, the software developer or data scientist downloading SDMetrics can use the CorrelationSimilarity metric. There are a total of over 30 metrics and more are still in development,” said Veeramachaneni.
Synthetic Data Vault generates synthetic data
The SDMetrics library, according to Veeramachaneni, is a part of the Synthetic Data Vault (SDV) Project that was first initiated at MIT's Data to AI Lab in 2016. From 2020, DataCebo owns and develops all aspects of the SDV.
The Vault, which can be defined as synthetic data generation ecosystem of libraries, was started with the idea to help enterprises create data models for developing new software and applications within the enterprise.
“While there is a lot of work going around in the area of synthetic data, especially in autonomous driving cars or images, little is being done to help enterprises take advantage of it,” Veeramachaneni said.
“The SDV was developed to ensure that enterprises can download the packages for generating synthetic data in cases where no data was available or there was a chance of putting data privacy at risk,” Veeramachaneni added.
Under the hood, the company claims to use several graphical modeling and deep learning techniques, such as Copulas, CTGAN and DeepEcho, among others.
Copulas, according to Veeramachaneni, has been downloaded over a million times and models using thr technique are being used by large banks, insurance firms and companies that are focusing on clinical trials.
The CTGAN, or neural network-based model, has been downloaded over 500,000 times.
Other data sets that have multiple tables or time-series data is also supported, the DataCebo founders said.