Amy Steier, Senior Machine Learning Scientist at Gretel.ai – Interview Series

Amy Steier is a Principal Machine Learning Scientist at Gretel.ai, the world’s most advanced privacy engineering platform. Gretel makes it easy to embed privacy by design into the fabric of data-driven technology. Its AI-powered open source libraries are designed to transform, anonymize and synthesize sensitive information.

Amy is a highly accomplished machine learning and data scientist with over 20 years of experience. His passion is big data and uncovering hidden intelligence using machine learning, data mining, artificial intelligence and statistics techniques. She is highly skilled in predictive modeling, classification, clustering, anomaly detection, data visualization, ensemble methods, information retrieval, cybersecurity analysis, NLP, models recommendation and user behavior analysis.

What initially inspired you to pursue a career in computer science and machine learning?

My pure, unwavering and enduring love of data. The power, mystery, intrigue and potential of data have always fascinated me. Computer science and machine learning are tools to exploit this potential. It’s also terribly fun to work in a field where the state of the art is changing so rapidly. I love the intersection of research and product. It’s very satisfying to take state-of-the-art ideas, take them one step further, and then transform them to match existing, tangible product needs.

For readers who are unfamiliar, could you explain what synthetic data is?

Synthetic data is data that looks and acts like the original data, but is also different enough to satisfy certain use cases. The most common use case is the need to protect the confidentiality of information contained in the original data. Another use case is the need to create additional data to increase the size of the original dataset. Another use case is to help resolve a class imbalance or perhaps a demographic bias in the original dataset.

Synthetic data allows us to continue to develop new, innovative products and solutions when the data necessary to do so would otherwise not be present or available.

How does the Gretel platform work to create synthetic data via APIs?

Gretel’s Privacy Engineering APIs allow you to ingest data to Gretel and explore the data we are able to extract. These are the same APIs used by our console. By exposing the APIs, through an intuitive interface, we hope to enable developers and data scientists to create their own workflows around Gretel.

While the console facilitates the creation of synthetic data, the APIs allow you to integrate the creation of synthetic data into your workflow. I love using APIs because it allows me to customize synthetic data creation for a very particular use case.

Could you discuss some of the tools offered by Gretel to help assess the quality of synthetic data?

After creating the synthetic data, Gretel will generate a synthetic report. In this report, you can see the Synthetic Data Quality Score (SQS), as well as a Privacy Shield Level (PPL) score.

The SQS score is an estimate of how well the generated synthetic data retains the same statistical properties as the original dataset. In this sense, the SQS score can be thought of as a score of utility or a score of confidence in whether the scientific conclusions drawn from the synthetic dataset would be the same if one had used the dataset. original instead.

The synthetic data quality score is calculated by combining the individual quality measures: field distribution stability, field correlation stability, and deep structure stability.

Field distribution stability is a measure of the ability of synthetic data to maintain the same field distributions as in the original data. Field correlation stability is a measure of how well correlations between fields have been maintained in the synthetic data. And finally, deep structure stability measures the statistical integrity of deeper multi-field distributions and correlations. To estimate this, Gretel compares a Principal Component Analysis (PCA) computed first on the original data and then again on the synthetic data.

How do Gretel privacy filters work?

Gretel privacy filters were the culmination of much research into the nature of adversarial attacks on synthetic data. Privacy filters prevent the creation of synthetic data with weaknesses commonly exploited by adversaries. We have two privacy filters, the first is the similarity filter and the second is the outlier filter. The similarity filter prevents the creation of synthetic records that are too similar to a training record. These are prime targets for adversaries seeking to better understand the original data. The second privacy filter is the outlier filter. This prevents the creation of synthetic records that would be considered an outlier in the space defined by the training data. Outliers revealed in a synthetic dataset can be exploited by membership inference attacks, attribute inference, and a wide variety of other adversarial attacks. They represent a serious privacy risk.

How can synthetic data help reduce AI bias?

The most common technique is to address representational bias in the data feeding an AI system. For example, if there is a strong class imbalance in your data, or if there may be a demographic bias in your data, Gretel offers tools to help first measure the imbalance and then resolve it in synthetic data. By removing the bias in the data, you often remove the bias in the AI ​​system built on the data.

You clearly enjoy learning about new machine learning technologies, how do you personally keep up with all the changes?

Read, read and read some more, lol! I like to start my day by reading about new ML technologies. The Medium knows me so well. I enjoy reading articles in Towards Data Science, Analytics Vidhya, and newsletters like The Sequence. Facebook AI, Google AI, and OpenMined all have great blogs. There are a plethora of good conferences to follow such as NeurIPS, ICML, ICLR, AISTATS.

I also appreciate tools that follow citation leads, help you find articles similar to ones you like and know your specific interests, and always watch in the background for an article you might be interested in. Zeta Alpha is one of those tools that I use a lot.

Finally, you really can’t underestimate the benefit of having colleagues with similar interests. At Gretel, the ML team tracks research papers relevant to the areas we explore and will meet frequently to discuss papers of interest.

What is your vision for the future of machine learning?

Easy access to data will launch a great era of innovation in machine learning which will then drive innovation in a wide range of fields such as healthcare, finance, manufacturing and biosciences. Historically, many groundbreaking advances in ML can be attributed to a large volume of rich data. Yet historically, much research has been hampered by the inability to access or share data due to privacy concerns. As tools like Gretel break down this barrier, access to data will become more democratic. The entire machine learning community will benefit from access to rich, large datasets, instead of just a few elite mega-corporations.

Is there anything else you would like to share about Gretel?

If you love data, you’ll love Gretel (so clearly I love Gretel!). Easy access to data has been the thorn in the side of every data scientist I have known. At Gretel, we take great pride in having created a console and set of APIs that make creating private and shareable data as easy as possible. We deeply believe that data is more valuable when shared.

Thank you for this great interview and for sharing your insights, readers who want to learn more should visit Gretel.ai.

Sherry J. Basler