Realistic Synthetic Data Generation at Scale - Modeling Production Data Without Exposure

Overview

Watch a 33-minute conference talk from SDC 2020 exploring how to generate realistic synthetic test data at scale that mirrors production data characteristics without exposing actual customer information. Learn about Druva's methodology for modeling and generating test datasets that maintain authentic patterns and relationships while being completely synthetic. Discover techniques for analyzing production data patterns, implementing models that capture key variables like file sizes and directory structures, and generating controlled random data that reflects real-world usage. Explore approaches for modeling directory trees, file distributions, naming conventions, and other critical variables needed for testing backup software, anti-virus tools, and legal discovery applications. Gain insights into creating versatile, repeatable synthetic datasets that enable thorough product testing while protecting sensitive production data. Principal Performance Engineer Mehul Sheth shares practical strategies for synthetic data generation that can be applied to various data types including mailboxes and transactional databases.

Syllabus

SDC 2020: Realistic Synthetic Data at scale: Influenced by, but not production data

Taught by

SNIAVideo

Reviews

Start your review of Realistic Synthetic Data Generation at Scale - Modeling Production Data Without Exposure

Taught by

Machine Learning in Production

Crafting Flight Simulations: The Art of Synthetic Data Generation

Synthetic Data Generation and Applications in Python

ML-Based Privacy-Preserving Framework for Generating Synthetic Data from Aggregated Sources

Compute Engine Testing with Synthetic Data Generation

Creating Synthetic Data with Langchain and OpenAI for Data Science and Machine Learning

From Data to Insights: 10 Best Data Analysis Courses for 2024

Never Stop Learning.