Overview
Watch a 33-minute conference talk from SDC 2020 exploring how to generate realistic synthetic test data at scale that mirrors production data characteristics without exposing actual customer information. Learn about Druva's methodology for modeling and generating test datasets that maintain authentic patterns and relationships while being completely synthetic. Discover techniques for analyzing production data patterns, implementing models that capture key variables like file sizes and directory structures, and generating controlled random data that reflects real-world usage. Explore approaches for modeling directory trees, file distributions, naming conventions, and other critical variables needed for testing backup software, anti-virus tools, and legal discovery applications. Gain insights into creating versatile, repeatable synthetic datasets that enable thorough product testing while protecting sensitive production data. Principal Performance Engineer Mehul Sheth shares practical strategies for synthetic data generation that can be applied to various data types including mailboxes and transactional databases.
Syllabus
SDC 2020: Realistic Synthetic Data at scale: Influenced by, but not production data
Taught by
SNIAVideo