Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Using Apache Spark and Differential Privacy for 2020 Census Data Protection

Databricks via YouTube

Overview

Explore the innovative use of Apache Spark and differential privacy in protecting respondent confidentiality for the 2020 US Census in this 29-minute talk. Dive into the challenges of balancing data accuracy with privacy protection while distributing $675 billion in federal funds and apportioning the US House of Representatives. Learn about the custom-built Spark application that performs millions of optimizations using mixed integer linear programs on a massive cluster. Discover the design of this differential privacy application and the sophisticated monitoring systems implemented in Amazon's GovCloud to oversee multiple clusters and thousands of application runs. Gain insights into the TopDown Algorithm (TDA) and how it addresses key challenges in monitoring Spark. Understand the importance of the Disclosure Avoidance System in enforcing global confidentiality protections for census data.

Syllabus

Intro
Abstract
Outline
Privacy and the Decennial Census
2010 Census: Summary of Publications (approximate counts)
We performed a database reconstruct and re-identification attack for all 308.745538 people in the 2010 Census
The basic idea of differential privacy: Uncertainty (noise) protects privacy
The Census Bureau is using differential privacy for the 2020 Census.
How much noise do we add? That's a policy decision.
We planned to create a Disclosure Avoidance System that dropped into the Census production system.
The Disclosure Avoidance System allows the Census Bureau to enforce global confidentiality protections
Our DP mechanism protects histograms of person types. Census "block"
Running the block-by-block algorithm with spark
In 2018 we invented the TopDown Algorithm (TDA)
Key challenges in monitoring spark
We created our own monitoring framework
Cluster List
Each DAS run is a "mission"
Mission Report
System Load
Free Memory
In Summary

Taught by

Databricks

Reviews

Start your review of Using Apache Spark and Differential Privacy for 2020 Census Data Protection

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.