Overview
Syllabus
Intro
The Business Intelligence use case How BI tools connect to Databricks?
Data growth
Challenges and opportunities Breaking down the extract problem Problem
Fetching query results Result pagination
Importing tables Use internal compute engine
Serving results before Arrow Multiple layers of conversion
Serving results with Arrow Bring results faster to the client
Collecting results in Arrow format Tasks generate Arrow batches
Arrow batch sizing Fetching Arrow batches
Improvements with Arrow Speedups up less than 3x
Extract bottlenecks
New data extract architecture Cloud Fotch system design
Inlining small results Hybrid results
Data layout File sizing and pagination
Fetching results from URLS Parallel file downloads
Cloud Fetch performance Extract faster than BI tools can ingest
Cloud Fetch in the wild Outperforms direct fotch by an order of magnitude
Conclusions Scaled up extract workloads using cloud storage
DATA+AI SUMMIT 2022
Taught by
Databricks