Building a Data Platform with Apache Spark on Kubernetes

Overview

Explore the challenges and solutions of building a data platform using Apache Spark on Kubernetes in this 31-minute conference talk. Learn how PUBG Corporation migrated its on-demand data analytics platform to Spark on Kubernetes, serving millions of online gamers. Discover the Sphynx project, which manages on-demand Spark clusters and Jupyter Notebooks as containerized applications on Kubernetes. Gain insights into the main log pipeline, Apache Spark layer platform, batch systems, and data system domain. Understand Kubernetes deployment, scheduling, and platform architecture. Delve into workflows, best practices, monitoring strategies, and future work considerations. Walk away with key takeaways for implementing Spark on Kubernetes in large-scale data processing environments.

Syllabus

Introduction
Overview
Main Log Pipeline
Apache Spark
Layer Platform
Notebooks
Batch System
Spark Platform
Data System Domain
Problems
What is Kubernetes
Kubernetes Deployment
Kubernetes Scheduler
Platform Architecture
Workflow
Best Sauce
Challenges
Monitoring
Future Work
Key Takeaways
Questions