HDFS CSI Plugin: Speeding Up Kubernetes in On-Premises Big Data Clusters

Overview

Explore the integration of Kubernetes with on-premises big data clusters through this conference talk. Learn about the HDFS CSI Plugin design and architecture, addressing the challenge of consuming HDFS data with Kubernetes. Discover best practices for running Spark workloads on Kubernetes with HDFS access using the CSI plugin. Examine performance comparisons between Spark on Kubernetes with HDFS and Spark on YARN with HDFS using the TPC-DS benchmark suite. Gain insights into big data history, containerization benefits, Kubernetes architecture, CSI core services, volume lifecycle management, and Hadoop HDFS characteristics as persistent volumes. Understand the potential of Kubernetes as an alternative to Hadoop YARN for resource scheduling in on-premises big data environments.

Syllabus

Intro
Outline
Big Data History Cont.
Big Data Stack
Big Data Trend
Benefit of Containerization
Kubernetes Architecture
Challenges
CSI(Container Storage Interface)
CSI Core Services
CSI Advance Features
Volume Lifecycle Volume Lifecycle
Controller and Node Services
Kubernetes Storages
Kubernetes CSI Support
PV, PVC and Storage Class
Package and Deployment Suggestion
Hadoop HDFS
HDFS Cluster Scale
Apache Ozone
HDFS/Ozone as PV
HDFS Characteristics as PV
HDFS NFS Gateway CSI
Ozone CSI
Resources