Fixing Small Files Performance Issues in Apache Spark Using DataFlint
Big Data Demystified via YouTube
Overview
Learn how to optimize Apache Spark performance by addressing small files issues in data lakes during this 26-minute technical lecture. Explore the critical relationship between storage layer interactions and file management in big data environments, with a focus on best practices for file sizing. Discover how Apache Spark processes files at the task level, and master techniques for identifying and resolving small files problems using the open-source DataFlint library. Gain practical insights into handling small files challenges when working with modern storage formats like Delta Lake and Iceberg. Delivered by Meni Shmueli, founder of DataFlint and experienced big data specialist who has helped numerous companies enhance their data operations performance and development efficiency.
Syllabus
Fixing small files performance issues in Apache Spark, using DataFlint [English]
Taught by
Big Data Demystified