Fault Tolerance in Distributed Systems - A Case Study in Apache Spark

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore fault tolerance in distributed systems through a case study of Apache Spark in this Scala Days conference talk. Delve into the challenges of building robust distributed computing platforms and learn valuable lessons applicable to developing your own systems. Examine how Spark implements fault-tolerance, including its use of Scala and functional programming principles, as well as instances where it deviates from concepts like immutability. Gain insights into the Spark computation model, its similarities to MapReduce, and how it extends beyond this paradigm. Understand what fault tolerance truly means in practice, including how to handle hardware failures and the importance of fault injection testing. Discover the limitations of platform guarantees and learn essential questions to ask when evaluating distributed systems. Through real-world examples and code analysis, gain a deeper understanding of fault tolerance implementation and its implications for both system developers and users.

Syllabus

Intro
Spark Computation Model: Like MapReduce
Beyond MapReduce
What is Fault Tolerance?
One Bad Disk Spoils The Whole Bunch
Handling Flaky Hardware (SPARK-8425)
Miracles do happen
Testing Fault Tolerance
Fault Injection Testing
What Fault Tolerance might mean to you