Overview
Syllabus
Intro
How does data reach the disk?
fsync is really important
It's hard to get durability correct Applications find it difficult
fsync can fail Durability gets harder to get right
Why care about fsync failures? "About a year ago the PostgreSQL community discovered that fsync (on Linux and some BSD systems) may not work the way we always thought it is [sic], with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values)."
Our work Systematically understand fsync failures
File System Results
Application Results
Outline
File System | Methodology: Fault Injection
File System Methodology: Workloads Common write patterns in applications • Reduced to simplest form
File System Result #1: Clean Pages Dirty page is marked clean after fsync failure on all three file systems
File System Result #22: Page Content File systems do not handle fsync errors uniformly • Page content depends on file system
File System Result #3: In-memory state In-memory data structures are not entirely reverted
Applications Five widely used applications
Applications Results: Overview Ext4 Ordered Mode
Crash/Restart Simple strategies fail Crash/restart is incorrect recovers wrong data from page cache • Example: PostgreSQL
Applications Results #1: False Failures False Failures: Indicate failure but actually succeed
Late Error Reporting All applications susceptible to data loss on ext4 data mode
Btrfs winning?
Applications Results Summary Simple strategies fail • Applications have moved away from retries
Challenges and Directions
Taught by
USENIX