Overview
Explore a conference talk on ScaleCheck, an innovative approach for discovering scalability bugs in large distributed systems using a single machine. Learn about the program analysis technique employed to identify potential causes of scalability issues and the colocation techniques used to test implementation code at real scales on a commodity PC. Discover how ScaleCheck has been integrated into popular storage systems like Cassandra, HDFS, Riak, and Voldemort, successfully exposing both known and unknown scalability bugs at scales up to 512 nodes on a 16-core PC. Gain insights into the methodology, including Naive Packing, Single Process Cluster, and Global Event Driven Architecture, as well as the concept of Colocation Factor. Understand the limitations and future work focused on scale-dependent CPU processing time.
Syllabus
Intro
ScaleCheck A Single Machine Approach for Discovering Scalability Bugs in Large Distributed Systems
An Example: Cassandra Bug #3831
The "Flapping" Bug(s)
Outline introduction
Naive Packing (NP)
Single Process Cluster (SPC) Deploy modes as processes threads in a single process
Per-Node Services Frequent Design pattern
Global Event Driven Architecture (GEDA) One global event handler per service
Finding New Bugs
Colocation Factor
Limitations and Future Work Focus on scale dependent CPUV Processing time
Taught by
USENIX