Overview
Syllabus
Intro
Why Large Bare Metal Boxes? • Faster local communication UNIX Domain Sockets Shared Memory
The Scale in our Department • 100K processes across hundreds of physical machines
SysV semaphore bottleneck (AIX)
Observations and Findings AIX CPU measurement when hyper-threading is very misleading No 'out of the box metrics on SysV IPC operations Sporadic slowness (depending on concurrency/contention)
SysV shared memory bottleneck (Linux) • Low-level application infrastructure code dropping messages Messaging leverages a form of "zero copy" IPC using Sysv
SysV shared memory bottleneck (Linux RHEL 6) The micro-benchmark
Case #2: Observations and Findings • No 'out of the box metrics on SysV IPC operations
UNIX domain socket bottleneck (Solaris) • Critical software infrastructure experiencing timeouts on load Identity management with very strict SLOS Narrowing down the problem A key SLI for the service is token generation latency
An Aside: Histograms and Distributions are Useful! • More representative of the data set
An Aside: A Histogram Example
Early Observations • No out of the box metrics on socket operations
Case #3: UNIX domain socket bottleneck (Solaris) The micro-benchmarkt-testing against size
Case #3: Conclusions • Solaris 11.3 is limited to a max of 256K UDS sockets
Task clone and exit bottleneck (Linux)
More Summary (Plea to Kernel Folks) • The Prime Directive of Monitoring: Non-interference
References
Taught by
USENIX