Overview
Syllabus
Intro
Parameter Server Training Overview
Adaptive Learning Rate
Synchronous Parameter Server Training
Evaluation by Estimator
Problems with Multi-Client Setup
Benefits of Single-Client Setup
Problems of Single-Client Setup
Schedule/Join APIs
Custom Training Loop with PS
Current Limitations of the APIs
Benefits of Inline Evaluation
Current Limitations of Inline Evaluation
Variable Sharding
Ongoing and Future Work
Runtime, Performance, and Scalability
Parameter server training in runtime
Invoke model func with async schedule API
Distributed functions in PS training
Large embedding model
Performance compared with Estimator
Worker profiles with multi-step packing
Multi-step packing: pros and cons
Preemptions and failures
Fault tolerance: worker failures
Large-scale fault tolerance testing
Run jobs with preemptible resources
Multi-worker testing framework
MLCompass dashboard
Taught by
TensorFlow