Completed
Large-scale fault tolerance testing
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Inside TensorFlow - Parameter Server Training
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 Parameter Server Training Overview
- 3 Adaptive Learning Rate
- 4 Synchronous Parameter Server Training
- 5 Evaluation by Estimator
- 6 Problems with Multi-Client Setup
- 7 Benefits of Single-Client Setup
- 8 Problems of Single-Client Setup
- 9 Schedule/Join APIs
- 10 Custom Training Loop with PS
- 11 Current Limitations of the APIs
- 12 Benefits of Inline Evaluation
- 13 Current Limitations of Inline Evaluation
- 14 Variable Sharding
- 15 Ongoing and Future Work
- 16 Runtime, Performance, and Scalability
- 17 Parameter server training in runtime
- 18 Invoke model func with async schedule API
- 19 Distributed functions in PS training
- 20 Large embedding model
- 21 Performance compared with Estimator
- 22 Worker profiles with multi-step packing
- 23 Multi-step packing: pros and cons
- 24 Preemptions and failures
- 25 Fault tolerance: worker failures
- 26 Large-scale fault tolerance testing
- 27 Run jobs with preemptible resources
- 28 Multi-worker testing framework
- 29 MLCompass dashboard