KungFu - Making Training in Distributed Machine Learning Adaptive

Overview

Explore the innovative KungFu distributed machine learning library for TensorFlow, designed to enable adaptive training in this OSDI '20 conference talk. Dive into the challenges of configuring numerous parameters in distributed ML systems and discover how KungFu addresses these issues through high-level Adaptation Policies (APs). Learn about the library's ability to dynamically adjust hyper-parameters and system parameters during training based on real-time monitored metrics. Understand the implementation of monitoring and control operators embedded in the dataflow graph, and the efficient asynchronous collective communication layer that ensures concurrency and consistency. Gain insights into the effectiveness of KungFu's adaptive approach, its mechanisms for distributed parameter adaptation, and the potential impact on improving the efficiency and performance of distributed machine learning training.

Syllabus

Intro
Training in Distributed ML Systems
Parameters in Distributed ML Systems
Issues with Empirical Parameter Tuning
Proposals for Automatic Parameter Adaptation
Open Challenges
Existing Approaches for Adaptation
KungFu Overview
Adaptation Policies
Example: Adaptation Policy for GNS
Embedding Monitoring Inside Dataflow Problem: High monitoring cost reduces adaptation benefit Idea: Improve efficiency by adding monitoring operators to dataflow graph
Challenges of Dataflow Collective Communication
Making Collective Communication Asynchronous Idea: Use asynchronous collective communication
Issues When Adapting System Parameters
Distributed Mechanism for Parameter Adaptation
How Effectively Does KungFu Adapt?
Conclusions: Kung Fu

Taught by

USENIX

Reviews

Start your review of KungFu - Making Training in Distributed Machine Learning Adaptive

Taught by

Scaling Distributed Machine Learning with the Parameter Server

Never Stop Learning.