Explore a cutting-edge solution for GPU sharing in container clouds designed specifically for deep learning workloads. This 15-minute conference talk introduces TGS (Transparent GPU Sharing), an innovative system operating at the OS layer that addresses the challenge of GPU underutilization in datacenters. Learn how TGS leverages adaptive rate control and transparent unified memory to achieve high GPU utilization and performance isolation, ensuring minimal impact on production jobs while significantly improving throughput for opportunistic jobs. Discover the advantages of TGS over existing application-layer and OS-layer solutions, and gain insights into its integration with Docker and Kubernetes. Understand the potential of this technology to revolutionize resource management in container clouds and optimize deep learning training processes.
Overview
Syllabus
NSDI '23 - Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
Taught by
USENIX