Machine Learning Using Various GPU Technologies with Kubeflow

Overview

Explore advanced GPU technologies for efficient machine learning in this 32-minute conference talk by Jihye Choi from SAMSUNG SDS. Discover how to optimize GPU utilization and enhance distributed learning in Kubeflow environments. Learn about Multi-Instance GPU technology for the NVIDIA A100, which allows splitting a single GPU into up to 7 instances, maximizing resource efficiency for simplified models. Delve into the benefits of GPUDirect RDMA, a high-performance networking technology that enables direct GPU memory communication without CPU intervention, improving GPU utilization and performance in distributed training scenarios. Gain valuable insights on combining these cutting-edge technologies with Kubeflow to overcome limitations in cost and GPU resources for MLOps practitioners.