Training Deep Learning Models in the Cloud - Infrastructure Considerations and Best Practices

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Learn about training deep learning models in cloud environments through this 56-minute webcast presented by experts from Habana (Intel) and IBM. Explore industry predictions showing deep learning's dominance in future cloud workloads, with a focus on foundation models trained using billions of parameters. Gain insights into AI adoption benefits across industries, infrastructure selection considerations for both on-premises and cloud deployments, and solution approaches for enterprise AI implementation. Discover how organizations leverage cloud-native AI software stacks like Kubernetes to manage complexity with evolving frameworks like TensorFlow and PyTorch. Examine critical aspects of operationalizing deep learning infrastructure, including scaling solutions, cost optimization, training time reduction, data storage capacity, bandwidth requirements, and additional key infrastructure selection criteria. Dive into technical topics like GPU Direct RDMA, storage needs, training methods, network usage, collective operations, and Optane Persistent Memory. Master the essentials of deep learning infrastructure design while understanding the tradeoffs between cost, performance, and flexibility in modern AI deployments.

Syllabus

Introduction
Industry Trends
AI by Enterprises
Storage and Compute
Ecosystem
Training and Deployment
Network Interfaces
Middleware Stack
Software
Preprocessing
Summary
Questions
What is GDR
Dual Approach
GPU Direct RDMA
Storage Needs
Training Methods
Network Usage
Collective Operations
Optane Persistent Memory