Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore strategies for optimizing Large Language Model (LLM) workflows using smart infrastructure enhanced by Volcano in this informative conference talk. Discover how to effectively manage large-scale LLM training and inference platforms while addressing critical challenges such as training efficiency, fault tolerance, resource fragmentation, operational costs, and topology-aware scheduling. Learn about fault detection techniques, fast job recovery, and self-healing mechanisms that significantly improve efficiency. Gain insights into handling long downtime in LLM training on heterogeneous GPUs, implementing intelligent GPU workload scheduling to reduce resource fragmentation and costs, and leveraging topology-aware scheduling on rack/supernode systems to accelerate LLM training. Benefit from real-world experiences shared by the speakers in managing thousands of GPUs and handling monthly workloads involving numerous LLM training and inference jobs in a cloud-native AI platform environment.

Syllabus

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano - Xin Li & Xuzheng Chang

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano

Taught by

How Volcano Enables the Next Wave of Intelligent Applications

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training

Efficient Multi-Cluster GPU Workload Management with Karmada and Volcano

Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Maximizing GPU Utilization Over Multi-Cluster - Challenges and Solutions for Cloud-Native AI Platform

Building a Fine-Grained and Intelligent Resource Management System on Kubernetes

9 Best Kubernetes Courses for 2024

Never Stop Learning.