Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Overview

Explore the intricacies of deploying and orchestrating large open-source inference models on Kubernetes in this 27-minute talk from the Linux Foundation. Dive into automating the deployment of heavyweight models like Falcon and Llama 2 using Kubernetes Custom Resource Definitions (CRDs) for seamless management of large model files through container images. Learn about streamlining deployment with an HTTP server for inference calls, eliminating manual tuning of deployment parameters with preset configurations, and auto-provisioning GPU nodes based on specific model requirements. Discover how to empower users to deploy containerized models effortlessly by providing pod templates in the workspace custom resource inference field, allowing the controller to dynamically create deployment workloads utilizing all GPU nodes. Gain insights into optimizing resource utilization and simplifying the deployment process for large language model inference in the rapidly evolving AI/ML landscape.

Syllabus

Effortless Scalability: Orchestrating Large Language Model Inference w... Joinal Ahmed & Nirav Kumar

Taught by

Linux Foundation

Reviews

Start your review of Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Taught by

Tags

Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Effortless Scalability - Orchestrating Large Language Model Inference with Kubernetes

Effortless Scalability - Orchestrating Large Language Model Inference with Kubernetes

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API

9 Best Kubernetes Courses for 2024

Never Stop Learning.