AI Inference Workloads - Solving MLOps Challenges in Production
Toronto Machine Learning Series (TMLS) via YouTube
Overview
Syllabus
Intro
Agenda
The Machine Learning Process
Deployment Types for Inference Workloads
Machine Learning is Different than Traditional Software Engineering
Low Latency
High Throughput
Maximize GPU Utilization
Embedding ML. Models into Web Servers
Decouple Web Serving and Model Serving
Model Serving System on Kubernetes
Multi-Instance GPU (MIG)
Run:Al's Dynamic MIG Allocations
Run 3 instances of type 2g.10gb
Valid Profiles & Configurations
Serving on Fractional GPUs
A Game Changer for Model Inferencing
Taught by
Toronto Machine Learning Series (TMLS)