Serverless Machine Learning Inference with KFServing

Overview

Explore serverless machine learning inference with KFServing in this conference talk. Learn how KFServing leverages KNative to simplify deployment of standard and custom ML models, enabling auto-scaling and scale-to-zero functionality. Discover the challenges of integrating serverless technology with ML inference, particularly in latency-critical environments. Gain insights into the evolution of inference stacks, GPU autoscaling challenges, and strategies to improve GPU utilization. Examine a real-world use case of personalized news monitoring and the proposed multi-model inference service solution. Delve into practical experiences of running a serverless inference platform, including techniques to reduce tail latency and cold start latency. Understand the importance of monitoring and alerting for control plane, access logs, and inference service metrics. Get a glimpse of the KFServing roadmap for 2020 and learn about the open working group driving its development.

Syllabus

Intro
Inference Stack Evolution PYTORCH
Model explanation, model pre-post transformers
GPU Autoscaling the challenge
Challenge: Increase GPU utilization
Use Case: Personalized News Monitoring
Challenge: Deploy many models
Proposed Solution: Multi-model Inference Service
Experience from running a serverless inference platform
Reduce tail latency caused by CPU throttling
Reduce cold start latency
Monitoring and Alerting: Control Plane
Monitoring and Alerting: Access logs
Monitoring and Alerting: Inference Service metrics
KFServing Roadmap 2020
Our Working Group is Open