Building and Managing a Centralized ML Platform with Kubeflow at CERN
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore the journey of building and managing a centralized machine learning platform using Kubeflow at CERN in this 31-minute conference talk. Discover how CERN leverages ML solutions for various challenges, including particle classification, simulation data generation, and beam calibration. Learn about the recently introduced centralized service that handles data preparation, model training, and serving while optimizing resource usage for different types of accelerators. Gain insights into CERN's experience with Kubeflow on Kubernetes, their integration of on-premises resources, and potential extensions to public clouds. Delve into topics such as cluster layout, deployment strategies, integrations, and automation of distributed training. Witness a demo of job submission and results, and understand the motivations behind CERN's ML platform development.
Syllabus
Introduction
Introductions
What is CERN
Motivation for our service
Reconstruction
Simulations
Goals
Platform
Cluster Layout
Deployment
Integrations
Issues
Burst to Public Clouds
Automating Distributed Training
Service Dashboard
Demo
Submitting jobs
Results
Closing remarks
Taught by
CNCF [Cloud Native Computing Foundation]