Deep Learning Pipelines for High Energy Physics Using Apache Spark and Distributed Keras
Databricks via YouTube
Overview
Syllabus
Intro
Experimental High Energy Physics is Data Intensive
Key Data Processing Challenge
Data Flow at LHC Experiments
R&D - Data Pipelines
Particle Classifiers Using Neural Networks
Deep Learning Pipeline for Physics Data
Analytics Platform at CERN
Hadoop and Spark Clusters at CERN
Step 1: Data Ingestion • Read input files: 4.5 TB from custom (ROOT) format
Feature Engineering
Step 2: Feature Preparation Features are converted to formats suitable for training
Performance and Lessons Learned • Data preparation is CPU bound
Neural Network Models and
Hyper-Parameter Tuning-DNN • Hyper-parameter tuning of the DNN model
Deep Learning at Scale with Spark
Spark, Analytics Zoo and BigDL
BigDL Run as Standard Spark Programs
BigDL Parameter Synchronization
Model Development - DNN for HLF • Model is instantiated using the Keras- compatible API provided by Analytics Zoo
Model Development - GRU + HLF A more complex network topology, combining a GRU of Low Level Feature + a DNN of High Level Features
Distributed Training
Performance and Scalability of Analytics Zoo/BigDL
Results - Model Performance
Workload Characterization
Training with TensorFlow 2.0 Training and test data
Recap: our Deep Learning Pipeline with Spark
Model Serving and Future Work
Summary • The use case developed addresses the needs for higher efficiency in event filtering at LHC experiments • Spark, Python notebooks
Labeled Data for Training and Test • Simulated events Software simulators are used to generate events
Taught by
Databricks