Deep Learning Pipelines for High Energy Physics Using Apache Spark and Distributed Keras

Overview

Explore a 39-minute conference talk detailing CERN's implementation of an Apache Spark-based data pipeline for deep learning research in High Energy Physics (HEP). Discover how CERN tackles the challenges of processing massive data volumes from Large Hadron Collider experiments, with particle collisions occurring every 25 nanoseconds. Learn about the novel event filtering system prototype using deep neural networks, and how it optimizes compute and storage resource usage. Dive into the data pipeline's architecture, which integrates PySpark, Spark SQL, and Python code via Jupyter notebooks for data preparation and feature engineering. Understand the key integrations enabling Apache Spark to ingest HEP data formats and interact with CERN's storage and compute systems. Examine the distributed training of neural network models using Keras API, BigDL, and Analytics Zoo on Spark clusters. Gain insights into the implementation details, results, and lessons learned from this cutting-edge application of big data technologies in particle physics research.

Syllabus

Intro
Experimental High Energy Physics is Data Intensive
Key Data Processing Challenge
Data Flow at LHC Experiments
R&D - Data Pipelines
Particle Classifiers Using Neural Networks
Deep Learning Pipeline for Physics Data
Analytics Platform at CERN
Hadoop and Spark Clusters at CERN
Step 1: Data Ingestion • Read input files: 4.5 TB from custom (ROOT) format
Feature Engineering
Step 2: Feature Preparation Features are converted to formats suitable for training
Performance and Lessons Learned • Data preparation is CPU bound
Neural Network Models and
Hyper-Parameter Tuning-DNN • Hyper-parameter tuning of the DNN model
Deep Learning at Scale with Spark
Spark, Analytics Zoo and BigDL
BigDL Run as Standard Spark Programs
BigDL Parameter Synchronization
Model Development - DNN for HLF • Model is instantiated using the Keras- compatible API provided by Analytics Zoo
Model Development - GRU + HLF A more complex network topology, combining a GRU of Low Level Feature + a DNN of High Level Features
Distributed Training
Performance and Scalability of Analytics Zoo/BigDL
Results - Model Performance
Workload Characterization
Training with TensorFlow 2.0 Training and test data
Recap: our Deep Learning Pipeline with Spark
Model Serving and Future Work
Summary • The use case developed addresses the needs for higher efficiency in event filtering at LHC experiments • Spark, Python notebooks
Labeled Data for Training and Test • Simulated events Software simulators are used to generate events