Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore an innovative approach to parallelizing embedding tables for large-scale recommendation models in this 20-minute conference talk from USENIX ATC '24. Dive into OPER, an algorithm-system co-design that addresses the challenges of deploying Deep Learning Recommendation Models (DLRMs) across multiple GPUs. Learn how OPER's optimality-guided embedding table parallelization technique improves upon existing methods by considering input-dependent behavior, resulting in more balanced workload distribution and reduced inter-GPU communication. Discover the heuristic search algorithm used to approximate near-optimal EMT parallelization and the implementation of a distributed shared memory-based system that supports fine-grained EMT parallelization. Gain insights into the significant performance improvements achieved by OPER, with reported average speedups of 2.3× in training and 4.0× in inference compared to state-of-the-art DLRM frameworks.