Overview
Syllabus
Glamorous Life of an Al Scientist
A Simple Example • How long does a matrix-matrix multiply take?
Practically
What About Memory?
Three Types of Parallelism
Within-operation Parallelism
Operation-wise Parallelism
Example-wise Parallelism
Implementing Data Parallelism • Many modern libraries make data parallelism relatively easy, eg PyTorch DistributedDataParallel
Computation Across Large Vocabularies
Noise Contrastive Estimation (Mnih & Teh 2012)
Mini-batch Based Negative Sampling
Class-based Softmax (Goodman 2001) • Assign each word to a class • Predict class first, then word given class
Binary Code Prediction (Dietterich and Bakiri 1995, Oda et al. 2017)
Two Improvement to Binary Code Prediction
Taught by
Graham Neubig