Overview
Syllabus
Intro
Why are Neural Networks Slow and What Can we Do?
A Simple Example • How long does a metro-matrix multiply take?
Practically
Speed Trick 3
Reduce # of Operations
Reduce CPU-GPU Data Movement
What About Memory?
Three Types of Parallelism
Within-operation Parallelism
Operation-wise Parallelism
Example-wise Parallelism
Computation Across Large Vocabularies
A Visual Example of the Softmax
Importance Sampling (Bengio and Senecal 2003)
Noise Contrastive Estimation (Mnih & Teh 2012)
Mini-batch Based Negative Sampling
Hard Negative Mining • Select the top n hardest examples
Efficient Maximum Inner Product Search
Structure-based Approximations
Class-based Softmax (Goodman 2001) • Assign each word to a class
Binary Code Prediction (Dietterich and Bakiri 1995, Oda et al. 2017)
Two Improvement to Binary Code Prediction
Taught by
Graham Neubig