Transformers, Parallel Computation, and Logarithmic Depth

Overview

Explore the computational power of transformers in this 57-minute lecture by Daniel Hsu from Columbia University. Delve into the relationship between self-attention layers and communication rounds in Massively Parallel Computation. Discover how logarithmic depth enables transformers to efficiently solve complex computational tasks that challenge other neural sequence models and sub-quadratic transformer approximations. Gain insights into parallelism as a crucial distinguishing feature of transformers. Learn about the collaborative research with Clayton Sanford from Google and Matus Telgarsky from NYU, focusing on the simulation capabilities between constant numbers of self-attention layers and communication rounds in Massively Parallel Computation.