Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

Overview

Explore a comprehensive lecture on protein machine learning representations, focusing on the joint distribution of sequence and structure. Dive into the analysis of ESMFold embeddings, uncovering massive activations and their implications. Learn about continuous compression schemes that significantly reduce ESMFold embeddings while maintaining structural information and performance on protein function benchmarks. Discover a novel tokenized all-atom structure vocabulary that enables high reconstruction accuracy from sequence alone. Examine the CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) embeddings and the HPCT (Hourglass Protein Compression Transformer) architecture, understanding their potential for compact representation of protein structure and sequence. Gain insights into information content asymmetries between sequence and structure, and explore the democratization of representations captured by large models. Investigate the flexible downstream applications of CHEAP embeddings, including generation, search, and prediction. The lecture concludes with a Q&A session, providing an opportunity to delve deeper into this cutting-edge research in protein machine learning.