Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.