Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the challenges and solutions for managing schema evolution in data lakes through this informative EuroPython 2021 conference talk. Learn best practices for storage, control, scalability, and availability in data lake design. Discover how Episource tackled the complex task of storing and searching evolving nested JSON data from their NLP engine processing millions of medical documents. Gain insights into implementing a solution using AVRO format for schema evolution, leveraging a Schema registry for version control, and utilizing Athena for distributed SQL queries. Understand the benefits of both "schema-on-write" and "schema-on-read" approaches in maintaining data integrity and compatibility across schema changes.