Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore tools and strategies for migrating large-scale datasets to cloud platforms in this 47-minute Devoxx conference talk. Delve into the experiences of the Hotels.com big data platform team as they tackle the challenges of moving extensive data sets and pipelines from on-premises clusters to cloud-based solutions. Discover two open-source tools developed to overcome unexpected obstacles: Circus Train, a dataset replication tool for copying Hive tables between clusters and clouds, and Waggle Dance, a federated Hive query service enabling data querying across multiple Hive metastores. Learn about the unique features of these tools, their advantages over existing solutions, and how they've been successfully implemented to build a petabyte-scale platform now utilized by other Expedia brands. Gain insights into real-world problems and solutions encountered in a large, organically grown corporation, moving beyond idealized architectures to practical applications in big data migration.
Syllabus
Introduction
Agenda
Company structure
Data processing
Migrating jobs first
Its going to take years
Data search replication
Finding an open source solution
Naming your project
Configuration
Distributed Copy
High of Diff
Other features
Bridging multiple clusters
Waggle Dance
Hive CLI example
Priori pattern
Cloud architecture
Taught by
Devoxx