Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Seagull - A Distributed, Fault Tolerant, Concurrent Task Runner

Linux Foundation via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a conference talk detailing Yelp's distributed task runner Seagull, which enables running 100,000 tests in under 10 minutes. Learn how this system utilizes Apache Mesos, AWS spot instances, and custom autoscaling to efficiently execute millions of tests daily. Discover the architecture, algorithms, and challenges involved in building a large-scale distributed testing infrastructure. Gain insights into optimizing resource utilization, handling spot instance volatility, and addressing issues like bandwidth constraints and orphaned containers. Examine future improvements planned for Seagull's scheduler, executor, and autoscaler components.

Syllabus

Seagull: A Distributed, Fault tolerant, Concurrent task runner
Outline What is Seagull? Why did we build it? Deep dive into Seagull Fleetmiser: Yelp's in-house cluster autoscaler Challenges and lessons learned Future of Seagull
Testing at Yelp Yelp needs to run-100,000 tests for its applications. Tests take-2 days to run if executed serially
Current seagull scale -350 seagull runs every day. Average runtime - 10-15 mins. -2.5 million ephemeral containers every day. Cluster scales from -70 instances to -450 instances. All spot instances -25 million tests executed every day.
Applications of seagull Test execution: unit, integration, acceptance, and selenium tests Locust: Yelp's load testing framework Photo classification: Classify tens of millions of photos in less than a day.
Seagull Mesos scheduler Written in pythong Uses libmesos One scheduler per test suite per run 50 schedulers running simultaneously at peak Customizable concurrency Fault tolerant
Placement strategies Aim: Optimize for seagull bundle setup time. Affinity for already used agents. Use as many resources in an offer as possible. This also simplifies the scale down.
Seagull executor Custom mesos executor written in python. Uses Mesos containerizer and group isolators. Does setup and teardown of bundles. Reports resource utilization stats. Uploads log files to 53, sends metrics to ElasticSearch and SignalFx.
Clusterwide resources Clusterwide resources: selenium and database connections Resources are not tied to specific agents. Zookeeper ephemeral znodes to keep track of how many connections are being used.
Log aggregation in splunk stdout & stderr of all the executors is stored in Splunk which allows us to see failure trends across multiple seagull runs.
Greedy Algorithm Test timings are stored in ElasticSearch P90 of test timings for last one week are stored in DynamoDB every day. The list is sorted in ascending order of test timings. Tests are packed into bins of 10 minutes.
Linear Programming algorithm Handle test dependencies. Some tests cannot be run together. Some tests need to run together We use the PULP LP solver.
Auto scaling strategies CPU utilization Seagull runs in flight
Based on CPU utilization Our tasks are CPU bound Autoscaler tracks the CPU utilization in the cluster, and makes decisions based on that. Scale up if the cluster utilization 654 for 15 minutes Scale down if the cluster utilization is 35% for 30 mins.
Based on the number of Seagull run submitted Autoscaler gets notified when a new seagullrunis submitted.
Scaling down is difficult! AWS Spotfleet does not allow us to specify which instances to terminate Autoscaler finds and terminates the idle instances, and readjusts the Spotfleet capacity
Bandwidth issues while talking to s3 Artifact and docker image download takes a long time causing seagull runs to be delayed.
Use VPC S3 endpoints Fast and secure access to 53 without any limitations on bandwidth Traffic does not leave Amazon network. *Caveat*: It can be only enabled for the S3 buckets in the same AWS region.
Central Docker registries get overwhelmed Setup: Multiple Docker registries on a single host behind an nginx proxy It failed to cope up with requests being made. Solution: Run Docker registries on every agent. Use /etc/hosts for address resolution
Spot instances AWS gives a warning 2 mins before reclaiming spot
Spot markets are volatile Fluctuations in spot prices of instances in certain markets can have an adverse effect on your application Getting the bid price right is hard. Trade-off between availability and cost savings.
Orphaned Docker containers Cannot kill containers because docker daemon gets busy which leads to orphaned docker containers. Containers take up resources that are not accounted for in Mesos.
docker-reaper Proxy for Docker daemon. Written in go. Forwards all the signals to its children. Cleans up all the containers after child process exits.
Mesos maintenance mode Designed to be used by a single operator Need external locking mechanism to make it work for multiple operators.
Scheduler improvements Use oversubscription Use task_processing library to replace the core-component of the scheduler. Use CSI plugin to implement clusterwide resources. Make it easier for other services/applications to use seagull for parallelizing tasks.
Executor improvements Containerize everything!!! Use Docker runtime in Mesos containerizer and eliminate the need to talk to Docker daemon
Autoscaler improvements More advanced autoscaling for better resource utilization Use multiple spot fleets. We may save more money? Use more instance types in the cluster.
We are hiring in Europe! • Offices in London or Hamburg, remote workers as welcome • Engineers or managers with dist-sys experience: Strong knowledge of systems and replication design Ability to work closely with information retrieval machine learning Strong understanding of operating systems, file systems and

Taught by

Linux Foundation

Reviews

Start your review of Seagull - A Distributed, Fault Tolerant, Concurrent Task Runner

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.