Seagull - A Distributed, Fault Tolerant, Concurrent Task Runner

Seagull - A Distributed, Fault Tolerant, Concurrent Task Runner

Linux Foundation via YouTube Direct link

We are hiring in Europe! • Offices in London or Hamburg, remote workers as welcome • Engineers or managers with dist-sys experience: Strong knowledge of systems and replication design Ability to work…

28 of 28

28 of 28

We are hiring in Europe! • Offices in London or Hamburg, remote workers as welcome • Engineers or managers with dist-sys experience: Strong knowledge of systems and replication design Ability to work…

Class Central Classrooms beta

YouTube playlists curated by Class Central.

Classroom Contents

Seagull - A Distributed, Fault Tolerant, Concurrent Task Runner

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Seagull: A Distributed, Fault tolerant, Concurrent task runner
  2. 2 Outline What is Seagull? Why did we build it? Deep dive into Seagull Fleetmiser: Yelp's in-house cluster autoscaler Challenges and lessons learned Future of Seagull
  3. 3 Testing at Yelp Yelp needs to run-100,000 tests for its applications. Tests take-2 days to run if executed serially
  4. 4 Current seagull scale -350 seagull runs every day. Average runtime - 10-15 mins. -2.5 million ephemeral containers every day. Cluster scales from -70 instances to -450 instances. All spot instances -…
  5. 5 Applications of seagull Test execution: unit, integration, acceptance, and selenium tests Locust: Yelp's load testing framework Photo classification: Classify tens of millions of photos in less than …
  6. 6 Seagull Mesos scheduler Written in pythong Uses libmesos One scheduler per test suite per run 50 schedulers running simultaneously at peak Customizable concurrency Fault tolerant
  7. 7 Placement strategies Aim: Optimize for seagull bundle setup time. Affinity for already used agents. Use as many resources in an offer as possible. This also simplifies the scale down.
  8. 8 Seagull executor Custom mesos executor written in python. Uses Mesos containerizer and group isolators. Does setup and teardown of bundles. Reports resource utilization stats. Uploads log files to 53…
  9. 9 Clusterwide resources Clusterwide resources: selenium and database connections Resources are not tied to specific agents. Zookeeper ephemeral znodes to keep track of how many connections are being us…
  10. 10 Log aggregation in splunk stdout & stderr of all the executors is stored in Splunk which allows us to see failure trends across multiple seagull runs.
  11. 11 Greedy Algorithm Test timings are stored in ElasticSearch P90 of test timings for last one week are stored in DynamoDB every day. The list is sorted in ascending order of test timings. Tests are pack…
  12. 12 Linear Programming algorithm Handle test dependencies. Some tests cannot be run together. Some tests need to run together We use the PULP LP solver.
  13. 13 Auto scaling strategies CPU utilization Seagull runs in flight
  14. 14 Based on CPU utilization Our tasks are CPU bound Autoscaler tracks the CPU utilization in the cluster, and makes decisions based on that. Scale up if the cluster utilization 654 for 15 minutes Scale …
  15. 15 Based on the number of Seagull run submitted Autoscaler gets notified when a new seagullrunis submitted.
  16. 16 Scaling down is difficult! AWS Spotfleet does not allow us to specify which instances to terminate Autoscaler finds and terminates the idle instances, and readjusts the Spotfleet capacity
  17. 17 Bandwidth issues while talking to s3 Artifact and docker image download takes a long time causing seagull runs to be delayed.
  18. 18 Use VPC S3 endpoints Fast and secure access to 53 without any limitations on bandwidth Traffic does not leave Amazon network. *Caveat*: It can be only enabled for the S3 buckets in the same AWS regio…
  19. 19 Central Docker registries get overwhelmed Setup: Multiple Docker registries on a single host behind an nginx proxy It failed to cope up with requests being made. Solution: Run Docker registries on ev…
  20. 20 Spot instances AWS gives a warning 2 mins before reclaiming spot
  21. 21 Spot markets are volatile Fluctuations in spot prices of instances in certain markets can have an adverse effect on your application Getting the bid price right is hard. Trade-off between availabilit…
  22. 22 Orphaned Docker containers Cannot kill containers because docker daemon gets busy which leads to orphaned docker containers. Containers take up resources that are not accounted for in Mesos.
  23. 23 docker-reaper Proxy for Docker daemon. Written in go. Forwards all the signals to its children. Cleans up all the containers after child process exits.
  24. 24 Mesos maintenance mode Designed to be used by a single operator Need external locking mechanism to make it work for multiple operators.
  25. 25 Scheduler improvements Use oversubscription Use task_processing library to replace the core-component of the scheduler. Use CSI plugin to implement clusterwide resources. Make it easier for other ser…
  26. 26 Executor improvements Containerize everything!!! Use Docker runtime in Mesos containerizer and eliminate the need to talk to Docker daemon
  27. 27 Autoscaler improvements More advanced autoscaling for better resource utilization Use multiple spot fleets. We may save more money? Use more instance types in the cluster.
  28. 28 We are hiring in Europe! • Offices in London or Hamburg, remote workers as welcome • Engineers or managers with dist-sys experience: Strong knowledge of systems and replication design Ability to work…

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.