Mesos - thin resource sharing layer enabling fine grained resource sharing across frameworks
One design option - global scheduler
optimal scheduling, but several drawbacks
hard to design expressive API
complexity makes it hard to scale, less resilient to failure
frameworks already implement their own scheduling
Mesos design - resource offers
send offers of resources (CPU, memory) to frameworks (fair sharing)
framework can accept or reject then use resources however they want
resource offers count against framework’s allocation, so framework scheduler is incentivized to respond quickly
Architecture - master process which manages set of slave daemons on cluster nodes - allocation through pluggable policy (usually fair sharing, priority) - frameworks must implement a scheduler and an executor module - mesos master gathers resource information from slaves daemons - sends resource offers to framework scheduler - scheduler can accept or reject offer - if accept, then it sends the mesos master spec of tasks to run - master passes tasks along to correct node, which runs them using framework executor - frameworks can also use filters to specify that they will always reject a set of resource offers - only offer nodes from list L (whitelist), only offer nodes with at least R resources free - isolation through Linux containers - resource allocation - guaranteed allocation, if exceeded, mesos can kill cany tasks
Fault tolerance - master node maintains soft state - active slave, active frameworks, running tasks - use ZooKeeper with hot standby masters - when master fails, standby master state populated by schedulers and slaves - master reports node failures and executor crashes to frameworks
Mesos Behavior - performs well with frameworks which scale elastically, homogeneous task duration, prefer all nodes equally - Definitions: - Framework ramp-up time: time it takes a new framework to achieve its allocation (e.g., fair share) - Job completion time: time it takes a job to complete, assuming one job per framework - System utilization: total cluster utilization - elastic framework can start using nodes immediately and release upon task completion (Hadoop, Dryad) - rigid framework needs full resource allocation before begin work, cannot scale up/down - mandatory vs. preferred resources (GPUs etc.) - assume mandatory resources requested by framework doesn’t exceed its guranteed share
Implementation - not too hard to port Hadoop, Torque, MPI, and Spark to Mesos
Evaluation - Macrobenchmark - 96 node mesos cluster, (4 CPU 15GB RAM) compare to static partioning (24 nodes per framework) - Hadoop mix of small and large - Hadoop large batch jobs - Spark running ML - Torque running MPI - Mesos achieves higher utilization and mostly faster job completion (except for MPI which makes sense) - delay scheduling allows greater data locality, improves job completion time
While the abstraction provided by Mesos is useful, not much is said in the paper about long running services, which were later scheduled on Mesos with Marathon, a framework developed by AirBnB for this purpose. Moreover, since Mesos is in fact targetted towards a cloud cluster environment, it fails to address some usability problems that Kubernetes later would - for example easy monitoring, logging, deployment. Mesos also assumes that scheduling has to be done by each framework individually, which may or may not be an issue depending on whether a scheduling framework can then be easily run on top of Mesos.
Mesos has already been widely adopted at large tech companies such as Twitter and AirBnB, making it a very successful framework. Moreover, the design and architecture of Mesos directly influences how many distributed systems are structured. For example, see the GCS, scheduler and nodes with drivers and workers in Ray.