Jockey 论文

2012引用 293

Cloud Computing and Resource ManagementDistributed systems and fault toleranceDistributed and Parallel Computing Systems

云计算 Distributed and Parallel Computing Systems Cloud Computing and Resource Management Distributed systems and fault tolerance

关系图谱

作者

摘要

Data processing frameworks such as MapReduce [8] and Dryad [11] are used today in business environments where customers expect guaranteed performance. To date, however, these systems are not capable of providing guarantees on job latency because scheduling policies are based on fair-sharing, and operators seek high cluster use through statistical multiplexing and over-subscription. With Jockey, we provide latency SLOs for data parallel jobs written in SCOPE. Jockey precomputes statistics using a simulator that captures the job's complex internal dependencies, accurately and efficiently predicting the remaining run time at different resource allocations and in different stages of the job. Our control policy monitors a job's performance, and dynamically adjusts resource allocation in the shared cluster in order to maximize the job's economic utility while minimizing its impact on the rest of the cluster. In our experiments in Microsoft's production Cosmos clusters, Jockey meets the specified job latency SLOs and responds to changes in cluster conditions.

作者查看全部 (5)

Rodrigo Fonseca

Éric Boutin

Srikanth Kandula

Peter Bodík

Jockey 论文

摘要

作者查看全部 (5)

相关技术

相关事件

相关文章