Performance debugging for distributed systems of black boxes 论文

2003ACM SIGOPS Operating Systems Review引用 568

Software System Performance and ReliabilityDistributed systems and fault toleranceCloud Computing and Resource Management

Cloud Computing and Resource Management Distributed systems and fault tolerance Software System Performance and Reliability

关系图谱

作者

摘要

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.

作者查看全部 (5)

Athicha Muthitacharoen

Patrick Reynolds

Janet L. Wiener

Jeffrey C. Mogul

Performance debugging for distributed systems of black boxes 论文

详细信息

摘要

作者查看全部 (5)

相关技术

相关事件

相关文章