Fundamentals of fault-tolerant distributed computing in asynchronous environments 论文

1999ACM Computing Surveys引用 347
Distributed systems and fault toleranceOptimization and Search ProblemsParallel Computing and Optimization Techniques

摘要

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance , and redundancy . This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction . We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.