Proactive fault tolerance for HPC with Xen virtualization 论文

2007引用 341

Cloud Computing and Resource ManagementDistributed and Parallel Computing SystemsDistributed systems and fault tolerance

云计算 Distributed and Parallel Computing Systems Cloud Computing and Resource Management Distributed systems and fault tolerance

关系图谱

作者

摘要

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.

作者查看全部 (4)

Stephen L. Scott

Christian Engelmann

Frank Mueller

Arun Babu Nagarajan

Proactive fault tolerance for HPC with Xen virtualization 论文

摘要

作者查看全部 (4)

相关技术

相关事件

相关文章