Proactive fault tolerance for HPC with Xen virtualization 论文

2007引用 341
Cloud Computing and Resource ManagementDistributed and Parallel Computing SystemsDistributed systems and fault tolerance

摘要

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.

相关技术

暂无数据

相关事件

暂无数据

相关文章

暂无数据