Transparent resilience for Chapel
Abstract
High-performance systems pose a number of challenges to traditional fault tolerance
approaches. The exponential increase of core numbers in large-scale distributed
systems exposes the growth of permanent, intermittent, and transient faults. The
redundancy schemes in use increase the number of system resources dedicated to
recovery, while the extensive use of silent-failure mode inhibits systems’ capability
to detect faults that hinder application progress. As parallel computation strives to
survive the high failure rates, software shifts focus towards the support of resilience.
The thesis proposes a mechanism for resilience support for Chapel, the high performance language developed by Cray. We investigate the potential for embedded
transparent resilience, to assist uninterrupted program completion on distributed
hardware, in the event of component failures. Our goal is to achieve graceful degradation; continued application execution when nodes in the system suffer fatal failures. We aim to provide a resilience-enabled version of the language, without application code modifications. We focus on Chapel’s task- and data-parallel constructs,
and enhance their functionality with mechanisms to support resilience.
In particular, we build on existing language constructs that facilitate parallel execution in Chapel. We focus on constructs that introduce unstructured and structured parallelism and constructs that introduce locality, as derived by the Partitioned Global Address Space programming model. Furthermore, we expand the
resilient support to cover data distributions on library-level.
The core implementation is on the runtime level, primarily on Chapels tasking and communication layers; we introduce mechanisms to support automatic task
adoption and recovery by guiding the control to perform task re-execution. On the
data-parallel track, we propose a resilience enabled version of the Block data distribution module. We develop an in-memory data redundancy mechanism, exploiting
Chapel’s concept of locales. We apply the concept of buddy locales, as the primary
means to store data redundantly and adopt remote workload from failed locales.
We evaluate our resilient task-parallel mechanism with respect to the overheads
introduced by embedded resilience. We use a set of constructed micro-benchmarks
to evaluate the resilient task-parallel implementation, while for the evaluation of
resilient data-parallelism we demonstrate results on the STREAM triad benchmark
and the N-body all-pairs algorithm, on a 32-node Beowulf cluster. In order to assist
the evaluation, we develop an error injection interface to simulate node failures.