Transparent resilience for Chapel
MetadataShow full item record
High-performance systems pose a number of challenges to traditional fault tolerance approaches. The exponential increase of core numbers in large-scale distributed systems exposes the growth of permanent, intermittent, and transient faults. The redundancy schemes in use increase the number of system resources dedicated to recovery, while the extensive use of silent-failure mode inhibits systems’ capability to detect faults that hinder application progress. As parallel computation strives to survive the high failure rates, software shifts focus towards the support of resilience. The thesis proposes a mechanism for resilience support for Chapel, the high performance language developed by Cray. We investigate the potential for embedded transparent resilience, to assist uninterrupted program completion on distributed hardware, in the event of component failures. Our goal is to achieve graceful degradation; continued application execution when nodes in the system suffer fatal failures. We aim to provide a resilience-enabled version of the language, without application code modifications. We focus on Chapel’s task- and data-parallel constructs, and enhance their functionality with mechanisms to support resilience. In particular, we build on existing language constructs that facilitate parallel execution in Chapel. We focus on constructs that introduce unstructured and structured parallelism and constructs that introduce locality, as derived by the Partitioned Global Address Space programming model. Furthermore, we expand the resilient support to cover data distributions on library-level. The core implementation is on the runtime level, primarily on Chapels tasking and communication layers; we introduce mechanisms to support automatic task adoption and recovery by guiding the control to perform task re-execution. On the data-parallel track, we propose a resilience enabled version of the Block data distribution module. We develop an in-memory data redundancy mechanism, exploiting Chapel’s concept of locales. We apply the concept of buddy locales, as the primary means to store data redundantly and adopt remote workload from failed locales. We evaluate our resilient task-parallel mechanism with respect to the overheads introduced by embedded resilience. We use a set of constructed micro-benchmarks to evaluate the resilient task-parallel implementation, while for the evaluation of resilient data-parallelism we demonstrate results on the STREAM triad benchmark and the N-body all-pairs algorithm, on a 32-node Beowulf cluster. In order to assist the evaluation, we develop an error injection interface to simulate node failures.