ROS Theses Repository

View Item 
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  • Admin
JavaScript is disabled for your browser. Some features of this site may not work without it.

Transparent resilience for Chapel

View/Open
PanagiotopoulouK_0120_macs.pdf (3.962Mb)
Date
2020-01
Author
Panagiotopoulou, Konstantina
Metadata
Show full item record
Abstract
High-performance systems pose a number of challenges to traditional fault tolerance approaches. The exponential increase of core numbers in large-scale distributed systems exposes the growth of permanent, intermittent, and transient faults. The redundancy schemes in use increase the number of system resources dedicated to recovery, while the extensive use of silent-failure mode inhibits systems’ capability to detect faults that hinder application progress. As parallel computation strives to survive the high failure rates, software shifts focus towards the support of resilience. The thesis proposes a mechanism for resilience support for Chapel, the high performance language developed by Cray. We investigate the potential for embedded transparent resilience, to assist uninterrupted program completion on distributed hardware, in the event of component failures. Our goal is to achieve graceful degradation; continued application execution when nodes in the system suffer fatal failures. We aim to provide a resilience-enabled version of the language, without application code modifications. We focus on Chapel’s task- and data-parallel constructs, and enhance their functionality with mechanisms to support resilience. In particular, we build on existing language constructs that facilitate parallel execution in Chapel. We focus on constructs that introduce unstructured and structured parallelism and constructs that introduce locality, as derived by the Partitioned Global Address Space programming model. Furthermore, we expand the resilient support to cover data distributions on library-level. The core implementation is on the runtime level, primarily on Chapels tasking and communication layers; we introduce mechanisms to support automatic task adoption and recovery by guiding the control to perform task re-execution. On the data-parallel track, we propose a resilience enabled version of the Block data distribution module. We develop an in-memory data redundancy mechanism, exploiting Chapel’s concept of locales. We apply the concept of buddy locales, as the primary means to store data redundantly and adopt remote workload from failed locales. We evaluate our resilient task-parallel mechanism with respect to the overheads introduced by embedded resilience. We use a set of constructed micro-benchmarks to evaluate the resilient task-parallel implementation, while for the evaluation of resilient data-parallelism we demonstrate results on the STREAM triad benchmark and the N-body all-pairs algorithm, on a 32-node Beowulf cluster. In order to assist the evaluation, we develop an error injection interface to simulate node failures.
URI
http://hdl.handle.net/10399/4318
Collections
  • Doctoral Theses (Mathematical & Computer Sciences)

Browse

All of ROSCommunities & CollectionsBy Issue DateAuthorsTitlesThis CollectionBy Issue DateAuthorsTitles

ROS Administrator

LoginRegister
©Heriot-Watt University, Edinburgh, Scotland, UK EH14 4AS.

Maintained by the Library
Tel: +44 (0)131 451 3577
Library Email: libhelp@hw.ac.uk
ROS Email: open.access@hw.ac.uk

Scottish registered charity number: SC000278

  • About
  • Copyright
  • Accessibility
  • Policies
  • Privacy & Cookies
  • Feedback
AboutCopyright
AccessibilityPolicies
Privacy & Cookies
Feedback
 
©Heriot-Watt University, Edinburgh, Scotland, UK EH14 4AS.

Maintained by the Library
Tel: +44 (0)131 451 3577
Library Email: libhelp@hw.ac.uk
ROS Email: open.access@hw.ac.uk

Scottish registered charity number: SC000278

  • About
  • Copyright
  • Accessibility
  • Policies
  • Privacy & Cookies
  • Feedback
AboutCopyright
AccessibilityPolicies
Privacy & Cookies
Feedback