Adaptive architecture-transparent policy control in a distributed graph reducer
Abstract
The end of the frequency scaling era occured around 2005 as the clock frequency
has stalled for commodity architectures. Thus performance improvements that could
in the past be expected with each new hardware generation needed to originate
elsewhere. Almost all computer architectures exhibit substantial and growing levels
of parallelism, exploiting which became one of the key sources of performance and
scalability improvements. Alas, parallel programming proved much more difficult
than sequential, due to the need to specify coordination and parallelism management
aspects. Whilst low-level languages place the burden on the programmers reducing
productivity and portability, semi-implicit approaches delegate the responsibility to
sophisticated compilers and run-time systems.
This thesis presents a study of adaptive load distribution based on work stealing
using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible
run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the
responsibility of coordination to the run-time system.
After characterising a set of parallel functional applications, we study the use of
historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel
and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer
applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not
specific to the language and the run-time system, and applies to other work stealing
schedulers.
Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation
on the performance of five Divide-&-Conquer programs run on a cluster of up to
256 PEs. When using Spark Colocation, the distributed graph reducer shares related
work resulting in a higher degree of both potential and actual parallelism, and more
fine-grained and less variable thread size. We validate this behaviour by observing
a reduction in average fetch times, but increased amounts of FETCH messages and
of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and
speedup improvements for Spark Colocation for the three more regular and nested
applications and performance degradation for two programs: one that is excessively
fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and
higher degree of parallelism have more opportunities to pay off.
In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used
within the same run and the ancestry information dynamically reconstructed at run
time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining
the advantages of a flexible architecture-transparent approach.