Effective scheduling techniques for high-level parallel-programming languages
Mike Rainey
University of Chicago
SWS Colloquium
Michael Rainey received a BSc in Computer Science and a BSc in
Cognitive Science from Indiana University in 2004 and a MSc in
Computer Science from University of Chicago in 2007. He expects to
receive a PhD in Computer Science from University of Chicago in the
Summer of 2010.
In the not-so-distant past, parallel programming was mostly the
concern of programmers specializing in high-performance
computing. Nowadays, on the other hand, many of today's desktop and
laptop computers come equipped with a species of shared-memory
multiprocessor called a multicore processor, making parallel
programming a concern for a much broader range of
programmers. High-level parallel languages, such as Parallel ML (PML)
and Haskell, seek to reduce the complexity of programming multicore
processors by giving programmers abstract execution models, such as
implicit threading, where programmers annotate their programs to
suggest the parallel decomposition. Implicitly-threaded programs,
however, do not specify the actual decomposition of computations or
mapping from computations to processors. The annotations act simply as
hints that can be ignored and safely replaced with sequential
counterparts. The parallel decomposition itself is the responsibility
of the language implementation and, more specifically, of the
scheduling system.
Threads can take arbitrarily different amounts of time to execute, and
these times are difficult to predict. Implicit threading encourages
the programmer to divide the program into threads that are as small as
possible because doing so increases the flexibility the scheduler in
its duty to distribute work evenly across processors. The downside of
such fine-grain parallelism is that if the total scheduling cost is
too large, then parallelism is not worthwhile. This problem is the
focus of this talk.
The starting point of this talk is work stealing, a scheduling policy
well known for its scalable parallel performance, and the work-first
principle, which serves as a guide for building efficient
implementations of work stealing. In this talk, I present two
techniques, Lazy Promotion and Lazy Tree Splitting, for implementing
work stealing. Both techniques derive their efficiency from adhering
to the work-first principle. Lazy Promotion is a strategy that
improves the performance, in terms of execution time, of a
work-stealing scheduler by reducing the amount of load the scheduler
places on the garbage collector. Lazy Tree Splitting is a technique
for automatically scheduling the execution of parallel operations over
trees to yield scalable performance and eliminate the need for
per-application tuning. I use Manticore, PML's compiler and runtime
system, and a sixteen-core NUMA machine as a testbed for these
techniques.
In addition, I present two empirical studies. In the first study, I
evaluate Lazy Promotion over six PML benchmarks. The results
demonstrate that Lazy Promotion either outperforms or performs the
same as an alternative scheme based on Eager Promotion. This study
also evaluates the design of the Manticore runtime system, in
particular, the split-heap memory manager, by comparing the system to
an alternative system based on a unified-heap memory manager, and
showing that the unified version has limited scalability due to poor
locality. In the second study, I evaluate Lazy Tree Splitting over
seven PML benchmarks by comparing Lazy Tree Splitting to its
alternative, Eager Tree Splitting. The results show that, although the
two techniques offer similar scalability, only Lazy Tree Splitting is
suitable for building an effective language implementation.