(LaTeX) Abstract: |
The Parallel Random Access Machine, \de{PRAM}, is the dominant
theoretical parallel computer model. It consists of a number of
processing units, \de{PU}s, which operate synchronously, and which
can access a shared memory in constant time. Unfortunately, this
high-level model is hardly realizable in hardware using current
technology. One possibility is to directly design algorithms for the
distributed memory computer, \de{DMC}, under consideration, but this
involves many hairy details, and is not portable at all. More
practical, is to develop for every DMC a library of basic algorithms
(sorting, matrix multiplication, list ranking, \dots), a PRAM
simulator, and a `compiler'. In this way, the programmer can program
in a high-level language, which is essential for the acceptance of
parallel computing. In this paper, we sketch our experiences with
a deterministic PRAM simulator for two-dimensional meshes.
The success of a PRAM simulator depends on the achieved speed-up. We
made considerable progress. Presently, the simulation of one step of
a PRAM with $65536 = 16 \cdot 64^2$ PUs, on a $64 \times 64$ mesh,
takes about 12000 routing steps, if all 65536 PRAM PUs simultaneously
make an access. The time depends on the precise access pattern, and
CRCW steps are slightly more expensive than EREW steps. Larger speed-up
is achieved if the mesh is larger or if more PRAM PUs are simulated on
every mesh PU. It may appear disappointing that we do not achieve
substantially better. However, on an $n \times n$ mesh, we can never
expect to achieve a speed-up larger than $n / (c + o(n))$, for some
constant $c$. In our case $c$ is about 12. Our techniques are directly
applicable to higher dimensional meshes. There the speed-up is larger.
Furthermore, in our conception, the PRAM simulator should only be used
for those operations that are not available in the library of DMC
algorithms. So, a moderate speed-up for them, does not necessarily
impair the speed-up of the whole program.
We developed a small PRAM language which allows to almost directly
simulate the basic PRAM algorithms presented in textbooks. It
was used to simulate a constant time algorithm for computing the
maximum \cite[Sec. 2.6.1]{Ja}, and for the standard logarithmic time
summation algorithm. The summation of 65536 numbers requires 33 PRAM
steps, which can be simulated on a $36 \times 36$ in 55158 routing
steps with maximal queue size~$126$.
In order to be able to analyze the details of the PRAM simulation,
we have performed it on a mesh simulator. Actually, we developed a
complete programming environment, containing many basic mesh
operations: several variants of routing and sorting algorithms,
ranking, etc. All these algorithms have been optimized, in order
to get competitive results for the moderate mesh sizes under
consideration. |