New for: D1, D2, D3, D4
whose I/O cost approaches the lower bound and that
guarantees almost perfect overlap between I/O and computation.
Previous algorithms have either suboptimal I/O volume
or cannot guarantee that I/O and computations can always be overlapped.
We give an efficient implementation
that can (at least) compete with the best practical
implementations but gives additional performance guarantees.
For the experiments we have configured a state of the art machine
that can sustain full bandwidth I/O with eight disks and
is very cost effective.