Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

MPI-I-1999-1-001

A theoretical and experimental study on the construction of suffix arrays in external memory

Crauser, Andreas and Ferragina, Paolo

MPI-I-1999-1-001. March 1999, 40 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
The construction of full-text indexes on very large text collections
is nowadays a hot problem. The suffix array [Manber-Myers,~1993] is
one of the most attractive full-text indexing data structures due to
its simplicity, space efficiency and powerful/fast search operations
supported. In this paper we analyze, both theoretically and
experimentally, the I/O-complexity and the working space of six
algorithms for constructing large suffix arrays. Three of them are
the state-of-the-art, the other three algorithms are our new
proposals. We perform a set of experiments based on three different
data sets (English texts, Amino-acid sequences and random texts) and
give a precise hierarchy of these algorithms according to their
working-space vs. construction-time tradeoff. Given the current
trends in model design~\cite{Farach-et-al,Vitter} and disk
technology~\cite{dahlin,Ruemmler-Wilkes}, we will pose particular
attention to differentiate between ``random'' and ``contiguous''
disk accesses, in order to reasonably explain some practical
I/O-phenomena which are related to the experimental behavior of
these algorithms and that would be otherwise meaningless in the
light of other simpler external-memory models.

At the best of our knowledge, this is the first study which provides
a wide spectrum of possible approaches to the construction of suffix
arrays in external memory, and thus it should be helpful to anyone
who is interested in building full-text indexes on very large text
collections.

Finally, we conclude our paper by addressing two other issues. The
former concerns with the problem of building word-indexes; we show
that our results can be successfully applied to this case too,
without any loss in efficiency and without compromising the
simplicity of programming so to achieve a uniform, simple and
efficient approach to both the two indexing models. The latter issue
is related to the intriguing and apparently counterintuitive
``contradiction'' between the effective practical performance of the
well-known BaezaYates-Gonnet-Snider's algorithm~\cite{book-info},
verified in our experiments, and its unappealing (i.e., cubic)
worst-case behavior. We devise a new external-memory algorithm that
follows the basic philosophy underlying that algorithm but in a
significantly different manner, thus resulting in a novel approach
which combines good worst-case bounds with efficient practical
performance.

Acknowledgement:
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-I-1999-1-001.ps479 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document: http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/1999-1-001
Hide details for BibTeXBibTeX
@TECHREPORT{CrauserFerragina99,
  AUTHOR = {Crauser, Andreas and Ferragina, Paolo},
  TITLE = {A theoretical and experimental study on the construction of suffix arrays in external memory},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-1999-1-001},
  MONTH = {March},
  YEAR = {1999},
  ISSN = {0946-011X},
}