Symbiose Project Team - INRIA/Irisa

Suffix Array Update

Written by Matthias GALLÉ

Supplementary material paper

Note: the updated webpage can be found here:

In this page you will find supplementary material for the paper (Matthias Gallé, Pierre Peterlongo, François Coste) "In-place Update of Suffix Array while Recoding Words". May 2008.

In this paper we present an algorithm to update a suffix array while in the indexed text some occurrences of a given repeat are substituted by a new character.

The problem araised when we wanted to replace a word (w_i in fig. 1) that we found using a suffix array, by the character M_i. As this was an iterative process, we thought it should be possible to obtain the suffix array of the new sequence (SA(s₁)) without recalculating it from s₁_.In fig. 1, that means to follow the red arrows instead of the blue ones.

fig 1

On this web site you will find the results of the test we performed to compare the efficiency of this algorithm, and of course the source code of the algorithm itself.

Tests

We compared our algorithm to one that recalculate the suffix from scratch.

We executed several iterations of the above shown process:

calculate the "best" repetition given a score function, and return a non-overlapping subset of this occurrences
replace all occurrences by a new character
{update, recalculate} the suffix array

The time measured is the one spent in step 3.

For the from scratch algorithm we used two different algorithms: Kärkkäinen and Sanders's (pdf and code) and Larsson and Sadakane's (ps and code).

All tests were executed on a 1GHz AMD Opteron processors with 4Gb of memory. We used three different score functions (random choice, maximal length, and maximal coverage) over the following corpora: canterbury, large canterbury, calgary, Manzini's historical, purdue.

Under the "time" column, you will find the time spent for each file, (divided by user and system time) for the update and both from scratch algorithms. Under the ratio column, you will find the addition of both times and a ratio (from_scratch / update) to better comparison.

Each file contains the three different strategies (mc=maximal coverage, ml=maximal length, rd=random)

corpus	times	ratio
canterbury
large
calgary
historical
purdue

Source code

Here you will find the source code, written in C++. It has been compiled with gcc 3.4 on linux and (darwin9) gcc 4.0.1 on Mac OS X.

Read the README.txt for more information.

Update (september 2008)

After some discussion in PSC, we compared our update algorithm with newer versions of Suffix Array Creation Algorithms. Our requirement of flexibel and not constant alphabet size discard lots of them (notably, libdivsufsort which seems to be the practical fastest SACA at this moment). We followed Yuta Mori's advice and used his implementation of the Induced Sorting algorithm from Nong, Zhang and Chan (paper and source-code) and tested also Mori's implementation of the same algorithm,

Here the results over two of the corpuses used:

corpus	times & ratio original implementation	times & ratio Yuta Mori's implementation
large
calgary

< Prev		Next >

[ Back ]