Paper supplementary material (last update: march 2009)
In this page you will find supplementary material for the paper (Matthias Gallé, Pierre Peterlongo, François Coste) "In-place Update of Suffix Array while Recoding Words". International Journal of Foundations of Computer Science. 2009.
In this paper we present an algorithm to update a suffix array while in the indexed text some occurrences of a given repeat are substituted by a new character.
The problem araised when we wanted to replace a word (wi in fig. 1) that we found using a suffix array, by the character Mi. As this was an iterative process, we thought it should be possible to obtain the suffix array of the new sequence (SA(s1)) without recalculating it from s1. In fig. 1, that means to follow the red arrows instead of the blue ones.
fig 1
On this web site you will find the results of the test we performed to compare the efficiency of this algorithm, and of course the source code of the algorithm itself.
Tests
We compared our algorithm to one that recalculate the suffix from scratch.
We executed several iterations of the above shown process:
- calculate the "best" repetition given a score function, and return a non-overlapping subset of this occurrences
- replace all occurrences by a new character
- {update, recalculate} the suffix array
The time measured is the one spent in step 3.
For the from scratch algorithm we used two different algorithms: Kärkkäinen and Sanders's (pdf and code) and Larsson and Sadakane's (ps and code).
All tests were executed on a 1GHz AMD Opteron processor with 4Gb of memory. We used three different score functions (random choice, maximal length, and maximal coverage) over the following corpora: canterbury, large canterbury, calgary, Manzini's historical, purdue.
Under the "time" column, you will find the time spent for each file, (divided by user and system time) for the update and both from scratch algorithms. Under the ratio column, you will find the addition of both times and a ratio (from_scratch / update) to better comparison.
Each file contains the three different strategies (mc=maximal coverage, ml=maximal length, rd=random)
corpus |
times |
ratio |
canterbury |
|
|
large |
|
|
calgary |
|
|
historical |
|
|
purdue |
|
|
Source code
Here you will find the source code, written in C++. It has been compiled with gcc 3.4 on linux and (darwin9) gcc 4.0.1 on Mac OS X.
Read the README.txt for more information.
Update (september 2008)
After some discussion in PSC, we compared our update algorithm with newer versions of Suffix Array Creation Algorithms. Our requirement of flexibel and not constant alphabet size discard lots of them (notably, libdivsufsort which seems to be the practical fastest SACA at this moment). We followed Yuta Mori's advice and used his implementation of the Induced Sorting algorithm from Nong, Zhang and Chan (paper and source-code) and tested also Mori's implementation of the same algorithm,
Here the results over two of the corpuses used:
corpus |
times & ratio original implementation |
times & ratio Yuta Mori's implementation |
large |
|
|
calgary |
|
|
|