|
ARC INRIA
|
Seed design optimization
BLAST-like programs use seeds for increasing speed. All computed
alignments must take part into an anchoring process, involving at least
one seed. Seed design is thus the keystone for a successful method.
Low specific seeds lead to time consuming applications, while low
sensitive seeds return imprecise alignments . Good seeds have to
find a right ratio between sensitivity and specificity.
In parallel, as seeds are fully indexed, a particular effort has to be
done in order to limit the memory required by the index. In this
project, we want to find, given a fixed sensibility, seeds leading to
fast application (high specificity) and minimizing the index size.
Results
In this context, we oriented our research in direction of subset
seeds, based on the work of L. Noé and G. Kucherov, working in
LIFL bioinformatics group. A subset seeds is composed by
non-consecutive flexible characters. This kind of seeds lead to
better results than "classic" seeds but their design is much
more difficult. Thus, in order to have the possibility to
quickly compute a subset seeds, or a set of subset seeds
characteristics, we developed an environment performing fast
specificity and sensibility of such seeds. This environment
uses a large bank of exhaustive alignments obtained by intensive
computing of Smith-Waterman alignments of human proteom with
bacteria and archea proteoms.
This corpus, with the use of the YASS program developed in the
Sequoia team, guided us for constructing several efficient sets
of subset seeds, having a sensibility slightly better that the
one used by Blastp. Those seeds were implemented on the ReMIX
machine. The tests performed allowed us to notice a 16X seed-up
factor. This seed-up is obtained thanks to a 13 seed-up hardware
factor, while the pure algorithmic approach leads to a 24%
additional seed-up.
This work, done in the frame of the algorithmic and hard
dependency, leads to results that improve the Blastp
performances by only using algorithmic approach. Furthermore,
this research work led to a very fruitful collaboration between
Sequoia and Symbiose teams: it raised hopeful new ideas on
indexing methods, allowing a significant space diminution
coupled with a real additional seed-up enhancement.
Publications
- P. Peterlongo, L. Noé, D. Lavenier, G. Georges,
J. Jacques, G. Kucherov, M. Giraud, Protein similarity search
with subset seeds on a dedicated reconfigurable hardware, PBC
2007, Workshop on Parallel Computational Biology, Gdansk,
Poland, September 9-12, 2007 (pdf)
- M. Giraud, G. Kucherov, D. Lavenier, L. Noé, P. Peterlongo,
Utilization of Subset Seeds on a Reconfigurable Architecture,
LAW 2007, London Algorithm Workshop, King's College London, UK,
February 7-8 2007 (pdf)
|
|