ARC INRIA FLASH - Seed Optimisation and Indexing of Genomic Databases

ARC FLASH

Seed Optimisation and Indexing of Genomic Databases


General

Overview
Intranet
People

Actions

Seed Design
Hardware Synthesis
Application

Contacts

Dominique Lavenier

Partners

Symbiose Project
LIFL bioinfo Group
LESTER lab
LBBMA lab

Related links

ACI ReMIX
West Genopole

ARC INRIA

Seed design optimization

BLAST-like programs use seeds for increasing speed. All computed alignments must take part into an anchoring process, involving at least one seed. Seed design is thus the keystone for a successful method. Low specific seeds lead to time consuming applications, while low sensitive seeds return imprecise alignments . Good seeds have to find a right ratio between sensitivity and specificity.

In parallel, as seeds are fully indexed, a particular effort has to be done in order to limit the memory required by the index. In this project, we want to find, given a fixed sensibility, seeds leading to fast application (high specificity) and minimizing the index size.

Results

In this context, we oriented our research in direction of subset seeds, based on the work of L. Noé and G. Kucherov, working in LIFL bioinformatics group. A subset seeds is composed by non-consecutive flexible characters. This kind of seeds lead to better results than "classic" seeds but their design is much more difficult. Thus, in order to have the possibility to quickly compute a subset seeds, or a set of subset seeds characteristics, we developed an environment performing fast specificity and sensibility of such seeds. This environment uses a large bank of exhaustive alignments obtained by intensive computing of Smith-Waterman alignments of human proteom with bacteria and archea proteoms.

This corpus, with the use of the YASS program developed in the Sequoia team, guided us for constructing several efficient sets of subset seeds, having a sensibility slightly better that the one used by Blastp. Those seeds were implemented on the ReMIX machine. The tests performed allowed us to notice a 16X seed-up factor. This seed-up is obtained thanks to a 13 seed-up hardware factor, while the pure algorithmic approach leads to a 24% additional seed-up.

This work, done in the frame of the algorithmic and hard dependency, leads to results that improve the Blastp performances by only using algorithmic approach. Furthermore, this research work led to a very fruitful collaboration between Sequoia and Symbiose teams: it raised hopeful new ideas on indexing methods, allowing a significant space diminution coupled with a real additional seed-up enhancement.

Publications

P. Peterlongo, L. Noé, D. Lavenier, G. Georges, J. Jacques, G. Kucherov, M. Giraud, Protein similarity search with subset seeds on a dedicated reconfigurable hardware, PBC 2007, Workshop on Parallel Computational Biology, Gdansk, Poland, September 9-12, 2007 (pdf)
M. Giraud, G. Kucherov, D. Lavenier, L. Noé, P. Peterlongo, Utilization of Subset Seeds on a Reconfigurable Architecture, LAW 2007, London Algorithm Workshop, King's College London, UK, February 7-8 2007 (pdf)