Comparing 700 000 bacterium proteins vs the Human Genome
The Inserm U694 laboratory is involved in the mitochondrial diseases.
The strategy is to perform an in-silico study to locate on the
human genome potential mitochondrial proteins. As the mitochondry may
originate from ancestral bacteria, a systematic comparison with the
proteom of all available bacteria must be done.
From a computational point of view, this is equivalent to perform a
tblastn treatment of 700,000 proteins against the human genome.
The computation time has been estimated to about 1 year on the
Inserm U694 server.
Results
A tblastn-like program has been implemented. The indexing
sheme is based on blast-like seeds and acts as
a reference. The size of the index represent about 40 times the size of
the human genome raw data (about 90 Gbytes).
A reconfigurable operator implementaing the time consumming part of the tblastn process has
been designed. It houses 160 small dedicated processors working in parallel.
With a single ReMIX board, the complete human genome against the bacterial proteom
(700 000 proteins) has been processed in 10 days.
Based on the algorithmic enhancements provided by the
design of new seeds we can now expect a reduction of 25% on both the computation time
and the ReMIX FLASH occupancy. It can also be pointed out that these results can be
generalized to standard computers, especially for multicore architectures.
Publications
- D. Lavenier, G. Georges, X. Liu, A Reconfigurable Index
FLASH Memory tailored to Seed-Based Genomic Sequence Comparison
Algorithms, The Journal of VLSI signal processing systems.
Special issue on Computing Architectures and Acceleration for
Bioinformatics Algorithms, vol 43, issue 3, sept. 2007.
- D. Lavenier, X. Xinchun, G. Georges, Seed-based Genomic Sequence
Comparison using a FPGA/FLASH Accelerator, International IEEE
Conference on Field Programmable Technology (FPT), Bangkok, Thailand,
2006 (pdf)