Symbiose Project Team - INRIA/Irisa

Scientific Axes

Symbiose: a Bioinformatics center

Symbiose is a bioinformatics research project. It focuses on methodological research at the interface between computer science and molecular biology, excluding "standard" informatics (“biocomputing”) for routine management of biological data.

The Symbiose team gathers two entities in the bioinformatics domain: a research group and a technical platform, called GenOuest.

The research group focuses on high performance computing for large-scale genomic data and modeling of large-scale biological systems.
The GenOuest platform belongs to Biogenouest, the French west life science network. It is labeled IBiSA since 2009. It is also certified ISO 9001:2008. It coordinates the activities of the RENABI-GO regional center, one of the six French bioinformatics resource centers. It offers different bioinformatics services: computing power, storage, databanks, development, and training for wide regional and national community.

Both entities tightly collaborate to offer a full technological, research and training support to the biological community. Research and technological development projects are conducted in collaboration with INRA, Inserm and CNRS biological teams from the full country.

This environment offers the opportunity to mix in a same place computer scientists with a strong expertise in high performance computing and dynamical modeling for bioinformatics, together with the experience of interactions with research labs in genomics through a resource center. In the competitive field of environment, we are concerned with the storage, analysis and interpretation of large-scale and multi-timescales datasets produced by other platforms and research teams, including ? although not exclusively ? the analysis of Next Generation Sequencing Data. Our project addresses both the pragmatic needs of the management and exploiting of a high throughput resource and the longer-term needs of the development of original algorithms and applications through dedicated researches.

A few slides presenting Symbiose are available here

Scientific axes

Our research specificities include our interest in large scale studies (genomes, proteomes or regulation networks) and discrete methods necessary to handle the associated complexity. We have a global concern for high performance computing and two types of modeling tasks, modeling sequences and structures and modeling regulation networks.

Optimized algorithms on parallel specialized architectures

First and foremost, large scale studies need a fine tuning and management of computational resources. We investigate the practical usage of parallelism to speed-up computations in genomics. Topics of interest range from intensive sequence comparisons to pattern or model matching, including structure prediction. We work on the co design of algorithms and hardware architectures tailored to the treatment of such applications. It is based on the study of reconfigurable machines employing Field Programmable Gate Arrays (FPGA) or fast components such as Flash memories or Graphical Processing Units (GPU).

Modeling sequences and structures

This track concerns the search for relevant (e.g. functional) spatial or logical structures in macromolecules, either with intent to model specific spatial structures (secondary and tertiary structures, disulfide bounds ... ) or general biological mechanisms (transposition ... ). In the framework of language theory and combinatorial optimization, we address various types of problems: design of grammatical models on biological sequences and machine learning of grammatical models from sequences; efficient filtering and model matching in data banks; protein structure prediction. Corresponding disciplinary fields are language theory, algorithmic on words, machine learning, data analysis and combinatorial optimization.

System biology

We address the question of constructing accurate models of biological systems with respect to available data and knowledge. The availability of high-throughput methods in molecular biology has led to a tremendous increase of measurable data along with resulting knowledge repositories, gathered on the web (e.g. KEGG,MetaCyc, RegulonDB). However, both measurements as well as biological networks are prone to incompleteness, heterogeneity, and mutual inconsistency, making it highly non-trivial to draw biologically meaningful conclusions in an automated way. Based on this statement, we develop methods for the analysis of large-scale biological networks which formalize various reasoning modes in order to highlight incomplete regions in a regulatory model and to point at network products that need to be activated or inactivated to globally explain the experimental data. We also consider small-scale biological systems for a fine understanding of conclusions that can be drawn on active pathways from available data, working on deducible properties rather than simulation. Corresponding disciplinary fields are model checking, constraint-based analysis and dynamical systems.

Main results in 2010

Annotations tools

These works strongly used the computing facilities hosted by the GenOuest platform inside the symbiose project. It involved high performance computing approaches and probability methods.

Omic information management MIMAS 3.0 is a Multiomics Information Management and Annotation System. [link to the publication]
Annotation tool AphidBase is a centralized bioinformatic resource for annotation of the pea aphid genome. [Reference in HAL]
Pea Aphid genome annotation: Annotation and analysis of the pea aphid genome by a large international community. We were strongly involved into the database management, annotation protocols and gene curation. [Reference in HAL]
Size of indexation structure Factor and suffix oracles provide an economic and efficient solution for storing all the factors and suffixes respectively of a given text. We give an estimation of the average size for the dedicated examples of factor/suffix oracles. [Reference in HAL]

Sequence alignement and comparison

These works were based on high performance computing approaches and grammatical inference.

Genomic Sequence comparison: Parallel Genomic Sequence Comparison [Reference in HAL]
Protein sequence comparison Book chapter dedicated to Seed-Based Parallel Protein Sequence Comparison Combining Multithreading, GPU, and FPGA technologies [Reference in HAL]
Comparative genomics Comparative genomics on lepidoptera species [Reference in HAL] , and aphid species [Reference in HAL] .
Short sequences alignment: GASSST, Global Alignment Short Sequence Search Tool [Reference in HAL]
Multiple repeats in DNA sequences: algorithm designed for the fast filtration of full genomes in order to detect multiple repeats. [Reference in HAL]
Homology searches: Book chapter dedicated to the use of advanced algorithmic techniques based on filtration and on the use of seeds for retrieving homologies with high specificity and sensitivity in large datasets. [Reference in HAL]
Structure of biological sequences A tutorial on Modelling Biological Sequences by Grammatical Inference [Reference in HAL]
Inference of genomic sequences structure We address the problem of searching for the smallest grammar problem on large sequences - that is, finding a smallest context-free grammar that generates exactly one sequence. We use the concept of maximal repeats and propose a new algorithm which can be applied on whole genomes of model organisms and able up to 10% smaller grammars than state-of-the-art. [Reference in HAL] and [Reference in HAL]

Structure alignment and comparison

We mainly used linear optimization methods to adress questions from structural biology.

Local Protein threading, sequence-structure alignment This paper presents a novel approach to PTP which allows to align a part of a protein structure onto a protein sequence in order to detect local similarities. [link to the publication]
Protein structure comparison, global alignment We introduce new integer programming model for Contact Map Overlap Revisited (CMO), a scoring scheme for similarities between protein structures. We propose an exact branch-and-bound algorithm with bounds obtained by a novel Lagrangian relaxation. [Reference in HAL]
SHREC'10 Track: Protein Models (Eurographics Workshop on 3D Object Retrieval competition) This paper presents the results of the SHREC'10 Protein Models Classification Track. The aim of this track is to evaluate how well 3D shape recognition algorithms can classify protein structures according to the CATH [CSL08] superfamily classification. The global alignment tools developed in the Symbiose team obtained the best result. [link to the publication]
Protein structure comparison, local alignment In this paper, we propose a new protein structure comparison method based on internal distances (DAST), which main characteristic is that it generates alignments having RMSD smaller than any previously given threshold. [link to the publication]

Markers identification in genome sequences

The results were obtained either with high performance computing approaches or with logic programming.

SNP SNPs identification without a reference genome: an approach for calling SNPs by comparing two sets of raw NGS reads without assembly nor mapping on a reference genome. [Reference in HAL]
mi-RNA deep sequencing of microRNAs and expression analysis during phenotypic plasticity in the pea aphid, Acyrthosiphon pisum [Reference in HAL]
Rearrangement breakpoints Cassis: Detection of genomic rearrangement breakpoints. [link to the publication]
Transcription factors We introduce a parallel scheme for comparing transcription factor binding sites matrices [Reference in HAL]
ModuleOrganizer: modules in families of transposable elements We introduce the concept of a transposable element module. We propose a new assembly method that does not require multiple sequence alignment. We show its sensitivity in several examples. [Reference in HAL]

Confronting (omic) data with knowledge-based regulatory models

The studies below mainly involved advanced logic programming (ASP) and discrete mathematics.

Model construction Model of cap-dependent translation initiation in sea urchin. A step towards the eukaryotic translation regulation network [Reference in HAL]
Active regulations Use constraint-based approaches (Bioquali tool) to localize potentially active post-transcriptional regulations in the Ewing's sarcoma gene regulatory network [Reference in HAL]
Source of phenotypes Designing Logical Rules to Model the Response of Biomolecular Networks with Complex Interactions: An Application to Cancer Modeling. [Reference in HAL]
Model correction tools Repair and Prediction (under Inconsistency) in Large Biological Networks with Answer Set Programming [Reference in HAL]
Model inference Complex update strategies for Probabilistic Boolean Networks [Reference in HAL]

Extract relevant and robust information from dynamical models

The results below were inspired by discrete and continuous dynamical systems properties.

Time-scale reduction Asymptotology of Chemical Reaction Networks. Pruning, pooling and limiting steps in metabolic networks.
Metabolic scale in genetically regulated metabolic networks We mix Gale-Nikaido reduction steps and differential inequalities to understand the role of genetic regulation over a metabolic model of lipid metabolism. [Reference in HAL]
Parameter robustness Parametric robustness in gene networks: reliable functioning with unreliable components
Model flexibility Use elementary mode based methods of metabolic networks to illustrate the flexibility of mammary gland in lactating dairy cows [Reference in HAL]
Average behavior of dynamical models A survey on probabilistic approaches for investigating biological networks [Reference in HAL]
Hybrid systems Piecewise smooth hybrid systems as models for networks in molecular biology

[ Back ]