Symbiose: a Bioinformatics center
Symbiose is a bioinformatics research project. It focuses on
methodological research at the interface between computer science and
molecular biology, excluding "standard" informatics (“biocomputing”) for
routine management of biological data.
The Symbiose team gathers two
entities in the bioinformatics domain: a research group and a technical
platform, called GenOuest.
- The research group focuses on high performance computing for large-scale
genomic data and modeling of large-scale biological systems.
- The GenOuest platform belongs to Biogenouest,
the French west life science network. It is labeled IBiSA since 2009.
It is also certified ISO 9001:2008. It coordinates the activities of the
RENABI-GO regional center, one of the six French bioinformatics
resource centers. It offers different bioinformatics services: computing
power, storage, databanks, development, and training for wide regional
and national community.
Both
entities tightly collaborate to offer a full technological, research and
training support to the biological community. Research and
technological development projects are conducted in collaboration with
INRA, Inserm and CNRS biological teams from the full country.
This environment offers the opportunity to mix in a same place
computer scientists with a strong expertise in high performance
computing and dynamical modeling for bioinformatics, together with the
experience of interactions with research labs in genomics through a
resource center. In the competitive field of environment, we are
concerned with the storage, analysis and interpretation of large-scale
and multi-timescales datasets produced by other platforms and research
teams, including ? although not exclusively ? the analysis of Next
Generation Sequencing Data. Our project addresses both the pragmatic
needs of the management and exploiting of a high throughput resource and
the longer-term needs of the development of original algorithms and
applications through dedicated researches.
A few slides presenting Symbiose are available here
Scientific axes
Our research specificities include our interest in large scale studies (genomes, proteomes or regulation networks) and discrete methods necessary to handle the associated complexity.
We have a global concern for high performance computing and two types of modeling tasks,
modeling sequences and structures and modeling regulation networks.
Optimized algorithms on parallel specialized architectures
First and foremost, large scale studies need a
fine tuning and management of computational resources. We investigate
the practical usage of parallelism to speed-up computations in genomics.
Topics of interest
range from intensive sequence comparisons to pattern or model matching,
including structure
prediction. We work on the co design of algorithms and hardware
architectures tailored to the treatment of such applications. It is
based on the study of reconfigurable machines employing Field
Programmable Gate Arrays (FPGA) or fast components such as Flash
memories or Graphical Processing Units (GPU).
Modeling sequences and structures
This track concerns the search for relevant
(e.g. functional) spatial or logical structures in macromolecules,
either with intent to model specific spatial structures (secondary and
tertiary structures, disulfide bounds ... ) or general biological
mechanisms (transposition ... ). In the framework of
language theory and combinatorial optimization, we address
various types of problems:
design of grammatical models on biological sequences and
machine learning of grammatical models from sequences;
efficient filtering and model matching in data banks; protein structure
prediction.
Corresponding disciplinary fields are language theory, algorithmic on
words, machine learning, data analysis and combinatorial optimization.
System biology
We address the question of constructing
accurate models of biological systems with respect to available data and
knowledge. The availability of high-throughput methods in molecular
biology has led to a tremendous increase of measurable data along with
resulting knowledge repositories, gathered on the web (e.g.
KEGG,MetaCyc, RegulonDB). However, both measurements as well as
biological networks are prone to incompleteness, heterogeneity, and
mutual inconsistency, making it highly non-trivial to draw biologically
meaningful conclusions in an automated way. Based on this statement, we
develop methods for the analysis of large-scale biological networks
which formalize various reasoning modes in order to highlight incomplete
regions in a regulatory model and to point at network products that
need to be activated or inactivated to globally explain the experimental
data. We also consider small-scale biological systems for a fine
understanding of conclusions that can be drawn on active pathways from
available data, working on deducible properties rather than simulation.
Corresponding disciplinary fields are model checking, constraint-based
analysis and dynamical systems.
Main results in 2010
Annotations tools
These works strongly used the computing facilities hosted by the GenOuest platform inside the symbiose project. It involved high performance computing approaches and probability methods.
- Omic information management MIMAS 3.0 is a Multiomics Information Management and Annotation System. [link to the publication]
- Annotation tool AphidBase is a centralized bioinformatic resource for annotation of the pea aphid genome. [Reference in HAL]
- Pea Aphid genome annotation: Annotation and analysis of the pea aphid genome by a large international community. We were strongly involved into the database management, annotation protocols and gene curation. [Reference in HAL]
- Size of indexation structure Factor and suffix oracles provide an economic and efficient solution for storing all the factors and suffixes respectively of a given text. We give an estimation of the average size for the dedicated examples of factor/suffix oracles. [Reference in HAL]
Sequence alignement and comparison
These works were based on high performance computing approaches and grammatical inference.
- Genomic Sequence comparison: Parallel Genomic Sequence Comparison [Reference in HAL]
- Protein sequence comparison Book chapter dedicated to Seed-Based Parallel Protein Sequence Comparison Combining
Multithreading, GPU, and FPGA technologies [Reference in HAL]
- Comparative genomics Comparative genomics on lepidoptera species [Reference in HAL] , and aphid species
[Reference in HAL] .
- Short sequences alignment: GASSST, Global Alignment Short Sequence Search Tool [Reference in HAL]
- Multiple repeats in DNA sequences: algorithm designed for the fast filtration of full genomes in order to detect multiple repeats. [Reference in HAL]
- Homology searches: Book chapter dedicated to the use of advanced algorithmic techniques based on filtration and on the use of seeds for retrieving homologies with high specificity and sensitivity in large datasets. [Reference in HAL]
- Structure of biological sequences A tutorial on Modelling Biological Sequences by Grammatical Inference [Reference in HAL]
- Inference of genomic sequences structure We address the problem of searching for the smallest grammar problem on large sequences - that is, finding a smallest context-free grammar that generates exactly one sequence. We use the concept of maximal repeats and propose a new algorithm which can be applied on whole genomes of model organisms and able up to 10% smaller grammars than state-of-the-art. [Reference in HAL]
and [Reference in HAL]
Structure alignment and comparison
We mainly used linear optimization methods to adress questions from structural biology.
- Local Protein threading, sequence-structure alignment This paper presents a novel approach to PTP which allows to align a part of a protein structure
onto a protein sequence in order to detect local similarities. [link to the publication]
- Protein structure comparison, global alignment We introduce new integer programming model for Contact Map Overlap Revisited (CMO), a scoring scheme for similarities between protein structures. We propose an exact
branch-and-bound algorithm with bounds obtained by a novel Lagrangian relaxation. [Reference in HAL]
- SHREC'10 Track: Protein Models (Eurographics Workshop on 3D Object Retrieval competition) This paper presents the results of the SHREC'10 Protein Models Classification Track.
The aim of this track is to evaluate how well 3D shape recognition algorithms can classify protein structures according to the CATH [CSL08] superfamily classification. The global alignment tools developed in the Symbiose team
obtained the best result. [link to the publication]
- Protein structure comparison, local alignment In this paper, we propose a new protein structure comparison method based on internal distances (DAST), which main characteristic is that it
generates alignments having RMSD smaller than any previously given threshold. [link to the publication]
Markers identification in genome sequences
The results were obtained either with high performance computing approaches or with logic programming.
- SNP SNPs identification without a reference genome: an approach for calling SNPs by comparing two sets of raw NGS reads without assembly nor mapping on a reference genome. [Reference in HAL]
- mi-RNA deep sequencing of microRNAs and expression analysis during phenotypic plasticity in the pea aphid, Acyrthosiphon pisum
[Reference in HAL]
- Rearrangement breakpoints Cassis: Detection of genomic rearrangement breakpoints. [link to the publication]
- Transcription factors We introduce a parallel scheme for comparing transcription factor binding sites matrices [Reference in HAL]
- ModuleOrganizer: modules in families of transposable elements We introduce the concept of a transposable element module. We propose a new assembly method that does not require multiple sequence alignment. We show its sensitivity in several examples. [Reference in HAL]
Confronting (omic) data with knowledge-based regulatory models
The studies below mainly involved advanced logic programming (ASP) and discrete mathematics.
- Model construction Model of cap-dependent translation initiation in sea urchin. A step towards the eukaryotic translation regulation network [Reference in HAL]
- Active regulations Use constraint-based approaches (Bioquali tool) to localize potentially active post-transcriptional regulations in the Ewing's sarcoma gene regulatory network [Reference in HAL]
- Source of phenotypes Designing Logical Rules to Model the Response of Biomolecular Networks with Complex Interactions: An Application to Cancer Modeling. [Reference in HAL]
- Model correction tools Repair and Prediction (under Inconsistency) in Large Biological Networks with Answer Set Programming
[Reference in HAL]
- Model inference Complex update strategies for Probabilistic Boolean Networks [Reference in HAL]
Extract relevant and robust information from dynamical models
The results below were inspired by discrete and continuous dynamical systems properties.
- Time-scale reduction Asymptotology of Chemical Reaction Networks. Pruning, pooling and limiting steps in metabolic networks.
- Metabolic scale in genetically regulated metabolic networks We mix Gale-Nikaido reduction steps and differential inequalities to understand the role of genetic regulation over a metabolic model of lipid metabolism.
[Reference in HAL]
- Parameter robustness Parametric robustness in gene networks: reliable functioning with unreliable components
- Model flexibility Use elementary mode based methods of metabolic networks to illustrate the flexibility of mammary gland in lactating dairy cows [Reference in HAL]
- Average behavior of dynamical models A survey on probabilistic approaches for investigating biological networks [Reference in HAL]
- Hybrid systems Piecewise smooth hybrid systems as models for networks in molecular biology