Learning grammars on genomic sequences

The position has been fulfilled.

Internship subject:
Using a linguistic approach for modeling genomic sequences has been advocated for a long time by David Searls [1]. Models may sometimes be designed by experts. In the team, we study how to automatically design these models by machine learning and we have proposed a successful approach for learning automata on protein sequences [2,3]. The subject of the internship is to study how this approach can be extended to learn more expressive grammars [4,5,6] allowing to model more easily long distance correlations. The proposed algorithm will be implemented and tested on real genomic datasets.

Keywords: Machine learning, Bioinformatics, Formal Grammars

Duration: 6 months

Prerequisites: Master studies in computer science or equivalent (this is a research subject: applicants should be able to continue with a PhD thesis after the internship)

Application: Elligible students to INRIA internship have to apply through this program but don't hesitate to contact This e-mail address is being protected from spam bots, you need JavaScript enabled to view it

François Coste

Bibliography:

[1] The language of genes, David Searls, Nature, 2002.

[2] Learning automata on protein sequences, François Coste and Goulven Kerbellec, JOBIM 2006.

[3] Apprentissage d'automates modélisant des familles de séquences protéiques, Goulven Kerbellec, Computer Science PhD Thesis, Université de Rennes 1, June 2008

[4] Polynomial identification in the limit of substitutable context-free languages, Alexander Clark and Rémi Eyraud, Journal of Machine Learning Research, August 2007.

[5] Comparing two unsupervised grammar induction systems: Alignment-Based Learning vs. EMILE, Menno van Zaanen and Pieter Adriaans, Technical Report: TR2001.05

[6] Unsupervised learning of natural languages, Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman, in Proc. Natl. Acad. Sci., August, 2005

< Prev		Next >

[ Back ]