|
Learning grammars on genomic sequences |
|
Written by François COSTE
|
Learning grammars on genomic sequences
The position has been fulfilled.
Internship subject:
Using a linguistic approach for modeling genomic sequences has been advocated for a long time by David Searls [1]. Models may sometimes be designed by experts. In the team, we study how to automatically design these models by machine learning and we have proposed a successful approach for learning automata on protein sequences [2,3]. The subject of the internship is to study how this approach can be extended to learn more expressive grammars [4,5,6] allowing to model more easily long distance correlations. The proposed algorithm will be implemented and tested on real genomic datasets.
Keywords: Machine learning, Bioinformatics, Formal Grammars
Duration: 6 months
Prerequisites: Master studies in computer science or equivalent (this is a research subject: applicants should be able to continue with a PhD thesis after the internship)
Application: Elligible students to INRIA internship have to apply through this program but don't hesitate to contact
This e-mail address is being protected from spam bots, you need JavaScript enabled to view it
François Coste
Bibliography:
[1] The language of genes, David Searls, Nature, 2002.
[2] Learning automata on protein sequences, François Coste and Goulven Kerbellec, JOBIM 2006.
[3] Apprentissage d'automates modélisant des familles de séquences protéiques, Goulven Kerbellec, Computer Science PhD Thesis, Université de Rennes 1, June 2008
[4] Polynomial identification in the limit of substitutable context-free languages, Alexander Clark and Rémi Eyraud, Journal of Machine Learning Research, August 2007.
[5] Comparing two unsupervised grammar induction systems: Alignment-Based Learning vs. EMILE, Menno van Zaanen and Pieter Adriaans, Technical Report: TR2001.05
[6] Unsupervised learning of natural languages, Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman, in Proc. Natl. Acad. Sci., August, 2005
|
|