Vous êtes ici

Analysis systems for serial sources in collections of historical image documents

Lieu: 
Equipe: 
Contexte: 

The Intuidoc team (https://www.irisa.fr/intuidoc) conducts researches on the topic of document image recognition. Since many years, the team proposes a system, called DMOS-PI method, for document structure analysis of documents. This DMOS-PI method is used for document recognition, or field extraction in archive documents, handwritten contents damaged documents (musical scores, archives, newspapers, letters, electronic schema, …).

EURHISFIRM European project aims at developing a research infrastructure to connect, collect, collate, align, and share reliable long-run company-level data for Europe to enable researchers, policymakers and other stakeholders to analyze, develop, and evaluate effective strategies to promote investment and economic growth. To achieve this goal, EURHISFIRM develops innovative tools to spark a “Big data” revolution in the historical social sciences and to open access to cultural heritage.

EURHISFIRM is a project funded by the European Commission within the Infrastructure Development Program of Horizon 2020. The first phase of the Infrastructure Development Program lasts for three years. It aims at developing an in-depth design study of the Research Infrastructure. After this phase, Development and Consolidation Phases follow if further applications will be successful. EURHISFIRM brings together eleven research institutions in economics, history, information technologies and data science from seven European countries.

Mission: 

The post-doctoral fellow will join a team working on EURHISFIRM workflow. The goal is to extract information from images of financial documents from the 20th century. We mainly focus on two collections: yearbooks, which describes the companies and their administrators, and prices lists, which are newspapers showing the daily stock prices.

Due to the large variety of those documents, it requires a flexible and easy-to-adapt document recognition system. It is based on a modelling of knowledge not only at the page level but also at the collection level in interaction with experts of the historical sources.

A first system has been designed, using DMOS-PI method. It uses a grammatical language, EPF (Enhanced Position Formalism), to describe a general page layout, with perceptive vision mechanisms, and an iterative analysis. The system also combines structural method with Deep Learning. For new collections, an adapted description of the document layout must be developed. This has to be done on a large range of structure levels: from very structured pages like table structures from stock exchange lists, up to a paragraph-oriented structures from yearbooks.

Some first experiments have been led on the recognition of “price lists” documents, on a specific French corpus. The objective of the work will be to generalize this system to other collections of price lists, from other countries. This requires to identify which are the common parts of price lists from each stock exchanges and countries, and how to make the system able to be adapted to a new collection in an easy way.

Profil / compétences: 

PhD or master degree in computer science.

Experience in document recognition or statistical analysis.

Fluent English

Skills in grammars and languages and/or logical programming are nice-to-have. 

Diplôme requis: 
PhD or master degree in computer science.
Lieu de travail: 
Rennes - Irisa
Type de contrat: 
CDD
Durée du contrat (en mois): 
18
Quotité: 
100%
Corps / catégorie: 
Post-doctoral researcher / engineer in computer science
Salaire Brut / Mens €: 
2500€ to 2900€ brut/month depending on qualification
Date prévisionnelle d'embauche: 
Lundi, 3. février 2020
Date prévisionnelle d'embauche: 
As soon as possible
Candidater: 

Candidates should contact via email: bertrand.couasnon@irisa.fr, aurelie.lemaitre@irisa.fr