ETIAB

PEPS CNRS-PSL

Textual Information Extraction to automatically Supply Databases, information transfers and evolution of thesauri.

The goal is to extract fields of characters from texts structured in catalog format, in order to supply spreadsheets for our databases and atlases. This method is now being tested on "The Archaeological Maps of Gaul" (Cartes Archéologiques de la Gaule).

Project being finalized


Institutional Partners

AOROC - UMR8546-CNRS/ENS CNRS - Centre National de la Recherche Scientifique ENS - Ecole Normale Supérieure | Paris FRANTIQ - Fédération et ressources sur l’Antiquité LaTTiCe - Laboratoire PSL  - Paris Sciences et Lettres | université de recherche

The collaboration between researchers, archaeologists, linguists and engineers aims to develop and validate an automatic processing of the corpus, so as to reduce intervention time, increase the viability of the results and facilitate a cross-disciplinary sharing of data: in other words, to improve the scientific environment for archaeological studies and research.

The objective is to be able to associate automatically extracted information to fields that constitute the basic tables. The ontology helps establish a link between the extracted information and a column from a table needing to be filled for the description of an element recognized in the text; an example for a site of the BaseFer: “fibula” matches a type of material (Categories table = jewelry and Material table = fibula).

From the point of view of textual analysis, the information is extracted as “candidate terms”, meaning that a word or a group of words matches a concept, for example: sword, Iron Age. The main difficulty has to do with the recognition of different variations of a term comprised of several words
The Extraction Terms tool for Archaeology is made up of a set of forms, which enables the user to control every step of the extraction. It is self-learning, meaning that it memorizes the choices that have been made.
The first form is thus comprised of two parts: •the left part displays the text, •the right part is for the configuration of the tool, for the display of lists of terms.
The first form is thus comprised of two parts: •the left part displays the text, •the right part is for the configuration of the tool, for the display of lists of terms.
Il est adossé aux ontologies « Pactols » développées en multi -langues par « Frantiq » et aux listes de valeurs intégrées dans la Base Fer pour la protohistoire.

The web tool is currently being developed and tested by Lattice and AOrOc.

Actors:
F. Mélanie Lattice CNRS, ENS Paris,
J. Ferguth Lattice,
M. Cartereau , agro ParisTech, AOrOc
K. Gruel, CNRS, ENS Paris,