By Weronika Adrian,
Università della Calabria
July 2017
Abstract
Information Extraction consists in obtaining structured information from unstructured and semi-structured sources. Existing solutions use advanced methods from the field of Natural Language Processing and Artificial Intelligence, but they usually aim at solving sub-problems of IE, such as entity recognition, relation extraction or co-reference resolution. However, in practice, it is often necessary to build on the results of several tasks and arrange them in an intelligent way. Moreover, nowadays, Information Extraction faces new challenges related to the large-scale collections of documents in complex formats beyond plain text.
An apparent limitation of existing works is the lack of uniform representation of the document analysis from multiple perspectives, such as semantic annotation of text, structural analysis of the document layout and processing of the integrated knowledge. The recent proposals of ontology-based Information Extraction do not fully exploit the possibilities of ontologies, using them only as a reference model for a single extraction method, such as semantic annotation, or for defining the target schema for the extraction process.
In this thesis, we address the problem of Information Extraction from homogeneous collections of documents i.e., sets of files that share some common properties with respect to the content or layout. We observe that interleaving semantic and structural analysis can benefit the results of the IE process and propose an ontology-driven approach that integrates and extends existing solutions.
The contributions of this thesis are of theoretical and practical nature. With respect to the first, we propose a model and a process of Semantic Information Extraction that integrates techniques from semantic annotation of text, document layout analysis, object-oriented modeling and rule-based reasoning. We adapt existing solutions to enable their integration under a common ontological view and advance the state-of-the-art in the field of semantic annotation and document layout analysis. In particular, we propose a novel method for automatic lexicon generation for semantic annotators, and an original approach to layout analysis, based on common labels identification and structure recognition. We design and implement a framework named KnowRex that realize the proposed methodology and integrates the elaborated solutions.